How to use Unsloth as an API endpoint
You can use local LLMs with tools like Claude Code and Codex by connecting those tools to Unsloth’s OpenAI-compatible API endpoint. This lets you run models like Qwen and Gemma locally for agentic coding. Unsloth also has beneficial features such as self-healing tool calling, code execution, and web search.
Unsloth makes it easy to deploy a fast API inference endpoint that provides:
Self-healing tool calling, which helps reduce broken or malformed tool calls by 50%
Code execution support, allowing Bash and Python execution for more accurate code outputs.
Advanced Web search that visits and actually reads webpages to gather in-depth info.
Automatic inference settings for GGUF models (temp, top-k etc.)
Models loaded in Unsloth (including GGUFs) are exposed as an authenticated API via llama-server. A long API key is generated for security reasons like how OpenAI provides one.
Your local models can then be used directly in your preferred AI agent, SDK, or chat client. Unsloth speaks two dialects on the same port. Both support streaming, tool calling (OpenAI tools / Anthropic tools), and vision inputs:

Anthropic-compatible
/v1/messagesfor Claude Code, OpenClaw, the Anthropic SDK, and any client that expects the Messages API.OpenAI-compatible
/v1/chat/completionsand/v1/responsesfor the OpenAI SDK, OpenCode, Cursor, Continue, Cline, Open WebUI, SillyTavern, and any OpenAI-compatible tool.
⚡ Quickstart
Install or update Unsloth Studio. Then launch Unsloth.
Load a model. Click New Chat, pick or search a model (GGUF), and wait for it to finish loading.
Create an API key. Click your Unsloth avatar in the bottom-left → Settings → API → type a key name → Create. Copy the
sk-unsloth-…value that appears. Unsloth only shows it once.Point your client at Unsloth. Use
http://localhost:PORTas the base URL and yoursk-unsloth-…key for auth. Jump to the recipe for your tool below.
🔑 Creating an API key
Open the sidebar, click your Unsloth avatar at the bottom-left.
Go to Settings → API (globe 🌐 icon).
Enter a friendly name (e.g.
claude-code-macbook). Set an expiry (optional)Click Create.
Copy the key. Unsloth stores only a hash and you won't be able to view it again.

All keys start with the sk-unsloth- prefix. Revoke a key from the same page at any time. Requests made with a revoked key will fail with 401 Unauthorized.
Treat your API key like a password. Anyone with the key and network access to your Unsloth instance can send requests to your loaded model.
Unsloth run command
Install or update Unsloth Unsloth. Earlier versions don't expose the external API. See Installation.
Load a GGUF model. load a GGUF model using the run command. This will also load the UI on the default port. The endpoint URL and API Key will be printed out to the console , ready for you to be used with your client of choice.
Loading a model from the CLI
You can load a model and have an API key created for you automatically using the unsloth CLI tool. When the model finishes loading, the endpoint URL and API key are printed to your console. Copy them into your client of choice and you're ready to go.
Before you start
Make sure you're on a recent version of Unsloth Studio as earlier versions don't expose the external API. See installation.
The quick way
Open a terminal and load a GGUF model:
That's it. This starts the server on the default port, loads the UI, and prints your endpoint URL and API key.
How the model name works
You can point at a model in a few different ways. Pick the one you find easiest:
Tuning the run (optional)
You don't need any of this for a basic load but if you want more control, you can pass extra flags and they'll be forwarded straight to the underlying llama-server. Your values override Studio's defaults.
A few examples:
Most llama-server flags work like context size, GPU layers, sampling parameters, KV cache types, reasoning settings, and so on. See the llama-server docs for the full list.
Server-side tool policy
unsloth run controls whether server-side tools (web search, code execution, etc.) are exposed by the inference server. Defaults are based on the bind address:
127.0.0.1(localhost) — tools on by default. Only your machine can reach the server.0.0.0.0or any non-loopback address — tools off by default. A leaked API key on a network-exposed server means arbitrary code execution on the host.
Flags:
--enable-tools/--disable-tools— force on or off. On0.0.0.0,--enable-toolsshows a y/N security prompt.--yes/-y— skip the prompt (for automation).
The resolved policy is a process-level hard override — individual requests cannot bypass it via enable_tools=true in the request body.

🌐 Endpoints
Studio exposes these endpoints on whichever port it booted on (typically http://localhost:8000 or http://localhost:8888):
POST /v1/messages
Anthropic Messages API
Claude Code, Anthropic SDK, OpenClaw, anything that speaks Anthropic
POST /v1/chat/completions
OpenAI Chat Completions API
OpenAI SDK, opencode, Cursor, Continue, Cline, Open WebUI, curl, etc.
GET /v1/models
OpenAI models list
List the models currently loaded in Unsloth
Authenticate with an Authorization: Bearer sk-unsloth-… header on every request.
You don't need to run different servers for the two formats. Studio handles both on the same port.
🖇️ Connecting your client
Unsloth enables you run local LLMs via most frameworks including Claude Code, Codex, OpenClaw, OpenCode and more. Click the specific tools below for a guide:
🧰 Tool calling
Both endpoints support function / tool calling in their native format, plus an Unsloth-specific shorthand for Studio's built-in tools.
OpenAI-style tools: send tools and tool_choice to /v1/chat/completions exactly as you would with OpenAI. Claude Code (via /v1/messages), opencode, Cursor, Continue, and Cline all work out of the box.
Anthropic-style tools: send tools (with input_schema) and tool_choice to /v1/messages exactly as you would with Claude.
Studio server side tools: Studio can execute Python, web search, and bash server-side and stream the results back as tool_result events. Opt in by adding these extra fields to either endpoint:
The model sees each tool's output on its next turn. For deeper coverage (schemas, streaming events, chaining), see .
If you're using the Anthropic /v1/messages endpoint, tool_choice maps cleanly: Anthropic auto → OpenAI auto, Anthropic any → OpenAI required, Anthropic {type: "tool", name: "x"} → OpenAI {type: "function", function: {name: "x"}}, Anthropic none → OpenAI none.
❔ Troubleshooting
401 Unauthorized : either the Authorization header is missing or the key is wrong. Keys must be passed as Authorization: Bearer sk-unsloth-…. If you lost the key, create a new one from Settings → API. Studio doesn't show old keys after creation.
Lost connection to the model server : Studio couldn't reach the underlying llama.cpp server. Usually the model finished loading but crashed, or the model tab was closed inside Studio. Reload the model from New Chat and retry.
Claude Code shows the default Anthropic model, not my local one : check all three env vars are exported in the same shell where you run claude:
Then run /model inside Claude Code to confirm. On Windows PowerShell use $env:ANTHROPIC_BASE_URL etc.
stream: true returns a single JSON blob instead of SSE : make sure you're hitting the right path (/v1/messages or /v1/chat/completions) and that your HTTP client is actually consuming the response as a stream, not buffering it.
I can't find the name of the model to add to opencode (or OpenClaw / any other client) : ask Studio directly. GET /v1/models returns the exact model ID you need to plug into the client's "Model ID" field:
You'll get back a JSON payload of the form {"data": [{"id": "gemma-4-26B-A4B-it-GGUF", ...}]}. Copy the id value, that's the string opencode's Model ID field (left column) and OpenClaw's models[].id expect. The display name on the right is whatever you want users to see.
Tool calls aren't executed : The model needs to support tool calling for client-side tools (tools / tool_choice). For Studio's built-in tools, remember to set enable_tools: true and list the ones you want in enabled_tools (e.g. ["python", "web_search"]).
Last updated
Was this helpful?

