globeHow to use Unsloth as an API endpoint

You can use local LLMs with tools like Claude Code and Codex by connecting those tools to Unsloth’s OpenAI-compatible API endpoint. This lets you run models like Qwen and Gemma locally for agentic coding. Unsloth also has beneficial features such as self-healing tool calling, code execution, and web search.

Unsloth makes it easy to deploy a fast API inference endpoint that provides:

Models loaded in Unsloth (including GGUFs) are exposed as an authenticated API via llama-server. A long API key is generated for security reasons like how OpenAI provides one.

Your local models can then be used directly in your preferred AI agent, SDK, or chat client. Unsloth speaks two dialects on the same port. Both support streaming, tool calling (OpenAI tools / Anthropic tools), and vision inputs:

  • Anthropic-compatible /v1/messages for Claude Code, OpenClaw, the Anthropic SDK, and any client that expects the Messages API.

  • OpenAI-compatible /v1/chat/completions and /v1/responses for the OpenAI SDK, OpenCode, Cursor, Continue, Cline, Open WebUI, SillyTavern, and any OpenAI-compatible tool.

⚡ Quickstart

  1. Install or update Unsloth Studio. Then launch Unsloth.

  2. Load a model. Click New Chat, pick or search a model (GGUF), and wait for it to finish loading.

  3. Create an API key. Click your Unsloth avatar in the bottom-left → SettingsAPI → type a key name → Create. Copy the sk-unsloth-… value that appears. Unsloth only shows it once.

  4. Point your client at Unsloth. Use http://localhost:PORT as the base URL and your sk-unsloth-… key for auth. Jump to the recipe for your tool below.

🔑 Creating an API key

  1. Open the sidebar, click your Unsloth avatar at the bottom-left.

  2. Go to SettingsAPI (globe 🌐 icon).

  3. Enter a friendly name (e.g. claude-code-macbook). Set an expiry (optional)

  4. Click Create.

  5. Copy the key. Unsloth stores only a hash and you won't be able to view it again.

All keys start with the sk-unsloth- prefix. Revoke a key from the same page at any time. Requests made with a revoked key will fail with 401 Unauthorized.

circle-exclamation

terminal Unsloth run command

  1. Install or update Unsloth Unsloth. Earlier versions don't expose the external API. See Installation.

  2. Load a GGUF model. load a GGUF model using the run command. This will also load the UI on the default port. The endpoint URL and API Key will be printed out to the console , ready for you to be used with your client of choice.

Loading a model from the CLI

You can load a model and have an API key created for you automatically using the unsloth CLI tool. When the model finishes loading, the endpoint URL and API key are printed to your console. Copy them into your client of choice and you're ready to go.

Before you start

Make sure you're on a recent version of Unsloth Studio as earlier versions don't expose the external API. See installation.

The quick way

Open a terminal and load a GGUF model:

That's it. This starts the server on the default port, loads the UI, and prints your endpoint URL and API key.

How the model name works

You can point at a model in a few different ways. Pick the one you find easiest:

Tuning the run (optional)

You don't need any of this for a basic load but if you want more control, you can pass extra flags and they'll be forwarded straight to the underlying llama-server. Your values override Studio's defaults.

A few examples:

Most llama-server flags work like context size, GPU layers, sampling parameters, KV cache types, reasoning settings, and so on. See the llama-server docsarrow-up-right for the full list.

Server-side tool policy

unsloth run controls whether server-side tools (web search, code execution, etc.) are exposed by the inference server. Defaults are based on the bind address:

  • 127.0.0.1 (localhost) — tools on by default. Only your machine can reach the server.

  • 0.0.0.0 or any non-loopback address — tools off by default. A leaked API key on a network-exposed server means arbitrary code execution on the host.

Flags:

  • --enable-tools / --disable-tools — force on or off. On 0.0.0.0, --enable-tools shows a y/N security prompt.

  • --yes / -y — skip the prompt (for automation).

The resolved policy is a process-level hard override — individual requests cannot bypass it via enable_tools=true in the request body.

🌐 Endpoints

Studio exposes these endpoints on whichever port it booted on (typically http://localhost:8000 or http://localhost:8888):

Endpoint
Compatible with
Use it from

POST /v1/messages

Anthropic Messages API

Claude Code, Anthropic SDK, OpenClaw, anything that speaks Anthropic

POST /v1/chat/completions

OpenAI Chat Completions API

OpenAI SDK, opencode, Cursor, Continue, Cline, Open WebUI, curl, etc.

GET /v1/models

OpenAI models list

List the models currently loaded in Unsloth

Authenticate with an Authorization: Bearer sk-unsloth-… header on every request.

circle-info

You don't need to run different servers for the two formats. Studio handles both on the same port.

🖇️ Connecting your client

Unsloth enables you run local LLMs via most frameworks including Claude Code, Codex, OpenClaw, OpenCode and more. Click the specific tools below for a guide:

🧰 Tool calling

Both endpoints support function / tool calling in their native format, plus an Unsloth-specific shorthand for Studio's built-in tools.

OpenAI-style tools: send tools and tool_choice to /v1/chat/completions exactly as you would with OpenAI. Claude Code (via /v1/messages), opencode, Cursor, Continue, and Cline all work out of the box.

Anthropic-style tools: send tools (with input_schema) and tool_choice to /v1/messages exactly as you would with Claude.

Studio server side tools: Studio can execute Python, web search, and bash server-side and stream the results back as tool_result events. Opt in by adding these extra fields to either endpoint:

The model sees each tool's output on its next turn. For deeper coverage (schemas, streaming events, chaining), see .

circle-info

If you're using the Anthropic /v1/messages endpoint, tool_choice maps cleanly: Anthropic auto → OpenAI auto, Anthropic any → OpenAI required, Anthropic {type: "tool", name: "x"} → OpenAI {type: "function", function: {name: "x"}}, Anthropic none → OpenAI none.

❔ Troubleshooting

401 Unauthorized : either the Authorization header is missing or the key is wrong. Keys must be passed as Authorization: Bearer sk-unsloth-…. If you lost the key, create a new one from Settings → API. Studio doesn't show old keys after creation.

Lost connection to the model server : Studio couldn't reach the underlying llama.cpp server. Usually the model finished loading but crashed, or the model tab was closed inside Studio. Reload the model from New Chat and retry.

Claude Code shows the default Anthropic model, not my local one : check all three env vars are exported in the same shell where you run claude:

Then run /model inside Claude Code to confirm. On Windows PowerShell use $env:ANTHROPIC_BASE_URL etc.

stream: true returns a single JSON blob instead of SSE : make sure you're hitting the right path (/v1/messages or /v1/chat/completions) and that your HTTP client is actually consuming the response as a stream, not buffering it.

I can't find the name of the model to add to opencode (or OpenClaw / any other client) : ask Studio directly. GET /v1/models returns the exact model ID you need to plug into the client's "Model ID" field:

You'll get back a JSON payload of the form {"data": [{"id": "gemma-4-26B-A4B-it-GGUF", ...}]}. Copy the id value, that's the string opencode's Model ID field (left column) and OpenClaw's models[].id expect. The display name on the right is whatever you want users to see.

Tool calls aren't executed : The model needs to support tool calling for client-side tools (tools / tool_choice). For Studio's built-in tools, remember to set enable_tools: true and list the ones you want in enabled_tools (e.g. ["python", "web_search"]).

Last updated

Was this helpful?