Prompt caching

When you send the same long system prompt on many requests, you can mark it with a cache_control breakpoint so the prefix is cached. Subsequent requests that reuse the prefix are billed at the cache-read rate instead of the full input rate.

cache_control works on both Gemini and Anthropic models, but the mechanism differs by provider. For Gemini models, OpenToken stores the marked prefix as a gateway-managed Vertex CachedContent. For Anthropic models, the cache_control breakpoint is forwarded to the Anthropic Messages API, which caches the prefix on its side.

Caching is best-effort: if anything fails, OpenToken silently falls back to a normal uncached request, so your call still succeeds.

Mark a prefix to cache

Put a cache_control block on the system message. The message content becomes an array of parts, and the cached part carries cache_control: { type: "ephemeral" }.

curl https://api.opentoken.kr/v1/chat/completions \
  -H "Authorization: Bearer $OPENTOKEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-2.5-pro",
    "messages": [
      {
        "role": "system",
        "content": [
          {
            "type": "text",
            "text": "<a large system prompt, roughly 16k+ characters>",
            "cache_control": { "type": "ephemeral" }
          }
        ]
      },
      { "role": "user", "content": "Summarize the policy above." }
    ]
  }'

{
  "model": "google/gemini-2.5-pro",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "<a large system prompt, roughly 16k+ characters>",
          "cache_control": { "type": "ephemeral" }
        }
      ]
    },
    { "role": "user", "content": "Summarize the policy above." }
  ]
}

Cache TTL (Anthropic models)

For Anthropic models, cache_control accepts an optional ttl. The default is a 5-minute cache (billed at roughly 1.25x the input rate to write). Setting ttl: "1h" engages a 1-hour cache, billed at roughly 2x the input rate to write:

{ "type": "ephemeral", "ttl": "1h" }

A 1-hour cache write surfaces in usage as prompt_tokens_details.cache_creation_tokens_1h. Gemini models use a fixed gateway-side TTL and ignore the ttl field.

Requirements

Caching only applies when both conditions hold. Otherwise the request runs uncached, with no error.

A prefix large enough to meet the provider's minimum cacheable size. For Gemini this is a per-model minimum of roughly 1024-4096 tokens; Anthropic enforces its own minimum. OpenToken does not enforce a character threshold itself, so a too-small prefix simply runs uncached with no error.
A DB-issued workspace key. Plain environment keys skip caching.

The first request that establishes the prefix is metered once as cache_creation_tokens. Later requests that hit the cached prefix are billed at the cache-read rate.

Reading cache usage

Cache hits show up in the usage object. Both cached_tokens and cache_creation_tokens live inside prompt_tokens_details — cached_tokens counts prompt tokens served from the cache, and cache_creation_tokens counts tokens written when the cache is established.

On the request that creates the cache, cache_creation_tokens is populated:

{
  "usage": {
    "prompt_tokens": 8200,
    "completion_tokens": 120,
    "total_tokens": 8320,
    "prompt_tokens_details": { "cache_creation_tokens": 8000 }
  }
}

On later requests that hit the cached prefix, cached_tokens reflects the reused tokens and they are billed at the cache-read rate:

{
  "usage": {
    "prompt_tokens": 8200,
    "completion_tokens": 120,
    "total_tokens": 8320,
    "prompt_tokens_details": { "cached_tokens": 8000 }
  }
}

Mark a prefix to cache

Cache TTL (Anthropic models)

Requirements

Reading cache usage

Next steps

Create chat completion

Models

On this page