Prompt caching
Cache a large system prefix and pay the cache-read rate on repeat requests.
When you send the same long system prompt on many requests, you can mark it with a cache_control breakpoint so the prefix is cached. Subsequent requests that reuse the prefix are billed at the cache-read rate instead of the full input rate.
cache_control works on both Gemini and Anthropic models, but the mechanism differs by provider. For Gemini models, OpenToken stores the marked prefix as a gateway-managed Vertex CachedContent. For Anthropic models, the cache_control breakpoint is forwarded to the Anthropic Messages API, which caches the prefix on its side.
Caching is best-effort: if anything fails, OpenToken silently falls back to a normal uncached request, so your call still succeeds.
Mark a prefix to cache
Put a cache_control block on the system message. The message content becomes an array of parts, and the cached part carries cache_control: { type: "ephemeral" }.
curl https://api.opentoken.kr/v1/chat/completions \
-H "Authorization: Bearer $OPENTOKEN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemini-2.5-pro",
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "<a large system prompt, roughly 16k+ characters>",
"cache_control": { "type": "ephemeral" }
}
]
},
{ "role": "user", "content": "Summarize the policy above." }
]
}'{
"model": "google/gemini-2.5-pro",
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "<a large system prompt, roughly 16k+ characters>",
"cache_control": { "type": "ephemeral" }
}
]
},
{ "role": "user", "content": "Summarize the policy above." }
]
}Cache TTL (Anthropic models)
For Anthropic models, cache_control accepts an optional ttl. The default is a 5-minute cache (billed at roughly 1.25x the input rate to write). Setting ttl: "1h" engages a 1-hour cache, billed at roughly 2x the input rate to write:
{ "type": "ephemeral", "ttl": "1h" }A 1-hour cache write surfaces in usage as prompt_tokens_details.cache_creation_tokens_1h. Gemini models use a fixed gateway-side TTL and ignore the ttl field.
Requirements
Caching only applies when both conditions hold. Otherwise the request runs uncached, with no error.
- A prefix large enough to meet the provider's minimum cacheable size. For Gemini this is a per-model minimum of roughly 1024-4096 tokens; Anthropic enforces its own minimum. OpenToken does not enforce a character threshold itself, so a too-small prefix simply runs uncached with no error.
- A DB-issued workspace key. Plain environment keys skip caching.
The first request that establishes the prefix is metered once as cache_creation_tokens. Later requests that hit the cached prefix are billed at the cache-read rate.
Reading cache usage
Cache hits show up in the usage object. Both cached_tokens and cache_creation_tokens live inside prompt_tokens_details — cached_tokens counts prompt tokens served from the cache, and cache_creation_tokens counts tokens written when the cache is established.
On the request that creates the cache, cache_creation_tokens is populated:
{
"usage": {
"prompt_tokens": 8200,
"completion_tokens": 120,
"total_tokens": 8320,
"prompt_tokens_details": { "cache_creation_tokens": 8000 }
}
}On later requests that hit the cached prefix, cached_tokens reflects the reused tokens and they are billed at the cache-read rate:
{
"usage": {
"prompt_tokens": 8200,
"completion_tokens": 120,
"total_tokens": 8320,
"prompt_tokens_details": { "cached_tokens": 8000 }
}
}