Service tiers
Route Google chat models on the global Vertex endpoint through Vertex Flex PayGo with service_tier flex for cheaper, latency-tolerant requests.
The service_tier field selects the upstream serving lane for a request. The default lane is standard. Setting service_tier: "flex" routes Google generative (chat) models served on the global Vertex endpoint through Vertex Flex PayGo, which costs roughly 50% less in exchange for higher and more variable latency on a best-effort basis. It does not apply to the embeddings endpoint, which is regional and ignores service_tier.
How flex routing works
Flex is a Vertex-only lane served exclusively through the GLOBAL Vertex endpoint. The field is honored only when a request reaches Vertex through the global endpoint.
- On a regional Vertex endpoint,
service_tier: "flex"is silently dropped and the request runs at standard pricing and latency. - When the gateway reaches Google through the Gemini API key rather than a Vertex service account, there is no Vertex lane to select, so the field is dropped.
Because the field is dropped rather than rejected, an unsupported request still succeeds — it just runs in the standard lane.
Flex trades latency for price. Requests can take noticeably longer and the delay varies request to request, since flex capacity is best-effort. Do not use it for interactive, user-facing turns.
When to use it
Reach for flex when throughput and cost matter more than per-request speed:
- Batch summarization over a backlog of documents.
- Background extraction and enrichment jobs.
- Offline classification or tagging pipelines.
Keep standard for chat UIs, autocomplete, and anything a person is waiting on.
Example
Send service_tier: "flex" alongside a normal chat completion request on a Google model.
curl https://api.opentoken.kr/v1/chat/completions \
-H "Authorization: Bearer $OPENTOKEN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemini-3-flash",
"service_tier": "flex",
"messages": [
{ "role": "user", "content": "Summarize this support thread in two sentences." }
]
}'from openai import OpenAI
client = OpenAI(
base_url="https://api.opentoken.kr/v1",
api_key=os.environ["OPENTOKEN_API_KEY"],
)
resp = client.chat.completions.create(
model="google/gemini-3-flash",
extra_body={"service_tier": "flex"},
messages=[
{"role": "user", "content": "Summarize this support thread in two sentences."},
],
)
print(resp.choices[0].message.content)The response shape is identical to a standard request; only the upstream lane and pricing change. Billing still writes one immutable usage record and a USD credit-ledger debit per request.