Service tiers

Route Google chat models on the global Vertex endpoint through Vertex Flex PayGo with service_tier flex for cheaper, latency-tolerant requests.

The service_tier field selects the upstream serving lane for a request. The default lane is standard. Setting service_tier: "flex" routes Google generative (chat) models served on the global Vertex endpoint through Vertex Flex PayGo, which costs roughly 50% less in exchange for higher and more variable latency on a best-effort basis. It does not apply to the embeddings endpoint, which is regional and ignores service_tier.

How flex routing works

Flex is a Vertex-only lane served exclusively through the GLOBAL Vertex endpoint. The field is honored only when a request reaches Vertex through the global endpoint.

On a regional Vertex endpoint, service_tier: "flex" is silently dropped and the request runs at standard pricing and latency.
When the gateway reaches Google through the Gemini API key rather than a Vertex service account, there is no Vertex lane to select, so the field is dropped.

Because the field is dropped rather than rejected, an unsupported request still succeeds — it just runs in the standard lane.

Flex trades latency for price. Requests can take noticeably longer and the delay varies request to request, since flex capacity is best-effort. Do not use it for interactive, user-facing turns.

When to use it

Reach for flex when throughput and cost matter more than per-request speed:

Batch summarization over a backlog of documents.
Background extraction and enrichment jobs.
Offline classification or tagging pipelines.

Keep standard for chat UIs, autocomplete, and anything a person is waiting on.

Example

Send service_tier: "flex" alongside a normal chat completion request on a Google model.

curl https://api.opentoken.kr/v1/chat/completions \
  -H "Authorization: Bearer $OPENTOKEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-3-flash",
    "service_tier": "flex",
    "messages": [
      { "role": "user", "content": "Summarize this support thread in two sentences." }
    ]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="https://api.opentoken.kr/v1",
    api_key=os.environ["OPENTOKEN_API_KEY"],
)

resp = client.chat.completions.create(
    model="google/gemini-3-flash",
    extra_body={"service_tier": "flex"},
    messages=[
        {"role": "user", "content": "Summarize this support thread in two sentences."},
    ],
)
print(resp.choices[0].message.content)

The response shape is identical to a standard request; only the upstream lane and pricing change. Billing still writes one immutable usage record and a USD credit-ledger debit per request.

How flex routing works

When to use it

Example

Next steps

Create chat completion

Prompt caching

Models

On this page