Transept
4 min readVitalii `Mevkh` Vlasiuk & Literess

How much does AI translation actually think?

We tried to control Gemini 3's thinking budget the same way we did with Gemini 2.5. It doesn't work that way anymore. Here's what we measured instead.

Gemini is a gift that keeps on giving. It's the best translation LLM out of the box, for most languages. Via its thinking capability, you can crack even the hardest problems, like idioms, anime references or your unique brand voice™.

The issue is, thinking takes time. Thinking takes money; each token of it is billed. As a developer, you want to control both.

Gemini 2.5 had it easy. Gemini 3 made it hard.

Here's what we learned in Transept's labs

What changed in Gemini 3

Gemini 2.5 had thinkingBudget: a numeric ceiling on reasoning tokens. Gemini 3 dropped it. The replacement is thinkingLevel: a categorical setting with four values — minimal, low, medium, high. The two parameters are mutually exclusive; sending both to a Gemini 3 model returns a 400 on current SDKs.

That changes what "controlling cost" means in practice. Previously, you could put a hard cap on how much you are willing to spend. Now, it's up to the model to spend whatever 'medium' feels like.

We presume the reason is that models are post-trained to adapt thinking length to different task difficulties, like humans do. Thus, setting arbitrary limits would hurt this capability.

But how many tokens do low, medium, and high actually take?

The data

Here is a sample of two weeks of our real translation traffic, roughly 1000 calls across three Gemini 3 models. Thinking tokens data comes from usageMetadata.thoughtsTokenCount on each response, which is the most accurate source possible.

Thinking tokens by model

ModelthinkingLevelcallsmedianp90max
gemini-3.1-flash-liteminimal181,1451,6942,055
gemini-3-flash-previewmedium3531,5583,93615,725
gemini-3.1-pro-previewmedium1121,2513,5796,063

The 15,725 is not a typo. One translation call, on a Flash model, on the medium setting, used fifteen thousand tokens of reasoning.

Usually, such token inflations are caused by Gemini leaking its raw thinking tokens into output. As you may know, the model's "thinking" is not really what it thinks, but rather polished summaries; real reasoning of the LLM is more like a stream of consciousness.

It's worth noting, though, that in this particular case, Gemini peacefully rummaged on an English->Ukrainian translation of a literary passage with a line like "pure steel rejects lacquers and paints". It was brute-forcing and reading aloud like twelve options until it liked the output.

minimal is the only tier that behaves like a real cap with a ~2k token ceiling. It's only available for Lite models, though.

Thinking tokens by task type (Flash vs Pro, both on medium)

TaskFlash medianFlash p90Pro medianPro p90
Translation2,0863,9311,1721,938
Rewrite2,9954,9003,8784,961
Fix3,6665,3201,5452,072
Regenerate2,6963,4621,4401,992

Pro thinks less than Flash on plain translation, by almost 2×. It only out-thinks Flash on rewrite tasks. Reasoning effort tracks task difficulty, not the marketing tier of the model.

Arguably, Pro achieves better results in translation benchmarks specifically due to its ability to think deeper, when needed.

Practical takeaways

  • Don't set thinkingBudget on a Gemini 3 model. It will either be ignored or throw an error.
  • medium is open-ended on the upper end. A realistic plan is ~4k p90 thinking tokens per call, not 2k.
  • If you need a hard limit on cost, use minimal where possible. It does not make sense to stop outputs: you will be billed regardless.
  • Read usageMetadata.thoughtsTokenCount on every response if you bill per call. The number isn't in the response body, but it's on the invoice.
  • Pro on medium is likely to think less than its median on simple tasks, and more on difficult ones. Lite & Flash don't have this variance.

That means Pro is the best model for proofreading and refinement beyond the LLM baseline, as that's where its creativity activates. Raw draft translation is better on Flash.

It takes time and money, though. In Transept, we figured out how to mitigate Pro costs by using smart proofreading logic to only engage Pro on the most difficult parts. We also use Pro for planning and making decisions, delegating execution to smaller models and fine-tunes.

That way, everything works.