EngineeringGeminiAI translation

How much does AI translation actually think?

We tried to control Gemini 3's thinking budget the same way we did with Gemini 2.5. It doesn't work that way anymore. Here's what we measured instead — and what it means for AI translation quality and cost.

Vitalii VlasiukMay 25, 20265 min readUpdated June 12, 2026

The Literess mascot thinking, multiplied into a cascade of copies of herself, beside a stack trace repeating “java.lang.StackOverflowError at Character.think(Character.java:1)”

Gemini is a gift that keeps on giving. It's the best translation LLM out of the box, for most languages. Via its thinking capability, you can crack even the hardest problems, like idioms, anime references or your unique brand voice™.

The issue is, thinking takes time. Thinking takes money; each token of it is billed. As a developer, you want to control both.

Gemini 2.5 had it easy. Gemini 3 made it hard.

Here's what we learned in Transept's labs

What changed in Gemini 3

Gemini 2.5 had thinkingBudget: a numeric ceiling on reasoning tokens. Gemini 3 dropped it. The replacement is thinkingLevel: a categorical setting with four values — minimal, low, medium, high. The two parameters are mutually exclusive; sending both to a Gemini 3 model returns a 400 on current SDKs.

That changes what "controlling cost" means in practice. Previously, you could put a hard cap on how much you are willing to spend. Now, it's up to the model to spend whatever 'medium' feels like.

We presume the reason is that models are post-trained to adapt thinking length to different task difficulties, like humans do. Thus, setting arbitrary limits would hurt this capability.

But how many tokens do low, medium, and high actually take?

The data

Here is a sample of two weeks of our real translation traffic, roughly 1000 calls across three Gemini 3 models. Thinking tokens data comes from usageMetadata.thoughtsTokenCount on each response, which is the most accurate source possible.

Thinking tokens by model

Model	`thinkingLevel`	calls	median	p90	max
`gemini-3.1-flash-lite`	minimal	521	1,145	1,694	2,055
`gemini-3.1-flash-lite`	low	112	1,362	15,726	15,729
`gemini-3-flash-preview`	medium	682	1,558	3,936	15,725
`gemini-3.1-pro-preview`	medium	457	1,251	3,579	6,063

The 15,725 is not a typo. One translation call, on a Flash model, on the medium setting, used fifteen thousand tokens of reasoning.

Usually, such token inflations are caused by Gemini leaking its raw thinking tokens into output. As you may know, the model's "thinking" is not really what it thinks, but rather polished summaries; real reasoning of the LLM is more like a stream of consciousness.

It's worth noting, though, that in this particular case, Gemini peacefully rummaged on an English->Ukrainian translation of a literary passage with a line like "pure steel rejects lacquers and paints". It was brute-forcing and reading aloud like twelve options until it liked the output.

A few weeks later, an update. That 15,725 wasn't a one-off act of Gemini's devotion, but a pattern specific to complex tasks.

With more traffic, especially on gemini-3.1-flash-lite, the same number keeps coming back. A suspiciously tight 15,724–15,729 band, dozens of calls deep, almost all of them Lite models chewing through glossary audits. Interestingly enough, they run on low thinking effort, which you wouldn't expect to be the most expensive setting.

A model that lands on the same number eleven times has hit a ceiling, not picked a phrasing. low thinking saturates around 15.7k tokens. As small models like lite think less effectively, so hard tasks stimulate them to burn a lot of tokens; that never happens on larger models, each token of those is more likely to be a meaningful step towards the answer.

The saturation has a cost, moreover. On Gemini 3, thinking tokens count against maxOutputTokens. A call that spends 15.7k on thinking has almost nothing left for the answer; all thinking, however, is billed in output tier On a 16K budget, the audit came back empty (finishReason=MAX_TOKENS) about half the time on small docs, which lead us to investigate.

Raising the budget fixes the symptom, not the cause. We moved the audit budget to 32K on low, which leaves ~16k for output after a saturated think — the empty responses stopped. But the thinking didn't shrink. You can't cap it on Gemini 3, so it still burns ~15.7k; the bigger budget just gives the answer somewhere to go.

The only real fix to 16k thinkingmaxxing was to reduce the task scope.

minimal is the only tier that behaves like a real cap with a ~2k token ceiling. It's only available for Lite models, though.

Thinking tokens by task type (Flash vs Pro, both on `medium`)

Task	Flash median	Flash p90	Pro median	Pro p90
Translation	2,086	3,931	1,172	1,938
Rewrite	2,995	4,900	3,878	4,961
Fix	3,666	5,320	1,545	2,072
Regenerate	2,696	3,462	1,440	1,992

Pro thinks less than Flash on plain translation, by almost 2×. It only out-thinks Flash on rewrite tasks. Reasoning effort tracks task difficulty, not the marketing tier of the model.

Arguably, Pro achieves better results in translation benchmarks specifically due to its ability to think deeper, when needed.

Practical takeaways

Don't set thinkingBudget on a Gemini 3 model. It will either be ignored or throw an error.
medium is open-ended on the upper end. A realistic plan is ~4k p90 thinking tokens per call, not 2k.
If you need a hard limit on cost, use minimal where possible. It does not make sense to stop outputs: you will be billed regardless.
Read usageMetadata.thoughtsTokenCount on every response if you bill per call. The number isn't in the response body, but it's on the invoice.
Pro on medium is likely to think less than its median on simple tasks, and more on difficult ones. Lite & Flash don't have this variance.

That means Pro is the best model for proofreading and refinement beyond the LLM baseline, as that's where its creativity activates. Raw draft translation is better on Flash.

It takes time and money, though. In Transept, we figured out how to mitigate Pro costs by using smart proofreading logic to only engage Pro on the most difficult parts. We also use Pro for planning and making decisions, delegating execution to smaller models and fine-tunes.

That way, everything works.

The author

Vitalii VlasiukCo-founder

Co-founder of Transept, writing as “Mevkh.” A Language and Literature degree, then a turn into software: senior AI engineer shipping production LLM features to 50,000+ users — RAG, agentic tools, LLM-as-judge evaluation. A novelist on the slow path, with 120,000 words of satirical romance fantasy in a drawer. The friction between AI translation and his own prose is what set this whole thing in motion.