In nearly all cases, what we as consumers pay for AI-powered chat interfaces, resembling ChatGPT-4o, is currently measured in tokens: invisible units of text that go unnoticed during use, yet are counted with exact precision for billing purposes; and though each exchange is priced by the variety of tokens processed, the user has no direct approach to confirm the count.
Despite our (at best) imperfect understanding of what we get for our purchased ‘token’ unit, token-based billing has change into the usual approach across providers, resting on what may prove to be a precarious assumption of trust.
Token Words
A token is just not quite the identical as a word, though it often plays the same role, and most providers use the term ‘token’ to explain small units of text resembling words, punctuation marks, or word-fragments. The word , for instance, may be counted as a single token by one system, while one other might split it into , and , with each bit increasing the associated fee.
This method applies to each the text a user inputs and the model’s reply, with the value based on the full variety of these units.
The issue lies within the proven fact that users . Most interfaces don’t show token counts while a conversation is going on, and the best way tokens are calculated is tough to breed. Even when a count is shown a reply, it is just too late to inform whether it was fair, making a mismatch between what the user sees and what they’re paying for.
Recent research points to deeper problems: one study shows how providers can overcharge without ever breaking the principles, just by inflating token counts in ways in which the user cannot see; one other reveals the mismatch between what interfaces display and what is definitely billed, leaving users with the illusion of efficiency where there could also be none; and a third exposes how models routinely generate internal reasoning steps which might be never shown to the user, yet still appear on the invoice.
The findings depict a system that precise, with exact numbers implying clarity, yet whose underlying logic stays hidden. Whether that is by design, or a structural flaw, the result is identical: users pay for greater than they will see, and sometimes greater than they expect.
Cheaper by the Dozen?
Within the first of those papers – titled Is , from 4 researchers on the Max Planck Institute for Software Systems – the authors argue that the risks of token-based billing extend beyond opacity, pointing to a built-in incentive for providers to inflate token counts:
The paper presents a heuristic able to performing this sort of disingenuous calculation without altering visible output, and without violating plausibility under typical decoding settings. Tested on models from the LLaMA, Mistral and Gemma series, using real prompts, the strategy achieves measurable overcharges without appearing anomalous:
Source: https://arxiv.org/pdf/2505.21627
To deal with the issue, the researchers call for billing based on somewhat than tokens, arguing that that is the one approach that offers providers a reason to report usage truthfully, and contending that if the goal is fair pricing, then tying cost to visible characters, not hidden processes, is the one option that stands as much as scrutiny. Character-based pricing, they argue, would remove the motive to misreport while also rewarding shorter, more efficient outputs.
Here there are a variety of extra considerations, nonetheless (most often conceded by the authors). Firstly, the character-based scheme proposed introduces additional business logic that will favor the seller over the buyer:
The optimistic motif here is that the seller is thus encouraged to provide concise and more meaningful and helpful output. In practice, there are obviously less virtuous ways for a provider to cut back text-count.
Secondly, it is cheap to assume, the authors state, that firms would likely require laws with a purpose to transit from the arcane token system to a clearer, text-based billing method. Down the road, an insurgent startup may resolve to distinguish their product by launching it with this sort of pricing model; but anyone with a very competitive product (and operating at a lower scale than EEE category) is disincentivized to do that.
Finally, larcenous algorithms resembling the authors propose would include their very own computational cost; if the expense of calculating an ‘upcharge’ exceeded the potential profit profit, the scheme would clearly don’t have any merit. Nonetheless the researchers emphasize that their proposed algorithm is effective and economical.
The authors provide the code for his or her theories at GitHub.
The Switch
The second paper – titled , from researchers at the University of Maryland and Berkeley – argues that misaligned incentives in business language model APIs will not be limited to token splitting, but extend to of hidden operations.
These include internal model calls, speculative reasoning, tool usage, and multi-agent interactions – all of which could also be billed to the user without visibility or recourse.

Source: https://www.arxiv.org/pdf/2505.18471
Unlike conventional billing, where the amount and quality of services are verifiable, the authors contend that today’s LLM platforms operate under : users are charged based on reported token and API usage, but don’t have any means to substantiate that these metrics reflect real or obligatory work.
The paper identifies two key types of manipulation: , where the variety of tokens or calls is increased without user profit; and , where lower-performing models or tools are silently used rather than premium components:
The paper documents instances where greater than ninety percent of billed tokens were never shown to users, with internal reasoning inflating token usage by an element greater than twenty. Justified or not, the opacity of those steps denies users any basis for evaluating their relevance or legitimacy.
In agentic systems, the opacity increases, as internal exchanges between AI agents can each incur charges without meaningfully affecting the ultimate output:
To confront these issues, the authors propose a layered auditing framework involving cryptographic proofs of internal activity, verifiable markers of model or tool identity, and independent oversight. The underlying concern, nonetheless, is structural: current LLM billing schemes rely on a persistent , leaving users exposed to costs that they can not confirm or break down.
Counting the Invisible
The ultimate paper, from researchers on the University of Maryland, re-frames the billing problem not as a matter of misuse or misreporting, but of structure. The paper – titled, and from ten researchers on the University of Maryland – observes that the majority business LLM services now hide the that contributes to a model’s final answer, yet .
The paper asserts that this creates an unobservable billing surface where entire sequences will be fabricated, injected, or inflated without detection*:
To counter this asymmetry, the authors propose , a third-party auditing system designed to confirm hidden tokens without revealing their contents, and which uses hashed fingerprints and semantic checks to identify signs of inflation.

. Source: https://arxiv.org/pdf/2505.13778
One component verifies token counts cryptographically using a Merkle tree; the opposite assesses the relevance of the hidden content by comparing it to the reply embedding. This enables auditors to detect padding or irrelevance – signs that tokens are being inserted simply to hike up the bill.
When deployed in tests, CoIn achieved a detection success rate of nearly 95% for some types of inflation, with minimal exposure of the underlying data. Though the system still will depend on voluntary cooperation from providers, and has limited resolution in edge cases, its broader point is unmistakable: the very architecture of current LLM billing assumes an honesty that can not be verified.
Conclusion
Besides the advantage of gaining pre-payment from users, a scrip-based currency (resembling the ‘buzz’ system at CivitAI) helps to abstract users away from the true value of the currency they’re spending, or the commodity they’re buying. Likewise, giving a vendor leeway to define their own units of measurement further leaves the buyer at nighttime about what they are literally spending, when it comes to real money.
Just like the lack of clocks in Las Vegas, measures of this sort are sometimes aimed toward making the buyer reckless or indifferent to cost.
The scarcely-understood , which will be consumed and defined in so some ways, is probably not an appropriate unit of measurement for LLM consumption – not least because it could actually cost repeatedly more tokens to calculate a poorer LLM end in a non-English language, in comparison with an English-based session.
Nonetheless, character-based output, as suggested by the Max Planck researchers, would likely favor more concise languages and penalize naturally verbose languages. Since visual indications resembling a depreciating token counter would probably make us slightly more spendthrift in our LLM sessions, it seems unlikely that such useful GUI additions are coming anytime soon – a minimum of without legislative motion.
*