All Insights

The Tokenmaxxing Problem Is Coming for Your Team

CivSafe Team·May 31, 2026·6 min read

Amazon quietly killed KiroRank last Thursday.

If you haven't heard of it: KiroRank was the internal AI usage leaderboard Amazon built on their Kiro developer platform. It tracked token consumption across developer teams, ranked them against each other, and — alongside a mandate requiring more than 80% of developers to use AI tools every week — created exactly the kind of pressure that makes rational people do irrational things.

So developers did irrational things. They had MeshClaw, Amazon's internal agentic AI platform, generate code that would immediately be deleted. They piped AI output to nowhere. They fed the system questions they already knew the answers to, just to log sessions. The practice got a name — tokenmaxxing — spread fast enough to earn its own guides and vocabulary, and eventually rendered Amazon's own usage data meaningless. Compute costs spiked. The signal they were trying to capture disappeared.

On May 29, Amazon pulled the plug. KiroRank is gone. They're moving to "normalized deployments" — tracking AI-generated code that actually shipped, rather than raw token volume.

Goodhart's Law, again, right on schedule: when a measure becomes a target, it ceases to be a good measure.

You don't have Amazon's cushion

This story is being covered as a Big Tech cautionary tale. But the real audience isn't Amazon's engineering managers — they're already rebuilding. The real audience is every 15-person nonprofit, 30-person government team, or 50-person SMB that's currently trying to justify AI investment to a skeptical board or funder.

Here's what's happening in organizations like that right now: leadership absorbs the AI pressure from every news cycle, decides the team needs to "be using AI," and reaches for metrics to prove it. The metrics they reach for are the same ones Amazon used. Tools logged per week. Sessions in the platform. "Are people actually using this?" If the numbers look good, leadership assumes the program is working.

The problem is that employees — even good, well-intentioned ones — adapt to metrics. Not maliciously. Practically. If the question is "did you use AI this week?" people will use AI this week, whether or not it helped them. And unlike Amazon, a 30-person team doesn't have the data scale or the engineering oversight to notice when the signal has gone sideways. You can run a bad AI adoption program for six months before anyone figures out why the expected productivity gains never materialized.

What tokenmaxxing looks like at your scale

At the organizations we work with, tokenmaxxing doesn't look like developers running deletion scripts. It looks like this:

Someone runs a grant report draft through an AI tool, asks it to reformat three times, and then submits their original because the AI version wasn't better — but they logged the session. A team lead asks a chatbot to summarize a document they already read, because the monthly all-hands requires everyone to show an AI example. A staffer uses an AI assistant to answer questions they already know the answer to, because "checking with AI first" is now written into the workflow doc.

None of this is dishonest. It's just adaptation. When you measure activity, you get activity. You don't get outcomes.

What to track instead

If you're measuring AI adoption on your team right now — or being asked to justify AI investment to a board, an executive director, or a funder — here's what actually works:

Time-to-output on specific workflows. Pick two or three tasks your team does every week. Clock them before AI. Clock them after. "Our policy brief drafts went from four hours to ninety minutes" is a sentence your board can use. It requires no dashboard and no interpretation.

Volume handled without headcount change. Are you processing more constituent requests, more intake forms, more compliance documents with the same team? That's the cleanest possible ROI story. Not theoretical — actual throughput, actual dates.

Error rates on repeatable tasks. If AI is handling data entry, intake summaries, or routine correspondence, track how often output needs human correction. Rising correction rate means wrong tool for the job. Falling correction rate means you're winning.

Tool-specific feedback, not general AI sentiment. "Do you think AI is valuable?" gets you social desirability bias. "Does this specific tool save you time on this specific task — yes or no?" gets you actionable information you can actually act on.

None of these require a leaderboard. None of them create incentives for theater. And all of them connect to something your organization actually cares about.

The trust problem underneath the metric problem

There's a layer to the KiroRank story that the tech press isn't covering. When Amazon set an 80% weekly usage mandate and backed it with a leaderboard, they weren't really saying "we want you to have better tools." They were saying "we don't trust you to use them, so we'll watch." The developers who gamed the system weren't saboteurs. They were doing what rational people do when they're managed for compliance rather than supported for capability.

Small orgs can't afford to absorb that kind of trust damage. You probably know most of your team personally. If you introduce AI as something that gets monitored and scored, you'll get quiet, persistent resistance — people who technically comply and practically work around it. Usage numbers that look fine. Actual adoption that stays flat.

The rollout pattern that actually works: start with the two or three people on your team who are already curious about this. Let them find what's genuinely useful, build real wins on specific tasks, and teach their colleagues. That spreads faster than any mandate. And it produces something tokenmaxxing never could — organizational capacity that's actually there when you need it.

The one question worth asking yourself today

Amazon spent money building a metric, drove behavior toward that metric, and then dismantled the metric because the behavior it produced was useless. They can absorb that iteration cycle. Most of our clients can't.

If you're currently tracking AI adoption on your team, ask one question: could your people hit these numbers without AI actually doing anything useful? If the honest answer is yes, you're measuring the wrong thing.

The right thing to measure has always been the same: did the work get done, did it get done faster, did the team leave with more capacity than they started with? That's what changes when AI is working. That's what stays stubbornly flat when it isn't.

If you're not sure which side of that line you're on — or you're trying to build the case for AI investment and need outcomes you can actually point to — that's the kind of sprint we do. Two weeks, specific workflows, before-and-after numbers your board will understand. No leaderboard required.

CivSafe — Strategic Innovation. Community Impact.