There's a quiet workflow running in a lot of organizations right now: record meeting, upload to Whisper or Otter.ai or some similar service, get a transcript back. Nobody thinks much about it. It works.
Here's the part nobody thinks about: that audio is leaving your building. It's going to OpenAI's servers, or Otter.ai's servers, or whoever you're paying. Your staff meeting where you discussed a sensitive HR situation. Your intake call with a vulnerable client. Your budget review with the executive team. Your donor conversation. All of it, uploaded, processed, stored somewhere you don't control.
Last week, that calculus changed. On March 26, Cohere — the enterprise AI company that tends to build quietly and ship well — dropped a transcription model called Transcribe. It's open-source, Apache 2.0, runs on a consumer GPU, and it just topped every benchmark in the field. VentureBeat's deep analysis landed March 30 with a headline that cuts to it: "Cohere's open-weight ASR model hits 5.4% word error rate — low enough to replace speech APIs in production pipelines."
That's not marketing copy. That's the actual number from the Hugging Face Open ASR Leaderboard, which is the closest thing to an independent truth-telling service this space has. Cohere Transcribe is sitting at number one with a 5.42% word error rate. Whisper Large v3 is behind it. ElevenLabs Scribe v2 is behind it. Every option you're probably paying for is behind it.
What 5.4% word error rate actually means
Word error rate measures how often a model gets a word wrong. At 5.42%, Cohere Transcribe is making fewer mistakes than most human transcriptionists working at speed. For practical purposes — meeting notes, call summaries, interview transcription, document intake — this is good enough that you don't need to review it line by line. You need to skim it.
For context: Whisper Large v3 runs around 7-8% WER on comparable benchmarks. Services like Otter.ai are built on top of models in that range. So the accuracy jump is real, not marginal.
Speed is also real. The model processes 525 minutes of audio per minute — which means an hour-long meeting takes about seven seconds to transcribe. This opens up workflows that weren't practical before: transcribing every call automatically, processing archived recordings in bulk, running real-time transcription without a cloud API in the loop.
Why 2 billion parameters matters
The model has 2 billion parameters. That sounds like a technical detail. Here's why it matters for your organization: you can run this on hardware you might already have.
A machine with a mid-range gaming GPU from the last couple of years can run this model. A used workstation from your IT closet might be able to run it. An AWS g4dn instance (the cheap ones, around $0.50/hour) can run it. You don't need a $50,000 AI server or a fancy GPU cluster.
For comparison: the models that were achieving this accuracy level two years ago required infrastructure that only enterprises could afford. Now a 12-person nonprofit can run state-of-the-art transcription on a single machine in their office and never touch an external API.
The model ships as a standard Hugging Face checkpoint. You can download it today, load it in a few lines of Python, and point it at your audio files. There's also a free API tier through Cohere directly if you want to test before committing to self-hosting.
The privacy case for small orgs
If you work with a legal aid nonprofit, a community health organization, a social services team, or any organization where client confidentiality matters — you probably should not be sending recordings to a third-party cloud service. You may not even be allowed to under your organization's policies, or under the privacy commitments you've made to clients.
This is the part that gets glossed over in the AI productivity conversation. Everyone's excited about the efficiency gains from transcription. Far fewer people are asking whose servers your audio is living on.
With a self-hosted Cohere Transcribe setup, the audio never leaves your infrastructure. The model runs on your machine. The transcript stays in your systems. If your organization is subject to PIPEDA, PHIPA, GDPR, or any similar framework — this is how you get the efficiency gains without the compliance headache.
The same logic applies to competitive sensitivity. Your strategy calls, your client intake recordings, your internal team discussions — you probably don't want those sitting in a third-party vendor's data retention policy.
What this replaces
For most organizations, Cohere Transcribe can replace all of this:
Otter.ai / Fireflies / similar — subscription-based meeting transcription tools that sync with your calendar and process everything in the cloud. Most are $15-40/month per user. A team of 10 is paying $150-400/month for transcription. Cohere Transcribe does the same thing, better, for the cost of a cheap VPS.
OpenAI Whisper API — still widely used for custom integrations and workflows. Billed per minute of audio, which adds up quickly at scale. Apache 2.0 Cohere Transcribe running locally costs effectively zero per minute once you have the infrastructure.
Rev.com and similar human + AI hybrid services — used when accuracy really matters. Rev charges around $1.50/minute for their AI service. For a 60-minute recording, that's $90. Cohere Transcribe, self-hosted, approaches that accuracy at a cost that rounds to zero.
Languages and real-world limitations
Cohere Transcribe handles 14 languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Chinese (Mandarin), Japanese, and Korean. For Canadian organizations doing work in both official languages, English and French both perform well.
Two real limitations worth knowing upfront: the model doesn't do automatic language detection — you need to tell it which language to expect. And it doesn't include speaker diarization (who said what). For most use cases this doesn't matter. For anything where you need speaker attribution, you'd combine Transcribe with a separate diarization step, which is a well-trodden path.
Also worth noting: the model doesn't perform as well on heavily code-switched audio — conversations that mix languages mid-sentence. If your work involves multilingual communities where this is common, test it on real samples before committing.
What to actually do this week
If your team is currently paying for a transcription service, run a test: pull five recent recordings, run them through Cohere Transcribe via the free API (no setup required), compare the output to what you're getting now. That test takes an afternoon and costs nothing.
If the accuracy is comparable or better — and it likely will be — then the question becomes whether self-hosting makes sense for your volume and privacy requirements. For organizations where audio privacy is a serious consideration, the answer is almost always yes.
The migration path from most cloud transcription tools is simpler than it looks. The real work is usually the integration: connecting transcription output to wherever your team actually uses it — notes in your CRM, summaries in Notion, action items in your task tool. That's the kind of workflow setup that takes a few days, not months.
We've been testing Cohere Transcribe on real organizational audio over the past few days — town hall recordings, multi-speaker intake calls, low-quality phone audio. We know where it performs well and where edge cases come up. If your organization does any significant volume of audio processing and you want to understand what a privacy-first, self-hosted setup looks like for your specific situation, that's exactly what we help with.