CivSafe — Strategic Innovation. Community Impact.

On April 2nd, Google released Gemma 4. Multimodal. Four model sizes. Runs on Ollama. You can pull it to a Mac mini and process PDFs, images, audio, and video with zero API costs. The benchmarks put it at #3 on LMSYS Arena.

Most of the coverage has been about those benchmarks. The benchmark story is fine. The benchmark story is not the point.

The point is the license.

Why Apache 2.0 matters more than you think

Every previous version of Gemma shipped under a custom Google license. That license had usage restrictions. It gave Google the right to terminate your access. It created legal uncertainty that pushed most serious production teams toward Mistral or Alibaba's Qwen instead — both of which use Apache 2.0.

Gemma 4 is Apache 2.0. Same license as Qwen 3.5, Mistral, and most of the open-weight ecosystem that developers actually build products on.

What Apache 2.0 means in practice: no MAU caps. No acceptable-use clauses that change at Google's discretion. No scenario where you build a workflow on this model and Google decides eighteen months later that your use case violates the terms. You can run it commercially. You can modify it. You can ship it embedded in a product. Google cannot pull the rug.

This has been the real barrier to self-hosted AI for a lot of small teams. Not compute. Not capability. Legal risk. "Can we actually put this in production without a lawyer checking the terms every six months?" For Gemma 4, the answer is yes.

What you can run locally right now

Gemma 4 ships in four sizes. Here's how they map to real hardware:

E2B (effectively 2B parameters, MatFormer architecture) — runs on almost any modern machine with 8GB RAM. Fast. Good for simple document tasks and quick classification.

E4B (effectively 4B) — the sweet spot for a $599 Mac mini M4 with 16GB. Pull it with Ollama, set OLLAMA_FLASH_ATTENTION=1, and you have a local multimodal model running entirely offline. About 9.6GB download. No API key. No usage cost. No data leaving your building.

26B MoE (A4B active) — needs 32GB unified memory on Mac or a proper GPU. This is where the community hit a wall, which we'll get to.

31B Dense — 40GB+ required. Server-class hardware. Skip this unless you're self-hosting at scale.

The multimodal capability runs across all four sizes. Every Gemma 4 model natively processes images, video frames, and audio. Not a bolt-on. It's native.

For a 10-person NGO that currently pays $400/month to process grant documents, meeting recordings, and funder reports through an API: the E4B on a Mac mini eliminates that cost entirely while keeping everything on your own hardware.

The speed problem the community found in 24 hours

Here's the part Google's launch post did not emphasize.

Within 24 hours of release, developers running benchmarks on their own hardware noticed the 26B MoE model generating text at 11 tokens per second on GPUs where Qwen 3.5 27B was hitting 60+. The routing overhead in the mixture-of-experts architecture costs you throughput. Same quality, different speed.

On an RTX 4090 at Q4 quantization:

Qwen 3.5 27B: ~35 tokens/second
Gemma 4 31B Dense: ~25 tokens/second
Gemma 4 26B MoE: ~11 tokens/second

The first batch of tokenizer and quantization implementations also shipped broken — users reported models that wouldn't load at all. Ollama pushed a fix in v0.20.0 within 48 hours, but if you pulled it in the first two days and got nothing, that was why.

This is normal for a major open-weight release. The ecosystem catches up fast. But if you tried Gemma 4 this week and it felt wrong, check your Ollama version and re-pull.

None of this affects the E2B and E4B models meaningfully. The speed hit is specific to the larger MoE architecture. For small-team workflows running on a Mac mini, you won't feel it.

Where Gemma 4 genuinely beats Qwen right now

Speed aside, the community found one area where Gemma 4 wins clearly: multilingual quality.

Developers testing German, Arabic, Vietnamese, and French outputs reported consistent quality improvements over Qwen 3.5. If your organization works in French — and if you're an Ottawa-based NGO or public sector org, you almost certainly do — this matters. French-language document processing, bilingual reporting, multilingual constituent communications: Gemma 4 handles these noticeably better.

For English-only, speed-critical workflows: Qwen 3.5 27B is still faster.

For bilingual or multilingual workflows where quality matters more than throughput: Gemma 4 is worth testing now.

The practical playbook

If you're building or choosing a self-hosted AI model right now, the decision framework should look like this:

Apache 2.0 only. This is no longer negotiable. If a model ships under a custom or restrictive license, the production risk is real. Apache 2.0 is now the baseline. Gemma 4, Qwen 3.5, Mistral all qualify. Meta's Llama 4 does not.

For speed-critical tasks (real-time chat, interactive tools, high-volume processing): Qwen 3.5 27B is still faster per token. Run it on the same Ollama setup.

For multilingual orgs or batch processing where quality > speed: Gemma 4 E4B is your new default. Set it up on a Mac mini, run it locally, and stop paying per-token for document work.

For video and audio: Gemma 4 has genuine native multimodal support across all sizes. If you're processing meeting recordings, training videos, or scanned documents with images, this opens up workflows that would have required stitching multiple models together a month ago. One model. One machine. No API cost.

Watch the quantization situation. As Q4 and Q5 implementations stabilize over the next few weeks, the speed numbers on the 26B MoE will improve. The hardware requirements will drop. Teams that get familiar with the deployment setup now will be ready to upgrade.

The bigger shift happening here

Twelve months ago, self-hosted AI meant compromising on capability. You ran the smaller model because the bigger model needed a server room. The Gemma 4 E4B on a 16GB Mac mini is genuinely competitive for most document and communication tasks a 10-50 person org needs to automate.

The cost math for small orgs has flipped. It is now cheaper to own your AI than to rent it, for almost any repeatable workflow. The only remaining barrier for most teams is setup knowledge — knowing how to deploy and integrate these tools.

Apache 2.0 removes the legal barrier. Ollama removes the technical barrier. What's left is someone who's actually done this before showing your team how to make it work.