CivSafe — Strategic Innovation. Community Impact.

On April 29th, the News/Media Alliance — representing over 100 publishers including CNN, NBCUniversal, Vox Media, Ziff Davis, USA Today, and hundreds of regional news outlets — sent a formal demand letter to the executive director of the Common Crawl Foundation.

If you haven't heard of Common Crawl, you're not alone. Most nonprofit directors and SMB owners haven't. That's kind of the point.

Common Crawl is a nonprofit that has been crawling the open web since 2008. They archive those crawls and release the data publicly. Their datasets are enormous — petabytes of text scraped from millions of websites across the entire internet, updated on a rolling basis.

Those datasets are also the training backbone for virtually every major AI model you use. GPT. Llama. Gemma. Mistral. DeepSeek. Almost every large language model was partly trained on Common Crawl data. When you ask an AI assistant a question, some fraction of its answer comes from patterns it learned in Common Crawl text. That text includes your website.

If your organization has had a publicly accessible website for more than a few years, your content is almost certainly in there.

What the publishers just did

The demand letter is direct. The News/Media Alliance told Common Crawl to:

Remove publisher content upon request, actually honor it, and do it fast
Publicly clarify that it does not own or authorize the use of scraped content
Revise its terms to explicitly prohibit use of the archive for AI training
Create an enforceable opt-out registry with real teeth

The frustration driving this has been building for years. Some of the publishers said they sent removal requests to Common Crawl over 2.5 years ago — and those requests were never honored. Common Crawl says it respects robots.txt going forward, but its historical archives — already downloaded and redistributed across research institutions worldwide — are a different problem entirely.

The phrase being used: "data laundering." The mechanism works like this. You block OpenAI's crawler (GPTBot) in your robots.txt. The AI company says it sourced training data from Common Crawl, not directly from your site. Common Crawl scraped your site years ago, before you had that block in place. Nobody technically lied. But your content ended up in the training data anyway, through a plausibly neutral intermediary — a nonprofit — in the chain. That's the part publishers are now calling out publicly.

Why this matters if you're not a media company

The publishers who signed this letter have legal teams and care because they sell content, and AI is competing directly with that content. Most small orgs don't have that problem.

But there are real questions worth asking for any org with a public website.

Have you ever published something that was meant to be temporary? A draft service description. A job posting that named internal contacts. A project update that mentioned a funder before the decision was public. A press release about a program that later got cancelled. A client testimonial with an identifiable name. Those things were indexed. They're in the archive. They may be baked into models used by millions of people right now.

Do you actually know what your public website says? Most organizations don't. Content accumulates over years. Old pages don't get deleted. "Quick campaign" pages from three years ago are still live. Your website, right now, likely contains a lot of historical context your team hasn't reviewed since the Obama administration.

Is your original work going out without attribution? For advocacy organizations, think tanks, and research-driven nonprofits specifically — if your policy positions, original analysis, or findings are on your public site, they've been fed into AI training sets. Those models can now summarize your positions without attribution, remix your research without credit, and produce content that resembles your work. For most advocacy work, spreading ideas is the goal. But that should be a deliberate choice, not a surprise.

What you can actually do

Let's be honest about the limits. The historical data is already gone. There are hundreds of copies of Common Crawl archives distributed across cloud storage, research institutions, and private servers worldwide. Even if Common Crawl honored every removal request tomorrow, the training data already baked into existing models isn't changing. You cannot retrain GPT or Llama to forget your website. That ship has sailed.

What you can do is control what gets scraped from this point forward.

Update your robots.txt. Add this to your site's robots.txt file:

User-agent: CCBot
Disallow: /

This tells Common Crawl's crawler not to index your site in future crawls. Most modern CMS platforms (WordPress, Squarespace, Webflow, Drupal) make robots.txt editable without a developer. If you're on a managed platform, check the SEO or privacy settings — many now have AI crawler toggles.

Block other AI crawlers while you're at it. Common Crawl isn't the only vector. You'll also want to consider blocking GPTBot (OpenAI), Google-Extended, PerplexityBot, and others. Dark Visitors maintains a current, well-organized list of AI crawler user agents with robots.txt snippets ready to copy.

One caveat worth knowing: there's real evidence that blocking AI crawlers broadly via robots.txt can hurt your regular search traffic. A study from early 2026 found a 23.1% decline in monthly visits for publishers who blocked AI crawlers, without a corresponding reduction in AI citations of their content. Block specifically by user agent rather than writing a blanket Disallow: / that catches legitimate search crawlers too.

Do a content audit. Even if you decide not to block scrapers, spend two hours running a crawl of your own site with a tool like Screaming Frog (free up to 500 URLs) or Sitebulb. Get a list of every public URL. Review anything more than two years old. Delete what shouldn't be there. You'll find things.

Write a basic web content policy. This doesn't have to be a 20-page governance document. A single internal page: what goes on the public site vs. behind a login, how long content lives before it gets reviewed for deletion, who's responsible. Most small orgs have never written this down, which is why their 2020 partnership announcement is still indexable and scraped into every AI model released since.

The bigger picture

The publishers' letter to Common Crawl is a proxy fight. They're unlikely to win it in the way they want — Common Crawl will probably update its FAQ, create a formal opt-out form, and the underlying issue of already-distributed historical archives will remain unresolved. The real target is the AI companies that used the archive as cover for mass-scale content ingestion without consent or compensation.

But the precedent matters. More enforcement mechanisms are coming. EU AI Act provisions on training data provenance are already being tested. The concept of "laundering" through neutral intermediaries is going to face legal and regulatory challenge — slowly, but it's coming.

For small orgs, the practical window is now. Not because you're at legal risk, but because the combination of public content accumulation + AI training + no intentional content policy is a risk profile that will only get harder to reverse.

What's already in the models is already in the models. What you publish next is still your call.

We help small teams audit their public data footprint, build practical web content policies, and make smart decisions about what to publish and what to protect. If you want to know what your website actually looks like from the outside, that's a half-day sprint. We've done it for NGOs and government orgs across Canada.

Your Website Is Already In Every Major AI's Training Data. Publishers Just Fought Back.

What the publishers just did

Why this matters if you're not a media company

What you can actually do

The bigger picture