HomeArtificial IntelligenceNews Sites Blocking Internet Archive: A Shocking New Trend

News Sites Blocking Internet Archive: A Shocking New Trend

  • News sites blocking Internet Archive have surged past 340, with most owned by five of America’s largest local publishers.
  • News sites blocking Internet Archive cite AI training data fears, though no publisher has confirmed actual scraping occurred.
  • Researchers, historians, and working journalists depend on the Wayback Machine to access archived local reporting.
  • Alden Global Capital subsidiaries Tribune Publishing and MediaNews Group are among the publishers leading the blockade.

A Digital Preservation Crisis Nobody Saw Coming

The number of news sites blocking Internet Archive crawlers has now crossed 340 — and it’s still climbing. What started as a quiet shift in a few robots.txt files has turned into one of the more consequential battles in the history of online journalism preservation. The irony is sharp: the very publishers whose work most needs saving are the ones locking the door.

Back in January, Nieman Lab first reported that major outlets — The New York Times, The Guardian, USA Today Co. — had started blocking the Internet Archive’s web crawlers. The stated concern was that AI companies might scrape the nonprofit’s repositories to harvest training data. It was a reasonable fear in the abstract. But here’s the thing: as of the latest reporting, not a single news publisher has actually confirmed that any AI company scraped their content from the Wayback Machine. The threat remains theoretical. The damage to the archive is very real.

News Sites Blocking Internet Archive: Who’s Behind It

The publishers driving this trend aren’t household tech names — they’re the unglamorous machinery of American local media. According to Nieman Lab’s updated analysis, many of the 340-plus sites are owned by five of the seven largest local news publishers in the US: USA Today Co. (formerly Gannett), McClatchy, Advance Local, MediaNews Group, and Tribune Publishing. The pattern of news sites blocking Internet Archive access is concentrated almost entirely within these large ownership groups.

That last two — MediaNews Group and Tribune Publishing — are both subsidiaries of Alden Global Capital, the hedge fund that critics have long labelled a predatory asset stripper of local journalism. There’s a certain bitter logic here. Alden has spent years gutting newsrooms, cutting staff, and selling off properties. Now it’s also cutting off access to the historical record of those same newsrooms. The communities those papers once served can’t even rely on an archive to remember what they were.

Advance Local, a subsidiary of the Newhouse family-owned Advance Publications, confirmed to Nieman Lab that it began hard-blocking the Internet Archive last August. At least 13 of its properties are now listed in the data, including The Cleveland Plain Dealer, The Oregonian, and The Patriot-News — regional institutions with decades of civic journalism that researchers and historians now can’t easily access through the Wayback Machine.

The numbers are striking. In January, Nieman Lab found 241 news websites disallowing at least one Internet Archive-affiliated bot — roughly 80% of them USA Today Co. properties. By May, that figure had jumped by 141 more sites, bringing the total to 382 tracked outlets, 342 of which are local. That’s a 58% increase in five months, and every new addition to the list represents another case of news sites blocking Internet Archive crawlers from capturing civic history.

What’s Actually at Stake

It’s easy to frame this as a corporate IP dispute. It’s actually much more than that. Local news archives are primary source material — the kind historians, legal researchers, civic advocates, and yes, other journalists, rely on constantly. The growing trend of news sites blocking Internet Archive bots puts that entire ecosystem at risk.

Edward McCain, a journalism librarian at the University of Missouri, put it plainly: “Blocking the Internet Archive’s web crawlers threatens one of the most effective ways that we capture and store news content for the long term. In the present we may have some workarounds, but in the long run, it weakens a vital link in primary source materials that we need to understand where we’ve been and where we want to go.”

Working journalists are among the Wayback Machine’s heaviest users, especially in regions where local outlets have closed or hollowed out. B.J. Mendelson, editor of The Monroe Gazette newsletter, described the stakes in a petition signed by over 200 journalists: “I cover news within a larger news desert in New York’s Rockland, Sullivan, and Rockland counties. This means I need to heavily rely on archival data of old news articles from now deceased, or zombie-fied, media outlets. Without the Internet Archive, my work would be incredibly difficult to do.”

That petition is one of several circulating online pushing back against the blockades. They haven’t moved the needle yet.

AI Scraping: The Catalyst That Changed Everything

To understand why news sites blocking Internet Archive bots accelerated so sharply in 2024 and 2025, you have to understand the broader panic gripping media companies over generative AI. Publishers have watched companies like OpenAI, Google, and Anthropic hoover up vast swaths of the web to train large language models, often without payment or permission. Lawsuits are piling up. Licensing deals — the New York Times with Microsoft being the most-discussed — are becoming a template.

The Internet Archive became collateral damage. Publishers reasoned that even if the Archive itself isn’t selling training data, its open, publicly accessible repositories make it trivially easy for AI companies to scrape journalism at scale. Block the Archive, block the potential vector. The result is that news sites blocking Internet Archive crawlers are, in effect, making a calculated trade: sacrificing long-term public access to guard against a threat that remains, so far, unproven.

Meredith Broussard, a data journalist and professor at NYU, offered the clearest framing of the underlying dynamic: “This is the same fight that everybody has been having with the Internet Archive since its inception. Internet Archive is a very old-school, ‘information-should-be-free’ organization. But the people who are invested differently have different priorities. There are lots of different historical and legal and economic issues that are colliding in this situation. AI companies are the catalyst for the latest skirmish in a very old battle.”

She’s right that this isn’t new. The Internet Archive has faced legal challenges going back years — most recently a bruising copyright lawsuit brought by major book publishers that resulted in significant restrictions on its digital lending program. News publishers have been watching that case closely.

The Archive’s Response — and Its Limits

The Wayback Machine hasn’t been passive. Mark Graham, the founder of the Wayback Machine, told Nieman Lab the organization has implemented systems to limit bulk downloading and is working with vendors like Cloudflare to monitor bot activity. “We are in conversation with many publishers and appreciate the opportunity to address their concerns” — though those conversations have done little to slow the pace of news sites blocking Internet Archive crawlers in the months since.

Source: https://www.niemanlab.org/2026/05/more-than-340-local-news-outlets-are-limiting-the-internet-archives-access-to-their-journalism/

Zara
Zara
I am a psychology undergraduate with a strong passion for technology, digital creativity, and innovation. Alongside my studies, I have experience in social media management, content writing, and exploring tech tools that enhance communication and problem-solving. As a tech enthusiast, I enjoy learning new digital skills, adapting to emerging trends, and using technology to create meaningful impact.
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular