HomeArtificial IntelligenceNorway's Sovereign LLM: 2PB of Huawei Flash and a Big Lesson

Norway’s Sovereign LLM: 2PB of Huawei Flash and a Big Lesson

When Norway’s National Library set out to build a sovereign LLM — a large language model trained on Norwegian language and culture — the project quickly ran into a problem that nobody in the AI industry had properly documented: how do you move petabytes of archived national heritage data through a modern AI training pipeline without everything grinding to a halt? The answer, it turns out, involves 2 petabytes of Huawei flash storage, a national supercomputer, and a whole lot of trial and error.

  • Norway’s National Library is building a sovereign LLM trained entirely on Norwegian language, culture, and history.
  • The sovereign LLM pipeline relies on 2 petabytes of Huawei OceanStor Dorado all-flash storage for low-latency data processing.
  • The project’s biggest bottleneck isn’t compute — it’s data quality, cleaning, and moving petabytes from archive to pipeline.
  • No standard evaluation tools exist for the model, so the library’s team is building their own from scratch.
  • Norway’s National Library is building a sovereign LLM trained entirely on Norwegian language, culture, and history.
  • The sovereign LLM pipeline relies on 2 petabytes of Huawei OceanStor Dorado all-flash storage for low-latency data processing.
  • The project’s biggest bottleneck isn’t compute — it’s data quality, cleaning, and moving petabytes from archive to pipeline.
  • No standard evaluation tools exist for the model, so the library’s team is building their own from scratch.

Why Norway Decided It Needed a Sovereign LLM

The logic here is straightforward, even if the execution isn’t. Marius Husnes, Head of IT Platform at Nasjonalbiblioteket (Norway’s National Library), made the case plainly at Huawei’s ID Forum 2026 in Paris: no commercial LLM provider is building a model that genuinely understands Norwegian. OpenAI, Google, Anthropic — they’re all training on internet-scale English-dominant data. A model trained that way will know Shakespeare but might struggle to contextualise Ibsen within Norwegian literary tradition, or accurately reflect centuries of Norwegian-language news, law, and public life.

As Husnes put it, “Norway is a small country solving a problem every non-English-speaking nation will face: how do you build AI that reflects your language, your culture and your history?” That framing matters. This isn’t a vanity project or nationalist posturing — it’s a genuine infrastructure problem. A country that relies entirely on foreign-built, English-centric AI models for public services, education, and civic life is effectively outsourcing its cultural memory to someone else’s training data. The case for a dedicated sovereign LLM becomes obvious the moment you frame it that way.

Norway’s Ministry of Culture saw this clearly enough to task the National Library with solving it. The library is a logical choice: it’s the single largest repository of Norwegian digital content in existence, holding books, newspapers, broadcast recordings, web archives, and more. Crucially, it operates under a legal deposit mandate — meaning it’s legally entitled to receive a copy of every published book and broadcast content produced in Norway. That’s a data moat no private company can replicate.

Husnes - training national LLM.
Husnes – training national LLM. — blocksandfiles.com

The library also struck a deal with Norwegian newspapers to use copyrighted content for sovereign LLM training — something Husnes was clearly proud of. “No private company has this,” he said. That’s not a small point. Copyright has become one of the central battlegrounds in AI development globally, with publishers and news organisations suing AI labs over training data. The library’s institutional status gave it a path through that legal maze that a startup or tech giant simply couldn’t navigate.

The Sovereign LLM Pipeline: Where Huawei Fits In

By the time the library started building its sovereign LLM training pipeline, it had accumulated roughly 20 petabytes of unique digitised content — books, audio, video, still images, web crawls, all of it scanned and catalogued since 2005. With redundancy, that’s around 60 PB stored across disk and tape in a classic 3-2-1 configuration (three copies, two media types, one off-site). That’s an impressive archive. It’s also, as Husnes’ team discovered, a serious engineering headache when you need to feed it into a machine learning pipeline.

The preservation system was built for durability and long-term access, not speed. High read latency, optimised for infrequent retrieval — exactly the opposite of what an AI training pipeline demands. Training pipelines want high-throughput, low-latency, parallel I/O. These two systems have fundamentally different performance profiles, and bridging them turned out to be one of the project’s core technical challenges.

Husnes - preservation and AI pipeline storage.
Husnes – preservation and AI pipeline storage. — blocksandfiles.com

The solution involved standing up a dedicated in-house AI processing environment: an Nvidia DGX H200 system, a 384-core CPU cluster, and multiple Huawei OceanStor Dorado all-flash arrays totalling 2 PB of flash capacity. This staging layer acts as the performance buffer between the slow archive and the actual training runs. Data flows from the archive into this environment, gets cleaned, deduplicated, normalised, validated, and prepared — then gets pushed to Norway’s national supercomputer, Sigma2 Olivia, for the actual training.

Sigma2 Olivia is an HPE Cray Supercomputing EX system with 448 GPUs and 64,512 CPU cores, running a 5.3 PB Cray ClusterStor E1000 storage system. That’s serious horsepower. But here’s the thing Husnes kept emphasising: the bottleneck was never compute. It was data quality and pipeline throughput. You can have all the GPUs in the world; if your input data is dirty, duplicated, or arriving too slowly, your training run suffers. That lesson applies whether you’re building a sovereign LLM or any other large-scale model.

Marius Husnes.
Marius Husnes. — blocksandfiles.com

The Problems Nobody Warned Them About

One of the most valuable things Husnes shared in Paris wasn’t a technical specification — it was an admission. His team had to figure out how to move petabyte-scale datasets from a cold archive through a live AI pipeline largely on their own, because nobody had published a playbook for it. The AI industry has spent enormous energy documenting model architectures, training techniques, and benchmarks. The unglamorous, operationally complex work of wrangling legacy archive systems into AI-ready pipelines? That’s been largely ignored.

That gap is going to bite a lot of institutions. National libraries, government archives, broadcasting organisations — many of them are sitting on exactly this kind of data: irreplaceable, historically significant, legally complex, and stored in systems designed for preservation rather than computation. If they want to build sovereign LLM systems, they’ll face the same infrastructure mismatch Norway encountered. And unlike a well-funded tech lab, most of these institutions will be starting without a roadmap.

The orchestration challenge compounds this. The library’s setup spans three distinct systems: the preservation archive, the on-premises AI processing environment, and the national Sigma2 supercomputer. Getting those three to work together smoothly — with different storage architectures, different access patterns, and different operational teams — is an ongoing project, not a solved problem.

Evaluation, Governance, and the Questions AI Labs Don’t Ask

Perhaps the most intellectually interesting part of Husnes’ talk was his summary of what his team is still working through. These aren’t infrastructure problems — they’re harder than that.

Evaluation is a genuine mess. Standard LLM benchmarks are built around English-language tasks. Norwegian has two official written forms — Bokmål and Nynorsk — plus a range of dialects and several centuries of linguistic evolution. How do you measure whether your sovereign LLM is actually good at Norwegian? There’s no off-the-shelf answer, so the library is building its own evaluation tools in parallel with training the model. That’s a significant undertaking on top of everything else.

Governance raises harder questions still. Who controls access to a nationally built AI model? Can researchers use it freely? What about commercial applications? What content is it allowed to generate, and who decides when it gets something wrong about Norwegian history or culture? These are institutional and political questions as much as technical ones, and they don’t have clean answers yet. For a model trained on copyrighted newspaper content and national broadcast archives, getting this wrong has real legal and reputational consequences.

What This Means Beyond Norway

The geopolitical subtext here is worth pausing on. This project uses Huawei storage infrastructure at significant scale — 2 petabytes in a high-visibility national AI programme in Western Europe. That’s notable. Huawei’s presence in European enterprise and public sector infrastructure has been contested, particularly in telecoms, but in the storage market the company has continued to win meaningful contracts. Norway’s National Library project is a concrete data point that Huawei’s OceanStor Dorado line is being trusted for serious, sensitive workloads on the continent.

More broadly, Norway’s experience is a preview of what dozens of countries will eventually face. The global AI conversation is dominated by English-language models built by American companies. But language is inseparable from culture, law, history, and identity — and institutions that care about those things are going to want AI systems that reflect them accurately. Building a sovereign LLM isn’t just a technical project; it’s a cultural and political act.

The institutions best positioned to do this work aren’t necessarily the ones with the most GPUs. They’re the ones with the data — decades of digitised, legally cleared, culturally rich content that no tech company has access to. National libraries, archives, and broadcasters have that. What they haven’t had, until now, is a clear model for how to actually deploy it. Norway is writing that playbook in real time, and Husnes is right that other nations should be paying close attention. AI needs custodians, not just builders — and some of the best custodians are already sitting on the most valuable training data in the world.

Source: https://www.blocksandfiles.com/flash/2026/05/22/norways-2-petabytes-of-huawei-flash-storage-and-llm-training/5244910

Yasir Khursheed
Yasir Khursheedhttps://www.squaredtech.co/
Meet Yasir Khursheed, a VP Solutions expert in Digital Transformation, boosting revenue with tech innovations. A tech enthusiast driving digital success globally.
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular