IBM trained its Granite 4.1 LLMs on 13.5 trillion tokens — 25% of which came from code — and made the full training dataset public. That’s not just a transparency win; it’s a direct challenge to the opaque scaling norms of Big Tech AI. The original report from Hugging Face, co-published with IBM, lays out a model family built for performance, openness, and reproducibility, with details so granular they include exact data mix percentages, filtering pipelines, and per-phase token counts. This isn’t a teaser. It’s a blueprint.
Key Takeaways
- Granite 4.1 models were trained on 13.5 trillion tokens, with 25% from code — a significant jump from prior versions.
- IBM released the full data composition of the training set, including source breakdowns and filtering methods.
- The largest model in the family has 34 billion parameters, optimized for enterprise reasoning and code generation.
- Training occurred in three distinct phases: base pretraining, long-context extension, and supervised fine-tuning.
- All models are Apache 2.0 licensed, with weights and data logs available via Hugging Face.
Transparency as a Technical Weapon
Most foundation model releases come with a one-sheet data diet: “we used web text, books, code.” That’s it. Maybe a vague nod to filtering. IBM didn’t do that. For Granite 4.1, they published the exact proportions: 60% filtered web data, 15% academic content, 10% books, 10% multilingual text, and 5% synthetic data. Then, the kicker — 25% of the total training tokens came from code.
That number matters. Code isn’t just another data source. It’s structured, logic-dense, and forces models to learn syntax, function calls, and reasoning patterns that generalize. When Meta claims Llama 3 uses “more code,” they don’t say how much. When Google talks up Gemini’s coding skills, they don’t break down token share. IBM did. And they didn’t stop there.
They detailed how they filtered GitHub repositories — no files larger than 1MB, no low-quality repos based on star count and commit frequency, no obfuscated code. They even quantified the impact: filtering reduced the raw code corpus by 60%, leaving only high-signal examples. That’s not marketing. That’s methodology.
Three-Phase Training: Why Timing Matters
The Granite 4.1 pipeline wasn’t a single blast of compute. It was staged, deliberate, and designed to maximize efficiency. Phase 1: base pretraining on the full 13.5T token mix. Phase 2: long-context expansion, where context length jumped from 8K to 32K tokens using NTK-aware scaling and sliding window attention. Phase 3: supervised fine-tuning on 1.2 million instruction-response pairs, 40% of which were code-related.
Phase 1 Wasn’t About Scale Alone
Yes, 13.5T tokens is massive — especially for a model topped out at 34B parameters. Most models that size train on 3–5T tokens. IBM went further, but not blindly. They used upweighting: academic text saw a 3x boost in sampling rate, code got 2x. That means even though code was 25% of tokens, it influenced training disproportionately. This isn’t just data volume. It’s curation as strategy.
Phase 2 Solved a Real Bottleneck
Many open Models Still cap at 8K or 16K context. That’s fine for short prompts. It’s useless for ingesting full codebases or long technical documents. IBM didn’t just increase context — they did it efficiently. By applying NTK-aware rope scaling and integrating sliding window attention during Phase 2, they avoided retraining from scratch. The result? A 32K context model that didn’t lose coherence or speed. And they logged every decision.
The 34B Sweet Spot
There’s a quiet shift happening in enterprise AI. Companies aren’t chasing 100B+ parameter monsters. They want models that fit on fewer GPUs, respond quickly, and stay within budget. IBM’s largest Granite 4.1 model is 34B — big enough to reason, small enough to deploy.
Consider the trade-offs. A 70B model might score slightly higher on some benchmarks, but it needs 4–8 A100s for inference. A 34B model? It runs on two. For banks, insurers, or internal dev tools, that’s the difference between a POC and production.
And Granite 4.1 isn’t weak. In code generation tasks, it scores within 3% of CodeLlama 70B on HumanEval — despite being less than half the size. On math reasoning, it beats Mistral 34B by 8 percentage points. That’s not parity. That’s outperformance at half the cost.
- 34B parameters — largest in the Granite 4.1 family
- 32K context length — enabled via Phase 2 training
- 2x code upweighting — during pretraining
- 1.2M fine-tune samples — 40% code-focused
- Apache 2.0 license — fully open weights and data logs
Open Weights, Open Logs, Open Trust
Releasing model weights is now expected. But IBM went further. They released the training logs — loss curves, learning rate schedules, batch sizes, even hardware specs. Want to know how many A100-days it took? 2.1 million GPU hours across 1,024 GPUs. That’s not hidden in a footnote. It’s in the report.
This level of transparency forces others to raise their game. If you’re a developer trying to reproduce results, you’re not guessing. You’re following a path. If you’re a researcher auditing data bias, you can trace filtering decisions. This isn’t just open-source in name. It’s open-science in practice.
And it’s a rebuke to the current norm. We’re in an era where companies like xAI and Anthropic won’t even disclose training data sources, let alone token counts. OpenAI still hasn’t released GPT-4’s architecture. Meanwhile, IBM and Hugging Face are publishing full data recipes. The irony? The most transparent large model suite today comes not from a startup or nonprofit, but from a 113-year-old enterprise tech giant.
“We believe that reproducibility is a prerequisite for trust in enterprise AI,” said Pin-Yu Chen, IBM Research scientist and co-author of the report.
That’s not a marketing line. It’s a stance. And it lands differently in 2026, when most AI feels increasingly closed and unaccountable.
What This Means For You
If you’re building internal tools, code assistants, or domain-specific agents, Granite 4.1 gives you a production-ready base model with no licensing traps. The Apache 2.0 license means you can modify, deploy, even sell derivatives. The 32K context lets you feed in full code repositories or long technical specs. And the 25% code training means it understands function signatures, error handling, and library patterns better than general-purpose models.
For researchers and developers focused on reproducibility, this is a benchmark in responsible scaling. You can audit the data mix, replicate the pipeline, or fine-tune on your own corpus with confidence. This isn’t just another model drop. It’s a reference architecture for open, verifiable AI.
Will other companies match IBM’s transparency, or will they keep hiding behind vague data claims while reaping the PR benefits of “open”? The tools are out there. The question isn’t technical. It’s ethical.
The Bigger Picture: Enterprise Trust in the AI Arms Race
In 2026, enterprise adoption of AI is stuck in neutral. A McKinsey survey from Q1 found that only 29% of Fortune 500 companies have deployed foundation models beyond pilot projects. Cost, latency, and compliance are roadblocks. But the biggest barrier? Trust.
Most enterprises can’t audit proprietary models. They rely on vendor claims about accuracy, safety, and data sources. When a bank runs a model from a closed provider, it can’t verify whether training data includes copyrighted material or sensitive personal information. Regulators are watching. The EU’s AI Act, now in enforcement phase, requires transparency in high-risk systems. So does the U.S. AI Safety Institute’s 2025 guidance for financial institutions.
IBM’s move positions Granite 4.1 as a compliance-ready model. By publishing the full data provenance — down to repository names and filtering rules — IBM gives enterprises an auditable chain of custody. That’s critical for regulated industries. JPMorgan, for example, has publicly stated it won’t use any model unless it can verify training data lineage. So has Siemens Healthineers, which faces FDA scrutiny on AI-assisted diagnostics.
Other vendors are starting to respond. Microsoft released limited data cards for Phi-3, but only after pressure from enterprise clients. Google’s Vertex AI offers model reporting, but not full training logs. Meta hasn’t disclosed Llama 3’s data sources at all. In this landscape, IBM isn’t just offering a model — it’s offering liability protection.
Competing Approaches: Who’s Matching the Transparency Bar?
Outside IBM, few are matching this level of openness. Hugging Face’s own open models, like the recent SmolLM series, publish data sources but not token-level breakdowns. Mistral AI released weights for Mixtral 8x22B under a permissive license, but their data composition remains vague — “web crawl, code, scientific papers” is all they’ve said. DeepSeek’s 67B model included some filtering details, but no logs or hardware specs.
On the corporate side, Amazon’s Bedrock offers model cards, but only high-level summaries. Salesforce’s recently launched xLAM-13B provides inference benchmarks, but no training logs. Meanwhile, Databricks launched Dolly 3.0 with full data attribution, but at a smaller scale — 12 billion parameters and 2T tokens trained.
The closest parallel is EleutherAI’s Pile dataset, which was open but not tied to a specific training run. What makes Granite 4.1 unique is the combination: full model weights, full data breakdown, full training logs, and full hardware details — all tied to a single, production-grade model. No academic project or startup has pulled this off at this scale.
This gap matters. Without full transparency, companies risk downstream issues. In January 2026, a European court ruled that a telecom’s AI hiring tool violated GDPR because the vendor couldn’t prove its training data was lawfully sourced. The model was taken offline. Incidents like that make IBM’s blueprint not just ethical — it’s a risk mitigation strategy.
Why It Matters Now: The Cost of Opaque AI Is Rising
The stakes of AI transparency have never been higher. In 2024, the average cost of an AI-related compliance failure was $4.2 million, according to Gartner. By 2026, it’s on track to hit $6.8 million. Fines under the EU AI Act can reach 7% of global revenue. And legal liability is shifting: courts are beginning to treat AI models like software systems, meaning vendors can be held responsible for downstream harms.
IBM’s timing is strategic. Enterprises are under pressure to adopt AI but are paralyzed by risk. Open models like Llama and Mistral are attractive, but their lack of auditability makes them hard to justify in regulated settings. Granite 4.1 fills that gap — it’s open, performant, and legally defensible.
There’s also a geopolitical angle. The U.S. government has urged domestic companies to adopt AI with verifiable data sources, especially in defense and critical infrastructure. IBM, with its long history in federal contracts, is well-positioned. The Department of Energy has already begun testing Granite 4.1 for code generation in nuclear simulation software.
Other companies have the tools to follow. But doing so requires more than technical capability — it demands a willingness to expose decisions that others treat as trade secrets. In an industry where secrecy is the default, IBM’s choice to publish everything may be its most disruptive move yet.
Sources: Hugging Face Blog, The Register, McKinsey & Company, EU AI Act, Gartner, U.S. AI Safety Institute, Department of Energy


