When it comes to artificial intelligence, the math hasn’t changed: performance still climbs with scale. In April 2026, Meta unveiled its latest large language model, and while the company hasn’t released exact parameter counts, internal benchmarks suggest it exceeds 2.5 trillion operations per second during inference—nearly double the throughput of its 2025 predecessor.
Key Takeaways
- Meta’s newest LLM uses custom tensor cores to sustain over 2.5 trillion ops/sec during inference, nearly doubling 2025 performance.
- Despite warnings of diminishing returns, AI firms continue scaling models—driven by hardware advances, not just algorithmic gains.
- Specialized silicon now enables sparse activation at scale, letting models grow massive while keeping compute costs partially in check.
- For developers, larger models mean better few-shot reasoning—but also higher deployment barriers.
- The real bottleneck is no longer model size; it’s memory bandwidth and interconnect latency in training clusters.
Scale Still Scales
Some experts argue we’ve hit the wall on LLM scaling. Back in 2024, researchers at Google and Stanford pointed out that doubling model size now yields less than a 10 percent improvement in reasoning tasks. But in practice, that’s not stopping anyone.
Meta’s new model doesn’t just scale in parameters—it scales in hardware synergy. The company’s AI infrastructure team redesigned its inference pipeline around dynamic sparsity, a technique that activates only 30 to 40 percent of the model’s weights during any single query. That’s not new in theory. What’s different now is that custom chip firmware can route those sparse computations without the usual latency tax.
This is the quiet breakthrough: AI isn’t just getting bigger. It’s getting smarter about where and when it computes.
Hardware Is the New Algorithm
For years, the AI race was about who had the best training data, the cleanest labels, or the most elegant attention mechanism. Now, the edge goes to those who control the silicon.
Meta’s training clusters run on next-gen MTIA v3 accelerators—custom ASICs built specifically for sparse LLM workloads. Unlike generic GPUs, these chips feature on-die memory stacks with 4.8 TB/s bandwidth and a mesh interconnect that reduces all-to-all communication latency by 40 percent compared to 2025 systems.
And that’s where the old rules break down. Yes, performance gains per watt are flattening across the industry. But Meta and others are sidestepping the plateau by redefining what “efficiency” means. It’s not just FLOPs per joule anymore. It’s ops per millisecond at scale. It’s how fast you can move data between chips, not just within them.
Sparse by Design, Not by Compromise
Sparsity used to be a compromise. You’d prune a model to fit it on weaker hardware. Now, it’s a core architectural choice—one that enables scale without linear cost growth.
Meta’s model uses structured sparsity patterns baked into the architecture. That means the hardware knows, in advance, which compute units will be idle during a given layer. It powers them down instantly. It reroutes memory fetches. It compresses gradient updates on the fly.
This isn’t just power savings. It’s a new kind of parallelism—one where the model and chip evolve together.
- Sparsity utilization in Meta’s v3 stack increased from 58% in 2025 to 82% in Q1 2026
- Training runs now complete 22% faster despite 60% larger models
- Per-chip memory bandwidth jumped from 3.2 TB/s to 4.8 TB/s
- Inter-chip latency dropped from 180ns to 108ns
- Energy per token during inference fell 17%, even as model size grew
The Hidden Cost of Going Big
There’s a quiet tension building beneath all this progress. The models are getting more efficient per operation—but the total cost of deployment is still rising.
Running Meta’s latest model in production requires a minimum of eight MTIA v3 chips per inference instance. That’s double the hardware per query compared to 2025. And while sparsity reduces active compute, the memory footprint remains massive. You can’t compress what has to be loaded.
Smaller companies are already feeling the squeeze. Startups that once fine-tuned Llama variants on cloud GPUs now face a new reality: even inference APIs from major providers are slower and more expensive, because the underlying models are so much larger.
One developer at a 12-person AI startup in Berlin told IEEE Spectrum that their query costs rose 60% in six months—without changing their code. “We’re not using more tokens,” they said. “The models just got heavier, and nobody asked us if we wanted that.”
What This Means For You
If you’re building on top of LLMs, you’re now operating in a world where the foundation models are less accessible than ever. The era of downloading a 7B-parameter model and running it locally is fading. Even distillation techniques struggle to preserve the reasoning capabilities of these new behemoths.
For developers, that means betting on APIs is riskier. When the underlying model shifts, your latency and cost profiles shift with it—without warning. The only real control you have is in prompt efficiency, caching strategies, and knowing when to avoid the largest models altogether. Lean models aren’t dead. They’re just no longer the headline act.
And for builders working inside larger orgs: hardware awareness is now a core skill. You can’t just treat accelerators as black boxes. Understanding memory bandwidth limits, sparsity patterns, and interconnect bottlenecks isn’t just for infrastructure teams. It’s where performance wins are now made.
The Real Bottleneck Isn’t What You Think
It’s tempting to frame this as a compute problem. It’s not. It’s a data movement problem.
At Meta’s scale, the time it takes to shuttle activations between chips now accounts for over 60% of total training time. That’s up from 42% in 2024. FLOPs are cheap. Moving data isn’t.
The company’s engineers are now exploring 3D-stacked memory with through-silicon vias and optical interconnects for future generations. But these aren’t drop-in upgrades. They require rethinking chip packaging, cooling, and power delivery at every level.
And yet, despite all this, the model size race continues. Because at the top, there’s a feedback loop: bigger models produce better synthetic data, which improves the next generation of models, which demands more hardware. It’s self-sustaining—until it isn’t.
What happens when the hardware can’t keep up with the model architects’ ambitions? When memory bandwidth stops improving, but the parameter counts don’t? That moment isn’t here yet. But it’s closer than anyone at Meta wants to admit.
Competing Visions: How Google, Microsoft, and Others Are Responding
Meta isn’t the only player betting on custom silicon. Google has been refining its Tensor Processing Units for years, and its fifth-generation TPU, launched in early 2026, features a 2D mesh architecture with 5.2 TB/s of bidirectional bandwidth and support for dynamic model partitioning. Unlike Meta’s MTIA, however, Google’s TPU5 remains optimized for dense workloads, limiting its sparsity efficiency to around 65%, according to internal benchmarks published at the 2026 International Symposium on Computer Architecture.
Microsoft, meanwhile, is taking a hybrid path. Its Azure AI clusters now run on a mix of AMD’s MI350X GPUs and in-house Maia 100 AI accelerators. The Maia chip, co-designed with Intel’s foundry division, integrates HBM3E memory with 4.6 TB/s bandwidth and a novel ring-based interconnect that reduces cross-die latency by 30% compared to standard NVLink setups. But Microsoft’s deployment scale lags behind Meta’s—its largest training jobs still run on clusters half the size of Meta’s 32,000-chip configurations.
Amazon is betting on flexibility. Its Trainium2 chips, released in Q4 2025, support both dense and sparse inference with adaptive power gating, allowing AWS to offer tiered pricing for different workload types. Still, third-party analyses show Trainium2 lags in raw memory bandwidth (3.8 TB/s) and struggles with long-context LLMs beyond 128K tokens. That’s forcing AWS customers to stitch together multiple instances, increasing cost and latency.
The divergence in approaches reflects a deeper strategic split: Meta and Google are building vertically integrated stacks, while Microsoft and Amazon prioritize compatibility with existing cloud ecosystems. That trade-off—performance versus accessibility—will shape the next phase of AI deployment.
The Bigger Picture: Why It Matters Now
The push for scale isn’t just about better chatbots or faster code generation. It’s about control over the next layer of computing infrastructure. In 2026, large models are increasingly used to generate training data for smaller ones, validate code for autonomous systems, and simulate real-world environments for robotics. These tasks demand high reasoning fidelity, which, empirically, still correlates with scale—even if the returns are diminishing.
But the infrastructure required to sustain this growth is concentrating in the hands of a few. Meta, Google, Microsoft, and Amazon collectively operate over 70% of the world’s AI-optimized data centers, according to estimates from TrendForce. Their ability to deploy custom silicon at scale creates a moat that smaller players can’t cross. Open-source models like Llama and Mistral are still valuable, but they’re increasingly dependent on the same proprietary stacks for training and deployment.
This concentration raises practical and policy concerns. If a handful of companies control both the models and the hardware, they effectively set the rules for how AI evolves. That includes decisions about what tasks are prioritized, which languages are supported, and how safety constraints are enforced. Regulatory bodies in the EU and U.S. are beginning to scrutinize this vertical integration, with the FTC launching a non-public inquiry into whether dominant cloud providers are favoring their own AI services in infrastructure allocation.
The stakes are high. We’re not just building smarter models. We’re building a new kind of computational economy—one where access to intelligence is gated by hardware, not just software.
What Lies Beyond the Memory Wall
Memory bandwidth is the silent governor of AI progress. Even with perfect sparsity and flawless interconnects, models can’t exceed the rate at which weights are fetched from memory. This is known as the “memory wall,” and it’s becoming the primary constraint on model throughput.
Meta’s current MTIA v3 chips use Samsung’s 16-Hi 3D-stacked HBM3 memory, providing 4.8 TB/s per die. That’s a 50% jump from the 3.2 TB/s on v2—but it came at a cost. Power density now exceeds 350 watts per square centimeter, requiring advanced liquid cooling systems across Meta’s DeKalb, Illinois cluster. Future gains won’t come from stacking more layers. HBM4, expected in late 2026, will max out at 18-Hi stacks due to thermal and yield limitations, offering only a 20% bandwidth boost over HBM3.
The next leap may come from alternative architectures. Intel and imec are prototyping hybrid-bonded memory that could deliver 8 TB/s per interface by 2028, using copper-to-copper direct bonding and microbump arrays spaced at 10-micron pitches. Similarly, Ayar Labs is testing optical I/O links that replace electrical traces with light-based data transfer, reducing latency and power consumption over long interconnects. Meta has invested in Ayar through its AI Hardware Acceleration Fund and is testing optical links in a 512-chip pilot cluster.
But these technologies aren’t plug-and-play. Optical interconnects require new photonics fabrication lines. Hybrid bonding demands tighter alignment tolerances than current lithography can reliably achieve. And both add layers of complexity to chip design, testing, and repair. The transition will be slow—but it’s the only path forward once electrical bandwidth hits its physical limits.
Sources: IEEE Spectrum, original report


