The NVIDIA Blackwell Ultra NVL72 platform runs 20× more agents per megawatt than the NVIDIA Hopper system, according to the first AgentPerf benchmark released by Artificial Analysis.
Key Takeaways
- AgentPerf is the industry’s first benchmark designed for agentic AI workloads.
- Blackwell Ultra NVL72 delivers up to 20× more agents per megawatt than the HGX H200.
- The performance edge stems from a 72‑GPU rack‑scale design and deep stack optimizations.
- DeepSeek V4 Pro, a large mixture‑of‑experts model, serves as the benchmark’s reference workload.
- Developers can now gauge responsiveness, concurrency, and energy efficiency for agents, not just single LLM calls.
Agentic AI Benchmark Reveals Blackwell’s Lead
When Artificial Analysis rolled out AgentPerf, they weren’t just adding another number to the AI leaderboard—they were redefining what we should be measuring. Traditional inference tests focus on a single LLM request, but an agent is more like a relay, chaining dozens of calls, tool invocations, and context updates. That distinction matters because the latency and power profile of a relay‑style workload explode as each step adds more data to the context. The benchmark’s methodology reflects that reality, and Blackwell’s numbers look striking: the GB300 NVL72 can sustain far more concurrent agents per megawatt at both the 20‑ and 60‑token‑per‑second service‑level objectives.
What Makes Agentic Workloads Different from Traditional Inference
From Sprint to Relay: The Workload Shift
In a conventional chat completion you send one prompt, get one response, and you’re done—that’s a sprint. An agent, on the other hand, breaks a goal into many steps, keeps calling the LLM, and may also invoke external tools like compilers, databases, or web browsers. Each handoff passes a growing context, so the compute graph becomes multiplicative, not additive. Existing benchmarks never captured that cascade, which is why they’ve been blind to the real pressure points of agentic AI.
Blackwell Ultra NVL72: Architecture That Delivers the Gains
72 GPUs in One Rack
At the heart of the advantage is the GB300 NVL72’s ability to connect 72 GPUs into a single rack‑scale system. That massive fabric lets a large mixture‑of‑experts model like DeepSeek V4 Pro distribute its expert shards efficiently. NVIDIA’s CUDA kernels overlap communication with compute, so the cost of moving data between experts gets absorbed instead of adding to latency. It’s a design that feels tailor‑made for the relay‑style patterns agents demand.
TensorRT LLM further boosts efficiency by decoupling input processing from output generation. When dozens of agents run in parallel, that separation lets the system optimize each stage independently, keeping the throughput high even as context sizes balloon. The benchmark showed that, at the 20‑token‑per‑second target, GB300 NVL72 supports more agents per megawatt than the HGX H200, and the gap widens at the 60‑token level.
Implications for Developers and Enterprises
For teams building autonomous assistants, the numbers translate into concrete cost and performance decisions. If you’re budgeting for a fleet of agents that need to respond within a second, the energy‑per‑agent metric becomes a first‑order concern. Blackwell’s 20× advantage means you could potentially run twenty times the number of agents for the same power budget, or cut your electricity bill by a comparable factor.
- Higher agent density per megawatt reduces hardware spend for large‑scale deployments.
- Improved concurrency at low token rates eases latency spikes during peak demand.
- Stack‑wide optimizations mean you don’t have to rewrite your model to fit the hardware.
- Energy efficiency gains align with sustainability goals that many enterprises now track.
What This Means For You
If you’re a developer who’s been tuning models for single‑call latency, you’ll need to rethink your performance targets. Instead of measuring only how fast a model spits out a response, you should start profiling the entire agent loop: context growth, tool latency, and inter‑expert communication. The GB300 NVL72’s architecture suggests that you can get away with larger MoE models without paying a linear penalty in power consumption.
Enterprises that are scaling agentic services should factor in the new “agents per megawatt” metric when sizing clusters. It isn’t just a bragging right—it’s a lever you can pull to lower OPEX and hit sustainability KPIs. With the benchmark’s methodology now public, you can compare your own infrastructure against Blackwell’s results and decide whether a rack‑scale upgrade is worth the investment.
Looking ahead, the industry will likely see more benchmarks that capture the full agentic workflow, and hardware vendors will keep tightening the stack to squeeze out every watt. The question is whether the next generation of AI servers will continue to bundle dozens of GPUs, or if new architectures will emerge that handle the relay pattern more natively.
Historical Context: From Single‑Shot Benchmarks to Agent‑Centric Metrics
For several years the community relied on inference suites that measured a single forward pass. Those tests were useful when most deployments ran static chat or completion endpoints. As developers began stitching together chains of calls, the gap between measured latency and real‑world experience widened. The first attempts to address that gap introduced multi‑step workloads, but they still treated each step as an isolated request. AgentPerf is the first effort that treats the whole chain as a single entity, mirroring how autonomous agents operate in production.
Earlier hardware generations, such as the Hopper line, excelled at raw FLOPs per watt. Their designs emphasized peak throughput for dense matrix multiplication. Those strengths translated well to single‑prompt scenarios, but they didn’t expose the communication overhead that becomes dominant when many experts need to exchange data. Blackwell’s rack‑scale fabric builds on that legacy, adding a layer of inter‑GPU bandwidth that directly tackles the bottleneck identified by the new benchmark.
Industry events over the past few cycles have highlighted the need for energy‑aware AI. Sustainability pledges, carbon‑reporting requirements, and rising electricity costs pushed vendors to publish power efficiency numbers. AgentPerf extends that conversation by attaching energy consumption to a concrete functional unit—agents. The shift from “watts per token” to “agents per megawatt” gives operators a more actionable figure when planning capacity.
Concrete Scenarios: How Different Teams Can Use the New Metric
Scenario 1 – Startup Building a Code‑Assist Bot
A small company wants a developer assistant that can fetch documentation, run a compiler, and suggest fixes. Their workload spikes when a new feature is released, pushing dozens of concurrent agents to run the same set of tool calls. Using the agents‑per‑megawatt figure, they can calculate that a single GB300 NVL72 rack would support the peak load while staying under their power envelope. That lets them avoid over‑provisioning on multiple smaller servers, cutting both CAPEX and energy bills.
Scenario 2 – Enterprise Knowledge‑Base Search
A large corporation plans to roll out an internal search assistant that crawls documents, calls an LLM for summarization, and then formats answers for a chat interface. The agent loop includes a database query, a summarization step, and a final rendering step. By measuring each loop’s latency and energy draw, the team can decide whether to place the workload on a Blackwell rack or stick with existing infrastructure. The benchmark shows that, at low token rates, Blackwell delivers higher concurrency, meaning the search assistant can handle more simultaneous queries without a noticeable slowdown.
Scenario 3 – Cloud Provider Offering Agent‑as‑a‑Service
A cloud platform wants to expose a managed agent service where customers pay per‑agent‑hour. Pricing models often factor in compute cost, but they rarely account for the power draw of each agent. With the new metric, the provider can price the service based on a predictable energy consumption model, ensuring that margins stay healthy even as customers scale up their agent fleets. The provider can also advertise a greener footprint, appealing to sustainability‑focused customers.
Competitive Landscape: Where Blackwell Stands
Other hardware manufacturers have announced rack‑scale solutions that also aim to reduce inter‑GPU latency. Those offerings typically bundle fewer GPUs per chassis and rely on software tricks to hide communication costs. Blackwell’s 72‑GPU configuration gives it a raw bandwidth advantage that directly translates into the agentic workload gains observed in AgentPerf.
Software‑only approaches—such as model parallelism libraries and quantization toolchains—can improve efficiency, but they still depend on the underlying hardware’s ability to move tensors quickly. The benchmark suggests that, without a fabric designed for massive expert sharding, those software tricks hit a ceiling. Blackwell’s stack‑wide optimizations, from CUDA kernels to TensorRT LLM, provide a cohesive environment where each layer reinforces the others.
Vendors that focus on single‑GPU accelerators may excel at edge deployments, but they lack the rack‑scale cohesion needed for high‑density agent farms. The market is therefore splitting into two camps: those that double down on multi‑GPU orchestration, and those that pursue lightweight, low‑power nodes for specialized tasks. Blackwell clearly belongs to the former, and its benchmark performance validates that strategy.
Key Questions Remaining
- Will future benchmarks expand beyond DeepSeek V4 Pro to include other model families, and how might that affect the relative advantage of Blackwell?
- How will software ecosystems evolve to expose the “agents per megawatt” metric in everyday monitoring tools?
- Can emerging memory‑centric architectures close the bandwidth gap without scaling GPU counts?
Answers to those questions will shape the next wave of hardware and software co‑design. As the community embraces agentic workloads, the yardsticks we use to compare systems will keep shifting, and the ability to adapt will separate early adopters from the rest.
Sources: NVIDIA Blog, Artificial Analysis

