NVIDIA’s dominance of AI training hardware has been the defining competitive fact of the AI infrastructure buildout. The H100, and its successor the B200, became the closest thing the technology industry has produced to a strategic resource — rationed, hoarded, and priced at margins that would be recognizable in luxury goods markets. A single H100 server cluster capable of training a frontier model requires capital expenditure measured in the tens of millions. NVIDIA’s gross margins on data center GPU sales have consistently exceeded 70%.

That dominance is structurally durable in the short term and structurally contested in the medium term. The contestation is coming from multiple directions simultaneously.

AMD’s Legitimate Challenge

AMD’s MI300X accelerator established that NVIDIA’s architectural lead is real but not unbridgeable. The MI300X offers competitive performance on inference workloads — particularly for large language model inference, where memory bandwidth is often the binding constraint — at a price point that creates genuine procurement decisions rather than the rubber-stamp NVIDIA purchases that characterized the 2023-2024 shortage period.

The MI300X’s 192GB of HBM3 memory was, at launch, larger than any NVIDIA configuration available, which translated to a meaningful advantage for inference of large models. Inference economics matter more than training economics for most commercial AI deployments — training happens once, inference runs continuously at scale. The market segment where AMD can compete most effectively is also the market segment that most enterprises actually care about.

AMD’s software stack — ROCm — remains the primary obstacle to broader adoption. The CUDA ecosystem is two decades of optimized libraries, tooling, and developer familiarity that does not transfer to ROCm without friction. Model training frameworks, inference optimizers, and profiling tools all have CUDA as the primary target. AMD has been closing this gap aggressively, but it remains the key competitive variable.

Custom Silicon: Google, Amazon, Microsoft

The hyperscalers have concluded that building their own AI silicon is strategically necessary and economically justified at their scale. Google’s TPU v5, Amazon’s Trainium 2, and Microsoft’s Maia 100 are all in production deployment across their respective cloud infrastructure.

The economics are compelling at hyperscale. A chip designed specifically for the workloads that dominate a company’s infrastructure — attention mechanisms, matrix multiplications, transformer inference — can outperform general-purpose accelerators on those specific workloads while being manufactured at lower cost. The catch is that custom silicon only makes sense if you can amortize the $500M-$1B design and tape-out cost across sufficient volume. Google, Amazon, and Microsoft all have that volume. Almost nobody else does.

The strategic implication: hyperscalers are progressively reducing their dependency on NVIDIA for internal workloads while continuing to offer NVIDIA GPU instances to customers who require software compatibility. The customer-facing and internal infrastructure are diverging.

Startups: Groq, Cerebras, SambaNova

The dedicated AI chip startup category has been brutally competitive. Groq’s LPU (Language Processing Unit) architecture demonstrated that purpose-built inference hardware could produce latency numbers that GPU clusters cannot match — Groq’s inference speed benchmarks for large language models have consistently outperformed NVIDIA H100 clusters by factors of 5-10x on tokens-per-second per unit. The bottleneck has been memory capacity and ecosystem integration, not raw compute.

Cerebras built chips the size of a wafer — literally, a single silicon wafer rather than the individual dies that conventional chips are diced from — which eliminates the inter-chip communication overhead that limits the scaling of multi-chip AI accelerators. The CS-3 wafer-scale engine contains 4 trillion transistors and 900,000 AI-optimized cores on a single piece of silicon. The manufacturing complexity is extraordinary and the price point is commensurate.

These startups are not competing for the hyperscaler market NVIDIA dominates. They are competing for specific inference applications where latency or throughput requirements cannot be met by GPU clusters, and for customers who want AI infrastructure that is not dependent on NVIDIA’s allocation decisions.

The Export Control Constraint

U.S. export controls on advanced semiconductors have created a bifurcated global AI hardware market. NVIDIA’s highest-performance chips cannot be exported to China. The H20 — a degraded version of the H100 designed to comply with export control performance thresholds — is the ceiling for Chinese buyers. Huawei’s Ascend 910B has stepped into the gap, achieving performance that, while below H100 levels, is sufficient for many inference and fine-tuning workloads.

The export control regime has accelerated Chinese domestic AI chip development in ways that will have long-term competitive consequences. The current performance gap between NVIDIA’s frontier chips and Chinese alternatives is real and significant. The trajectory of that gap — whether it widens, narrows, or closes — depends on factors that are not purely technical: equipment access, talent availability, and the policy decisions of TSMC and other foundries that manufacture advanced semiconductors under U.S. jurisdiction.