Subtitle: Navigating Inference Costs to Unlock the AI Investment Code

In September, OpenAI unveiled its new GPT-01 model. While many are impressed by its advanced reasoning capabilities, there are also growing concerns about its cost-effectiveness. The API price has surged sixfold compared to its predecessor, GPT-4, with actual usage expenses soaring up to thirty times higher. This stark price increase brings back memories of the early days of GenAI in 2022. In this article, we’ll explore the shifting landscape of inference costs for large models and uncover the investment opportunities they present.

Session 1: Inference Costs Are Set to Become the Largest AI Expenditure

In the large-model ecosystem, GPUs serve two pivotal functions: training and inference. Training equips models with the capacity to understand and generate complex information, while inference powers the real-time responses that bring models to life by processing user queries and generating relevant outputs almost instantly.

In the 18 months following the launch of GPT-3.5, the market's focus was on training compute, with skyrocketing costs frequently making headlines. However, since June’s API price cuts across various models, attention has increasingly shifted toward inference compute. The industry is beginning to realize that while training comes with high costs, inference expenses are set to outstrip them.

Barclays estimates that the training of the GPT-4 series models consumed around $150 million in compute costs. Yet, by the end of 2024, cumulative inference compute costs are projected to reach $2.3 billionfifteen times the training spend. Looking ahead to 2026, demand for inference compute in GenAI is expected to skyrocket, reaching three times the cost of training compute.

The release of GPT-01 has further accelerated the shift in computational resources toward inference. Compared to GPT-4, GPT-01 generates 50% more tokens for the same prompt, producing four times the inference tokens to support more advanced reasoning.

Moreover, with the API cost per token for GPT-01 priced six times that of GPT-4, utilizing GPT-01 in similar scenarios drives API costs up by thirty times. This reflects a 30x increase in inference computation and expenses, assuming API fees scale proportionally with inference demand. Research from Arizona State University indicates that in practical applications, this figure can soar to as high as 70x. Consequently, access to the ChatGPT-01 preview is currently limited to paying members, with a weekly cap of 50 prompts.

GPT-01 and the concept of inference scaling laws highlight the trade-off between inference compute and reasoning capabilities. This stems from the "Bermuda Triangle" in GenAI: under the same conditions, inference cost (compute), model performance, and latency are all interdependent.

However, if inference costs can be reduced through advances in models, systems, or hardware, this "triangle" can be expanded. The result? GenAI applications gain the flexibility to reduce costs, improve capabilities, or minimize latency—enhancing current products and unveiling potential new use cases. Ultimately, the pace of these cost reductions will dictate the speed of value creation in GenAI.

Session 2: Every Technological Revolution Faces Cost Challenges

James Watt improved the steam engine in 1776, but widespread adoption did not happen overnight. Over the next three decades, incremental advancements—including double-acting designs, centrifugal governors, and multi-stage steam engines—drove thermal efficiency from a modest 2% to between 5% and 10%. Only with these improvements did steam engines gradually become the leading power source for factories.

In 1871, the invention of the ring armature generator enabled stable direct current output. However, only later advancements—such as multiphase motors and transformers, which significantly enhanced efficiency and range of power transmission—made electricity economically feasible and led to its widespread adoption for lighting, industry, and transportation.

Every productivity revolution driven by technology faces the challenge of usage costs, and for GenAI applications, that challenge centers on inference expenses.

In both traditional SaaS and GenAI, usage costs are primarily server-based. Traditional SaaS relies on CPUs, keeping server costs relatively low. A 2023 survey of 400 startups revealed that server costs accounted for just 5% of total expenses and 4% of annual recurring revenue (ARR). This low expenditure results in minimal marginal costs, enabling traditional SaaS companies to scale profitably and leverage economies of scale.

In contrast, GenAI applications demand significant inference computation for user interactions, leading to escalating GPU rental costs or API fees. As mentioned earlier, GPT-4 is projected to accrue $2.3 billion in inference costs from its March 2023 launch through the end of 2024—accounting for 49% of its revenue.

These hefty inference costs drastically compress the gross margins of GenAI applications, increasing operational difficulty. For SaaS startups, expenses related to marketing, promotion, and customer support typically average around 30% of ARR. In comparison, a GenAI app with similar marketing spending sees its combined inference and sales costs nearly match total annual revenue. Thus, GenAI apps that succeed in scaling today either have substantial funding to uphold a high burn rate or employ highly efficient distribution strategies for rapid growth on a lean budget.

Yet, many applications remain uncommercializable due to prohibitively high inference costs. Consider OpenAI’s Sora and Meta’s recently launched MovieGen video generation tools: producing a one-minute video incurs around $20 in inference costs, assuming only one in five videos meets quality standards. These substantial expenses hinder accessibility to such tools. Consequently, most popular video apps today focus on ultra-short video creation, offering clips that last from just 2 to 6 seconds.

AI agents, which are a central focus in GenAI development, require multiple model invocations per prompt, resulting in considerable computational demands. Real-time AI agents can cost up to $1 per hour in computational expenses.

These costs are estimated under more economical conditions using GPU cloud services. Using model APIs directly would raise costs by two to five times.

Session 3: Innovations in Models and Systems Drive Down Inference Costs Over the Past Two Years

Inference costs have substantially decreased over the past two years. Since the launch of GPT-3.5 Turbo in March 2023, its price has plummeted by 70% in just seven months. Similarly, Google’s Gemini 1.5 Pro, updated in October, experienced a price reduction of over 50% compared to its initial release in May.

Comparing GPT-3.5 Turbo to GPT-4 Mini, released in late July, the latter boasts significantly stronger reasoning capabilities—scoring 40% higher on the Artificial Analysis Quality Index (AAQI)—yet is priced at just 10% of GPT-3.5 Turbo’s initial cost. In a little over a year, access to a more powerful model now costs one-tenth as much, a testament to the remarkable pace of cost reduction.

Assuming GPU resource costs for inference have decreased proportionally by 90% in tandem with the model API prices, approximately 20% of this reduction stems from lower GPU cloud server costs, while the remaining 70% is attributed to improved inference efficiency.

Strategies to enhance inference efficiency include:

  1. Reducing Inference Computation: Lowering model parameters can significantly decrease computational demands. For example, in 2024, OpenAI, Google, and Meta introduced smaller models. Mixture of Experts (MoE) can also indirectly reduce the number of parameters used during inference to balance computational loads with model performance; GPT and Mistral are prime examples. Additionally, low-precision methods operate similarly to reduce inference computation requirements, which also greatly improves Minimum Flop Usage (MFU).
  2. Improving Compute Utilization (MFU): GPU computations involve two main actions: data loading and processing. During inference, data transmission often lags behind processing, leading to significant compute wastage. To address this, the industry has innovated at both the model layer (e.g., attention mechanisms like GQA and sparsification methods like Sliding Window Attention) and the system layer (e.g., inference engines like Flash Attention and batching techniques like Continuous Batching).

Mistral 8×7B employs methods such as Sliding Window Attention, Continuous Batching, and MoE. With performance comparable to Llama2 70B, released around the same time, it achieves this at approximately 35% of the inference cost. Meanwhile, DeepSeek-V2 enhances the MoE while introducing the innovative Multi-Level Attention (MLA) mechanism, further slashing inference costs by another 50%.

Session 5: Hardware Breakthroughs Will Drive Future Inference Cost Reductions

While system and model innovations have primarily fueled inference cost cuts over the past two years, the next wave of reductions is likely to be driven by hardware breakthroughs.

For every dollar spent on GPU cloud services, 54% goes toward server holding costs, with more than half dedicated to GPU purchases—significantly overshadowing expenses like electricity and other server costs. Consequently, GPU prices are the main drivers behind inference expenses.

Emerging companies like Groq and Cerebras have targeted GPU bandwidth constraints, greatly enhancing inference efficiency and economics. When deploying the same Llama3.1 70B model, APIs from Cerebras and Groq are 30–40% cheaper than mainstream APIs based on NVIDIA's H100, yet they offer higher throughput.

Meanwhile, NVIDIA's upcoming B200, set for release next year, aims to substantially cut inference costs. MLPerf testing demonstrates that the B200 boasts 4x the throughput of the H100 (FP4 for B200, FP8 for H100). With a projected market price of $35,000—approximately 40% higher than the H100—the B200's enhanced performance could lead to a 70% reduction in inference costs. Even when operating at identical precision levels, anticipated cost reductions of 30–40% are expected.

Enhanced GPU hardware performance, combined with ongoing innovations at the model and system levels, is poised to further reduce inference costs, unlocking new possibilities for GenAI applications.

Session 6: When Products Aren’t Perfect, Investment Opportunities Emerge

Beyond cost considerations, GenAI applications face an additional challenge: optimizing user interaction and performance. As with any technological revolution, reducing costs is just one part of the equation; the quest for the "ideal interaction interface" is equally vital.

The widespread adoption of desktop computers hinged on the advent of the mouse and Windows OS. Similarly, the popularity of short videos stems from the swipe-up design that gained traction nearly a decade after the iPhone’s introduction through platforms like TikTok. This browsing mode takes full advantage of mature smartphone hardware, advances in video compression, mobile networks, and recommendation algorithms.

We believe GenAI has the potential to introduce entirely new modes of human interaction with hardware. However, realizing this potential may take years.

Delaying investments until products are fully mature and widely adopted often results in missed opportunities. The ideal time to invest is when a product begins to show promise but still has room for refinement.

The Human Performance Benchmark Index serves as a standard metric for evaluating AI capabilities, signifying how closely AI measures up to human performance across various skill areas.

By 2016, visual solutions had achieved 80% of human capabilities, with initial applications in specific fields—though many limitations remained. Between 2014 and 2019, the U.S. market witnessed a surge in early-stage investments in this area, while graduation rates from seed, Series A, and Series B rounds to the next stage remained high at 70% to 80%. However, post-2019, as visual solutions approached perfection with limited room for further optimization, overall investment interest declined.

Thus, when industry solutions still present gaps, it may be an ideal moment for Series B investments. For Series A and seed rounds, it’s crucial to position oneself 3 to 5 years ahead.

Currently, GenAI applications routinely grapple with high costs and subpar performance, placing them further from mass adoption. However, if a product has already achieved perfection, the prime investment opportunity may have passed. It is precisely these shortcomings that harbor investment potential. Navigating these applications through their final challenges can yield returns of tenfold your investment.

As it stands, GenAI applications face high costs and performance difficulties, keeping them at a distance from mass adoption. Yet, if a product is already perfected, the prime investment window may have closed. It is these very challenges that offer investment potential. Guiding these applications past their final hurdles can yield 10x returns.

A special thanks to the team at PaleBlueDot AI for their invaluable insights.