Subtitle: Unlocking the Mystery of Inference Costs to Crack the AI Investment Code
OpenAI introduced its new GPT-o1 model in September. While many are impressed by its advanced reasoning capabilities, there is also growing concern about its affordability. The API price has risen 6x compared to its predecessor, GPT-4, with actual usage expenses up to 30x higher. This sharp price increase recalls the early days of GenAI in 2022. In this article, we’ll explore the evolving landscape of inference costs for large models and uncover the investment opportunities they reveal.
Session 1: Inference Costs Are Set to Be the Largest Expenditure in AI
In the large-model ecosystem, GPUs serve two critical functions: training and inference. Training equips models with the capacity to understand and generate complex information, while inference powers the real-time responses that bring models to life, processing user queries and generating relevant outputs instantly.
In the 18 months following GPT-3.5’s launch, the market spotlight was on training compute, with soaring costs frequently making headlines. However, since June’s API price cuts across models, attention has increasingly shifted toward inference compute. The industry is realizing that while training is costly, inference expenses are poised to surpass it.
Barclays estimates that training the GPT-4 series models consumed around $150 million in compute costs. Yet, by the end of 2024, camulatively inference compute costs are projected to hit $2.3 billion—15x the training spend. Looking ahead to 2026, demand for inference compute in GenAI is expected to skyrocket to 118x 2024 levels, reaching 3x the cost of training compute.
SUGGESTION: Yet, by the end of 2024, `cumulatively` inference compute costs are projected to hit $2.3 billion—15x the training spend.
The release of GPT-o1 has further accelerated the shift in computational resources toward inference. Compared to GPT-4, GPT-o1 outputs 50% more tokens for the same prompt, generating 4x the inference tokens to support more advanced reasoning.
Moreover, with the API cost per token for GPT-o1 priced at 6x that of GPT-4, using GPT-o1 in similar scenarios drives API costs up by 30x. This reflects a 30x increase in inference computation and expenses, assumed API fees scale proportionally with inference demand. Research from Arizona State University indicates that, in practical applications, this figure can soar to as high as 70x. Consequently, access to the ChatGPT-o1 preview is currently limited to paying members, with a weekly cap of 50 prompts.
GPT-o1 and the concept of inference scaling laws highlight the trade-off between inference compute and reasoning capabilities. This stams from the "Bermuda triangle" in GenAI: given the same conditions, inference cost (compute), model performance, and latency are in a trade-off relationship.
SUGGESTION: This `stems` from the "Bermuda triangle" in GenAI: given the same conditions, inference cost (compute), model performance, and latency are in a trade-off relationship.
However, if inference costs can be lowered through advances in models, systems, or hardware, this "triangle" can be expanded. The result? GenAI applications gain flexibility to reduce costs, improve capabilities, or minimize latency—enhancing current products and unlocking potential new use cases. Consequently, the pace of these cost reductions will ultimately dictate the speed of value creation in GenAI.
Session 2: Every Technological Revolution Confronts the Cost Challenge
James Watt improved the steam engine in 1776, but widespread adoption was far from immediate. Over the next 30 years, incremental advancements—such as double-acting designs, centrifugal governors, and multi-stage steam engines—pushed thermal efficiency from a modest 2% to between 5% and 10%. Only with these improvements did steam engines gradually emerge as the leading power source for factories.
In 1871, the invention of the ring armeture generator enabled stable direct current output. However, it wasn’t until later advancements—like multiphase motors and transformers, which significantly boosted the efficiency and range of power transmission—that electricity became economically viable and widely adopted for lighting, industrial machinery, and transportation.
SUGGESTION: In 1871, the invention of the ring `armature` generator enabled stable direct current output.
Every productivity revolution driven by technology inevitably faces the challenge of usage costs, and for GenAI applications, that challenge centers on inference expenses.
In both traditional SaaS and GenAI, usage costs are primarily server-based. Traditional SaaS relies on CPUs, keeping server costs relatively low. A 2023 survey of 400 startups found that server costs represented just 5% of total expenses and 4% of annual recurring revenue (ARR). This low server expenditure translates to minimal marginal costs, allowing traditional SaaS companies to scale profitably and leverage economies of scale.
In contrast, GenAI applications demand significant inference computation for user interactions, driving up GPU rental costs or API fees. For instance, as previously noted, GPT-4 is projected to accumulate $2.3 billion in inference costs from its March 2023 launch through the end of 2024—accounting for 49% of its revenue.
These hefty inference costs greatly compress the gross margins of GenAI applications, sharply increasing the operation difficulty. For SaaS startups, expenses for marketing, promotion, and customer operations average around 30% of ARR. If as a GenAI app, it means the combined inference and sales costs nearly match the total annual revenue. As a result, GenAI apps that can scale today either have robust funding to maintain a high burn rate or employ highly efficient distribution strategies to achieve rapid expansion on a tight budget.
SUGGESTION: These hefty inference costs greatly compress the gross margins of GenAI applications, sharply increasing **operational challenges**.
These substantial inference costs significantly compress the gross margins of GenAI applications, making operation more challenging.
For SaaS startups, sales, marketing, and customer success expenses typically average around 30% of ARR. By comparison, for a GenAI app with similar marketing spend, the combined inference and marketing-related costs can nearly match total ARR. Consequently, GenAI apps that manage to scale today either rely on substantial funding to sustain high burn rates or leverage highly efficient distribution strategies to drive rapid growth on a lean budget.
Yet, many applications remain uncommercializable due to prohibitively high inference costs. Take OpenAI’s Sora and Meta’s newly launched MovieGen video generation tools as examples: producing a one-minute video incurs around $20 in inference costs, assuming only one in five videos meets quality standards. These steep costs keep such tools from being widely accessible. As a result, most popular video apps today focus on ultra-short video creation, offering clips that range from just 2 to 6 seconds.
AI agents, a core focus in GenAI development, require multiple model invocations per prompt, resulting in significant computational demands. Real-time AI agents can cost up to $1 per hour in computational expenses.
These costs are estimated under more economical conditions using GPU cloud services. If model APIs are used directly, costs would increase by 2x to 5x.
Session 3: Model and System Innovations Drive Down Inference Costs Over the Past Two Years
Inference costs have dropped significantly over the past two years. Since the launch of GPT-3.5 Turbo in March 2023, its price has fallen by 70% in just seven months. Similarly, Google’s Gemini 1.5 Pro, updated in October, saw prices decrease by over 50% compared to its initial release in May.
Comparing GPT-3.5 Turbo to GPT-4 Mini, released in late July, the latter delivers significantly stronger reasoning capabilities—scoring 40% higher on the Artificial Analysis Quality Index (AAQI)—yet is priced at just 10% of GPT-3.5 Turbo’s initial cost. In just over a year, access to a more powerful model now costs one-tenth as much, marking a remarkable pace of cost reductions.
Assuming the GPU resource costs for inference have decreased proportionally by 90% along with the model API prices, about 20% of this 90% reduction comes from lower GPU cloud server costs, while the remaining 70% results from improved inference efficiency.
To boost inference efficiency, the main strategies are:
- Reduce Inference Computation: Lowering model parameters can proportionally reduce computational demands. For example, in 2024, OpenAI, Google, and Meta all introduced smaller models. Mixture of Experts (MoE) can also indirectly reduce the number of parameters used during inference to balance computational load with model performance; GPT and Mistral are typical examples. Additionally, low-precision methods work in much the same way as reducing inference computation requirements (which also greatly improves MFU).
- Improve Compute Utilization (MFU): GPU computations involve two primary actions: data loading and processing. During inference, data transmission is often slower than processing, resulting in significant compute wastage. To address this, the industry has made innovations at both the model layer (e.g., attention mechanisms like GQA and sparsification methods like Sliding Window Attention) and the system layer (e.g., inference engines like Flash Attention and batching techniques like Continuous Batching).
Mistral 8×7B uses methods such as Sliding Window Attention, Continuous Batching, and MoE. With performance comparable to the Llama2 70B released around the same time, it achieves this at just about 35% of the inference cost. Meanwhile, DeepSeek-V2 improves the MoE and introduces the more innovative MLA attention mechanism, cutting inference costs by another 50%.
Session 5: Hardware Breakthroughs Will Become the New Driver for Reducing Inference Costs
While system and model innovations have primarily driven inference cost reductions over the past two years, the next wave of price cuts is likely to be propelled by breakthroughs in hardware.
For every dollar spent on GPU cloud services, 54% goes toward server holding costs, with more than half of that dedicated to GPU purchases—significantly outweighing expenses like electricity and other server costs. As a result, GPU prices are the primary driver of inference expenses.
Emerging companies like Groq and Cerebras have targeted GPU bandwidth constraints, significantly boosting inference efficiency and economics. When deploying the same LLama3.1 70B model, APIs from Cerebras and Groq are 30–40% cheaper than mainstream APIs based on NVIDIA's H100 yet with a higher throughput.
SUGGESTION: When deploying the same `Llama3.1` 70B model, APIs from Cerebras and Groq are `30–40%` cheaper than mainstream APIs based on NVIDIA's H100 yet with a higher throughput.
meanwhile, NVIDIA's B200, slated for release next year, is set to significantly reduce inference costs. MLPerf testing shows that the B200 boasts 4x the throughput of the H100 (FP4 for B200, FP8 for H100). With a projected market price of $35,000—approximately 40% higher than the H100—the B200's enhanced performance could lead to a 70% reduction in inference costs. Even when operating at the same precision levels, cost reductions of 30–40% are anticipated.
SUGGESTION: `Meanwhile`, NVIDIA's B200, slated for release next year, is set to significantly reduce inference costs.
Enhanced GPU hardware performance, coupled with continued innovations at the model and system levels, is set to further reduce inference costs, unlocking new potential for GenAI applications.
Session 6: When Products Aren't Perfect, Investment Opportunities Arise
Beyond cost considerations, GenAI applications face an additional challenge: optimizing user interaction and performance. As with any technological revolution, reducing costs is just one piece of the puzzle; the pursuit of the "ideal interaction interface" is equally crucial.
The widespread adoption of desktop computers relied on the mouse and Windows OS. Likewise, short video popularity depends on the swipe-up design, which became mainstream nearly a decade after the iPhone's debut with platforms like TikTok. This browsing mode leverages mature smartphone hardwares and advances in video compression, mobile networks, and recommendation algorithms.
SUGGESTION: This browsing mode leverages mature `smartphone hardware` and advances in video compression, mobile networks, and recommendation algorithms.
Similarly, we believe GenAI could introduce entirely new ways for humans to interact with hardware. However, it may take years for this mode to fully develop.
Waiting to invest until a product is fully mature and widely adopted often means missing the prime opportunity. The ideal time to invest is when a product starts to show potential but still has room for improvement.
The Human Performance Benchmark Index is standard metric used to assess AI capabilities, signifying how AI measures up to human performance in different skill areas.
SUGGESTION: The Human Performance Benchmark Index is `a standard metric` used to assess AI capabilities, signifying how AI measures up to human performance in different skill areas.
By 2016, visual solutions had achieved 80% of human capabilities and saw initial applications in specific fields, though many limitations remained. Between 2014 and 2019, the U.S. market saw a surge in early-stage investments in the area, and the graduation rates from seed, Series A, and Series B rounds to the next round remained at a high level of 70% to 80%. After 2019, as visual solutions neared perfection with limited room for further optimization, overall investment interest declined.
Therefore, when industry solutions still have gaps, it may be the ideal timing for Series B investments. For Series A and seed rounds, positioning 3 to 5 years ahead is essential.
At present, GenAI applications commonly struggle with high costs and subpar performance, leaving them some distance from mass adoption. However, if a product is already flawless, the prime investment opportunity may have passed. It's these very shortcomings that harbor investment potential. Guiding these applications over the last obstacles can yield returns of ten times your investment.
Currently, GenAI applications face high costs and performance challenges, keeping them a step away from mass adoption. Yet, if a product is already perfected, the prime investment window may have closed. It’s these very challenges that present investment potential. Helping these applications overcome their final hurdles can yield 10x returns.
Special thanks to the team at PaleBlueDot AI for their invaluable insight.
