There has been a lot of noise in the mainstream press about the massive buildout of AI datacenters across the world.. with the latest being an audacious announcement by the Trump presidency of the $500 billion Project Stargate. The primary market signal of this buildout is “CapEx,” or capital expenditures, reported by the top 5 AI companies: OpenAI, Microsoft, Amazon, Google, Meta (Apple & Tesla remain dark horses). Microsoft alone has famously committed $250 billion (yes, a quarter of a trillion dollars) to AI datacenter building in the next 3 years (well, 5 years, but we’re on year 2 already). The others are not far behind. Meta (Facebook) may even be exceeding that number.
And, shockingly, that’s just the physical build-out of the AI datacenters. Construction, chips, server racks, cooling towers, and (omg) power plants. You read that right. Both Microsoft and Google are building / funding the construction of entire electrical power stations, formerly the exclusive domain of governments and public utilities, and signing contracts to guarantee purchase of 100% of the power generated by the plants for the next 20 years. In the case of Google, this entails the construction of six new nuclear power plants. (side quest: Google building 7 new nuclear power plants for the sole use of AI)
Are you starting to sense the seriousness of this moment?
In case you were wondering… yes, the term “unprecedented” applies. Read here my comparison of corporate AGI development budgets with the two most audacious human endeavors ever: the Manhattan Project (nuclear bomb) and the Apollo Program (putting a man on the moon):
(Note, there is a blurred line here between the Top 5 tech companies and the Top 5 AI companies. Big Tech is making massive investment in AI Startups, and in parallel, many of them have their own Frontier models. While Google and Meta have their own models, Amazon & Microsoft have largely decided to focus on large-scale venture investment and massive infrastructure buildout. This table tries to break down the cross-pollination a bit:
[TL;DR]
AI DataCenter / Training Run Costs
assumptions:
- most of the frontier models utilize AI datacenters
which contain within them racks upon racks…
>5,000 nVidia H100 GPUs networked in parallel
to perform their training runs. - a training run lasts generally 100 days,
with those 5,000+ chips running at full tilt the entire time.
(some as short as 6 days, some as long as 300 days…
but 100 is a rough average)
top line conclusions
here’s the takeaway:
- one H100 running for 24 hours
uses the same amount of electricity as
a single US citizen uses in a day.
(700W x 24h) - thus, as a general heuristic,
1 H100 = 1 human
(in terms of electrical power consumption)
using that metric…
. - …a modern AI training run
takes the same amount of energy
as it does to power a city of 500,000
(1/2 a million) ppl in America
for one day - or, more pragmatically,
the electricity that it takes
to power a village of 5,000 people
running full tilt 24/7 for 100 days
. - total electrical cost (~$3 million)
for the training run is slim,
compared to: - total hardware cost ($180 million)
to build the datacenter,
and that’s just for the chips - that cost estimate does not include:
land, construction costs, gas, electricity,
cooling, network, interconnect, staff,
or operations…
. - operational costs (aka inference), or
“how much it costs to answer user prompts & queries”
will be the topic of a separate post
Questions that led to this post:
- I see compute measured in FLOP for AI training runs,
…such as 5e25 FLOP
(the reported computation required to train
Google’s Gemini Ultra Frontier AI in q4.2024).
…what does FLOPs mean? - is it power x time?
can you achieve 5e25 FLOP in 100 days with X compute,
or in a single day by deploying 100X the compute power? - what is the level of infrastructure required to achieve 5e25 FLOP?
- as in, how many nVidia H100s would it take?
- …running at 100% capacity
across how many 24/7 days?
- how much electricity would that take?
- how much would the datacenter cost?
- in terms of CapEx (chips + infrastructure + construction)?
- how does 5e25 FLOP compare with the world’s fastest supercomputers (TOP500)?
- How long would it take the world’s fastest supercomputer to generate 5e25 FLOP?
- are the AI datacenters faster than the supercomputers?
AI DataCenter & Frontier Model Training Runs:
the down and dirty details:
- The total training compute for the final training run of Gemini Ultra,
likely the most compute-intensive model to date, is estimated at 5e25 FLOP
. - 5e25 FLOP = 5×10^25 FLOPs
written out:
50,000,000,000,000,000,000,000,000 operations per second
- FLOPs is an acronym for
FLoating-point OPerations per second…
basically, how many simple mathematical calculations
does the specified hardware
(be it a CPU, a GPU, or an entire datacenter)
perform in one second of time - FLOPs are a measure of computational work (effort),
not energy or power specifically - efficiency is how much power (electricity) is needed
by the GPU + machine / network architecture
in order to achieve a fixed number of FLOPs - less efficient datacenters
can achieve the same number of FLOPs
as state-of-the-art datacenters,
by using more electricity, more time,
or both
- compute does not scale linearly:
Linear scaling assumes perfect parallelism.
In reality, scaling efficiency can drop due to bottlenecks in:
-
- hardware,
- memory bandwidth, or
- communication between nodes
- An NVIDIA H100 GPU delivers
approximately 1 petaflop (1 PFLOP)
of FP16 compute performance - 1 PFLOP = 10^15 FLOPs
- FP16 is “16-bit floating point operation”
- the number of bits indicates the level of accuracy of the calculation
- supercomputers often run at 32-bits (FP32)
- super-efficient AI models often “dumb down” to 8-bit or even 4
- this gives a loss of accuracy,
but massive improvements in speed and efficiency
- that rating is FLOPs per second (1/60th of a minute)
- there are 86,400 seconds in a day (24h * 60 sec/min)
- a single H100 chip
would take 1,600 years
to complete a 5e25 FLOPs
AI training run
. - alternatively,
an array of 6,000 H100s
networked in parallel
could complete the compute / training
in 100 days
. - or 600,000 H100s
to complete the training in a single day
HARDWARE COST
- each H100 = $30,000 MSRP
- 6,000 H100s == $180 million
- that is chips only.
- does not include land, construction costs, gas, electricity, cooling, network, interconnect, staff, or operations
POWER / ELECTRICITY / ENERGY
- each H100 consumes 700W of power
- AI datacenters typically house
between 5,000 and 20,000 H100s
within a single facility,
so: - 6,000 H100s = 4 MW (megawatts)
DEFINING ENERGY
- the unit of energy is the joule
- 1 Watt=1 joule/second (1J/s)
- in practical terms, watt-hours (Wh)
- Energy (in watt-hours)=Power (in watts)×Time (in hours)
GPU power consumption
- For 6,000 GPUs each consuming 700 W:
- Total power = 6,000 × 700W
- = 4,200,000W
- = 4.2MW >> PER HOUR
.
- so, for 100 day training run of 6,000 H100s, thats:
- 100d x 24h x 4.2MW
- which is:
- 10,000 MWh
COST OF ELECTRICITY
how much does using 10,000 MWh cost at commercial rates?
how much at residential rates?
As of December 2024, the average electricity rate in California is approximately
30 cents per kWh
(roughly same for commercial and residential)
10,000 MW = 10,000,000 KW
so: $3 MILLION USD
to power a single AI datacenters training run
POWER PLANT GENERATING CAPACITY
Q: what is the output of an average power plant?
A:
- 1,000 MW for traditional power generation
- 100 MW for renewables
Power Plant Type |
Rated Power Output |
Total Annual Energy Output |
Coal |
500–1,500 MW |
4,500,000–13,000,000 MWh |
Natural Gas (Combined Cycle) |
400–1,300 MW |
3,500,000–11,500,000 MWh |
Nuclear |
1,000–1,600 MW |
9,000,000–14,000,000 MWh |
Hydropower |
50–3,000 MW |
500,000–25,000,000+ MWh |
Wind Farm (onshore) |
100–300 MW |
250,000–750,000 MWh |
Solar Farm (utility-scale) |
50–250 MW |
100,000–500,000 MWh |
Nuclear Powered AI Datacenters
Google commissions buildout of 7 new nuclear power plants in US, guarantees purchase of 100% of their power output through 2045; electricity generation to be reserved for the sole use of AI models, entities & agents. [Wall Street Journal, Oct 14, 2024]
HUMAN POWER CONSUMPTION
what is the average watt consumption of a US city of 100,000 people?
- average electricity consumption per person in the U.S. is approximately 12,000 kWh per year (12 MWh)
- 365×24=8,760 hours/year
- 100,000 ppl × 12,000 kWh
- = 1,200,000,000 kWh/year
- = 1.2 TWh / year
so the city on average needs generation of:
140 MW
…BUT:
that’s average, and it needs to sustain peak load
peak is generally double
so call it:
280 MW
is the standard energy generation capacity
for a city of 100,000 people in America
AI DATACENTERS VS. HUMAN CITIES
so basically, the
AI datacenter power draw,
while training
is the equivalent of
5,000 U.S. households
running full steam
…and it does that for 100 days nonstop
if it were to compress the training run
into a single day,
that would be the power draw of
500,000 people
half a million people
AI CHIP / HUMAN EQUIVALENCIES
intriguingly:
a single nVidia H100 GPU
uses the same amount of electrical power
as a US citizen in a single day
…so there is your “human equivalent” in terms of AI brains…
1 human = 1 nVidia H100
.
and finally:
AI DATACENTERS VS. SUPERCOMPUTERS
- the world’s fastest supercomputer is:
- El Capitan
- housed at Lawrence Livermore National Labs (USA)
- powered by AMD silicon:
- 11 million total chip cores (CPU + GPU)
- 44,000 CPUs : AMD EPYC 24C “Genoa” 24-core 1.8 GHz
- 44,000 GPUs: AMD Instinct MI300A
- peak performance: 3e18 FLOP
- sustained performance: 2e18 FLOP
.
- so to do the 5e25 AI training run,
it would take:
. - 330 days
thats ~1 year,
or 3x that of the 6k H100s
at the AI datacenter
. - and to answer the question:
the World’s Fastest Supercomputer harnesses actually less than 1/3rd the raw power of a single medium-to high-end AI datacenter.
…to which we say: Dayaaaaammm!!!
the original unabridged ChatGPT 4o conversation / Q&A transcript that led to this post is available for viewing / augmentation / follow-up:
• https://chatgpt.com/share/677308ec-1058-8003-b28f-5e9d9a9d5836
.
.
Related Posts:
- How much Power is AI consuming globally?
- How does your iPhone compare to the world’s fastest supercomputers?
- What use are Supercomputers at all in the Age of AI Datacenters?
engine: MidJourney v6.1
prompt: drone photo of AI datacenters stretching onward to an infinite horizon, smoke pouring from the cooling towers, large conduits & pipelines connecting them to a distant city. silhouettes of multiple nuclear power plants dot the horizon.