AI DataCenters & Training Runs 2025 : How much Energy, how much Money?

AI datacenters : nuclear energy to train frontier AIs

There has been a lot of noise in the mainstream press about the massive buildout of AI datacenters across the world.. with the latest being an audacious announcement by the Trump presidency of the $500 billion Project Stargate. The primary market signal of this buildout is “CapEx,” or capital expenditures, reported by the top 5 AI companies: OpenAI, Microsoft, Amazon, Google, Meta (Apple & Tesla remain dark horses). Microsoft alone has famously committed $250 billion (yes, a quarter of a trillion dollars) to AI datacenter building in the next 3 years (well, 5 years, but we’re on year 2 already). The others are not far behind. Meta (Facebook) may even be exceeding that number.

And, shockingly, that’s just the physical build-out of the AI datacenters. Construction, chips, server racks, cooling towers, and (omg) power plants. You read that right. Both Microsoft and Google are building / funding the construction of entire electrical power stations, formerly the exclusive domain of governments and public utilities, and signing contracts to guarantee purchase of 100% of the power generated by the plants for the next 20 years. In the case of Google, this entails the construction of six new nuclear power plants. (side quest: Google building 7 new nuclear power plants for the sole use of AI)

Are you starting to sense the seriousness of this moment?

In case you were wondering… yes, the term “unprecedented” applies. Read here my comparison of corporate AGI development budgets with the two most audacious human endeavors ever: the Manhattan Project (nuclear bomb) and the Apollo Program (putting a man on the moon):

(Note, there is a blurred line here between the Top 5 tech companies and the Top 5 AI companies. Big Tech is making massive investment in AI Startups, and in parallel, many of them have their own Frontier models. While Google and Meta have their own models, Amazon & Microsoft have largely decided to focus on large-scale venture investment and massive infrastructure buildout. This table tries to break down the cross-pollination a bit:

[TL;DR] 

AI DataCenter / Training Run Costs

assumptions:

  • most of the frontier models utilize AI datacenters
    which contain within them racks upon racks…
    >5,000 nVidia H100 GPUs networked in parallel
    to perform their training runs.
  • a training run lasts generally 100 days,
    with those 5,000+ chips running at full tilt the entire time.
    (some as short as 6 days, some as long as 300 days…
    but 100 is a rough average)

top line conclusions

here’s the takeaway:

  • one H100 running for 24 hours
    uses the same amount of electricity as
    a single US citizen uses in a day.
    (700W x 24h)
  • thus, as a general heuristic,
    1 H100 = 1 human
    (in terms of electrical power consumption)
    using that metric…
    .
  • …a modern AI training run
    takes the same amount of energy
    as it does to power a city of 500,000
    (1/2 a million) ppl in America
    for one day
  • or, more pragmatically,
    the electricity that it takes
    to power a village of 5,000 people
    running full tilt 24/7 for 100 days
    .
  • total electrical cost (~$3 million)
    for the training run is slim,
    compared to:
  • total hardware cost ($180 million)
    to build the datacenter,
    and that’s just for the chips
  • that cost estimate does not include:
    land, construction costs, gas, electricity,
    cooling, network, interconnect, staff,
    or operations
    .
  • operational costs (aka inference), or
    “how much it costs to answer user prompts & queries”
    will be the topic of a separate post

 


Questions that led to this post:

  1. I see compute measured in FLOP for AI training runs,
    …such as 5e25 FLOP
    (the reported computation required to train
    Google’s Gemini Ultra Frontier AI in q4.2024).
    …what does FLOPs mean?
  2. is it power x time?
    can you achieve 5e25 FLOP in 100 days with X compute,
    or in a single day by deploying 100X the compute power?
  3. what is the level of infrastructure required to achieve 5e25 FLOP?
    1. as in, how many nVidia H100s would it take?
    2. …running at 100% capacity
      across how many 24/7 days?
  4. how much electricity would that take?
    1. see also: How much of the World’s Energy is AI actually using?
  5. how much would the datacenter cost?
    1. in terms of CapEx (chips + infrastructure + construction)?
  6. how does 5e25 FLOP compare with the world’s fastest supercomputers (TOP500)?
  7. How long would it take the world’s fastest supercomputer to generate 5e25 FLOP?
  8. are the AI datacenters faster than the supercomputers?

AI DataCenter & Frontier Model Training Runs:

the down and dirty details:

  • The total training compute for the final training run of Gemini Ultra,
    likely the most compute-intensive model to date, is estimated at 5e25 FLOP
    .
  • 5e25 FLOP = 5×10^25 FLOPs
    written out:
    50,000,000,000,000,000,000,000,000 operations per second
  • FLOPs is an acronym for
    FLoating-point OPerations per second…
    basically, how many simple mathematical calculations
    does the specified hardware
    (be it a CPU, a GPU, or an entire datacenter)
    perform in one second of time
  • FLOPs are a measure of computational work (effort),
    not energy or power specifically
  • efficiency is how much power (electricity) is needed
    by the GPU + machine / network architecture
    in order to achieve a fixed number of FLOPs
  • less efficient datacenters
    can achieve the same number of FLOPs
    as state-of-the-art datacenters,
    by using more electricity, more time,
    or both
  • compute does not scale linearly:
    Linear scaling assumes perfect parallelism.
    In reality, scaling efficiency can drop due to bottlenecks in:
    • hardware,
    • memory bandwidth, or
    • communication between nodes
  • An NVIDIA H100 GPU delivers
    approximately 1 petaflop (1 PFLOP)
    of FP16 compute performance
  • 1 PFLOP = 10^15 FLOPs
    • FP16 is “16-bit floating point operation”
    • the number of bits indicates the level of accuracy of the calculation
    • supercomputers often run at 32-bits (FP32)
    • super-efficient AI models often “dumb down” to 8-bit or even 4
    • this gives a loss of accuracy,
      but massive improvements in speed and efficiency
  • that rating is FLOPs per second (1/60th of a minute)
  • there are 86,400 seconds in a day (24h * 60 sec/min)
  • a single H100 chip
    would take 1,600 years
    to complete a 5e25 FLOPs
    AI training run
    .
  • alternatively,
    an array of 6,000 H100s
    networked in parallel
    could complete the compute / training
    in 100 days
    .
  • or 600,000 H100s
    to complete the training in a single day

HARDWARE COST

  • each H100 = $30,000 MSRP
  • 6,000 H100s == $180 million
  • that is chips only.
  • does not include land, construction costs, gas, electricity, cooling, network, interconnect, staff, or operations

POWER / ELECTRICITY / ENERGY

  • each H100 consumes 700W of power
  • AI datacenters typically house
    between 5,000 and 20,000 H100s
    within a single facility,
    so:
  • 6,000 H100s  = 4 MW (megawatts)

DEFINING ENERGY

  • the unit of energy is the joule
  • 1 Watt=1 joule/second (1J/s)
  • in practical terms, watt-hours (Wh)
  • Energy (in watt-hours)=Power (in watts)×Time (in hours)

GPU power consumption

  • For 6,000 GPUs each consuming 700 W:
    • Total power = 6,000 × 700W
    • = 4,200,000W
    • = 4.2MW >> PER HOUR
      .
  • so, for 100 day training run of 6,000 H100s, thats:
    • 100d x 24h x 4.2MW
  • which is:
    • 10,000 MWh

COST OF ELECTRICITY

how much does using 10,000 MWh cost at commercial rates?
how much at residential rates?

As of December 2024, the average electricity rate in California is approximately

30 cents per kWh
(roughly same for commercial and residential)

10,000 MW = 10,000,000 KW

so: $3 MILLION USD
to power a single AI datacenters training run

POWER PLANT GENERATING CAPACITY

Q: what is the output of an average power plant?

A:

  • 1,000 MW for traditional power generation
  • 100 MW for renewables

Power Plant Type

Rated Power Output
(MW = megaWatts)

Total Annual Energy Output
(MWh = megaWatt hours)

Coal

500–1,500 MW

4,500,000–13,000,000 MWh

Natural Gas (Combined Cycle)

400–1,300 MW

3,500,000–11,500,000 MWh

Nuclear

1,000–1,600 MW

9,000,000–14,000,000 MWh

Hydropower

50–3,000 MW
(varies widely)

500,000–25,000,000+ MWh

Wind Farm (onshore)

100–300 MW

250,000–750,000 MWh
(variable)

Solar Farm (utility-scale)

50–250 MW

100,000–500,000 MWh
(dependent on sunlight)

Nuclear Powered AI Datacenters

AI Datacenters : Google to power AI with new nuclear power plants

Google commissions buildout of 7 new nuclear power plants in US, guarantees purchase of 100% of their power output through 2045; electricity generation to be reserved for the sole use of AI models, entities & agents. [Wall Street Journal, Oct 14, 2024]

HUMAN POWER CONSUMPTION

what is the average watt consumption of a US city of 100,000 people?

  • average electricity consumption per person in the U.S. is approximately 12,000 kWh per year (12 MWh)
  • 365×24=8,760 hours/year
  • 100,000 ppl × 12,000 kWh
  • = 1,200,000,000 kWh/year
  • = 1.2 TWh / year

so the city on average needs generation of:

140 MW

…BUT:

that’s average, and it needs to sustain peak load

peak is generally double

so call it:

280 MW

is the standard energy generation capacity
for a city of 100,000 people in America

AI DATACENTERS VS. HUMAN CITIES

so basically, the
AI datacenter power draw,
while training

is the equivalent of
5,000 U.S. households
running full steam

…and it does that for 100 days nonstop

if it were to compress the training run
into a single day,

that would be the power draw of
500,000 people

half a million people

AI CHIP / HUMAN EQUIVALENCIES

intriguingly:

a single nVidia H100 GPU
uses the same amount of electrical power
as a US citizen in a single day

…so there is your “human equivalent” in terms of AI brains…

1 human = 1 nVidia H100

.

and finally:

AI DATACENTERS VS. SUPERCOMPUTERS

  • the world’s fastest supercomputer is:
  • El Capitan
    • housed at Lawrence Livermore National Labs (USA)
    • powered by AMD silicon:
      • 11 million total chip cores (CPU + GPU)
      • 44,000 CPUs : AMD EPYC 24C “Genoa” 24-core 1.8 GHz
      • 44,000 GPUs: AMD Instinct MI300A
    • peak performance: 3e18 FLOP
    • sustained performance: 2e18 FLOP
      .
  • so to do the 5e25 AI training run,
    it would take:
    .
  • 330 days
    thats ~1 year,
    or 3x that of the 6k H100s
    at the AI datacenter
    .
  • and to answer the question:
    the World’s Fastest Supercomputer harnesses actually less than 1/3rd the raw power of a single medium-to high-end AI datacenter.
    …to which we say: Dayaaaaammm!!!

 

the original unabridged ChatGPT 4o conversation / Q&A transcript that led to this post is available for viewing / augmentation / follow-up:
https://chatgpt.com/share/677308ec-1058-8003-b28f-5e9d9a9d5836

.

.


Related Posts:

 


engine: MidJourney v6.1

prompt: drone photo of AI datacenters stretching onward to an infinite horizon, smoke pouring from the cooling towers, large conduits & pipelines connecting them to a distant city. silhouettes of multiple nuclear power plants dot the horizon.