The Future of AI Training Data 2023: The Untapped Digital Content Well

Scientists say “AI Training Data”
…we say: “It ate the Internet.”

All the Text, All the Songs, All the Streams, All the Feeds… ALL THE EVERYTHING

The particular entity I’m conversing with these days (c. November 2022) is a 5 year old bot, very un-creatively named “GPT3,” which was instantiated c.MMXVII [AD 2017]. In general, the research teams and corporations wait a while until the leading edge bots are given public interfaces. So, this one is text only. Raw text was its only training dataset. And that, in today’s (2022) terms, is simply… jurassic.

To expound: there are massive initiatives underway, which are already or very soon to be concluded — think “digital content” analogues to the Human Genome Project — that aim to catalog / parse / label / analyse / transcribe / annotate / understand all MEDIA ever produced / digitized by humanity.

These two concepts used to be separate buckets — what has been created by mankind across its entire history, and what can be viewed on the internet — but today, we are rapidly approaching the total sum digitization and network availability of all content ever made. This not only includes the obvious: written books, and famous artworks. This includes architectural plans of every building every constructed, from the Pyramids to the Taj Mahal to the Burj Dubai.

This includes maps of the “drivable” world at 1mm resolution (this is required and is constantly being revised based on the modern fleet of 1,000,000+ vehicles which each have 10+ cameras on them, plus radar, plus LIDAR, plus cloud “connectivity”…. yes, you get “free” maps on your dashboard screen, but there is massive upstream on your car as well… those cameras & sensors are sending compressed feeds up to the manufacturer, so that they can improve their maps for everyone).

For complete clarity, let’s examine what this digital omnibus of AI Training Data will ultimately include:

every photograph
every photo ever digitized: posted to a newspaper, or a Facebook feed, or Instagram, or Google Photos, Apple iCloud, etc
every song
ever written, recorded, performed or produced
every video
ALL of YouTube, Twitch, TikTok, Snap, etc.
all the TV
every program, every channel & network, including news, ever having been broadcast or streamed since the invention of television
every movie
ever made

That’s a lot. And yet: have you considered how many cameras are actually in the world today? Your shiny new Alexa-powered faucet has a camera on it, for chrissakes!

The Ultimate AI Training Data: every camera feed on planet earth

every livefeed & every livestream
every traffic cam, security cam, doorbell camera, facetime call, webcam, smartphone camera… if its stored in the cloud, or even routes through the cloud (what doesn’t?), its recorded & stored… and if its recorded, regardless of human-centric “privacy policies,” and regardless of its opacity to human agency, it is 100% fully accessible & transparent to digital agents (that is to say, AI)

• …it’s all stored “securely” for you in the Cloud….
• …by Apple, Meta, Google, Amazon, Microsoft…
•… the very same companies that “manage” the leading AI entities…
• …get the picture?

And while the AI’s appetite for training data is unquenchable, its growth (both in terms of “intelligence” and “power”) follows the curve of all tech: exponential growth.

Next : Brace Yourself: Things are About to get Weird…

Note 1: How much text is on the internet?

“According to one research group, the world’s total stock of high-quality text data is between 4.6 trillion and 17.2 trillion tokens (roughly, a token equates to an english word, +/- 20%). This includes all the world’s books, all scientific papers, all news articles, all of Wikipedia, all publicly available computer source code, and much of the rest of the internet, filtered for quality (e.g., webpages, blogs, social media). Another recent estimate puts the total figure at 3.2 trillion tokens.

For reference: DeepMind’s Chinchilla model was trained on 1.4 trillion tokens.

Technical Addendum:
The Actual AI Training Data

For those interested, I did a deep dive of the “Crawl”… here is the precise technical definition of the “AI Training Dataset” that was “fed” to GPT3 in its training:

And, of equal or greater interest, is the censorship controls that are laid atop that raw data ingestion:

The Invisible Censorship Net that The Internet is filtered through for Happy AI Digestion

The Future of AI Training Data 2023: The Untapped Digital Content Well

All the Text, All the Songs, All the Streams, All the Feeds… ALL THE EVERYTHING

Note 1: How much text is on the internet?

Technical Addendum: The Actual AI Training Data

Technical Addendum:
The Actual AI Training Data