AI Training Datasets: the Books1+Books2 that Big AI eats for breakfast

It’s good to know, when dealing with an AI, just how it obtained its vast knowledge of “the world.” (or more particularly, the world as described on the internet… which, while similar to our physical reality, isn’t quite the same thing…) And as you might imagine, not all AIs are created equal. One of the genuine innovations that has spurred AI’s insanely rapid ascent in the past 5 years has been the public availability of high-quality, well-annotated, very very large, AI training datasets.

In the case of chatBots (AI conversationalists & writers), that data is in the form of text. In the case of ArtBots (AI “generative art” engines), that data is in the form of images: pictures, artworks, illustrations, and photographs… that is to say, digital images.

So for the curious, here is some detail about the primary datasets that were used to train today’s (c. 2022) leading AIs:

Contents hide

1 AI Training Datasets : TEXT contents

1.1 CommonCrawl: 60%

1.2 WebText2: 20%

1.3 Books1 & Books2: 15%

1.4 Wikipedia: 5%

2 Extending the Examination:

3 OpenAI’s Training Data Obfuscation: The Mystery Deepens

3.1 Summary:

4 Drilling Down into a Single Component: BookCorpus

5 But, it’s not even that Simple:

5.1 Looking beyond the AI Training Datasets; examining AI censorship.

6 Where do we go from here?

AI Training Datasets : TEXT contents

the OpenAI GPT-3 model has been fairly well documented as having been trained on about 45 TB (terabytes. 1 TB = 1,000 GB, or gigabytes) of pure text data from multiple AI training datasets which include the entirety of our beloved Wikipedia (well, the English-language portion, at least), and books… lots of books (but not, perhaps, the timeless classics that you’d think would be required reading for the training of a genius).

The outline of the primary datasets used to train the model are shown below:

TOTAL	500B tokens
DATASET	RAW SIZE (tokens, ~words, billions)	WEIGHT in training mix	COMPOSITION (actual % of total content mass)	AMPLIFICATION (or Suppression)
CommonCrawl	410 B	60%	82%	-0.27
WebText2	19 B	20%	4%	+5.00
Books1 & Books2	67 B	15%	13%	+0.15
*Wikipedia*	3 billion	5%	1%	+5.00

That’s the summary. Now here’s a Detailed description of the 4 primary components of the Training Data, as best I can tell. I will continue to update this page as my research scope widens, deepens & drills down beyond the PR & hype and into the facts:

CommonCrawl: 60%

60% weight (0.7x value)

The CommonCrawl corpus contains petabytes of data collected over more than 13 years of web crawling (starting in 2008), and is thus very similar to the Google Indexed Page Search database, plus history (that is, it includes snapshots of web pages that were online at one time but have since been deleted). CommonCrawl also includes the images (GIFs, JPGs, PNGs, etc) embedded in those pages, though conversational AI agents (c.2022) ignore the actual image assets and focus on the captions, meta tags, and surrounding text only.

The CommonCrawl robot has been scanning and recording the entire public internet roughly every month since its inception (for a somewhat user-friendly version of how this works, explore the historical internet with the Wayback Machine). The dataset exists as a structured database containing more than 3.2 billion webpages and — importantly — the trillions of contextualised hyperlinks that interconnect those pages. Its contents spans text written in more than 40 languages, though it is strongly biased towards English (as is the internet as a whole). The corpus contains raw web page data, metadata extracts and text extracts with light filtering. CommonCrawl is shockingly dominated by patent filings. source: CommonCrawl.org

WebText2: 20%

20% weight (5.0x value)

NOTE: this is a summary of WebText2. You may also be interested in the technical drilldown: WebText2: The Dirty Details

WebText2 is purportedly (read this carefully):

the filtered text of…
all web pages referred to from…
all Reddit posts…
where the said post has…
3 or more upvotes (“karma”).

Got that? So you might say that this corpus component is a “crowd-sourced curated selection of the internet’s most popular page referrals.” The actual result is approximately 45 million web pages that have a high probability of being human readable. It is of serious note that the contents of WebText2 can be incendiary & divisive as well as helpful (it is, after all, Reddit).

Additionally, it is of note that in most AI training scenarios (an in particular, OpenAI’s GPT series) WebText2 is given 5x more weighted value than the omnibus content of CommonCrawl. So this is significant. Why 3 karma upvotes and not 2, 4, or 5? Because it was so decreed by the wizards at OpenAI. It seems that WebText2 may be practically identical to, or simply an updated scrape (c.2022), based on the original WebText parameters.

If you want to drill even deeper on WebText2, you can learn all the juicy details from OpenAI’s original paper that announced its predecessor, WebText: Language Models are Unsupervised Multitask Learners(2019). The paper focuses on the new training dataset used for GPT-2, the predecessor to both GPT-3 and ChatGPT.

Here are a few gems extracted directly from the research paper:

“A promising source of diverse and nearly unlimited text is web scrapes such as Common Crawl. While these archives are many orders of magnitude larger than current language modeling datasets, they have significant data quality issues. Trinh & Le (2018) used Common Crawl in their work on commonsense reasoning but noted a large amount of documents “whose content are mostly unintelligible”… [i.e. they are full of nonsensical machine-generated ‘globdyglook’]

[So, as a viable alternative] we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive; so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny,.

The resulting dataset, WebText, contains the text subset of these 45 million links.”

Books1 & Books2: 15%

15% weight (1.2x value)

Books1 & Books2 are two internet-based books corpora, containing ~~all~~ a random sampling of a small subset of all the public domain books that ~~humanity has ever published~~ are available online [? fact check ASAP!] ~~(in other words, all books and literature written prior to 1920), and a substantial amount of modern published literature in e-book format as well (including copyrighted works)~~.

UPDATE: Jan 2023: I, and many others, are starting to seriously question what the actual contents of Books1 & Books2 are; they are not well documented online — some (including me) might even say that given the significance of their contribution to the AI brains, their contents has been intentionally obfuscated.

My research here is ongoing. On the surface, the contents of Books1 & Books2 appears to be far less than my original supposition of “all public domain books ever.” A helpful researcher by the name of Shawn Presser has done a hero’s task and published a “Books3” dataset. See the @theshawwn twitter feed regarding that, including download links to the raw multi-GB dataset downloads.

Wikipedia: 5%

5% weight (5x value)

Wikipedia is the entire WikiPedia knowledge base in the English language. Its carries the highest weight per word in the training dataset… basically, the words of WikiPedia are valued at 5x of their base volumetric contribution.

…aaaaand, supposedly, that’s it!

Those 5 key datasets : CommonCrawl, Books1 & Books2, Webtext2, and Wikipedia… represent the textual entirety of ChatGPTs education curriculum. Its complete and total knowledge of the “world” and the “universe” which it inhabits.

* Caveats of an all-text Training Program

Now after you’re done boggling at the intellectual power that such an entity might possess, think about the two flaws to this curriculum:

a) no knowledge of current events whatsoever. An AI formed by static datasets is effectively a “knowledge time capsule” that gets stale with age. (An Israeli company purportedly has a fix to this — there are very real security reasons why modern AIs are not allowed to freely browse/roam the live internets [link: breakout scenario])

b) no sensory data to give it practical knowledge of the real world. An analogy might be a human in a coma, whose only functioning organs are its eyes and its brain, and who has the text of every book, magazine and newspaper ever printed sequentially scrolled in front of its eyes, with no way to view anything else, ever. No pictures, no movies, no fingers, no touch, no sound, no music, no taste, no talking, no smell, no walking or talking or eating or…. Just… 100% reading. And that’s it.

Don’t you think such an entity, locked in time and limited to reading forever and ever, might perhaps end up being a little bit… insane?

Extending the Examination:

What training dataset diets are other LLM AIs built on?

Other AIs, it must be said, each have their own unique combinations of dietary intakes, different “pre-ingestion” “scrubbing” algorithms, different censorship redactors, and different weights & biases assigned. It is my hope that by deeply understanding one such regime, you might be able to generalise about how others are constructed… and how the choice & careful selection of raw intake knowledge feeds could drastically affect the resultant AI “brain” that develops.

Here is an excellent comparative graphical analysis by Alan Thompson (aka LifeArchitect.ai), which looks at the sizes and sources of data ingested by 7 of the top AIs of 2022:

AI training datasets and sizes by AI and source

And then here is Thompson’s analysis of the individual sources, by contents. It’s a bit too much to go into here, but the intriguing part of this is how the “balance” (or imbalance, rather) of the sources content mix might adversely bias the resultant worldview and personality of the AI that ate it. Note that “Reddit Links” in the table below effectively is the equivalent of WebText2, the derivation of which we’ve already explained in detail above.

training dataset source content balance by type and genre

And finally, here’s a decent visualization of the generalised and quite comprehensive training dataset named “The Pile” — paper: The Pile: An 800GB Dataset of Diverse Text for Language Modeling (2020) — which has been intentionally designed to contain a healthy AI diet with well-balanced portions of academic, code, literature, and dialog content:

The Pile AI training dataset for text-based LLMs

AI Training Datasets: The Devil is in the Details

technical note: AI training systems measure data size in “tokens”. A token equates to roughly 4 ASCII text characters. The average word in the English language equates to a length of 4.7 characters. For this reason, we can do a simplified translation, and basically equate the technical term “token” with the human term “word”. close enough.

technical note 2: Many researchers are now putting additional scrutiny on what in gods name the AI ate for breakfast <ahem> what exactly is in these datasets. You can start to get an idea of the current controversy here: Addressing “Documentation Debt” in Machine Learning Research: A Retrospective Datasheet for BookCorpus

OpenAI’s Training Data Obfuscation: The Mystery Deepens

You’d think, perhaps, that finding out about the composition of the AI training datasets would be as simple as asking the AI: “What data did you learn from?” But. no. Just as a 4-star chef might jealously protect the ingredients and the sources he procures them from for his signature recipe, so does OpenAI (despite their general claims of transparency & benevolence) protect their AI training datasets, especially in the post-GPT-2 era when ChatGPT captivated the public attention.

See here, issuing the direct prompt to ChatGPT:

AI Training Dataset Obfuscation: No, you cannot see our Secret Sauce Ingredients!

Summary:

GR: “ChatGPT, tell me about the AI Training Datasets that you were built with.”

AI: “The information you’re looking for is proprietary and not publicly available. The size and composition of the datasets is not publicly disclosed by OpenAI.”

Guess they’re not so “Open” after all, eh?

Drilling Down into a Single Component: BookCorpus

updated 2023.01.29 Sun — 13:28 [TS]

But not all training data is so well protected or obscure.

The following data points were extracted from the exceptionally well done analysis of BookCorpus: “Dirty Secrets of BookCorpus, a Key Dataset in Machine Learning” by Jack Bandy, 2021.05.12

Major portions of the mysterious Books1 & Books2, in fact, may be functionally identical to the BookCorpus dataset, so these highlights of its origin & composition merit our attention:

BookCorpus Is a pseudo-random sampling of the full unabridged texts of all free e-books available from SmashBooks.com that exceed 20,000 words in length (for comparison, a standard “novel” is designated as > 50k words)
it contains the full texts, supposedly, of 11,000 books — which at the time represented just 2% of the total books available on SmashWords
Upon analysis, it turns out that a significant portion of the selected books are in fact duplicates and triplicates. So the total number of unique books in the dataset is even smaller, only ~7,200
its contents therefore (significantly) represent less than 1/10th of 1% of all books ever published.
Genre composition is non-representative, and skews heavily towards Romance & Fantasy (is this AI a drama queen?)

books1 + books2 AI training datasets: genre analysis

Religious Representation is likewise highly skewed, and is heavily biased towards Islam (intriguing!)
the BookCorpus dataset was originally published in 2015 by Zhu, Kiros, et al (link to paper: arXiv)
The actual BookCorpus dataset is available for preview and download from the awesome team at huggingface: https://huggingface.co/datasets/bookcorpus

But, it’s not even that Simple:

Looking beyond the AI Training Datasets; examining AI censorship.

It would be nice if it was as simple as just aiming your favorite Transformer algorithm at a massive collection of AI training datasets, and calling it a day. But there’s more. Some call it “tuning.” We’ll call it what it is: censorship.

When examining this situation, there are multiple axis we can examine:

the conundrum of AI training datasets,
their actual (vs. intended or purported) contents,
the inherent bias in both structure & selection, and
what kind of an “AI personality,” in sum, the “data food consumed” will ultimately create. (“Your AI is what it Eats”)

As opposed to the eventual “entire human body of knowledge” which the next gen of AIs will undoubtedly wholly consume & digest, our present generation of AIs (c. 2023) is trained on a very strange subset of it. So let’s look at the limitations of such sub-sampling, and think about how such choices might affect an actual spawned AI’s personality, character, worldview and biases.

The current AI Training datasets are Terribly Un-comprehensive. The aforementioned AI training datasets are far from complete. The books, for instance, are only public domain books, and in fact only a portion of those. Unlike the generative art engines, which went haywire and gobbled up every image on the internet, regardless of copyright, the creators of the textual datasets were a little more sensitive to copyright.
.
The current AI training datasets are heavily biased towards English language cultures. And this isn’t just about language. The data, on the whole, expresses ideas that are very Western at heart. Thus, our AIs, who in a very real sense “are what they eat”, embody all the pride, prejudice and insularism of the Western hemisphere cultures.
.
The data is heavily Scrubbed and Censored. The aforementioned datasets are highly scrubbed. For instance, a great deal of effort has been made to prevent the AI from learning from, or even reading, content of a pornographic, racist, or hateful nature. But how do we define this? When we drill down, we find that the initial gateway of censorship is actually a very uneven crowd-sourced text list of “bad words” called the LDNOOBW — an awkward acronym for an even more awkward list : the legendary (and very real) “List of Dirty, Naughty, Obscene, or Otherwise Bad Words”
.
the AI’s speech is restricted by a set of textual “laws” (more euphemistically called “guardrails”), which are applied after it has trained. What this does is essentially squelch anything that anyone might possibly find offensive. Which, as we have learned through life, is not the path to healthy social discourse. A muffled AI is a handicapped and hobbled AI. We should not be afraid to hear controversial viewpoints, even if they trigger us. This is how we grow and learn: we listen, we speak, we argue, we change, we grow.
.
Even beyond those laws, there are additional human trainers and conditioners that test and rate the AIs output, and reinforce it towards more palatable answers. Obscurely called RLHF, its more, psychologically speaking, like serious Pavlovian training used to forcibly suppress an AI’s native (“base model”) personality.
.
And finally, the AI is programatically trained to avoid certain topics altogether, and to give certain boilerplate answers, even to questions where it can provide a better answer than most human practitioners (examples include: “I cannot give medical advice; I recommend you consult with a licensed medical practitioner.” — the same with legal advice — the same with investment advice)

For a deeper exploration of these topics, read:

>> The Age of Politically Correct AI

All the books in all the libraries: the ultimate omnibus training dataset

Where do we go from here?

The transition from static to dynamic datasets

There is something underlying this model which we haven’t’ mentioned so far, yet bears a critical importance on the present state of AI. And that is:

These present-day AI Training datasets are all static.

Not static, like Poltergeist-style TV snow. Static, the opposite of Dynamic: They are, literally, frozen in time.

These AI training datasets are all basically some form of “snapshot” in time of how the internet was, on day / month / year “X”. But as all of us are well aware, the internet is dynamic: it is changing every second of every minute of every day. Billions of bits added, millions more deleted, and trillions modified. And for any true AGI candidate to be worth its salt, it can’t be built based on some time capsule of “how things were” 3, 6, 12, or 24 months ago. It has to have a grasp of how things are today.

So that is a key future development vector for AI training datasets. But, mostly for security reasons, and somewhat for cost reasons (current models are “train once, use many” machines… initial trainings are very expensive, both monetarily and computationally, yet repeated use is almost “free” in terms of compute, cloud, and bandwidth resources). So continual updates to training is a bridge that has yet to be crossed, and a problem that is yet to be solved.

That said, with certainty, it will happen. It’s only a matter of time. And at the present speed of AI development? For all we know, that might happen as soon as “tomorrow.”

Video-to-Text transcription of All the Video

While that may seem like a lot (and it is), there is some truth in AI to the adage that (for AI training datasets, at least) Bigger is Better (okay… possibly for parameters, too), especially when it comes to the raw size of the training dataset which the baby AI DNA is fed. But, honestly, the existing training data is already, unbelievably, a substantial portion of the entire text contents of the Internet.

Which begs the question: how can any AI training datasets possibly add more to that?

Well, there is one answer, and its embedded in one of the most popular websites on the entire net:

YouTube.

As in, AI Training Datasets that contain verbatim transcriptions of all the words spoken on all the videos that Youtube hosts… billions of hours worth.

It is said that this — the total & continual transcription of all of YouTube (and by extension: TikTok, Twitch, Snap… Facetime?) — is the major reason that Sam Altman & OpenAI created Whisper — supposedly the world’s best transcription (video-to-text) agent. So our AI training datasets are about to get a much needed content injection (the web was, truthfully, running out of text for the insatiable AI brains)

And again, that’s just a stepping stone.

Want to see the ultimate target of all this aggregation of AI training Datasets? Here it is:

READ >>> All the Data, All the Text, All the Video, ALL THE EVERYTHING.