Attention is All You Need : The Beginning of AGI v1.0

Jenny Tang might say- Attention is All You Need

On Monday, June 12, 2017, there was a second shift in the Force. On that afternoon, Ashish Vaswani and his colleagues at Google uploaded a 1 megabyte PDF file, in that moment pre-publishing a seminal paper, playfully named “Attention is All you Need.” The paper outlined the basic Transformer model which has been at the heart of much of the innovation of AI since. Particularly, ChatGPT and the numerous AI.Art engines that have begun to flood our internets with their content.

GPT (as in, GPT-3 and ChatGPT), by the way, stands for Generative Pre-trained Transformer… which directly references this paper.

Now, I knew all along that modern AI code was elegant… as in, reduced to a tight, efficient codebase. But in reading this paper, I just saw the actual code for today’s most advanced AI (want to geek out and stare at the genetic seed of a modern AI? see for yourself). I kid you not:

The AI Deep Learning codebase, in its entirety, consists of just 4 equations. The python code that embodies them is 233 lines of code that fit quite comfortably on about 4 pages of 8-1/2×11 piece of paper.

Those algorithms are, according to my query to ChatGPT:

Attention is All You Need : Algorithms that define the Transformer Model

So, given that newsflash, how do we end up with these insanely clever conversational AIs, AIs that we can talk to, ask questions of, and have extended meaningful conversations with? (counterpoint: Chomsky)

The Elegant Recipe for Birthing a Modern AI

Well, the recipe is insanely simple, if a bit pricey:

1. rent compute power.
…from the great cloud (generally, this would be Microsoft Azure or AWS). And not just a little power. We’re talking full throttle at multiple data centers for a few days, if not weeks. We’re talking multiple supercomputer equivalents worth of compute.

  • Total cost: roughly $1-10 million USD (per entity)
  • Total compute: 30,000 – 200,000 GPU years
  • Total compute (in human time): 3-21 days

2. make sure you have enough electricity.
As in, a few thousand gigawatts worth. Yes, this is Frankenstein. Yes, we are riding the lightning. Yes, it takes hella electricity to power all that compute. As in, approximately the power that it would take to power a town of 100,000 residents. For the duration of its creation.

3. load up that teeny weeny codebase… all 460 lines of it.
for comparison: the codebase of a modern smartphone operating system is something on the order of 10-20 million lines of code.

4. Aim it at a big set of training data.
a really big dataset. LOTS of  training data. Terabytes. Petabytes, if you can find it. And not just dumb shovel-fed raw data. Preferably Web 3.0-class data. In other words, data with metadata. As in, pictures with descriptive, accurate captions. Articles with sources, links, and summaries. Videos with accurate scene descriptions and transcripts. etc.

5. Let er rip.
By “rip,” I mean, hit the “start” button, which begins an iterative loop of those 28 lines of code, across the entire set of training data. Sometimes once (in AI terms, this is called an “epoch”), sometimes 2-3 times to get it right.

In total, your code will loop somewhere on the order of 1024 times (a “10” followed by 24 zeroes… 10,000,000,000,000,000,000,000,000 times… because it actually loops across each token (a token is roughly 4 letters) before moving on to the next one… in other words, it loops an insane amount of times)

6. wait for it to finish.
go on vacation. (for a fortnight, roughly)

7. Voila! You now have a “baked” neural network.
With deep deep deep language structures and mysterious logic circuits. The whole “model” as it is called, fits comfortably on a laptop SSD hard drive. No internet connection required.

8. smack a workable user interface on the top of it.
Ironically, the codebase of this UX will probably be about 10,000 times the size of the original AI seed codebase. And the (purely human) design & development time of the UX will probably be an order of magnitude longer than the actual time it took to train & birth the AI.

9. get ready to scale it.
launch it to the public.
cross your fingers.
hope that it doesn’t escape.

Attention is All You Need : TLDR

For integrity, here is the original abstract:

“The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.”

Attention is All You Need Authors:

These are the men (and one woman, thank god!) who made the Deep Learning breakthrough possible. Most were / are with the “Google Brain” division of Alphabet.

Full Paper @ ArkIV:

If you’re like me and like to see how the sausage is made, here is the actual paper, as originally published, hosted on ArkIV:

Oh, just One More Thing…

A picture, hopefully, says a lot more than a thousand words. This is a graphic visualization of the predictive word LLM in action, in both analysis (input) and synthesis (output) modes:
Attention is All You Need - data visualization
, , ,