Transformer Source Code 2017: The Heart of Deep Learning

I’ll admit it: my mind was blown (again. and this is about a daily occurence in the lightspeed world of AI design & development) when I heard a professor proclaim: “I’m not sure if you got this vibe: that this is something incredibly amazing… because it’s a generic system… its just 4 lines of equations… and you give it data… and it learns to translate… it learns to speak fluent French…”. (link: AI masterclass video: Łukasz Kaiser) So he claimed: the Transformer Source Code, at its heart, was a mere four lines of equations. Well, then.

If that was so, I should be able to take a look at those equations with my own two eyes. So I downloaded the seminal paper : “Attention is All You Need.” (2017, Vaswani, Shazeer, et al) and scanned it. urgh. a little above my pay grade. So next up: I did what is, more an more, becoming second nature, and well on its way to replacing my Google reflex: I asked ChatGPT.

GR:

What are the four key algorithms that comprise a modern LLM
based on the paper “Attention is all you need”?

ChatGPT:

The four key algorithms that comprise a modern LLM based on the paper “Attention is all you need” are:

    1. Self-attention mechanism,
    2. Multi-head attention mechanism,
    3. Position-wise feed-forward network,
    4. Layer normalization.
Hm. Okay. So it’s plausible. But not being a PhD AI engineer, I remained somewhat nonplussed. I mean, what I wanted to know, really, was: what exactly were these mysterious equations?
Were they as supremely elegant as E = M•c2, or were they massive confusions of integrals and summations and cryptic math glyphs?
So, I did finally revert to Google, my trusty old friend. And was very pleased with the results. Chief amongst them was this exceptional explainer by Eduardo Muñoz, entitled: “Attention is all you need: Discovering the Transformer paper“. It is from this presentation that I pulled all the screenshots, below. It’s my read that this report presents the actual complete python source code for building a basic Neural Net that performs the function of language translation (in this case, English to Spanish… but that’s all in the training data… as we know, ChatGPT speaks and translates between a matrix of at least 100 languages, with total fluency).
More than 4 lines of code, to be sure. But, hell. Let’s actually count them. I want to see what it takes to build a modern AI. So, here it is, the entire codebase. I begin with an outline of the total structure, including code-line count, and then present the actual source code that comprises each of the modules. Nerd out!

Transformer Source Code: Structure

  1. The Scaled Dot-Product Attention — 14 lines (8 actual)
  2. Multi-head Attention — 50 lines (29 actual)
  3. Positional Encoding — 24 lines (14 actual)
  4. The Encoder
    1. layer — 42 lines (24 actual)
    2. component — 37 lines (17 actual)
  5. The Decoder
    1. layer — 53 lines (30)
    2. component — 41 lines (17)
  6. The Transformer — 62 lines (21)
  7. Training
    1. Custom Loss Function (mask padding tokens) — 8 lines (6)
    2. Adam Optimizer Schedule (variable learning rate) — 13 lines (9)
    3. Main Train Function — 42 lines (26)
  8. Perform the Training on the Dataset — 37 lines (13)
  9. The actual runtime Program:
    1. Predict & Translate — 36 lines (19)

TOTAL CODEBASE — 460 lines (233 actual lines of code)

  • lines of code indicate the simple totals in the supplied codebase.
  • actual lines is a truer measure resulting from stripping out blank lines & comments.

Transformer Source Code: 233 Lines that Define AI Reality

Defining the Key Components of the Neural Net Creation Engine

The Scaled Dot-Product Attention — 14 lines

Transformer Source Code: The Scaled Dot-Product Attention

Multi-head Attention — 50 lines (29)

Transformer Source Code: Multi-head Attention

Positional Encoding — 24 lines (14 actual)

Transformer Source Code: Positional Encoding

4a. The Encoder Layer — 42 lines (24 actual)

Transformer Source Code: The Encoder Layer

4b. The Encoder Component — 37 lines (17 actual)

The Decoder Layer — 53 lines (30 actual)

Deep Learning Source Code: The Decoder Layer

The Decoder Component — 41 lines (17 actual)

Deep Learning Source Code: The Decoder Class

The Transformer — 62 lines (21 actual)

The Transformer: Complete Source Code

Training the Neural Net

Custom Loss Function (mask padding tokens) — 8 lines (6 actual)

Adam Optimizer Schedule (variable learning rate) — 13 lines (9 actual)

Main Train Function — 42 lines (26 actual)

Deep Learning: Main Train Function

Perform the Training on the Dataset — 37 lines (13 actual)

The actual Program:

Predict & Translate — 36 lines (13 + 6)

 

Transformer Code Repository:

https://github.com/edumunozsala/Transformer-NMT

 

 

Transformer Source Code: Example I/O

Finally, wouldn’t you like to see this code running, see it in live action? Me too. Here are a few example results:

#Show some translations
sentence = "you should pay for it."
print("Input sentence: {}".format(sentence))
predicted_sentence = translate(sentence)
print("Output sentence: {}".format(predicted_sentence))

Input sentence: you should pay for it. 
Output sentence: Deberías pagar por ello.

#Show some translations
sentence = "we have no extra money."
print("Input sentence: {}".format(sentence))
predicted_sentence = translate(sentence)
print("Output sentence: {}".format(predicted_sentence))

Input sentence: we have no extra money. 
Output sentence: No tenemos dinero extra.

#Show some translations
sentence = "This is a problem to deal with."
print("Input sentence: {}".format(sentence))
predicted_sentence = translate(sentence)
print("Output sentence: {}".format(predicted_sentence))

Input sentence: This is a problem to deal with. 
Output sentence: Este problema es un problema con eso.

So there it is. 
The codebase that changed a civilisation.
Now, go enjoy your donuts.