bert

Google Brain & University of Toronto

Author

Ashish Vaswani, et al.

Year

2017

Reading Date

June 2026

Difficulty

8/10

Status

Completed

Read Time

45 min

Description

The foundational paper that introduced the Transformer architecture, replacing recurrent layers with multi-headed self-attention.

Context

I read this paper during my deep dive into NLP architectures. It's the bedrock of modern LLMs, fundamentally shifting how sequential data is processed by allowing for massive parallelization.

Why I Picked This Paper

I started reading this paper because I wanted to understand how modern LLMs actually work. Every model I was studying—GPT, Llama, Gemini—kept referencing Transformers, so I decided to go back to the original source instead of relying on secondary explanations.

What I Knew Before Reading

I Understood

Embeddings

Recurrent Neural Networks (RNNs)

Convolutional Neural Networks (CNNs)

I Did Not Understand

Self Attention

Multi-Head Attention

Positional Encoding

Why recurrence disappears

My Reading Notes

The paper proposes a radical simplification: dispense with recurrence entirely and rely solely on an attention mechanism to draw global dependencies between input and output.

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V

Instead of processing tokens sequentially (where step t depends on step t-1), the Transformer maps queries, keys, and values across an entire sequence simultaneously.

Things That Clicked

The 'lightbulb' moment was realizing that Attention isn't just a weighting mechanism, it's a differentiable database lookup. Queries match with Keys, and return Values.

Multi-Head Attention allows the model to jointly attend to information from different representation subspaces at different positions. A single attention head would average this out.

Things I Struggled With

🤔

I initially struggled to intuitively grasp Positional Encoding. Since Transformers process everything simultaneously, they inherently lose sequence order. Injecting sine and cosine functions of different frequencies seemed like black magic at first.

Experiments I Tried

[1]

Built a toy self-attention mechanism in PyTorch from scratch to visualize the attention matrix.

[2]

Replaced an LSTM layer in a sentiment analysis model with a single Transformer block.

My Biggest Takeaways

Recurrence is fundamentally a bottleneck for parallelization.
Attention is all you need—literally. The architecture is surprisingly simple once you strip away the math notation.

Questions I'm Still Exploring

How do models handle contexts longer than their trained positional embeddings (e.g., RoPE vs absolute)?
Why does multi-head attention specifically need 8 heads in the original paper? Is there a theoretical bound?

Personal Reflection

"Reading this paper felt like finding the missing puzzle piece in my understanding of modern AI. It's rare to read an academic paper that is so immediately intuitive yet profoundly impactful. I now feel much more comfortable dissecting modern LLM architectures."