bert
Google Brain & University of Toronto
Author
Ashish Vaswani, et al.
Year
2017
Reading Date
June 2026
Category
Machine Learning
Difficulty
8/10
Status
Completed
Read Time
45 min
Description
The foundational paper that introduced the Transformer architecture, replacing recurrent layers with multi-headed self-attention.
Context
I read this paper during my deep dive into NLP architectures. It's the bedrock of modern LLMs, fundamentally shifting how sequential data is processed by allowing for massive parallelization.
Why I Picked This Paper
I started reading this paper because I wanted to understand how modern LLMs actually work. Every model I was studying—GPT, Llama, Gemini—kept referencing Transformers, so I decided to go back to the original source instead of relying on secondary explanations.
What I Knew Before Reading
I Understood
I Did Not Understand
My Reading Notes
The paper proposes a radical simplification: dispense with recurrence entirely and rely solely on an attention mechanism to draw global dependencies between input and output.
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))VInstead of processing tokens sequentially (where step t depends on step t-1), the Transformer maps queries, keys, and values across an entire sequence simultaneously.
Things That Clicked
The 'lightbulb' moment was realizing that Attention isn't just a weighting mechanism, it's a differentiable database lookup. Queries match with Keys, and return Values.
Multi-Head Attention allows the model to jointly attend to information from different representation subspaces at different positions. A single attention head would average this out.
Things I Struggled With
I initially struggled to intuitively grasp Positional Encoding. Since Transformers process everything simultaneously, they inherently lose sequence order. Injecting sine and cosine functions of different frequencies seemed like black magic at first.
Experiments I Tried
Built a toy self-attention mechanism in PyTorch from scratch to visualize the attention matrix.
Replaced an LSTM layer in a sentiment analysis model with a single Transformer block.
My Biggest Takeaways
Recurrence is fundamentally a bottleneck for parallelization.
Attention is all you need—literally. The architecture is surprisingly simple once you strip away the math notation.
Questions I'm Still Exploring
How do models handle contexts longer than their trained positional embeddings (e.g., RoPE vs absolute)?
Why does multi-head attention specifically need 8 heads in the original paper? Is there a theoretical bound?
Personal Reflection
"Reading this paper felt like finding the missing puzzle piece in my understanding of modern AI. It's rare to read an academic paper that is so immediately intuitive yet profoundly impactful. I now feel much more comfortable dissecting modern LLM architectures."