Introduction
Grokking is a phenomenon in neural networks where long after overfitting, validation accuracy sometimes suddenly begings to increase from chance level toward perfect generalization. This literature review explores the key papers in this area.
Papers
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
This paper introduced the concept of "grokking," where neural networks improve their generalization performance from random chance to perfect generalization, even after reaching a point of overfitting. They show that smaller datasets require more optimization steps to achieve generalization—this could be because a smaller dataset may need more iterations for the model to avoid overfitting on noise. The model may have an easier time generalizing if more patterns are shown to it (larger dataset). Out of the different optimization algorithms, weight decay towards origin is most effective in generalizing earlier. The transformer they trained was decoder-only with 2 layers, width 128, and 4 attention heads, trained for about 4 × 10^5 optimization steps. The datasets used were simple ones which have equations with binary operands of the form a operator b = c.
Tracing the Representation Geometry of Language Models from Pretraining to Post-training
How do language models learn facts? Dynamics, curricula and hallucinations
A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks
Predicting Grokking Long Before it Happens: A Look into the Loss Landscape of Models Which Grok
Explaining Grokking Through Circuit Efficiency
Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics
Deep Networks Always Grok and Here is Why
Towards Uncovering How Large Language Model Works: An Explainability Perspective
Towards Understanding Grokking: An Effective Theory of Representation Learning
Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking
To Grok or Not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets
Grokking as the Transition from Lazy to Rich Training Dynamics
The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon
Unifying Grokking and Double Descent
This paper states that both grokking and double descent are driven by the same mechanics, namely the network's classification of different patterns within the data. The neural network aims at identifying different patterns in the data, and grokking/double descent is due to the different speeds at which the network learns the different patterns. They distinguish between three types of patterns: 1) Heuristic patterns that are fast to learn and generalize well, 2) Overfitting patterns that are fast to learn but poor to generalize, and 3) Slow-generalizing patterns that are slow to learn but generalize well. The study also shows model-wise grokking, showing that grokking can occur with changes in model size (changes in embedding size parameters), similar to epoch-wise grokking.
Grokfast: Accelerated Grokking by Amplifying Slow Gradients
Omnigrok: Grokking Beyond Algorithmic Data
Grokking of Hierarchical Structure in Vanilla Transformers
This paper demonstrates grokking on datasets containing hierarchical structures rather than just algorithmic datasets. A non-hierarchical rule is more like a shallow pattern that does not require deeper understanding of the structure, but the hierarchical rule requires understanding of the structure of the sentence. Transformers exhibit an inverted U-shaped relationship between depth and generalization. The study introduces the "tree-structuredness" metric as a means to predict the optimal model depth for achieving structural grokking. The findings provide evidence that vanilla transformers, with sufficient training, can discover and utilize hierarchical sentence structures.
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Exploring Grokking: Experimental and Mechanistic Investigations
The Complexity Dynamics of Grokking
Grokking Explained: A Statistical Phenomenon
Grokking vs. Learning: Same Features, Different Encodings
Grokking at the Edge of Numerical Stability
Deep Grokking: Would Deep Neural Networks Generalize Better?
Progress Measures for Grokking via Mechanistic Interpretability
For the task of modular addition, this paper shows that the network activations repeat periodically. Once the network groks on this dataset, they essentially implement a function of trigonometric identities. The weights learned by the actual network after training can be approximated by sines and cosines of certain key frequencies, meaning that in the Fourier domain there is some pattern that repeats periodically. They introduce continuous measures such as restricted loss and excluded loss to explain grokking. The paper identifies three phases of grokking: memorization (the key frequencies are unused), circuit formation (model is learning to use these key frequencies), and cleanup (model has learned to complete the task using key frequencies).
Grokking Modular Arithmetic
Interpreting Grokked Transformers in Complex Modular Arithmetic
Bridging Lottery Ticket and Grokking: Is Weight Norm Sufficient to Explain Delayed Generalization?