Grokking - Literature Review | Sherin Muckatira

Introduction

Grokking is a fascinating phenomenon in deep learning where neural networks suddenly transition from memorization to generalization, often long after achieving perfect training accuracy. This literature review explores the key papers and findings in this emerging area of research.

Foundational Work

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

This seminal work introduced the concept of "grokking," where neural networks improve their generalization performance from random chance to perfect generalization, even after reaching a point of overfitting. The research highlights that smaller datasets require more optimization steps to achieve generalization—this could be because a smaller dataset may need more iterations for the model to avoid overfitting on noise. The model may have an easier time generalizing if more patterns are shown to it (larger dataset). Out of the different optimization algorithms, weight decay towards origin is most effective in generalizing earlier. The transformer they trained was decoder-only with 2 layers, width 128, and 4 attention heads, trained for about 4 × 10^5 optimization steps. The datasets used were equations with binary operands of the form a operator b = c.

Understanding Grokking Mechanisms

A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks

This paper connects the Lottery Ticket Hypothesis with grokking, exploring how grokking arises from the competition between a dense subnetwork, which initially dominates but generalizes poorly, and a sparse subnetwork that takes over after the grokking phase. The study highlights that the grokking phase transition corresponds to targeted norm growth in specific neurons, leading to the emergence of a sparse subnetwork. The authors speculate that the principles observed in grokking could extend to large language models, where targeted norm growth may facilitate emergent behaviors.

Predicting Grokking Long Before it Happens: A Look into the Loss Landscape of Models Which Grok

This paper proposes a novel method to predict grokking early in the training process by analyzing the learning curve during the initial epochs. By identifying specific oscillatory patterns in the loss landscape, the method can indicate whether a model will grok, allowing for computational savings by stopping training early if grokking is unlikely. The study introduces the use of spectral signatures derived from the Fourier transform of the loss function to quantify the oscillations in the early training phase. These low-frequency oscillations are shown to be predictive of grokking.

Explaining Grokking Through Circuit Efficiency

This work explains that grokking occurs due to the existence of two types of circuits: a memorizing circuit that learns quickly but generalizes poorly, and a generalizing circuit that learns more slowly but is more efficient in terms of parameter norms. The efficiency of the memorizing circuit decreases with an increase in the training dataset size, while the efficiency of the generalizing circuit remains constant. The study introduces two novel behaviors—ungrokking and semi-grokking. Ungrokking occurs when a network that has successfully grokked is further trained on a smaller dataset, causing it to regress to memorization and lose generalization.

Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics

This paper focuses on understanding training dynamics and how they relate to the grokking phenomenon. The research provides insights into the latent states that models traverse during training and how these affect generalization behavior.

Theoretical Perspectives

Deep Networks Always Grok and Here is Why

This work provides theoretical foundations for understanding grokking. The paper proposes a new local complexity measure which assesses the density of "linear regions" in the DNN's input-output mapping. The paper discusses a phase transition in the linear regions of the DNN during training. After this transition, the linear regions migrate away from training samples and towards the decision boundary. The authors introduce the concept of "delayed robustness," where a DNN becomes robust to adversarial examples long after it has achieved generalization.

Towards Uncovering How Large Language Model Works: An Explainability Perspective

This paper examines grokking from an explainability perspective, providing insights into how large language models internalize patterns during the grokking process.

Towards Understanding Grokking: An Effective Theory of Representation Learning

The study identifies four distinct learning phases: comprehension, grokking, memorization, and confusion. These phases illustrate how models transition through different states of learning and generalization, with the grokking phase being particularly significant for delayed generalization. The concept of a "Goldilocks zone" is introduced, where representation learning occurs optimally between the extremes of memorization and confusion.

Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

This paper identifies a dichotomy between early and late phase implicit biases in neural network training. In the early phase, training is dominated by kernel predictors due to large initialization, which focuses on overfitting the training data. In contrast, the late phase bias shifts towards min-norm or max-margin predictors, which generalize better. The paper proves that a sharp transition from memorization to generalization occurs due to the shift from the early phase kernel regime to the late phase rich regime.

Corrupted Data and Memorization

To Grok or Not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets

This paper investigates the balance between memorization and generalization in neural networks trained on datasets with corrupted labels. It highlights that networks can memorize incorrect examples while simultaneously learning the underlying rules, achieving high generalization performance. A significant finding is that neurons responsible for memorization can be explicitly identified and pruned, leading to perfect generalization. The study emphasizes the effectiveness of regularization methods such as weight decay, dropout, and BatchNorm in helping the network ignore corrupted data during training.

Training Dynamics and Optimization

Grokking as the Transition from Lazy to Rich Training Dynamics

This work proposes that grokking results from delayed feature learning, where networks initially memorize data before adapting their representations for better generalization. The authors show that grokking can happen even when weight norm increases, contradicting the idea that weight decay is essential for grokking. The paper posits that early training dynamics resemble lazy learning, where networks memorize data, while later stages involve rich learning as features adapt for generalization.

The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon

The Slingshot Mechanism is characterized by cyclic phase transitions between stable and unstable training regimes, observed as cyclic behavior in the norm of the last layer's weights. The mechanism plays a crucial role in the grokking phenomenon, where a sudden transition from poor to perfect generalization occurs. Grokking frequently coincides with the onset of Slingshot phases. The presence and frequency of Slingshot cycles are sensitive to hyperparameters, particularly the ε parameter of adaptive optimizers. Smaller ε values increase the frequency of Slingshot cycles and promote grokking.

Unifying Grokking and Double Descent

This paper states that both grokking and double descent are driven by the same mechanics, namely the network's classification of different patterns within the data. The neural network aims at identifying different patterns in the data, and grokking/double descent is due to the different speeds at which the network learns the different patterns. They distinguish between three types of patterns: 1) Heuristic patterns that are fast to learn and generalize well, 2) Overfitting patterns that are fast to learn but poor to generalize, and 3) Slow-generalizing patterns that are slow to learn but generalize well. The study also shows model-wise grokking, showing that grokking can occur with changes in model size (changes in embedding size parameters), similar to epoch-wise grokking.

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

This paper introduces the Grokfast algorithm, designed to accelerate the grokking phenomenon by amplifying slow-varying gradient components. The algorithm decomposes gradient signals into fast and slow components, targeting the slow component that contributes to delayed generalization. Grokfast uses spectral decomposition of parameter trajectories, and by amplifying the low-frequency (slow-varying) components, the algorithm significantly reduces the time to achieve generalization, making models grok up to 50 times faster.

Beyond Algorithmic Data

Omnigrok: Grokking Beyond Algorithmic Data

This study shows that grokking can occur in a variety of tasks beyond algorithmic data, such as image classification, sentiment analysis, and molecular prediction. Grokking is caused by the mismatch between training and test loss landscapes, resembling L and U shapes, respectively. The paper identifies representation learning as crucial to grokking, especially in tasks heavily reliant on high-quality representations. Large initialization scales and small weight decay can induce grokking, showing how training dynamics are sensitive to these factors.

Grokking of Hierarchical Structure in Vanilla Transformers

This paper demonstrates grokking on datasets containing hierarchical structures rather than just algorithmic datasets. A non-hierarchical rule is more like a shallow pattern that does not require deeper understanding of the structure, but the hierarchical rule requires understanding of the structure of the sentence. Transformers exhibit an inverted U-shaped relationship between depth and generalization. The study introduces the "tree-structuredness" metric as a means to predict the optimal model depth for achieving structural grokking. The findings provide evidence that vanilla transformers, with sufficient training, can discover and utilize hierarchical sentence structures.

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

This paper investigates whether transformers can learn implicit reasoning skills such as composition and comparison on datasets other than modulo arithmetic. It finds that transformers can learn implicit reasoning but only through grokking. The paper emphasizes that data distribution, rather than size, plays a critical role in grokking and generalization. A higher ratio of inferred facts to atomic facts accelerates grokking, underscoring the importance of data composition over sheer volume. Fully grokked transformers outperform state-of-the-art models like GPT-4-Turbo and Gemini-1.5-Pro in complex reasoning tasks.

Experimental Investigations

Exploring Grokking: Experimental and Mechanistic Investigations

This work provides comprehensive empirical experiments on grokking, exploring various aspects of the phenomenon through systematic experimentation.

The Complexity Dynamics of Grokking

This paper examines the rise and fall of complexity during training of a neural network, providing insights into how model complexity evolves throughout the grokking process.

Grokking Explained: A Statistical Phenomenon

This work explains grokking as being due to distribution shift in the dataset, providing a statistical perspective on the phenomenon.

Grokking vs. Learning: Same Features, Different Encodings

This paper compares grokking with standard learning, showing that while the features learned may be the same, the encodings differ between the two processes.

Grokking at the Edge of Numerical Stability

This work examines softmax collapse due to numerical instability and how addressing this leads to faster grokking, highlighting the importance of numerical considerations in training.

Deep Grokking: Would Deep Neural Networks Generalize Better?

The study finds that deep neural networks, such as 12-layer MLPs, are more susceptible to grokking than shallower models. Unlike shallow models that typically exhibit a single sharp increase in test accuracy, deep networks display a multi-stage generalization pattern with two distinct surges in test accuracy. The research identifies a "tunnel effect" in deep networks, where deeper layers compress features into low-rank representations.

Mechanistic Interpretability

Progress Measures for Grokking via Mechanistic Interpretability

For the task of modular addition, this paper shows that the network activations repeat periodically. Once the network groks on this dataset, they essentially implement a function of trigonometric identities. The weights learned by the actual network after training can be approximated by sines and cosines of certain key frequencies, meaning that in the Fourier domain there is some pattern that repeats periodically. They introduce continuous measures such as restricted loss and excluded loss to explain grokking. The paper identifies three phases of grokking: memorization (the key frequencies are unused), circuit formation (model is learning to use these key frequencies), and cleanup (model has learned to complete the task using key frequencies).

Grokking Modular Arithmetic

This paper shows that simple trigonometric functions of few frequencies and inputs can be used to approximate learning the task of modular arithmetic on multi-layer MLP. They use inverse participation ratio defined in terms of Fourier transform weights to indicate if the weights are random or periodic. At initialization the weights are random, but once the model learns to grok on arithmetic data, the Fourier-transformed data are more localized (fewer frequencies).

Interpreting Grokked Transformers in Complex Modular Arithmetic

This paper investigates how complex modular arithmetic is learned by models through reverse engineering. They identify distinct dynamics for different operations: subtraction introduces asymmetry in the Transformer model, multiplication requires cosine-biased components across all frequencies in the Fourier domain, and polynomials often reflect patterns from simpler arithmetic. The paper introduces two new metrics: Fourier Frequency Sparsity (FFS) and Fourier Coefficient Ratio (FCR), which help indicate late generalization and characterize the internal representations of models.

Pruning and Subnetworks

Bridging Lottery Ticket and Grokking: Is Weight Norm Sufficient to Explain Delayed Generalization?

The paper investigates the grokking phenomenon, arguing that identifying optimal subnetworks, or "grokking tickets," is key to achieving generalization. They demonstrate that when using grokking tickets, generalization happens earlier than in non-LTH networks, suggesting that finding these optimal subnetworks plays a crucial role in transitioning from memorization to generalization. They show that sparsity alone does not explain grokking, and they demonstrate empirically that the grokked ticket is capable of identifying patterns more quickly than the dense network by looking into the entropy of the discrete Fourier transform of the representations.

Recent Papers (To Read)

The following papers are recent publications from 2024-2025 that need to be reviewed. Papers are marked in red to indicate they haven't been read yet.

Where to Find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

This paper presents the first study of grokking in practical LLM pretraining. Specifically, we investigate when an LLM memorizes the training data, when its generalization on downstream tasks starts to improve, and what happens if there is a lag between the two. Unlike existing works studying when a small model generalizes to limited and specified tasks during thousands epochs' training on algorithmic data, we focus on a practical setting for LLMs, i.e., one-epoch pretraining of next-token prediction on a cross-domain, large-scale corpus, and generalization on diverse benchmark tasks covering math/commonsense reasoning, code generation, and domain-specific retrieval. Our study, for the first time, verifies that grokking still emerges in pretraining mixture-of-experts (MoE) LLMs, though different local data groups may enter their grokking stages asynchronously due to the heterogeneity of their distributions and attributions to others. To find a mechanistic interpretation of this local grokking, we investigate the dynamics of training data's pathways (i.e., expert choices across layers in MoE). Our primary discovery is that the pathways evolve from random, non-smooth across layers, instance-specific to more structured and transferable across samples, despite the converged pretraining loss. This depicts a transition from memorization to generalization. Two novel metrics are developed to quantify these patterns: one computes the pathway similarity between samples, while the other measures the consistency of aggregated experts between subsequent layers for each sample. These training data based metrics induce zero cost but can faithfully track and monitor the generalization of LLMs on downstream tasks, which, in conventional settings, requires costly instruction tuning and benchmark evaluation.

Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework that characterizes what kind of features will emerge, how and in which conditions it happens, and is closely related to the gradient dynamics of the training, for complex structured inputs. We propose a novel framework, named Li₂, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning (II) Independent feature learning (III) Interactive feature learning. At the lazy learning stage, top layer overfits to random hidden representation and the model appears to memorize, and at the same time, the backpropagated gradient G_F from the top layer now carries information about the target label, with a specific structure that enables each hidden node to learn their representation independently. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function E, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. When hidden nodes start to interact in the later stage of learning, we provably show how G_F changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of feature emergence, memorization and generalization, and reveals why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layers.

Using Physics-Inspired Singular Learning Theory to Understand Grokking & Other Phase Transitions in Modern Neural Networks

Classical statistical inference and learning theory often fail to explain the success of modern neural networks. A key reason is that these models are non-identifiable (singular), violating core assumptions behind PAC bounds and asymptotic normality. Singular learning theory (SLT), a physics-inspired framework grounded in algebraic geometry, has gained popularity for its ability to close this theory-practice gap. In this paper, we empirically study SLT in toy settings relevant to interpretability and phase transitions. First, we understand the SLT free energy ℱ_n by testing an Arrhenius-style rate hypothesis using both a grokking modulo-arithmetic model and Anthropic's Toy Models of Superposition. Second, we understand the local learning coefficient λ_α by measuring how it scales with problem difficulty across several controlled network families (polynomial regressors, low-rank linear networks, and low-rank autoencoders). Our experiments recover known scaling laws while others yield meaningful deviations from theoretical expectations. Overall, our paper illustrates the many merits of SLT for understanding neural network phase transitions, and poses open research questions for the field.

Grokking in the Ising Model

Delayed generalization, termed grokking, in a machine learning calculation occurs when the training accuracy approaches its maximum value long before the test accuracy. This paper examines grokking in the context of a neural network trained to classify 2D Ising model configurations. We find, partially with the aid of novel PCA-based network layer analysis techniques, that the grokking behavior can be qualitatively interpreted as a phase transition in the neural network in which the fully connected network transforms into a relatively sparse subnetwork. This in turn reduces the confusion associated with a multiplicity of paths. The network can then identify the common features of the input classes and hence generalize to the recognition of previously unseen patterns.

Grokking Beyond the Euclidean Norm of Model Parameters

Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. In this work, we demonstrate that grokking can be induced by regularization, either explicit or implicit. More precisely, we show that when there exists a model with a property P (e.g., sparse or low-rank weights) that generalizes on the problem of interest, gradient descent with a small but non-zero regularization of P (e.g., ℓ₁ or nuclear norm regularization) results in grokking. This extends previous work showing that small non-zero weight decay induces grokking. Moreover, our analysis shows that over-parameterization by adding depth makes it possible to grok or ungrok without explicitly using regularization, which is impossible in shallow cases. We further show that the ℓ₂ norm is not a reliable proxy for generalization when the model is regularized toward a different property P, as the ℓ₂ norm grows in many cases where no weight decay is used, but the model generalizes anyway. We also show that grokking can be amplified solely through data selection, with any other hyperparameter fixed.

Is Grokking a Computational Glass Relaxation?

Understanding neural network's (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches a near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs' generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs' Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking's far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.

Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns - yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio φ_r of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA - substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing φ_r drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.

NeuralGrok: Accelerate Grokking by Neural Gradient Transformation

Grokking is proposed and widely studied as an intricate phenomenon in which generalization is achieved after a long-lasting period of overfitting. In this work, we propose NeuralGrok, a novel gradient-based approach that learns an optimal gradient transformation to accelerate the generalization of transformers in arithmetic tasks. Specifically, NeuralGrok trains an auxiliary module (e.g., an MLP block) in conjunction with the base model. This module dynamically modulates the influence of individual gradient components based on their contribution to generalization, guided by a bilevel optimization algorithm. Our extensive experiments demonstrate that NeuralGrok significantly accelerates generalization, particularly in challenging arithmetic tasks. We also show that NeuralGrok promotes a more stable training paradigm, constantly reducing the model's complexity, while traditional regularization methods, such as weight decay, can introduce substantial instability and impede generalization. We further investigate the intrinsic model complexity leveraging a novel Absolute Gradient Entropy (AGE) metric, which explains that NeuralGrok effectively facilitates generalization by reducing the model complexity. We offer valuable insights on the grokking phenomenon of Transformer models, which encourages a deeper understanding of the fundamental principles governing generalization ability.

How to Explain Grokking

Explanation of grokking (delayed generalization) in learning is given by modeling grokking by the stochastic gradient Langevin dynamics (Brownian motion) and applying the ideas of thermodynamics.

Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition

We present a theoretical explanation of the "grokking" phenomenon, where a model generalizes long after overfitting, for the originally-studied problem of modular addition. First, we show that early in gradient descent, when the "kernel regime" approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that two-layer quadratic networks that achieve zero training loss with bounded ℓ∞ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small ℓ∞ regularization. We further provide empirical evidence that these networks as well as simple Transformers, leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.

Progress Measures for Grokking on Real-world Tasks

Grokking, a phenomenon where machine learning models generalize long after overfitting, has been primarily observed and studied in algorithmic tasks. This paper explores grokking in real-world datasets using deep neural networks for classification under the cross-entropy loss. We challenge the prevalent hypothesis that the L₂ norm of weights is the primary cause of grokking by demonstrating that grokking can occur outside the expected range of weight norms. To better understand grokking, we introduce three new progress measures: activation sparsity, absolute weight entropy, and approximate local circuit complexity. These measures are conceptually related to generalization and demonstrate a stronger correlation with grokking in real-world datasets compared to weight norms. Our findings suggest that while weight norms might usually correlate with grokking and our progress measures, they are not causative, and our proposed measures provide a better understanding of the dynamics of grokking.

Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity

In some settings neural networks exhibit a phenomenon known as grokking, where they achieve perfect or near-perfect accuracy on the validation set long after the same performance has been achieved on the training set. In this paper, we discover that grokking is not limited to neural networks but occurs in other settings such as Gaussian process (GP) classification, GP regression, linear regression and Bayesian neural networks. We also uncover a mechanism by which to induce grokking on algorithmic datasets via the addition of dimensions containing spurious information. The presence of the phenomenon in non-neural architectures shows that grokking is not restricted to settings considered in current theoretical and empirical studies. Instead, grokking may be possible in any model where solution search is guided by complexity and error.

Additional Resources

For an interactive tutorial on modular arithmetic and grokking, see: Google PAIR Grokking Explorable

Conclusion

Grokking represents a fascinating and complex phenomenon in deep learning that challenges our understanding of generalization. The literature reveals multiple perspectives on grokking—from circuit competition and training dynamics to mechanistic interpretability and theoretical foundations. Key insights include the role of dataset size and composition, the importance of optimization hyperparameters, the existence of distinct learning phases, and the presence of both memorizing and generalizing circuits. Future research directions include extending grokking to more complex tasks, developing better prediction methods, and understanding its implications for large language models.