Nested Learning: a new HOPE for AI
For the last decade, we’ve been operating under a specific mental model: Deep Learning is about stacking layers. We treat these layers (Attention, MLPs, RNNs) as static “processors” that information flows through. We train them once, freeze them, and then expect them to handle the world.
But a new paper from Google Research, presented at NeurIPS 2025, argues that this view is an illusion.
The paper, titled “The Illusion of Deep Learning Architectures” 1, proposes a radical shift: Neural networks aren’t just stacks of filters; they are systems of nested optimization problems. Every layer in your model is actually a little “learner” trying to solve its own objective function in real-time.
If you’ve ever wondered why Transformers are so good or how we can make models that actually learn after training, this paper offers a mathematical explanation that is surprisingly elegant.
The Core Insight: Architecture = Optimization
The most mind-bending part of the paper is the proof that well-known architectures are mathematically identical to well-known optimization algorithms.
Let’s look at Linear Attention, a simpified version of standard component in modern efficient Transformers. Its update rule looks like this:
\[\mathcal{M}_{t} = \mathcal{M}_{t-1} + v_t k_t^\top\]Here, $\mathcal{M}$ is the memory state, and it updates by adding the outer product of the current value ($v$) and key ($k$).
Now, let’s look at Gradient Descent (GD), the algorithm we use to train networks. The update rule for weights ($W$) is:
\[W_{t+1} = W_t - \eta \nabla \mathcal{L}\]The researchers asked a simple question: What if the Attention layer is actually running Gradient Descent on itself?
They define an “Associative Memory” objective, a loss function that measures how well the memory maps keys to values :
\[\mathcal{L}(\mathcal{M}) = - \langle \mathcal{M} k_t, v_t \rangle\]If you calculate the gradient of this loss with respect to the memory $\mathcal{M}$, you get $-v_t k_t^\top$. Plug that into the Gradient Descent formula (assuming a learning rate $\eta=1$), and you get:
\[\mathcal{M}_{t+1} = \mathcal{M}_t - (-v_t k_t^\top) = \mathcal{M}_t + v_t k_t^\top\]This is the exact same formula as Linear Attention.
This proves that a Linear Attention layer isn’t just a static block; it is a dynamic solver running one step of Gradient Descent on every token to “learn” the current context .
From “Heuristic” to “Hope”
Once you realize that Architecture is Optimization, you can stop guessing new architectures and start deriving them.
If Linear Attention is equivalent to basic Gradient Descent (which we know is kind of slow and clunky), why not upgrade the “inner optimizer”?
- What if we used Momentum? The paper shows that Momentum acts as a “Meta-Memory” that compresses gradients .
- What if we used Adam?
- What if we used a Neural Network to predict the updates?
This line of thinking led the authors to develop HOPE, a new architecture that treats the model as a “Continuum Memory System”.
The HOPE Architecture: Mimicking Brain Waves
Current LLMs suffer from “Anterograde Amnesia”. They can process the immediate context (short-term) and remember their pre-training (long-term), but they can’t form new long-term memories after deployment .
HOPE fixes this by mimicking the human brain’s oscillations (brain waves). The brain doesn’t have one clock; it has fast Gamma waves for immediate processing and slow Delta waves for long-term consolidation.
HOPE implements this mathematically by nesting optimization loops with different “frequencies”:
- High-Frequency Neurons: These update rapidly (like standard attention) to handle the immediate token-by-token flow.
- Low-Frequency Neurons: These are “slow learners” that only update every $C$ steps, consolidating information into long-term storage.
This structure allows the model to be Self-Modifying. It doesn’t just read data; it fundamentally rewrites its own parameters on the fly, allowing for true continual learning without catastrophic forgetting .
Why This Matters
This paper is exciting because it moves us away from the black-box alchemy of “adding more layers”. It gives us a white-box theoretical framework: if you want a better architecture, design a better optimizer.
By unifying the math of how we learn (optimization) with what we build (architecture), Nested Learning suggests that the next generation of AI won’t just be bigger, it will be smarter about how it manages its own mind.
– Ali
References
Enjoy Reading This Article?
Here are some more articles you might like to read next: