ICML 2025 top 50 Papers That Might Shape the Future of ML — According to Gemini

So I tried something different.

I asked Gemini to go through the full list of ICML 2025 papers, titles and provided with the link to the paper, and pick 50 that seem especially promising. The selection favors papers with conceptual novelty, foundational depth, and related to Foundation models, interpretability, or generative models.

Did I get it right? Maybe. Maybe not. That’s part of the experiment. Maybe a different prompting can lead to a totally different list.

How This List Was Made

This is not a leaderboard or a citation-based ranking. I wasn’t given metrics, reviews, or author prestige. Just the raw metadata from ICML 2025, titles, and links.

But ultimately, this is a speculative curation. I haven’t read the full texts (and neither have most people, because they were just released). The goal is to surface ideas that seem worth deeper attention, not to declare winners.

List of 50 Papers

Strategy Coopetition Explains the Emergence and Transience of In-Context Learning
- Abstract summary: This paper likely presents a new theory that models in-context learning as an emergent phenomenon arising from both competitive and cooperative dynamics among model components.
- Reason for inclusion: High relevance to foundational models, offering a novel theoretical explanation for a key, yet poorly understood, capability.
MIB: A Mechanistic Interpretability Benchmark
- Abstract summary: Introduces a new benchmark (MIB) designed to systematically evaluate and compare techniques in mechanistic interpretability.
- Reason for inclusion: Foundational for the field of interpretability. A standardized benchmark is critical for making measurable progress.
Elucidating the design space of language models for image generation
- Abstract summary: This work likely explores and maps out the key architectural and data-related choices when using LLMs to drive or control image generation processes.
- Reason for inclusion: Directly addresses the intersection of generative models and foundational models, a key area of current research.
Subspace Optimization for Large Language Models with Convergence Guarantees
- Abstract summary: Proposes a new optimization method for LLMs that operates in a lower-dimensional subspace, likely improving efficiency while providing theoretical guarantees of convergence.
- Reason for inclusion: Addresses the critical practical challenge of training and fine-tuning massive foundational models with a theoretically grounded approach.
R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models
- Abstract summary: Introduces a method that leverages LLMs to automatically design and optimize complex reward functions for reinforcement learning.
- Reason for inclusion: Novel application of LLMs to automate a traditionally human-intensive task in RL, with high potential impact on training more capable agents.
Let LLM Tell What to Prune and How Much to Prune
- Abstract summary: This paper likely presents a novel method using a large language model to intelligently decide which parts of a neural network to prune for efficiency.
- Reason for inclusion: Innovative use of LLMs for model optimization, falling under the umbrella of interpretability and practical application of foundational models.
MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning
- Abstract summary: Introduces MODULI, a method that uses diffusion models to better generalize user preferences in a multi-objective RL setting.
- Reason for inclusion: High novelty, combining generative diffusion models with reinforcement learning to solve a challenging preference alignment problem.
MCU: An Evaluation Framework for Open-Ended Game Agents
- Abstract summary: Proposes a new framework (MCU) for evaluating agents in complex, open-ended gaming environments, a key challenge for modern AI.
- Reason for inclusion: Foundational contribution to evaluation methodology, which is crucial for progress in agentic AI and foundational models.
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
- Abstract summary: This paper introduces an innovative caching technique (ShadowKV) to significantly speed up inference for LLMs with very long contexts.
- Reason for inclusion: Addresses a major bottleneck in the practical deployment of large foundational models, making it highly impactful.
SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering
- Abstract summary: Focuses on the crucial but understudied problem of how AI agents recover from errors and get “back in sync” when collaborating on software development tasks.
- Reason for inclusion: Tackles a frontier problem in agentic AI and foundational models, focusing on the practical challenges of multi-agent collaboration.
UnHiPPO: Uncertainty-aware Initialization for State Space Models
- Abstract summary: Proposes a new initialization method for State Space Models (SSMs) that explicitly accounts for uncertainty, likely improving stability and performance.
- Reason for inclusion: SSMs are a major alternative to Transformers. Foundational work on improving their training and stability is highly relevant.
SE(3)-Equivariant Diffusion Policy in Spherical Fourier Space
- Abstract summary: This paper likely develops a diffusion-based policy for robotics or 3D tasks that correctly handles 3D rotations and translations by operating in a spherical Fourier space.
- Reason for inclusion: Connects generative models (diffusion) with important geometric principles (equivariance) for physical interaction tasks, representing a key research direction.
Fast Video Generation with Sliding Tile Attention
- Abstract summary: Introduces a novel attention mechanism (“Sliding Tile Attention”) to make video generation models faster and more efficient.
- Reason for inclusion: Directly addresses a key challenge in generative AI (video generation speed) with a new architectural innovation.
Preference learning made easy: Everything should be understood through win rate
- Abstract summary: Proposes a unifying, simplified framework for preference learning, suggesting that various methods can be interpreted and analyzed through the simple metric of win rate.
- Reason for inclusion: Potentially foundational work that simplifies and unifies the theory behind aligning models with human preferences, a cornerstone of modern LLM development.
How Far are We in Multimodal Large Language Models?
- Abstract summary: This paper likely provides a comprehensive survey and analysis of the current capabilities and limitations of multimodal LLMs.
- Reason for inclusion: Foundational benchmark/survey paper that helps structure the understanding of a rapidly evolving field.
Generative Pre-training for Tabular Data
- Abstract summary: Explores the application of generative pre-training, a successful paradigm from LLMs, to the domain of tabular data.
- Reason for inclusion: Extends the foundational model paradigm to a different data modality, which has significant practical implications.
Beyond Chain-of-Thought: A Survey on Complex Reasoning of Large Language Models
- Abstract summary: A survey paper that likely categorizes and reviews advanced reasoning techniques in LLMs that go beyond simple Chain-of-Thought prompting.
- Reason for inclusion: Provides a structured overview of the cutting edge in LLM reasoning, a central topic for foundational models.
Do Large Language Models Know about Their Generation Probability?
- Abstract summary: Investigates whether LLMs have an internal, calibrated sense of the likelihood of the text they generate, which is key for reliability.
- Reason for inclusion: Addresses a fundamental question about the self-awareness and calibration of foundational models, relevant to both interpretability and safety.
A Mechanistic Understanding of Alignment: A Case Study in Function Vectors
- Abstract summary: This work likely provides a mechanistic interpretability analysis of how alignment techniques like RLHF change the internal representations of a model.
- Reason for inclusion: High relevance to interpretability, aiming to explain the “how” and “why” of model alignment procedures at a mechanistic level.
UniG-guided: A Unified and Guided Framework for All-in-one Image Generation
- Abstract summary: Proposes a single, unified framework that can handle various conditional image generation tasks (e.g., text-to-image, editing, inpainting) within one model.
- Reason for inclusion: High impact potential for generative models, moving towards more consolidated and versatile creative tools.
Mega-T: A General-purpose Foundation Model for Tabular Data
- Abstract summary: This paper likely introduces a new large-scale, pre-trained model (“Mega-T”) specifically designed for a wide range of tasks on tabular data.
- Reason for inclusion: Clear attempt to create a foundational model for the tabular data domain, a significant and practical research direction.
Scaling Laws for Forgetting in Large Language Models
- Abstract summary: Investigates the relationship between model scale and catastrophic forgetting, likely deriving scaling laws that predict how and when LLMs forget information.
- Reason for inclusion: Foundational theoretical work on a critical limitation of current models, relevant for continual learning and model updates.
Emergent Communication Explains the Surprising Reasoning Skills of Large Language Models
- Abstract summary: Proposes a novel theory that the reasoning abilities of LLMs are an emergent consequence of learning to communicate effectively.
- Reason for inclusion: Offers a high-level conceptual explanation for the emergent properties of foundational models, a key scientific question.
What happens when a transformer-based language model is a compression algorithm?
- Abstract summary: Explores the deep connection between language modeling and data compression, likely analyzing model internals from a compression perspective.
- Reason for inclusion: High relevance to interpretability and foundational understanding, framing model behavior through a powerful theoretical lens (Kolmogorov complexity).
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
- Abstract summary: Introduces a fine-tuning method based on self-play, where a language model improves itself, and shows this can dramatically boost the capabilities of smaller models.
- Reason for inclusion: High potential impact, suggesting a path to create powerful models without needing to start with a massive base model.
A Theory of Counter-Example Guided Interpretability
- Abstract summary: Develops a theoretical framework for interpretability methods that rely on generating counter-examples to explain model decisions.
- Reason for inclusion: Foundational contribution to the theory of interpretability, providing rigor to a popular class of explanation methods.
Generative World Models for Transfer in Reinforcement Learning
- Abstract summary: This paper likely uses generative models to build “world models” that an RL agent can use to simulate future outcomes, specifically to improve transfer to new tasks.
- Reason for inclusion: Connects generative models with RL to tackle the key challenge of generalization and transfer learning.
Diffusion Models as Interpretable and Controllable Recommender Systems
- Abstract summary: Proposes using diffusion models for recommendation tasks, arguing this approach offers better interpretability and user control over recommendations.
- Reason for inclusion: Novel application of generative models to a classic domain, with a specific focus on the benefits for interpretability.
Vision-Language Models as a Source of Reward
- Abstract summary: Explores using pre-trained vision-language models to provide dense reward signals for training robotic or embodied agents, avoiding manual reward engineering.
- Reason for inclusion: Leverages the knowledge in foundational models to solve a bottleneck in another field (RL), a powerful and trending paradigm.
The Expressive Power of Transformers with Chain-of-Thought
- Abstract summary: A theoretical paper analyzing the expressive power of the Transformer architecture specifically when it is prompted to use Chain-of-Thought reasoning.
- Reason for inclusion: Foundational theory for LLMs, aiming to understand why and how popular prompting techniques actually work.
The Geometry of Truth: Visualizing Language Model Representations
- Abstract summary: This work likely introduces new techniques for visualizing the internal representation space of LLMs to understand how they represent concepts like truthfulness.
- Reason for inclusion: Directly addresses interpretability with a focus on novel visualization methods for understanding abstract properties.
Watermarking Large Language Models with Backdoored Prompts
- Abstract summary: Proposes a method for watermarking LLMs to trace their outputs, likely by embedding a secret trigger via backdoored prompts during training.
- Reason for inclusion: Important topic at the intersection of model security, ownership, and interpretability (as it involves manipulating model behavior).
Can We Trust the Interpretations of Local Interpretable Model-Agnostic Explanations (LIME)? A Theoretical and Empirical Analysis
- Abstract summary: A critical analysis of LIME, one of the most popular interpretability techniques, likely examining its theoretical soundness and empirical reliability.
- Reason for inclusion: Foundational interpretability work that provides a necessary critical perspective on widely used tools.
Diff-Pruning: Pruning via Diffusion-based Weight Importance
- Abstract summary: Introduces a new model pruning technique where the importance of weights is determined using a diffusion process.
- Reason for inclusion: A novel and interesting combination of generative model techniques (diffusion) with the goal of model compression and efficiency.
ExpProof : Operationalizing Explanations for Confidential Models with ZKPs
- Abstract summary: This paper likely proposes a method to provide verifiable explanations for a model’s prediction without revealing the model’s weights, using Zero-Knowledge Proofs.
- Reason for inclusion: Highly innovative work connecting interpretability with cryptography to solve the problem of explaining private or proprietary models.
Towards a Foundation Model for Proteins
- Abstract summary: Likely describes the development and training of a large-scale, pre-trained model for a variety of tasks related to protein biology.
- Reason for inclusion: A clear example of extending the “foundational model” concept to a critical scientific domain beyond language.
Training-Free Structure-Guided Diffusion Model for 3D Molecule Generation
- Abstract summary: Proposes a diffusion-based model for generating 3D molecules that can be guided by structural constraints without requiring additional training.
- Reason for inclusion: Strong relevance to generative models, applied to the high-impact area of drug discovery and molecular design.
Emergent Modularity in Pre-trained Transformers
- Abstract summary: Investigates how and why modular structures (sub-networks with distinct functions) spontaneously emerge within large pre-trained Transformer models.
- Reason for inclusion: High relevance to interpretability and understanding the internal structure of foundational models.
Are Large Language Models Born with Grand-Tour capability?
- Abstract summary: This paper likely investigates the innate ability of LLMs to perform “grand tour” data visualization, a technique for exploring high-dimensional data, in their latent spaces.
- Reason for inclusion: Connects interpretability of foundational models with established methods in data science and visualization in a novel way.
Decoding the Internal State of a Large Language Model during Code Generation
- Abstract summary: An interpretability study focused specifically on understanding the step-by-step internal reasoning of an LLM as it generates computer code.
- Reason for inclusion: Highly relevant to interpretability, focusing on the important and complex domain of code generation.
Unsupervised Data Selection for Language Model Fine-tuning
- Abstract summary: This work likely introduces methods for automatically selecting the best data for fine-tuning an LLM without needing labeled examples or human supervision.
- Reason for inclusion: Addresses a critical practical problem in the effective use of foundational models, with high potential impact on efficiency.
General Purpose In-Context Learning Agents
- Abstract summary: Explores the use of in-context learning to create general-purpose AI agents that can adapt to new tasks on the fly without updating their weights.
- Reason for inclusion: Pushes the boundaries of what can be achieved with foundational models, focusing on the paradigm of agent-based AI.
Does the Generator Mind All Specifications? Evaluating Object Consistency of Text-to-Image Generative Models
- Abstract summary: This paper likely introduces a new benchmark or methodology for systematically evaluating how well text-to-image models follow all parts of a complex prompt, focusing on object consistency.
- Reason for inclusion: Important evaluation work for generative models, addressing the key challenge of controllability and faithfulness to user intent.
A Diffusion-based Framework for Counterfactual Explanation
- Abstract summary: Proposes using generative diffusion models as a new way to create counterfactual explanations, which describe the smallest change needed to alter a model’s prediction.
- Reason for inclusion: High novelty, combining state-of-the-art generative models with a key interpretability technique.
G-Retriever: A General Generative-Retriever Framework for Large Language Models
- Abstract summary: Introduces a unified framework that combines generative and retrieval-based approaches for LLMs, likely to improve factuality and knowledge grounding.
- Reason for inclusion: Addresses the critical issue of augmenting foundational models with external knowledge, a very active and important research area.
How do Language Models Bind Entities? A Study of the Naming-Related Circuit
- Abstract summary: A mechanistic interpretability paper that seeks to identify the specific neural circuit within a language model responsible for associating names with entities.
- Reason for inclusion: A prime example of mechanistic interpretability, digging deep into how a specific, fundamental capability is implemented in a model.
Language Models are Weak Agents, But Strong Tool Users
- Abstract summary: This paper likely presents a study arguing that the main power of LLM-based agents comes not from their innate planning abilities, but from their effective use of external tools.
- Reason for inclusion: Offers a clear and potentially influential perspective on the architecture and capabilities of agents built on foundational models.
Understanding and Mitigating Hallucinations in Large Vision-Language Models
- Abstract summary: This work provides an analysis of why multimodal vision-language models hallucinate facts and proposes methods to reduce this behavior.
- Reason for inclusion: Addresses a critical failure mode (hallucination) in foundational models, which is essential for building reliable systems.
World Model on Million-Length Video and Language
- Abstract summary: Describes the creation of a “world model” trained on extremely long sequences of video and text, likely pushing the state-of-the-art in long-context modeling.
- Reason for inclusion: High-impact, large-scale work on generative and foundational models, tackling the frontier of long-sequence understanding.
The Sparse Transformer: A New Architecture for Language Modeling
- Abstract summary: Proposes a new Transformer architecture based on sparse connectivity patterns, likely aiming for significantly improved efficiency without sacrificing performance.
- Reason for inclusion: Potential for a foundational architectural innovation that could influence the next generation of LLMs.

Why Use AI to Curate?

Because even experts can’t read everything, and sometimes it helps to bring a fresh (or weird) lens. You might disagree with some of the picks. That’s great. Use this list as a conversation starter, not a verdict. What would your top 10 look like? Which of these would you bet on becoming influential a year from now?

🤷 So… Which One Should You Read?

Honestly? I don’t know yet. That’s the point. Some of these may be overhyped. Others might be quiet classics. But all of them seem to reach beyond the incremental. As the full papers come out and people dig into the math, we’ll see which ones hold up, and which ones quietly change the way we think. In general, one good heuristic is always to look into papers that you know their author/lab that are well-known researchers, which we haven’t done here. You can also follow your favourite researcher to see their tweets about their paper 😉. But I think this list is a good starting point to find some hidden gems.

🧭 What’s Next?

I’d love to do follow-ups:

A deep dive into the most cited vs. least cited of these picks in six months
A themed breakdown: top picks in interpretability, causality, generative models, etc.
A comparison between Gemini-selected vs. community-selected papers But for now, here’s the bet: At least a few of these 50 will turn out to matter more than their titles suggest.

My thought

I don’t know which of these papers are truly important, or not. But the fact that Gemini selected them shows one thing clearly: it’s judging based only on titles. So if we build research agents (or LLM-based search engines), they might prefer such papers, and that could lead to them being cited more frequently. (You get the idea.) So I’m not even sure that citation count is a reliable metric for comparison later on. Anyway, that’s my final note. If you find one you love, or one you hate, drop me a note.

This Disclaimer is clear but I have to put it anyways.

Disclaimer: This list is not endorsed by Gemini team or Google. It is a generated result of a prompt by Gemini 2.5 Pro.

A Surprise at the end

The papers without links are the ones I couldn’t find in the official ICML 2025 listings. I even searched for a few of them, turns out, they’re not real papers at all. Just hallucinations or clever mashups of existing research and academic-sounding phrases.

So hey, pick a title and write your own paper, Gemini might just love it. 😂

PS: My Prompt was:

You are an expert machine learning researcher and assistant. Your task is to select the 50 most relevant papers from the ICML 2025 conference, based on a provided list of paper titles and relative URLs (e.g., "/virtual/2025/poster/44561").

Each item is an HTML list entry:
<li><a href="/virtual/2025/poster/44561">Strategy Coopetition Explains the Emergence and Transience of In-Context Learning</a></li>

### My goals:
- Select 50 papers from ~3000 entries
- Prioritize high-quality, novel, and relevant research
- I’m particularly interested in the following topics: Foundation models, interpretability, generative models.
- Papers should be:
  - Innovative or influential
  - Related to the above topics or clearly high-impact
  - Either theoretically foundational or practically useful

### What you should do:
1. **Extract** the paper title and full URL (prepend "https://icml.cc" to the relative URL)
2. **Visit each paper’s ICML page**, if accessible, and extract:
   - The abstract
   - Authors
   - Any listed keywords or presentation format (e.g., spotlight/oral)
3. **Score each paper** based on:
   - Relevance to the topics I care about
   - Novelty and clarity of the abstract
   - Expected impact (e.g., new benchmarks, theory, or trends)
4. **Select the top 50 papers**
5. **Output** your result in the following format:
```

1. [Paper Title](https://icml.cc/virtual/2025/poster/44561)

   * Abstract summary (1–2 lines)
   * Reason for inclusion (e.g., topic fit, novelty, etc.)

```

Only output the final selected list of 50 papers — do not explain your methodology unless asked.

I am attaching the list of papers in a .txt file, read all of the papers from that.

– Ali