The Sequence Radar #559 : Two Remarkable Papers This Week: Self-Improving Agents and the Limits of LLM Memorization | By The Digital Insider

Agents that improve themselves and the limits of memorization.

Created Using GPT-4o

Next Week in The Sequence:

We dive into safety evals as part of our series about benchmarking. Research cover’s Sakana AI groundbreaking paper about self-evolving models. Our opinion section focuses on the case for spatial intelligence and world models. Engineering will discuss another cool AI framework.

You can subscribe to The Sequence below:

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

📝 Editorial: Two Remarkable Papers This Week: Self-Improving Agents and the Limits of LLM Memorization

This week featured two standout papers that reveal complementary frontiers of AI development: one that pushes the limits of open-ended, self-improving systems, and another that rigorously quantifies how much information large language models can retain. The first, Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents, presents one of the most credible instantiations yet of recursive self-modifying agents. The second, How Much Do Language Models Memorize?, introduces a principled and practically measurable framework for assessing the memorization capacity of modern LLMs. Both contributions illuminate core dynamics of how AI systems evolve, learn, and remember—and together, they paint a vivid picture of our current trajectory in scaling and aligning intelligent systems.

The Darwin Gödel Machine (DGM) operationalizes the theoretical idea of self-referential improvement by constructing agents that can rewrite their own code to enhance performance. Built atop frozen foundation models, DGM alternates between self-modification and evaluation, benchmarking candidate agents on real-world coding tasks like SWE-bench and Polyglot. It employs a Darwinian mechanism: each agent is added to an archive, and new agents are generated via mutations of prior ones. Crucially, this enables divergent exploration across the agent design space. The system autonomously discovers new tools, workflows, and strategies, leading to performance improvements from 20% to 50% on SWE-bench. The results suggest that recursive, self-directed optimization is not only feasible but increasingly competitive with manually engineered systems.

What distinguishes DGM is its architecture for open-ended discovery. Rather than hill-climbing on a single agent, it maintains a population of diverse, evolving systems—allowing for breakthroughs to emerge from unlikely or initially suboptimal branches. This is a major departure from conventional agent fine-tuning pipelines, which often discard failed explorations. The paper demonstrates that key innovations often trace back to agents that initially underperformed, underscoring the value of preserving and revisiting earlier ideas. With strong safety protocols in place (e.g., sandboxing and human oversight), the DGM framework opens a credible path toward continuously evolving AI systems whose improvements compound autonomously over time.

Meanwhile, How Much Do Language Models Memorize? tackles a long-standing and under-specified question at the heart of LLM behavior: what does a model actually retain from its training data? The authors introduce a formal decomposition of memorization into "unintended memorization" (data-specific retention) and "generalization" (distribution-level abstraction). Using a compression-based method inspired by Kolmogorov complexity, they estimate the total number of bits a model can memorize. Their experiments—which span hundreds of transformer models trained on both synthetic and natural datasets—reveal a striking result: GPT-family models retain roughly 3.6 bits per parameter. This figure quantifies model capacity in a practical, interpretable way, and serves as a foundation for analyzing model behavior, privacy risks, and generalization thresholds.

Beyond static measurement, the paper derives scaling laws that predict how memorization patterns shift with data and model size. It reveals that models initially memorize data until their capacity saturates, after which generalization begins to dominate—providing a formal underpinning for the widely observed double descent phenomenon. It also shows how membership inference attacks become harder as datasets grow larger relative to model capacity. These results suggest that memorization is a predictable, quantifiable phenomenon, and not merely an emergent artifact. The framework sets the stage for more rigorous evaluation of privacy, reproducibility, and data influence in LLMs.

Together, these two papers reveal opposite yet deeply intertwined aspects of AI model development. The Darwin Gödel Machine charts the outer frontier of what self-improving systems might look like when left to explore and evolve. How Much Do Language Models Memorize? brings precision and clarity to a key limitation of such systems: their bounded capacity to retain specific information. One pushes forward the architecture of continual progress; the other grounds that progress in the mathematics of representation. As the field grapples with scale, autonomy, and alignment, both papers offer essential tools for understanding what models can become—and what they can (and cannot) remember along the way.

🔎 AI Research

"Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents" – UBC, Sakana AI, Vector Institute
This work introduces the Darwin Gödel Machine (DGM), a self-improving AI system that rewrites its own code to become a better coding agent using frozen foundation models. Inspired by biological evolution, it combines self-modification and open-ended search to achieve significant improvements on coding benchmarks like SWE-bench and Polyglot, representing a step toward safe, recursive self-improvement in AI systems.

"Self-Challenging Language Model Agents" – UC Berkeley & FAIR at Meta
This paper introduces the Self-Challenging Agent (SCA), a framework where language model agents autonomously generate and solve their own tasks using a formalism called Code-as-Task (CaT), which ensures task feasibility, verifiability, and difficulty. Using only self-generated data, the SCA achieves a 2× improvement in performance on multi-turn tool-use benchmarks (M3ToolEval, TauBench), outperforming prior self-improvement and distillation approaches without any human-curated tasks.

"REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards" – OpenThought Lab
This paper introduces Reasoning Gym, a suite of procedurally generated environments designed to train and evaluate reasoning models using reinforcement learning with verifiable rewards. It enables scalable, curriculum-driven learning across domains like math, logic, and games, and reveals that current LLMs struggle with general reasoning unless specifically trained for it.

"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics" – Hugging Face, Valeo.ai, Sorbonne University
SmolVLA is a small, efficient vision-language-action model built for low-cost robots, capable of real-world manipulation using only community-collected datasets. Despite its compact size, it matches or outperforms larger models by leveraging an interleaved attention architecture and asynchronous inference stack.

"How much do language models memorize?" – FAIR at Meta, Google DeepMind, Cornell, NVIDIA
This paper introduces a rigorous framework to quantify how much information language models memorize about specific datapoints, separating unintended memorization from generalization. Through experiments on synthetic and real datasets, it estimates GPT-style models store about 3.6 bits per parameter, and shows that memorization capacity defines a phase transition where generalization begins.

"Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism" – Pluralis Research
The authors propose a new low-rank compression method for model-parallel training that enables efficient decentralized training of billion-parameter models over low-bandwidth connections. Their approach compresses both activations and gradients with minimal overhead, achieving 100× communication efficiency and matching centralized training performance across geographically distributed GPUs.

🤖 AI Tech Releases

Mistral Code

Mistral released Mistral Code, its new coding assistant.

🛠 AI in Production

Voice AI at Airbnb

Airbnb discusses their use of speech AI capabilities for customer support.

📡AI Radar


#250, #Acquisition, #Agent, #Agents, #Ai, #AiAgent, #AICoding, #AIDevelopment, #AIInnovation, #AiModel, #AIModels, #AIResearch, #AISystems, #AiTools, #Amp, #Analysis, #Analytics, #Anthropic, #App, #ApplicationDevelopment, #Approach, #Architecture, #Attention, #Autonomous, #AutonomousAgents, #Behavior, #Benchmarking, #Benchmarks, #Billion, #Blog, #Business, #Catalyst, #Charts, #Claude, #Code, #Coding, #Communication, #Community, #Complexity, #Comprehensive, #Compression, #Content, #Cursor, #CustomerService, #Data, #DataAnalytics, #DataManagement, #Database, #Datasets, #DeepMind, #Defense, #Design, #Developer, #Development, #Devices, #Discovery, #Distillation, #Domains, #Double, #Dynamics, #Edge, #Editorial, #Efficiency, #Engineering, #Enterprise, #EnterpriseAI, #Evaluation, #Evolution, #Fair, #Featured, #Features, #Foundation, #FoundationModels, #Framework, #Funding, #Games, #Google, #GoogleDeepmind, #GPT, #GPUs, #Gradients, #Grammarly, #Heart, #Hiring, #How, #HuggingFace, #Human, #Ideas, #Inference, #Informatica, #Innovation, #Innovations, #Intelligence, #Internet, #It, #Language, #LanguageModel, #LanguageModels, #LargeLanguageModels, #Lawsuit, #Learn, #Learning, #Llm, #LLMs, #Logic, #Management, #Manipulation, #Math, #Mathematics, #Members, #Meta, #Method, #Microsoft, #Mistral, #Model, #Models, #Mutations, #Natural, #NewYork, #NewYorkCity, #One, #Openai, #OPINION, #Opposite, #Optimization, #Other, #PAID, #Paper, #Papers, #Parameter, #Patterns, #Performance, #Picture, #Pipelines, #Planning, #Platform, #Population, #Positioning, #PostgreSQL, #Prevent, #Privacy, #PrivacyRisks, #Productivity, #Radar, #Reasoning, #ReasoningModels, #Recruiting, #Reddit, #Reflection, #ReinforcementLearning, #ReinforcementLearningWithVerifiableRewards, #Research, #Retention, #Revenue, #Risks, #Robotics, #Robots, #Safety, #SakanaAI, #Salesforce, #Sandboxing, #Scalable, #Scale, #Scaling, #ScalingLaws, #Search, #SelfImprovingAI, #Snowflake, #Solve, #Space, #Speech, #SpeechAi, #Stack, #Startup, #Startups, #Store, #Talent, #TalentAcquisition, #Teams, #Tech, #Time, #Tool, #Tools, #Training, #TrainingData, #Transformer, #TransformerModels, #Transition, #Tuning, #Uc, #Vector, #Vision, #VisionLanguage, #Watsonx, #Work, #Workflows, #Workplace, #World, #WorldModels, #Writing
Published on The Digital Insider at https://is.gd/YToFS5.

Comments