The Sequence Radar #559 : Two Remarkable Papers This Week: Self-Improving Agents and the Limits of LLM Memorization | By The Digital Insider
Agents that improve themselves and the limits of memorization.
Next Week in The Sequence:
We dive into safety evals as part of our series about benchmarking. Research cover’s Sakana AI groundbreaking paper about self-evolving models. Our opinion section focuses on the case for spatial intelligence and world models. Engineering will discuss another cool AI framework.
You can subscribe to The Sequence below:
📝 Editorial: Two Remarkable Papers This Week: Self-Improving Agents and the Limits of LLM Memorization
This week featured two standout papers that reveal complementary frontiers of AI development: one that pushes the limits of open-ended, self-improving systems, and another that rigorously quantifies how much information large language models can retain. The first, Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents, presents one of the most credible instantiations yet of recursive self-modifying agents. The second, How Much Do Language Models Memorize?, introduces a principled and practically measurable framework for assessing the memorization capacity of modern LLMs. Both contributions illuminate core dynamics of how AI systems evolve, learn, and remember—and together, they paint a vivid picture of our current trajectory in scaling and aligning intelligent systems.
The Darwin Gödel Machine (DGM) operationalizes the theoretical idea of self-referential improvement by constructing agents that can rewrite their own code to enhance performance. Built atop frozen foundation models, DGM alternates between self-modification and evaluation, benchmarking candidate agents on real-world coding tasks like SWE-bench and Polyglot. It employs a Darwinian mechanism: each agent is added to an archive, and new agents are generated via mutations of prior ones. Crucially, this enables divergent exploration across the agent design space. The system autonomously discovers new tools, workflows, and strategies, leading to performance improvements from 20% to 50% on SWE-bench. The results suggest that recursive, self-directed optimization is not only feasible but increasingly competitive with manually engineered systems.
What distinguishes DGM is its architecture for open-ended discovery. Rather than hill-climbing on a single agent, it maintains a population of diverse, evolving systems—allowing for breakthroughs to emerge from unlikely or initially suboptimal branches. This is a major departure from conventional agent fine-tuning pipelines, which often discard failed explorations. The paper demonstrates that key innovations often trace back to agents that initially underperformed, underscoring the value of preserving and revisiting earlier ideas. With strong safety protocols in place (e.g., sandboxing and human oversight), the DGM framework opens a credible path toward continuously evolving AI systems whose improvements compound autonomously over time.
Meanwhile, How Much Do Language Models Memorize? tackles a long-standing and under-specified question at the heart of LLM behavior: what does a model actually retain from its training data? The authors introduce a formal decomposition of memorization into "unintended memorization" (data-specific retention) and "generalization" (distribution-level abstraction). Using a compression-based method inspired by Kolmogorov complexity, they estimate the total number of bits a model can memorize. Their experiments—which span hundreds of transformer models trained on both synthetic and natural datasets—reveal a striking result: GPT-family models retain roughly 3.6 bits per parameter. This figure quantifies model capacity in a practical, interpretable way, and serves as a foundation for analyzing model behavior, privacy risks, and generalization thresholds.
Beyond static measurement, the paper derives scaling laws that predict how memorization patterns shift with data and model size. It reveals that models initially memorize data until their capacity saturates, after which generalization begins to dominate—providing a formal underpinning for the widely observed double descent phenomenon. It also shows how membership inference attacks become harder as datasets grow larger relative to model capacity. These results suggest that memorization is a predictable, quantifiable phenomenon, and not merely an emergent artifact. The framework sets the stage for more rigorous evaluation of privacy, reproducibility, and data influence in LLMs.
Together, these two papers reveal opposite yet deeply intertwined aspects of AI model development. The Darwin Gödel Machine charts the outer frontier of what self-improving systems might look like when left to explore and evolve. How Much Do Language Models Memorize? brings precision and clarity to a key limitation of such systems: their bounded capacity to retain specific information. One pushes forward the architecture of continual progress; the other grounds that progress in the mathematics of representation. As the field grapples with scale, autonomy, and alignment, both papers offer essential tools for understanding what models can become—and what they can (and cannot) remember along the way.
🔎 AI Research
"Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents" – UBC, Sakana AI, Vector Institute
This work introduces the Darwin Gödel Machine (DGM), a self-improving AI system that rewrites its own code to become a better coding agent using frozen foundation models. Inspired by biological evolution, it combines self-modification and open-ended search to achieve significant improvements on coding benchmarks like SWE-bench and Polyglot, representing a step toward safe, recursive self-improvement in AI systems.
"Self-Challenging Language Model Agents" – UC Berkeley & FAIR at Meta
This paper introduces the Self-Challenging Agent (SCA), a framework where language model agents autonomously generate and solve their own tasks using a formalism called Code-as-Task (CaT), which ensures task feasibility, verifiability, and difficulty. Using only self-generated data, the SCA achieves a 2× improvement in performance on multi-turn tool-use benchmarks (M3ToolEval, TauBench), outperforming prior self-improvement and distillation approaches without any human-curated tasks.
"REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards" – OpenThought Lab
This paper introduces Reasoning Gym, a suite of procedurally generated environments designed to train and evaluate reasoning models using reinforcement learning with verifiable rewards. It enables scalable, curriculum-driven learning across domains like math, logic, and games, and reveals that current LLMs struggle with general reasoning unless specifically trained for it.
"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics" – Hugging Face, Valeo.ai, Sorbonne University
SmolVLA is a small, efficient vision-language-action model built for low-cost robots, capable of real-world manipulation using only community-collected datasets. Despite its compact size, it matches or outperforms larger models by leveraging an interleaved attention architecture and asynchronous inference stack.
"How much do language models memorize?" – FAIR at Meta, Google DeepMind, Cornell, NVIDIA
This paper introduces a rigorous framework to quantify how much information language models memorize about specific datapoints, separating unintended memorization from generalization. Through experiments on synthetic and real datasets, it estimates GPT-style models store about 3.6 bits per parameter, and shows that memorization capacity defines a phase transition where generalization begins.
"Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism" – Pluralis Research
The authors propose a new low-rank compression method for model-parallel training that enables efficient decentralized training of billion-parameter models over low-bandwidth connections. Their approach compresses both activations and gradients with minimal overhead, achieving 100× communication efficiency and matching centralized training performance across geographically distributed GPUs.
🤖 AI Tech Releases
Mistral Code
Mistral released Mistral Code, its new coding assistant.
🛠 AI in Production
Voice AI at Airbnb
Airbnb discusses their use of speech AI capabilities for customer support.
📡AI Radar
IBM Acquires Seek AI and Launches NYC AI Accelerator
IBM has acquired Seek AI, a data analytics startup, and launched the Watsonx AI Labs in New York City to bolster enterprise AI innovation and support local startups.Salesforce Acquires Moonhub's Team to Enhance AI Hiring Tools
Salesforce has integrated key members of Moonhub, an AI-driven recruiting startup, to strengthen its talent acquisition capabilities, following its recent acquisition of Informatica.Google Releases App for Running AI Models Locally
Google has quietly launched the AI Edge Gallery app, enabling users to download and run various AI models locally on their devices without an internet connection.Console Raises $6.2M to Automate IT Tasks with AI
Startup Console secured $6.2 million in seed funding to develop AI tools that automate routine IT tasks, aiming to free up IT teams for more strategic work.Reddit Sues Anthropic Over Unauthorized Data Use
Reddit has filed a lawsuit against AI startup Anthropic, alleging unauthorized scraping of Reddit content to train its AI models without proper licensing agreements.Anthropic's AI Writes Its Own Blog with Human Oversight
Anthropic's AI model, Claude, is now generating content for the company's blog, with human editors refining the AI's drafts to ensure quality and accuracy.Yoshua Bengio Launches LawZero for AI Safety
AI pioneer Yoshua Bengio has founded LawZero, a nonprofit focused on developing "honest" AI systems to detect and prevent potentially harmful behaviors in autonomous agents.Snowflake to Acquire Crunchy Data for $250 Million
Snowflake plans to acquire Crunchy Data, a PostgreSQL database startup, for $250 million to enhance its data management offerings and support AI application development.Grammarly Secures $1 Billion to Build AI Productivity Platform
Grammarly has obtained $1 billion in non-dilutive funding from General Catalyst to evolve its writing assistant into a comprehensive AI-driven productivity platform.OpenAI Reaches 3 Million Business Users and Launches Workplace Tools
OpenAI has announced that it now serves over 3 million business users and has launched new workplace tools to compete with Microsoft's offerings.Anthropic Unveils Claude Gov for U.S. National Security
Anthropic has launched Claude Gov, a tailored AI model designed to assist U.S. defense and intelligence agencies with tasks like strategic planning and threat analysis, featuring enhanced capabilities for handling classified information.Anysphere's Cursor Achieves $9.9B Valuation
Anysphere, the developer of the AI coding assistant Cursor, has raised $900 million, reaching a $9.9 billion valuation and surpassing $500 million in annual recurring revenue, positioning itself as a leader in the AI coding tools market.Rosebud Secures $6M to Enhance AI Journaling App
Rosebud has obtained $6 million in seed funding to advance its AI-driven journaling app, which offers personalized self-reflection and coaching by analyzing user entries, aiming to expand its features and reach.Solidroad Raises $6.5M to Revolutionize Customer Service Training
Dublin-based startup Solidroad has raised $6.5 million to develop AI tools that coach customer service representatives, analyzing all customer interactions to provide personalized training and improve both human and AI agent performance.
#250, #Acquisition, #Agent, #Agents, #Ai, #AiAgent, #AICoding, #AIDevelopment, #AIInnovation, #AiModel, #AIModels, #AIResearch, #AISystems, #AiTools, #Amp, #Analysis, #Analytics, #Anthropic, #App, #ApplicationDevelopment, #Approach, #Architecture, #Attention, #Autonomous, #AutonomousAgents, #Behavior, #Benchmarking, #Benchmarks, #Billion, #Blog, #Business, #Catalyst, #Charts, #Claude, #Code, #Coding, #Communication, #Community, #Complexity, #Comprehensive, #Compression, #Content, #Cursor, #CustomerService, #Data, #DataAnalytics, #DataManagement, #Database, #Datasets, #DeepMind, #Defense, #Design, #Developer, #Development, #Devices, #Discovery, #Distillation, #Domains, #Double, #Dynamics, #Edge, #Editorial, #Efficiency, #Engineering, #Enterprise, #EnterpriseAI, #Evaluation, #Evolution, #Fair, #Featured, #Features, #Foundation, #FoundationModels, #Framework, #Funding, #Games, #Google, #GoogleDeepmind, #GPT, #GPUs, #Gradients, #Grammarly, #Heart, #Hiring, #How, #HuggingFace, #Human, #Ideas, #Inference, #Informatica, #Innovation, #Innovations, #Intelligence, #Internet, #It, #Language, #LanguageModel, #LanguageModels, #LargeLanguageModels, #Lawsuit, #Learn, #Learning, #Llm, #LLMs, #Logic, #Management, #Manipulation, #Math, #Mathematics, #Members, #Meta, #Method, #Microsoft, #Mistral, #Model, #Models, #Mutations, #Natural, #NewYork, #NewYorkCity, #One, #Openai, #OPINION, #Opposite, #Optimization, #Other, #PAID, #Paper, #Papers, #Parameter, #Patterns, #Performance, #Picture, #Pipelines, #Planning, #Platform, #Population, #Positioning, #PostgreSQL, #Prevent, #Privacy, #PrivacyRisks, #Productivity, #Radar, #Reasoning, #ReasoningModels, #Recruiting, #Reddit, #Reflection, #ReinforcementLearning, #ReinforcementLearningWithVerifiableRewards, #Research, #Retention, #Revenue, #Risks, #Robotics, #Robots, #Safety, #SakanaAI, #Salesforce, #Sandboxing, #Scalable, #Scale, #Scaling, #ScalingLaws, #Search, #SelfImprovingAI, #Snowflake, #Solve, #Space, #Speech, #SpeechAi, #Stack, #Startup, #Startups, #Store, #Talent, #TalentAcquisition, #Teams, #Tech, #Time, #Tool, #Tools, #Training, #TrainingData, #Transformer, #TransformerModels, #Transition, #Tuning, #Uc, #Vector, #Vision, #VisionLanguage, #Watsonx, #Work, #Workflows, #Workplace, #World, #WorldModels, #Writing
Published on The Digital Insider at https://is.gd/YToFS5.
Comments
Post a Comment
Comments are moderated.