The Sequence Radar #700: From GPT-5 to Claude Opus, This Crazy Week in Model Releases

One of the most incredible weeks in generative AI.

Next Week in The Sequence:

Knowledge: Our series about interpretability continues by discussing the different types of interpretability.
AI of the Week: We dive into the OpenAI’s gpt-oss models.
Opinion: We dive into the debate of specialized vs generalist models.

Subscribe Now to Not Miss Anything

📝 Editorial: From GPT-5 to Claude Opus, This Crazy Week in Model Releases

In a normal week, the release of GPT-5 would have been enough for this editorial but not in this week. Four major model releases—GPT-5, gpt-oss, Genie 3, and Claude Opus—signal where frontier systems are headed and how the ecosystem around them is consolidating. The headline isn’t just “bigger models”; it’s increasingly systems-first: planning, tool-use, memory, and grounding are being treated as core capabilities rather than bolt-ons. Together, these launches sketch a stack: generalist reasoners at the top, open and efficient models in the middle, and simulation/generative environments at the bottom that make agents testable—and useful.

GPT-5 is framed less as “more params, more benchmarks” and more as a deliberative engine that can decompose tasks, call tools, and keep long-horizon objectives on track. The interesting bits are in orchestration: better control over reasoning depth vs. latency, more reliable function/tool calling, and guardrails that make high-stakes workflows auditable. In practice, that means moving from “answer my question” to “plan, execute across APIs and data sources, and justify the steps you took”—the difference between a chatbot and an operator.

On the open side, gpt-oss matters because it raises the floor. A strong, permissively licensed model with clean training and fine-tuning hooks gives teams a credible default for private, cost-sensitive workloads. You won’t route everything to a frontier model—nor should you. Expect usage patterns where gpt-oss handles the 80% of tasks that are routine (summaries, extraction, structured generation), while premium tokens are reserved for reasoning spikes, tricky edge cases, and safety-critical calls. The strategic value here is reproducibility and unit economics, not chasing the very last point on leaderboards.

DeepMind’s Genie 3 pushes on a different frontier: world models that you can act in. It’s not just pretty video; it’s controllable, action-conditioned generation that turns prompts into playable scenes and interactive micro-worlds. That unlocks two things: (1) richer pretraining and evaluation beds for agents (you can probe planning, transfer, and failure modes safely), and (2) new creative tools where users sketch mechanics and constraints and the model instantiates a living environment. If the past few years were about text and images, Genie 3 is about dynamics—state that evolves under your actions.

Claude Opus remains the banner for careful, reliable reasoning. The emphasis is still on faithful long-form analysis, disciplined tool use, and safety scaffolding that keeps outputs steerable without turning sterile. In enterprise settings—policy generation, sensitive RAG, code reviews with provenance—Opus tends to win not by flash but by consistency under pressure. Think less “one-shot genius” and more “won’t hallucinate a policy clause at 2 a.m.” That reliability compounds when you wire it into agents, where a single ungrounded step can derail an entire run.

Put together, the pattern is clear. A modern AI stack will route between (a) a frontier planner (GPT-5/Opus) for decomposition and oversight, (b) efficient open models (gpt-oss) for bulk transformation, and (c) grounded simulators/environments (Genie 3) for training, testing, and human-in-the-loop design. Around that, you need infrastructure that was optional before: evaluation harnesses that catch regressions, telemetry for tool calls and traces, policy layers that are programmable, and memory that’s both cheap and compliant.

🔎 AI Research

Title:CoAct-1: Computer-using Agents with Coding as Actions

AI Lab:University of Southern California & Salesforce Research
Summary:
CoAct-1 is a multi-agent computer-using system that augments GUI-based control with direct programmatic execution, using an Orchestrator to delegate tasks between a vision-language GUI Operator and a coding-capable Programmer. On the OSWorld benchmark, it achieves a state-of-the-art success rate of 60.76%, completing tasks with far fewer steps by replacing long, error-prone GUI sequences with precise Python or Bash scripts, yielding substantial gains in OS-level, multi-application, and file management scenarios

Title:Cognitive Loop via In-Situ Optimization: Self-Adaptive Reasoning for Science

AI Lab:Microsoft Discovery and Quantum, Office of the CTO
Summary:
CLIO introduces a self-adaptive reasoning framework that allows non-reasoning LLMs like GPT-4.1 to formulate, reflect on, and revise their cognitive processes at inference time using recursive, uncertainty-aware optimization and graph-based belief aggregation. Without any post-training, CLIO surpasses OpenAI’s o3 in both low and high reasoning effort modes on the Humanity’s Last Exam benchmark, while also exposing belief structures, enabling user steerability, and reducing reasoning variance through graph-induced semantic reduction.

Title:Tool-integrated Reinforcement Learning for Repo Deep Search

AI Lab:Peking University & ByteDance
Summary:
This paper introduces ToolTrain, a two-stage training framework combining rejection-sampled supervised fine-tuning and reinforcement learning to improve large language models’ ability to navigate code repositories using retrieval tools for issue localization. Through its lightweight agent RepoSearcher, ToolTrain enables models to perform complex multi-hop reasoning and surpasses commercial models like Claude-3.7 on function-level localization tasks, achieving state-of-the-art performance on SWE-Bench-Verified.

Title:GOEDEL-PROVER-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

AI Lab:Princeton University, NVIDIA, Tsinghua University, Stanford University, Meta FAIR, Amazon, Shanghai Jiao Tong University, Peking University
Summary:
Goedel-Prover-V2 is a state-of-the-art open-source theorem proving model that achieves 90.4% pass@32 on MiniF2F and solves 86 problems on PutnamBench, outperforming much larger models like DeepSeek-Prover-V2-671B with a significantly smaller 32B architecture. This is accomplished via three key innovations: scaffolded data synthesis to generate progressively harder proofs, verifier-guided self-correction using Lean compiler feedback, and model averaging to boost diversity and reasoning generalization during fine-tuning and RL training.

Title:VeriTrail: Closed-Domain Hallucination Detection with Traceability

AI Lab:Microsoft Research
Summary:
VeriTrail is a closed-domain hallucination detection framework that introduces traceability by modeling generative processes as directed acyclic graphs, allowing it to localize where hallucinations emerge—especially in multi-step reasoning chains (MGS). It outperforms baseline methods like RAG, AlignScore, and long-context LMs on two novel benchmarks (FABLES+ and DiverseSumm+) by iteratively verifying claims through intermediate outputs, using LM-driven evidence selection, decomposition, and verdict generation.

🤖 AI Tech Releases

GPT-5

OpenAI released its highly anticipated GPT-5 model

gpt-oss

OpenAI is back on the open source race with the release of gpt-oss-120b and gpt-oss-20b, two open weight models with robust capabilities in areas such as reasoning, tool usage and many others.

Claude Opus 4.1

Anthropic released a new version of Claude Opus with strong reasoning, coding and agentic capabilities.

Genie 3

Google DeepMind released Genie 3, its amazing model that can create realisitic 3D environments.

Harmony

OpenAI and HuggingFace open sourced Harmony, a new structured format for LLM responses.

Game Arena

DeepMind and Kaggle collaborated on Game Arena, a new tournament environment for evaluating foundation models.

📡AI Radar

X will start inserting paid placements into Grok replies, as Musk moves to monetize the chatbot with xAI-powered ad targeting.
Cohere announced North, its next generation enterprise AI platform.
Lava raised $5.8M to launch a cross-app wallet so AI agents can pay vendors with shared usage credits instead of per-tool subscriptions.
Tavily nabbed fresh capital to provide governance-aware web search/crawling for enterprise AI agents, with Insight leading and customers like Groq and Cohere.
Clay closed a $100M Series C at a $3.1B valuation to expand its AI-driven sales automation platform.
ElevenLabs debuted an AI music generator and says licensing deals (Merlin, Kobalt) make outputs usable for commercial projects.
UAE spend-management startup Alaan raised $48M Series A led by Peak XV to scale AI-powered finance automation across MENA.
OpenMind raised $20M and unveiled OM1, an open, hardware-agnostic OS plus a FABRIC protocol so robots can share identity/context—an “Android for humanoids” pitch.
After earnings, Tim Cook told staff Apple will ramp AI investment and needs to “win” the category despite Siri delays.
SoftBank is buying Foxconn’s Lordstown, Ohio plant to repurpose it for the Stargate AI data-center push alongside OpenAI and Oracle.
Accel is leading a new round that would lift n8n’s valuation to about $2.3B as investor interest in agentic automation surges.

#3D, #Agent, #Agents, #Ai, #AIAGENTS, #AIInvestment, #AiPlatform, #AIPowered, #Amazing, #Amazon, #Amp, #Analysis, #Android, #APIs, #App, #Apple, #Architecture, #Arena, #Art, #Automation, #AutomationPlatform, #Benchmark, #Benchmarks, #California, #Chatbot, #Claude, #Clio, #Code, #Coding, #Cohere, #Computer, #Data, #DataSources, #Deals, #DeepMind, #Deepseek, #DeepSeekProverV2, #Design, #Detection, #DifferenceBetween, #Discovery, #Diversity, #Dynamics, #Economics, #Edge, #Editorial, #ElevenLabs, #Emphasis, #Engine, #Enterprise, #EnterpriseAI, #Environment, #Evaluation, #Exam, #Fair, #FileManagement, #Finance, #Flash, #Form, #Foundation, #FoundationModels, #Framework, #Game, #Generative, #GenerativeAi, #Generator, #Genie, #Governance, #GPT, #GPT4, #GPT5, #Graph, #Grok, #Groq, #Gui, #Hallucination, #HallucinationDetection, #Hallucinations, #Hardware, #Harmony, #Horizon, #How, #Huggingface, #Human, #Identity, #Images, #Inference, #Infrastructure, #Innovations, #InSight, #Interpretability, #Investment, #It, #Language, #LanguageModels, #LargeLanguageModels, #Latency, #Lava, #Learning, #LED, #LESS, #Llm, #LLMs, #Loop, #Management, #Mechanics, #Memory, #Meta, #Microsoft, #MicrosoftDiscovery, #Model, #Modeling, #Models, #MultiAgent, #MultiStepReasoning, #Music, #Musk, #Nvidia, #O3, #One, #OpenModels, #OpenSource, #OpenWeightModels, #Openai, #Operator, #OPINION, #Optimization, #Opus, #Oracle, #Orchestration, #Os, #OSS, #PAID, #Paper, #Patterns, #Performance, #Placements, #Plan, #Planner, #Planning, #Platform, #Policy, #Probe, #Prompts, #Python, #Quantum, #Radar, #RAG, #Reasoning, #Reduction, #ReinforcementLearning, #Reliability, #Repo, #Repositories, #Reviews, #Robots, #Safety, #Sales, #Salesforce, #Scale, #Scaling, #Search, #Sensitive, #Signal, #Simulation, #Simulators, #Siri, #Softbank, #Stack, #Staff, #Stanford, #Stargate, #Startup, #Success, #Synthesis, #Teams, #Tech, #Telemetry, #Testing, #Text, #Time, #Tool, #Tools, #Traceability, #Training, #Transfer, #Transformation, #Tuning, #Uae, #University, #Vendors, #Version, #Video, #Vision, #VisionLanguage, #Vs, #Web, #WebSearch, #Work, #Workflows, #World, #WorldModels, #XAI

Published on The Digital Insider at https://is.gd/V4vGJs.

Julio Marchi © Speaks Out Network

Search This Blog