The Sequence Radar #711: Flash, But Precise: Inside Gemini 2.5 Flash Image

The new release represents one of the most impressive models ever created.

Next Week in The Sequence:

Our series about interpretability continues with an intro to mechanistic interpretability.
The AI of the week is going to cover the new Hermes 4 model pretrained mostly using synthetic data.
In the opinion section, we discuss some of the dynamics of NVIDIA, Huawei and Intel in the chip geo-political wars.

Subscribe Now to Not Miss Anything:

📝 Editorial: Flash, But Precise: Inside Gemini 2.5 Flash Image

Gemini 2.5 Flash Image (internally nicknamed “nano-banana”) is Google’s new native image generation and editing model, designed to combine low-latency, cost-efficient inference with materially better visual quality and controllability than the Gemini 2.0 Flash image features. The model exposes four first-class capabilities: multi-image fusion (compositing), character/asset consistency across prompts and edits, fine-grained prompt-based local edits, and edits grounded by Gemini’s world knowledge. It’s available now in Google AI Studio, the Gemini API, and Vertex AI.

Architecturally/operationally, Flash Image is positioned as a native image model rather than a multimodal text model with an image head. That allows it to support targeted transformations and template-driven generation while leveraging the broader Gemini family’s semantic priors (“world knowledge”) for more faithful, instruction-following edits. In practice, this lets a single prompt both understand a sketched diagram and apply complex edits in one step, reducing orchestration overhead in apps that previously required separate vision+image-edit pipelines.

For production asset pipelines, the biggest unlocked workflow is persistent character or product identity: developers can place the same character/object into diverse scenes while preserving appearance, or generate a catalog of consistent brand assets from a single visual spec. Google ships a Studio template demonstrating this behavior, and the model also adheres well to strict visual templates (e.g., real-estate cards, uniform badges), making it suitable for programmatic layouting and bulk creative ops.

The editing toolchain supports precise, text-addressed local edits—blurs, removals, pose changes, colorization, background swaps—without manual masks, enabling granular transformations controlled entirely by natural language. Because edits are semantics-aware, they can chain with understanding tasks (e.g., “read this hand-drawn diagram and color-code the vectors, then remove the annotation in the bottom-left”), which shortens multi-stage image processing flows.

Multi-image fusion lets the model ingest several input images and synthesize a coherent composite, such as dropping a photographed product into a new environment or restyling interiors with target textures/palettes. Google’s demo app exposes this as a drag-and-drop workflow; in code, the same is a multi-part prompt mixing text and images and requesting a single fused output. This capability is particularly useful for virtual staging, synthetic lifestyle photography, and rapid A/B creative generation.

Ecosystem-wise, Flash Image is in preview (stabilization coming “in the coming weeks”), ships with updated “build mode” in AI Studio (template apps, quick deploy or save-to-GitHub), and is also being distributed via OpenRouter and fal.ai. All generated/edited images carry an invisible SynthID watermark to support provenance and attribution. On public human-preference leaderboards, the preview model currently ranks first for both Text-to-Image and Image Edit on LMArena, indicating strong early quality and edit fidelity. Google calls out active work on long-form text rendering, even tighter identity consistency, and finer factual detail in images.

🔎 AI Research

Title: STEPWISER: Stepwise Generative Judges for Wiser Reasoning

AI Lab: FAIR at Meta & collaborators
Summary: This paper introduces STEPWISER, a reinforcement learning–trained generative judge that evaluates intermediate reasoning steps in large language models through a meta-reasoning process. It outperforms traditional supervised process reward models by providing more accurate feedback, improving inference-time search, and enhancing downstream training.

Title: UQ: Assessing Language Models on Unsolved Questions

AI Lab: Stanford University & collaborators
Summary: The authors propose UQ, a benchmark of 500 challenging, unsolved Stack Exchange questions to evaluate models on problems without ground-truth answers. They introduce validator pipelines and a community-driven platform, showing that frontier models solve only a fraction of tasks, thereby pushing evaluations toward open-ended, real-world knowledge gaps.

Title: Hermes 4 Technical Report

AI Lab: Nous Research
Summary: Hermes 4 is a family of open-weight hybrid reasoning models trained with large-scale synthetic data, rejection sampling, and specialized environments to integrate multi-step reasoning with broad instruction following. It achieves strong benchmark results across math, coding, knowledge, and alignment tasks, while exhibiting flexible persona adoption and stylistic transfer.

Title: A Scalable Framework for Evaluating Health Language Models

AI Lab: Google Research
Summary: This work introduces Adaptive Precise Boolean rubrics, a framework for evaluating health LLMs that replaces subjective Likert scales with targeted yes/no rubrics to improve reliability and scalability. Tested on metabolic health queries and real patient data, the method halves evaluation time, boosts inter-rater agreement, and enables automated assessments comparable to expert review.

Title: Autoregressive Universal Video Segmentation Model (AUSM)

AI Lab: NVIDIA, CMU, Yonsei University & NTU
Summary: AUSM reformulates video segmentation as sequential mask prediction, unifying prompted and unprompted tasks within a single autoregressive, state-space architecture. It enables parallel training for long video streams, achieves state-of-the-art performance on multiple benchmarks, and delivers up to 2.5× faster training compared to iterative baselines.

Title: rStar2-Agent: Agentic Reasoning Technical Report

AI Lab: Microsoft Research
Summary: This work presents rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning that integrates Python tool use, reflection on code execution, and step refinement. Using an efficient infrastructure, a new GRPO-RoC algorithm, and a compute-light multi-stage RL recipe, the model achieves frontier-level performance—80.6% on AIME24 and 69.8% on AIME25—surpassing much larger models like DeepSeek-R1 while producing shorter, more efficient reasoning traces

🤖 AI Tech Releases

Gemini 2.5 Flash Image

Google released Gemini 2.5 Flash Image with editing capabilities.

gpt-realtime

OpenAI released its Realtime API for voice agents available for developers.

Claude for Chrome

Anthropic unveiled Claude as a Chrome extension.

📡AI Radar

Maisa AI raised $25M led by Creandum and launched a model-agnostic “Maisa Studio” for supervised agentic “digital workers” aimed at making enterprise automation reliable after most gen-AI pilots flop.
Nvidia’s latest outlook underwhelmed investors, fueling worries that AI-chip demand is decelerating—especially with China sales excluded—pressuring the stock and peers.
Malaysia’s SkyeChip introduced the MARS1000, the nation’s first homegrown edge AI processor, underscoring its push into AI hardware amid tighter export controls.
Assort Health reportedly raised about $50M at a ~$750M valuation to scale AI voice agents that handle patient calls, scheduling and FAQs for specialty clinics.
Meta is forming a California super PAC and pledging tens of millions to support candidates favoring lighter AI regulation, ramping its state-level political spending.
Accel’s new investment lifts Vercel’s valuation to roughly $9B—about triple its prior mark—as the Next.js company expands its AI-powered developer cloud.
Google expanded NotebookLM’s auto-generated Video Overviews and enhanced Audio Overviews to 80 languages, taking its study aids well beyond English.
MathGPT.ai, which uses a Socratic “no-direct-answers” approach with new LMS integrations, is expanding from a 30-campus pilot to 50+ institutions this fall.
Ant Group’s quarterly profit fell about 60% as heavy spending on AI and global expansion squeezed margins despite growth ambitions.

#Adoption, #Agent, #Agents, #Agreement, #Ai, #AIRegulation, #AiStudio, #AIPowered, #Algorithm, #Amp, #Annotation, #API, #App, #Approach, #Apps, #Architecture, #Art, #Assets, #Audio, #Automation, #AutoRegressive, #Background, #Badges, #Behavior, #Benchmark, #Benchmarks, #California, #China, #Chip, #Chrome, #Claude, #Cloud, #Code, #Coding, #Color, #Community, #Data, #Deepseek, #DeepseekR1, #Developer, #Developers, #Dynamics, #EARLY, #Edge, #EdgeAI, #Editing, #Editorial, #English, #Enterprise, #EnterpriseAutomation, #Environment, #Evaluation, #Extension, #Fair, #Features, #Flash, #Form, #Fraction, #Framework, #Fusion, #Gemini, #Gemini20, #Generative, #Github, #Global, #Google, #GoogleAIStudio, #GPT, #Growth, #GRPO, #Hand, #HandDrawn, #Hardware, #Health, #Huawei, #Human, #Hybrid, #Identity, #ImageGeneration, #Images, #Inference, #Infrastructure, #Integrations, #Intel, #Interpretability, #Investment, #It, #Js, #Language, #LanguageModels, #Languages, #LargeLanguageModels, #Latency, #Learning, #LED, #Lifestyle, #Light, #LLMs, #Malaysia, #Margins, #Mask, #Math, #Mathgpt, #Meta, #Method, #Microsoft, #Mixing, #Model, #Models, #MultiStepReasoning, #Multimodal, #Natural, #NaturalLanguage, #NextJs, #Notebooklm, #Nvidia, #One, #OPINION, #Orchestration, #Outlook, #Overviews, #PAID, #Paper, #PatientData, #Performance, #Persistent, #Photography, #Pilot, #Pilots, #Pipelines, #Platform, #Process, #Production, #Profit, #Prompts, #Python, #Radar, #Read, #Reasoning, #ReasoningModels, #Recipe, #Reflection, #Regulation, #ReinforcementLearning, #Reliability, #Review, #Sales, #Scalable, #Scale, #Search, #Semantics, #Sequential, #Ships, #Solve, #Space, #Spending, #Stack, #Stanford, #Streams, #Study, #SyntheticData, #Tech, #Template, #Templates, #Text, #TextToImage, #Time, #Tool, #Training, #Transfer, #University, #Vectors, #Vertex, #VertexAi, #Video, #Vision, #Visual, #Voice, #Work, #Workflow, #World

Published on The Digital Insider at https://is.gd/xC255Y.

Julio Marchi © Speaks Out Network

Search This Blog