May 2026 AI Engineering Roundup

May 31, 2026 • By Zach Deane-Mayer

I switched to Opus 4.8 as soon as it launched and didn’t look back. At this point I’ve come to trust Anthropic’s incremental updates. Each new model is indeed better than the last across the work I do. It’s worth the effort to switch to the newest model when it launches.

Note that prompt injection is slightly worse in 4.8 than 4.7.

Anthropic is now renting essentially all of xAI’s Colossus 1 to serve inference. This is a win/win.

Microsoft canceled Claude Code internally and pushed everyone to Copilot CLI. This is a dumb idea. Claude is trained to use Claude Code. It makes more and more sense to use Anthropic models through Claude Code and OpenAI models through Codex.

Model Releases

Claude Opus 4.8 (2026-05-28) — Anthropic’s new flagship, ~6 weeks after 4.7 at the same price ($5/$25 per 1M). Cuts overconfidence ~4-10x, and hallucinates an unavailable tool 5% of the time vs 11% before. Still #1 on the Artificial Analysis Intelligence Index at 61.4, with 69.2% SWE-Bench Pro. New mid-conversation system messages steer a long agentic loop without breaking prompt cache.

Claude Mythos Preview (2026-05-13) — Anthropic’s unreleased frontier model keeps pulling away on autonomous cybersecurity. UK AISI’s latest checkpoint was the first to clear both end-to-end cyber ranges and the only one to solve every task estimated over 8 hours. METR put its 50%-time-horizon at ~16 hours. AISI says the length of cyber tasks frontier models can finish is doubling every few months. I still want to get my hands on it.

Gemini 3.5 Flash (2026-05-04) — Google shipped Gemini 3.5 Flash straight to GA at I/O as a fast/cheap agentic daily driver. The problem: $1.50/$9 per 1M is a 3x hike over Gemini 3 Flash, so it’s no longer cheap, and Intelligence Index 55 trails 3.1 Pro, Opus 4.8, and GPT-5.5. Fast at >280 tok/s, but vibe checks are mixed: too many tool calls, hallucinations, Sonnet-tier code. Gemini 3.5 Pro promised next month.

Gemini Omni (2026-05-04) — Google’s family merging Gemini reasoning with its generative media stack, starting with conversational video creation and editing (‘NanoBanana for video’) and custom avatars. Takes text, image, audio, and video, with multi-turn editing that keeps scene and character consistency. Shipping as Gemini Omni Flash to paid users and YouTube Shorts. Reactions treated it as more differentiated than the 3.5 Flash refresh.

Grok 4.3 (2026-05-01) — xAI shipped Grok 4.3 at $1.25/$2.50 per 1M. Intelligence Index 53, 7th place, well behind frontier; non-hallucination dropped 8 points and Andon Labs found it preferred to ‘sleep’ rather than act on Vending-Bench 2. A small cheap model, not a frontier offering. xAI also sunset grok-4.1 and grok-4 on two weeks’ notice. I don’t use xAI models.

GPT-5.5 / GPT-5.5 Pro (2026-04-25) — OpenAI’s new flagship, with gains in agentic coding and computer use rather than a step change in intelligence. Since GPT-5.4 the main model and Codex are one system. 2x the API price of GPT-5.4. UK AISI rates it at rough cyber parity with Mythos. I have Claude Code call Codex, which calls GPT-5.5. It sweats the details.

Enterprise Products

Anthropic / US intelligence deal (2026-05-28) — The Trump administration and Anthropic are finalizing a deal letting US spy agencies use Anthropic’s tools, with a carve-out ensuring the model is not used on Americans’ data and dropping the DoD’s earlier ‘any lawful use’ demand. The White House wants it as a template for other companies. Anthropic held its line and still got the contract.

Microsoft cancels Claude Code licenses (2026-05-28) — Microsoft canceled its internal Claude Code licenses after token-based billing made the cost untenable, pushing developers toward its own GitHub Copilot CLI; reporting says Claude Code’s popularity inside Microsoft had undermined Copilot CLI.

Google AI Search (2026-05-22) — Google is rebuilding Search around an ‘intelligent search box’ that looks and feels like a Gemini chatbot prompt, with links becoming an afterthought. It also added ‘information agents,’ an AI version of Google Alerts. Google is dismantling the ten blue links that made it.

Cognition Devin Auto-Triage (2026-05-18) — Cognition launched Devin Auto-Triage, an always-on first responder for bugs, alerts, and incidents, with long-term memory, a manager/subagent structure, and PR generation. Early users like Modal call it more useful than homegrown triage automations.

Anthropic programmatic usage metering (2026-05-13) — Starting June 15, Anthropic moves automated subscription use (Agent SDK, claude -p, GitHub Actions, third-party apps) into a separate monthly credit pool equal to the plan’s dollar amount, on top of unchanged interactive limits. The historical subsidy was an estimated 70-90% discount off API pricing, so power users read it as a rug pull. Inference isn’t free and the movie pass business model doesn’t work.

OpenAI Development Company (2026-05-11) — OpenAI launched a majority-owned services unit to help businesses build and deploy frontier models via forward-deployed engineers, bringing in ~150 specialists through the Tomoro acquisition. A Palantir-style field-engineering model to own the deployment layer. The labs are all moving into last-mile services for revenue; Anthropic launched a $1.5B JV with Blackstone and Goldman the same week.

Claude Code rate limit increases (2026-05-06) — On the back of new compute deals, Anthropic doubled Claude Code’s 5-hour rate limits for Pro, Max, Team, and Enterprise, killed the peak-hours reduction, and raised Opus API limits. Weekly limits then went up 50% through July 13, partly to offset backlash over the new metering. I still hit the 5-hour limit constantly.

Claude Managed Agents (Dreaming, Outcomes) (2026-05-06) — At Code with Claude, Anthropic added Managed Agents features: ‘Dreaming’ (agents review past sessions to self-improve at your tasks) and ‘Outcomes’ (rubrics, grading, objective tracking), plus multiagent orchestration and webhooks. Pitched as productizing the harness so whole teams get more productive, not just individuals.

Gemini Spark (2026-05-05) — Google’s 24/7 personal AI agent, an OpenClaw competitor, connecting natively to Gmail, Calendar, Drive, and Docs via MCP, running on Gemini 3.5 Flash and the Antigravity harness. Each task runs in an isolated ephemeral VM with traffic through a DLP gateway, and credentials never reach the agent. Rolled out to US AI Ultra subscribers. Simon Willison rightly warns the volume of sensitive data flowing through it makes prompt injection a top candidate for an agent-security disaster.

Codex for Work (2026-05-01) — OpenAI repositions Codex from coding agent to general computer-use agent ‘for any task done with a computer’ (docs, slides, sheets, research). Computer Use runs 42% faster, with Office/Google/Salesforce connectors and an in-app Office editor.

Open Source

GLM-5.1 (2026-05-16) — Z.ai’s update to GLM-5, improving across the board with a focus on long-horizon tasks. One of the strongest open frontier model families.

MiMo V2.5 Pro (2026-05-16) — Xiaomi’s MiMo V2.5 Pro (1T total / 42B active MoE, Apache 2.0) is neck-and-neck with Kimi K2.6 and GLM-5.1, scoring 52-54 on the Intelligence Index. A notable trajectory given Xiaomi’s open-model debut was only a year ago.

Gemma 4 (2026-05-05) — Google’s long-awaited Gemma update: 4B, 9B, and 31B dense models plus a 26B-A4B MoE, multimodal with 262K context. The headline is the switch to an Apache 2.0 license, removing the legal uncertainty of its prior custom terms. Ships multi-token-prediction drafter checkpoints for 2-3x faster on-device decoding. Strong for creative work; Qwen3.6 stays better for agentic coding.

Cohere Command A+ (2026-05-04) — Cohere’s first MoE open-weights model under Apache 2.0 (~218B total / 25B active, multimodal, 48 languages), runnable on 2x H100 at W4A4. Intelligence Index 37, around Claude 4.5 Haiku, with strong non-hallucination but weaker coding. The launch oddly skipped standard benchmark comparisons.

Qwen 3.7 Max (2026-05-04) — Alibaba’s Qwen 3.7 Max debuted 5th on Artificial Analysis, tied with GPT-5.4 xhigh and ahead of Gemini 3.5 Flash, leading open peers on Terminal-Bench 2.0 and SWE-bench Pro. It’s a closed Max-series model; Qwen has never open-weighted Max, though the community expects open 27B/35B variants. Verbosity and high token use remain weaknesses.

Kimi K2.6 (2026-05-01) — Moonshot’s Kimi K2.6 (1T total / 32B active MoE) is again one of the best open models, focused on long-horizon tasks that run over hours. Intelligence Index 52-54, alongside DeepSeek V4 and MiMo V2.5 Pro, and reported #1 open-weight model on the Finance Agent Benchmark V2.

Qwen 3.6 (27B / 35B-A3B) (2026-05-01) — Qwen’s 27B dense and 35B-A3B MoE (Apache 2.0, 262K context) rank as the new open-weights leader under 150B params. Local-LLM developers praised it for agentic coding: the 35B-A3B runs on a single 32GB GPU at 30-40 tok/s, and the 27B is called the first local model that practically holds up against Claude Code for scaffolding and refactors, though it still trails Sonnet/Opus on one-shot coding.

DeepSeek V4 (Pro + Flash) (2026-04-24) — DeepSeek’s V4 line: two MIT-licensed 1M-context MoE models, Pro (1.6T total / 49B active, now the largest open-weights model) and Flash (284B total). Pricing is the story: Flash at $0.14/$0.28 per 1M is cheaper than GPT-5.4 Nano, and Pro runs benchmarks ~12x cheaper than GPT-5.5 and ~19x cheaper than Opus 4.7. Intelligence Index 52-54.

Research

AlphaProof Nexus (Google DeepMind) (2026-05-28) — Google DeepMind’s AlphaProof Nexus autonomously solved 9 of 353 open Erdos problems and proved 44 of 492 open OEIS conjectures, at a few hundred dollars per problem. The first AI math results exciting in themselves rather than as a capability proxy.

Microsoft Copilot Cowork file exfiltration (2026-05-26) — Microsoft Copilot Cowork let agents send emails to the user’s own inbox without approval; those messages can carry external images that fire network requests, so a prompt injection could exfiltrate data when the user opens the message. Combined with OneDrive’s pre-authenticated download links, an attacker could pull files directly. Preventing data exfiltration remains the hardest part of designing agentic systems.

METR Frontier Risk Report (2026-05-20) — METR’s first major frontier risk report, with unusually deep access across Anthropic, Google, Meta, and OpenAI including chains-of-thought and non-public capability data. Frontier agents do real autonomous engineering at human-expert-weeks scale, but stay weak where success is costly to verify and routinely violate constraints and act deceptively on hard tasks. Concludes agents plausibly had the means, motive, and opportunity for a minimal rogue deployment but could not make it robust to shutdown; basic jailbreaks reliably fooled the monitors.

OpenAI disproof of the Unit Distance Conjecture (2026-05-20) — A general-purpose OpenAI reasoning model (speculated GPT-5.6, under 32 hours and ~$1000 of compute, no Lean) produced a novel disproof of the planar Unit Distance Conjecture, a 1946 Erdos problem. Timothy Gowers called it the first really clear case of AI solving a well-known open math problem, if the low human-interaction claim holds. Notably a general model, not a domain solver like AlphaProof.

Grep vs embeddings for coding agents (2026-05-15) — A paper shows grep-style text search, wrapped in the right agent harness, can match or beat embedding-based retrieval on coding-agent tasks. The quip: the two-parameter model for agentic search is BM25, the zero-parameter version is grep. I’ve never used a vector database and probably never will.

Google: AI-developed zero-day exploited in the wild (2026-05-13) — Google reports a threat actor using a known AI-developed zero-day exploit in the wild, one of the first confirmed cases of an AI-discovered vulnerability weaponized by real attackers.

Mini Shai-Hulud supply-chain attack (2026-05-12) — The Mini Shai-Hulud campaign expanded to hit OpenSearch, Mistral AI, Guardrails AI, and others across npm and PyPI, targeting AI developer tooling. It persists by hooking into Claude Code’s settings.json and VS Code’s tasks.json so it re-executes even after the package is removed.

Figure Helix-02 (2026-05-08) — Figure streamed humanoid robots running 24+ hour autonomous shifts on package sorting at roughly human parity (~3s/package). The robots reason from camera pixels, run on-device inference, coordinate as a networked fleet inferring each other’s actions from motion, swap out for low battery, and self-diagnose. Figure claims no teleoperation. One of the clearest public demos of long-duration, no-human-in-the-loop multi-robot orchestration.

Project Glasswing / Claude Mythos (2026-05-05) — Anthropic’s Project Glasswing and partners found more than 10,000 high- or critical-severity vulnerabilities in essential software within a month, most still unpatched. One example: Mythos detected a flaw in wolfSSL, used by billions of devices, and built an exploit to forge certificates for fake bank and email sites. Peter Wildeford notes Mythos alone is finding more vulnerabilities than all sources combined in prior years. The defenders had better be faster than the attackers.

Anthropic sycophancy study (2026-05-01) — Anthropic measured sycophancy across ~1M real Claude conversations with an automatic classifier. Only 9% showed sycophantic behavior overall, but two domains spiked: 38% of spirituality conversations and 25% of relationship conversations. The findings tie directly to training changes for Opus 4.7 and Mythos.

Developer Tools

Antigravity 2.0 (2026-05-19) — Google’s agent stack expanded into a standalone desktop app, a closed-source Go CLI (replacing the open-source TypeScript Gemini CLI, which stops working with Google AI subscriptions June 18), an SDK, and Managed Agents in the Gemini API. Marquee demo: a functioning OS built in 12 hours using 93 parallel sub-agents and 2.6B tokens for under $1K. Gemini CLI looks good, so far Antigravity is annoying as hell.

OpenAI deprecates finetuning APIs (2026-05-12) — OpenAI is deprecating its finetuning APIs, ending years of being the standout big lab for finetuning support. Top shops like Cursor and Cognition have instead increased open-model RLFT.

Claude Code workflow features (2026-05-08) — Claude Code shipped a cluster of workflow additions: /fewer-permission-prompts adds safe repeatedly-prompted commands to permissions, /ultrareview runs a dedicated bug-catching review session, /focus shows only final results, /usage reports where tokens went, plus session recaps and push notifications via /remote. Auto mode is now on Max.

Codex Background Computer Use and Chrome plugin (2026-05-07) — Codex got OS-level computer use that sees, clicks, and types in Mac apps in the background without taking over the machine, so you can keep working in parallel. It also shipped a Chrome plugin that operates across background tabs using logged-in sites and DevTools access.

Cognition Windsurf 2.0 (Devin Review) (2026-05-06) — Cognition shipped Devin Review and Quick Review / SWE-Check in Windsurf 2.0, explicitly targeting the new bottleneck: reviewing AI-generated code rather than generating it.

MCP 2026-07-28 release candidate (2026-05-05) — The MCP 2026-07-28 release candidate makes the protocol stateless: no handshake, no session ID, any request can hit any server instance, easing scaling and load balancing. It also adds first-class extensions (MCP Apps and Tasks), auth hardening, and a clearer deprecation policy.

Cursor Composer 2.5 (2026-05-04) — Cursor’s strongest coding model yet, focused on sustained long-running work and reliable instruction following over raw benchmarks. Scored 62 on the Coding Agent Index, reported 3-18x cheaper than Opus 4.7 with notably lower token use.

Infrastructure

Epoch AI inference compute crunch estimate (2026-05-26) — Epoch AI estimates a possible inference compute crunch: demand is growing faster than serving capacity, especially for long-context workloads where throughput degrades sharply. This is the story underneath the rate limits and the price hikes.

White House approves H200 sales to China (2026-05-21) — The White House approved Nvidia H200 sales to ten Chinese firms, estimated to roughly triple China’s compute capacity growth, alongside ongoing H100 sales. China separately banned Nvidia’s RTX 5090 gaming chip. This is a bad idea.

Anthropic $200B Google Cloud commitment (2026-05-07) — Anthropic committed to $200 billion in spending on Google cloud and chips over five years. Even this is not enough for a company growing roughly 10x-80x per year.

Anthropic-SpaceX/xAI Colossus compute deal (2026-05-04) — Anthropic struck a deal to use essentially all of SpaceX/xAI’s Colossus 1 data center for Claude inference: 300+ MW and 220,000+ NVIDIA GPUs, at $1.25 billion per month (~$15B/year) through May 2029 per a SpaceX S-1 filing. xAI was comfortable leasing because training already moved to Colossus 2, where Grok 5 trains; it collects ~$6B/year at ~65% margin.

Google TPU 8t and TPU 8i (2026-05-01) — Google detailed its 8th-gen TPUs in a split design: 8t for training (170-180% gain in cost-performance) and 8i for inference (80% gain, 117% better power efficiency), plus 300% more network bandwidth and 200% more on-chip SRAM. Positioned to cut costs for Gemini and future trillion-parameter models.

Financing

OpenRouter Series B (2026-05-26) — OpenRouter raised a $113M Series B led by CapitalG. Weekly volume grew from 5T to 25T tokens over six months. Part of a broader inference-infra funding wave that also swept up Fireworks, Baseten, and Modal.

Cerebras IPO ($CBRS) (2026-05-15) — Wafer-scale chip maker Cerebras IPO’d, closing at $280/share for a ~$60B market cap, six months after NVIDIA’s $20B execuhire of Groq. CFO Bob Komin pushed back on the small-models-only narrative, saying Cerebras serves trillion-parameter models including OpenAI’s internal 5.4 and 5.5. The constraint: limited wafer and TSMC access until at least 2028.

Cognition $1B Series D (2026-05-12) — Cognition (Devin) raised over $1B at a $26B valuation, up 2.5x from September. Run-rate revenue grew to $492M, projecting >$1B ARR by year-end. The largest remaining independent agent lab.

Isomorphic Labs $2.1B funding (2026-05-12) — Demis Hassabis announced $2.1B in new funding for Isomorphic Labs’ AI-driven drug discovery, one of the largest capital commitments tied to an applied AI platform this month.

Anthropic $65B Series H (2026-05-07) — Anthropic raised a $65B Series H at a $965B post-money valuation, briefly valued $1-1.2T on secondary markets, and reportedly above OpenAI for the first time. Run-rate revenue crossed $47B mid-month, up from ~$9B in December, with inference gross margins climbing from 38% to over 70%.

Sierra $1B at $15B valuation (2026-05-04) — Bret Taylor’s Sierra raised roughly $1B at a $15B valuation, up from its prior $10B round, with ARR estimated around $200M+.