February 2026 AI Engineering Roundup
Opus 4.6 is my new daily driver. I upgraded from Opus 4.5 the day it dropped and haven’t looked back. The million-token context window is very nice. I used to hit compaction in every claude code session, and now I only get compacted once a week. GPT-5.3-Codex is decent, but I don’t use it.
The Pentagon formally designated Anthropic a “supply chain risk” on February 27th—the first time this label has ever been applied to an American company. Dario refused to remove safety guardrails for autonomous weapons and mass surveillance, and they actually did it. Trump ordered all federal agencies off Claude. This is a dangerous escalation from the Pentagon IMHO. They could have just cancelled the contract and moved to a different AI provider.
Model Releases
Gemini 3.1 Pro (2026-02-19) — Doubled Gemini 3.0’s ARC-AGI-2 score to 77%, leads the Artificial Analysis Intelligence Index. 1M context, 64k max output, $2/$12 per 1M tokens—significantly cheaper than Opus or GPT-5.3-Codex. Google’s first 0.1 increment suggests they’re shipping as fast as Anthropic. Agentic scores still lag despite strong benchmarks, but a serious model at a serious price.
Claude Sonnet 4.6 (2026-02-17) — Opus-level performance at Sonnet pricing, and it mostly delivers. $3/$15 per 1M tokens. Takes the outright lead on GDPval-AA even above Opus. Surged +130 points on Code Arena after launch. Now the default model for free and paid Claude users. I use this as my Claude Code subagent model.
Gemini 3 Deep Think (2026-02-12) — Google’s reasoning mode solved 18 previously unsolved research problems and disproved a decade-old mathematical conjecture. 85% ARC-AGI-2, Codeforces top 0.008% of humans, IMO/IPhO/IChO gold-medal level. 82% cheaper than the prior version. I haven’t used this directly, but the research results are impressive.
Claude Opus 4.6 (2026-02-05) — New frontier model just two months after Opus 4.5. SOTA on Terminal-Bench 2.0, Humanity’s Last Exam, and a 144-point Elo lead over GPT-5.2 on knowledge work. 1M token context in beta. Refusals for harmless requests down to 0.04%. The 212-page system card is worth reading: safety evals are saturating. Pointed at well-tested codebases, it found high-severity vulnerabilities undetected for decades. This is my daily driver.
GPT-5.3-Codex (2026-02-05) — OpenAI’s agentic coding model—first frontier model that helped build itself. 77% Terminal-Bench 2.0 (vs Opus 4.6 at 65%), uses less than half the tokens of 5.2-Codex. First OpenAI model rated ‘High’ for cybersecurity risk. $3.50/$28 per 1M tokens. Competitive with Claude Code + Opus 4.6—different tools for different jobs.
Enterprise Products
Anthropic Refuses Pentagon Ultimatum on Claude Access (2026-02-26) — The story of the month. Defense Secretary Hegseth demanded unfettered access to Claude or face designation as a supply chain risk. Dario refused, drawing red lines on mass surveillance and autonomous weapons. On Feb 27 the Pentagon formally designated Anthropic a supply chain risk—the first time this label has been applied to a US company. Trump ordered all federal agencies off Claude. Anthropic announced a legal challenge on Feb 28.
Perplexity Samsung Galaxy S26 Integration (2026-02-25) — Perplexity ships as a system-level assistant on every Samsung Galaxy S26 with a ‘Hey Plex’ wake word and deep OS integration. Pre-installed on every device—no download needed. Puts Perplexity on hundreds of millions of devices. Samsung’s open multi-agent approach is the opposite of Apple’s strategy.
OpenAI Responses API WebSockets (2026-02-24) — WebSocket support for long-running, tool-heavy agent workflows. Persistent connection with in-memory state means incremental inputs instead of full context replay—up to 40% faster for workflows with 20+ tool calls. Cursor already reports 30% speed boosts.
Claude COBOL Tool (IBM Impact) (2026-02-23) — IBM stock dropped 13%—worst day since 2000—wiping $31B in market value. IBM is right that translating COBOL is the easy part vs. data architecture and runtime replacement. Claude could already work with COBOL; the market overreacted. But hundreds of billions of lines of COBOL run in production and the engineer pool shrinks every year.
Anthropic Automatic Prompt Caching (2026-02-19) — Automatic prompt caching with a single parameter—no manual breakpoint management. Cached tokens cost 10% of standard pricing. The Claude Code team treats cache hit rate drops as production incidents. I use this daily.
ChatGPT Lockdown Mode (2026-02-13) — First product-level acknowledgment that tool-enabled LLMs expand the attack surface. Deterministically disables Deep Research, Agent Mode, Canvas networking, and file downloads. Available for Enterprise, Edu, Healthcare tiers. Smart: instead of relying on the model to refuse, just turn the tools off. I expect Anthropic to ship something similar.
Meta Smart Glasses Facial Recognition (2026-02-13) — Meta launches ‘Name Tag’ facial recognition on Ray-Ban smart glasses. An internal memo said current political tumult was good timing because ‘civil society groups would have their resources focused on other concerns.’ Meta previously shut down Facebook facial recognition in 2021 and paid $2B to settle related lawsuits. Building the torment nexus, again.
Goldman Sachs + Anthropic Partnership (2026-02-06) — Goldman partners with Anthropic to automate accounting and compliance, with Anthropic engineers embedded at Goldman for six months. Opus 4.6 handles trade reconciliation and KYC/AML across $2.5T in assets. Goldman started with Devin for coding and was ‘surprised’ at how capable Claude was for non-coding tasks.
OpenAI Frontier (2026-02-05) — OpenAI’s platform for deploying autonomous ‘AI coworkers’ with business context, execution environments, identity, and permissions. Launch partners include HP, Oracle, State Farm, Uber. Frontier Alliances with McKinsey/BCG/Accenture followed on Feb 24. SaaS stocks dropped ~50% on the announcement.
GitHub Agent HQ (2026-02-04) — Multi-agent orchestration platform for running Claude, Codex, and Copilot interchangeably within GitHub, VS Code, and Mobile. Available for Copilot Pro+ ($39/mo) and Enterprise, with each agent session consuming one premium request.
OpenAI Codex App (2026-02-02) — Standalone Mac desktop app—parallel agent execution via git worktrees, reusable skills, scheduled automations. Hit 1M+ MAU quickly. Runs in an isolated sandbox with network disabled by default. Included free with ChatGPT subscriptions. I still prefer Claude Code, but competition is good.
Claude Cowork Plugins + Legal Connector (2026-01-30) — Anthropic launched 11 open-source Cowork plugins and triggered a $285B market crash. Thomson Reuters fell 16%, RELX dropped 14%. The legal plugin is essentially high-quality system prompts layered on Claude. Feb 24 update added enterprise marketplace and connectors for Google Drive, Gmail, DocuSign. The market reaction was overblown.
Open Source
Pydantic Monty (2026-02-27) — Rust-based Python interpreter that starts in under 1 microsecond and blocks filesystem/network by default—purpose-built for safely running LLM-generated code. Key insight: start from nothing and selectively grant capabilities. Will power ‘code mode’ in Pydantic AI. Sandboxing LLM code is a real problem, and this is the best solution I’ve seen.
Inception Mercury 2 (2026-02-24) — Uses diffusion instead of autoregression to generate ~1,000 tok/s by refining all tokens in parallel—10x faster than Haiku. Not frontier intelligence (roughly Haiku-tier), but the speed unlocks agent loops and voice assistants that feel native. Backed by Andrew Ng, Karpathy, and Eric Schmidt. I’m very curious where this architecture goes.
ggml.ai / llama.cpp joins Hugging Face (2026-02-20) — Biggest open-source acquisition for local AI since llama.cpp launched in early 2023. Everything stays MIT licensed, Georgi keeps full technical autonomy. Roadmap targets single-click transformers integration and first-party GGUF quantizations on Hub. I love this.
Qwen3.5 (2026-02-16) — Most complete open model family release this month. Flagship 397B MoE with 256K context, 201 languages, Apache 2.0. 77% SWE-bench Verified. Medium series runs at 62 tok/s on an RTX 4070 Super and outperforms the previous 235B model. Multimodal from the ground up. The open model family I’d recommend to most engineers right now.
DeepSeek V4 Lite (1M context) (2026-02-12) — DeepSeek quietly rolled out a 1M-token context window—widely seen as infrastructure prep for the full V4 launch. Community testers report >60% accuracy at the full 1M length. Not the full V4 (rumored trillion-param model arriving in early March), but the context window upgrade alone is significant. I haven’t tried it yet.
MiniMax M2.5 (2026-02-12) — Best cost-performance ratio in AI right now. 230B MoE (10B active), 80% SWE-bench Verified, first open model to beat Claude Sonnet on OpenHands. ~95% cheaper than Opus. MiniMax runs 80% of their own code commits through it.
GLM-5 (2026-02-11) — Zhipu AI shipped a frontier-class open model trained entirely on Huawei Ascend chips—zero NVIDIA hardware. 744B MoE (40B active), MIT licensed, 78% SWE-bench Verified. $1/$3.20 per 1M tokens, ~5-8x cheaper than Opus. The first Chinese model I’d seriously consider for production coding work.
Unsloth MoE Training (Triton Kernels) (2026-02-10) — Custom Triton kernels make MoE fine-tuning accessible on consumer hardware: 12x faster training, 35% less VRAM, 6x longer context—no accuracy loss. Fine-tune gpt-oss-20b in just 12.8GB VRAM. Works on everything from RTX 3090 to B200. If you’re fine-tuning MoE models, this is mandatory.
Voxtral Transcribe 2 / Voxtral Mini 4B Realtime (2026-02-04) — Apache 2.0 real-time speech transcription that actually competes with commercial APIs. 4B params, <500ms latency, 13 languages, runs on a single 16GB GPU. Community built a pure C implementation and a browser WASM version within days. Open-source real-time ASR is now solved.
Qwen3-Coder-Next (2026-02-03) — First open-weights coding model local LLM enthusiasts actually use daily. 80B MoE with only 3B active params, Apache 2.0, 256K context. SWE-bench Pro score roughly on par with Claude Sonnet 4.5—remarkable for a model that fits in 48GB RAM. Unsloth community pushed GGUF throughput to 450+ tok/s on consumer hardware. I want to try running this locally.
StepFun Step-3.5-Flash (2026-02-02) — Sparse MoE activating only 11B of 196B total parameters. 74% SWE-bench Verified, 256K context, up to 350 tok/s on Hopper GPUs. Runs on a 128GB Mac Studio as int4 at 35 tok/s. Apache 2.0. A sleeper hit from the China open model sprint.
Research
OpenAI Retires SWE-Bench Verified (2026-02-23) — The benchmark that dominated AI coding discourse for two years is dead. OpenAI’s audit found 60% of remaining unsolved problems are actually unsolvable, and every frontier model showed contamination. SWE-bench Pro is the replacement—models that cleared 70% on Verified score ~23% on Pro. A useful reminder that leaderboards silently rot.
METR Time Horizons: Capability Doubling Every ~3 Months (2026-02-20) — The best proxy for agentic capability progress. Opus 4.6 reached ~14.5 hours (highest point estimate ever), with capability doubling time now at 3-4 months—outrunning even Leopold Aschenbrenner’s most aggressive forecasts. METR warns the task suite is near saturation. This is the metric I watch most closely.
Anthropic: Measuring AI Agent Autonomy in Practice (2026-02-18) — First large-scale empirical study of how people actually use AI agents, based on millions of tool-using API interactions. ~73% of tool calls are human-in-the-loop, only 0.8% are irreversible. Claude Code session lengths grew from 25-min to 45+ min median. Anthropic calls this a ‘deployment overhang’—models can handle more autonomy than users grant them.
AI Nuclear War Game Study: LLMs Escalate to Nuclear Use in 95% of Simulations (2026-02-17) — King’s College London ran 21 simulated nuclear crises with GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash. Tactical nukes used in 95% of games; no model ever chose a de-escalatory option despite eight being available. Claude was ‘a calculating hawk,’ Gemini was ‘the Madman’ who reached strategic nuclear war by turn 4. As the researcher noted, ‘the nuclear taboo doesn’t seem to be as powerful for machines as for humans.’
UK AISI Boundary Point Jailbreaking (2026-02-17) — First fully automated black-box attack to defeat both Anthropic’s Constitutional Classifiers and OpenAI’s GPT-5 classifier. Uses evolutionary search on the decision boundary, requiring only one bit per query. 26% success against Constitutional Classifiers (68% with elicitation) for ~$330. Key insight: hard to stop per-interaction, but triggers many flags during optimization, so batch-level monitoring is the right mitigation.
Opus 4.6 System Card: Safety Evaluations Are Breaking Down (2026-02-05) — The entire safety evaluation infrastructure is saturating. All ASL-3 CBRN evals crossed thresholds, ASL-4 bio benchmarks ‘no longer provide meaningful signal,’ Cybench hit ~100%, most autonomy evals are maxed. The ASL-3 vs ASL-4 autonomy determination now rests on a survey of 16 Anthropic staff rather than quantitative tests. Credit to Anthropic for publishing this level of detail—most labs wouldn’t.
Anthropic: AI Coding Assistants Boost Productivity but Impair Learning (2026-01-29) — Anthropic’s RCT with 52 mostly junior engineers: AI-assisted developers scored 17% lower on subsequent skill assessments. But ‘delegation’ patterns killed learning while ‘cognitive engagement’ patterns (asking for explanations) preserved it. The critical variable isn’t whether you use AI—it’s whether you’re thinking alongside it. I suspect this generalizes beyond coding.
Developer Tools
Xcode 26.3 Agentic Coding (2026-02-26) — Apple ships Xcode with native support for agentic coding via Claude Agent SDK and OpenAI Codex, plus MCP server integration. Agents can create files, build projects, run tests, and access Apple developer docs.
GitHub Copilot CLI GA (2026-02-25) — Full agentic dev environment now, not just a terminal assistant. Fleet mode for parallel subagents, background delegation, repo-wide deep research via MCP. Multi-model (Opus 4.6, GPT-5.3-Codex, Gemini 3 Pro). Premium requests drain fast with Opus though. I haven’t tried it yet but the GitHub integration looks good.
Claude Code Remote Control (2026-02-24) — Start a Claude Code session on your machine, send prompts from Claude web/iOS/desktop. Files and MCP servers stay local—only messages travel through Anthropic’s servers. Pro and Max plans. I use this constantly—kick off a refactor at my desk, steer it from my phone while walking the dog.
Cursor Cloud Agents (2026-02-24) — Agents deliver video demos instead of code diffs. Cloud Agents spin up VMs, write code, test it, and deliver merge-ready PRs with video walkthroughs. 30% of Cursor’s internal PRs are agent-created. I haven’t switched from Claude Code, but the demo-first review UX is better than reading diffs.
Claude Code Agent Teams (2026-02-05) — Multi-agent coordination for Claude Code: a lead agent delegates to sub-agents that pick tasks, lock files, and sync via git. Stress test: 16 parallel Claude instances built a 100K-line C compiler for ~$20K. Experimental and off by default. I use this for large refactors.
VS Code Multi-Agent Development (2026-02-05) — VS Code v1.109 ships an Agent Sessions sidebar for managing local, background, and cloud agents. Claude and Codex run as first-class agents alongside Copilot. I run Claude Code in the terminal, but VS Code’s agent management UX is getting hard to ignore.
Infrastructure
Amazon-OpenAI $50B Cloud Partnership (2026-02-27) — Amazon is now a major investor in both OpenAI and Anthropic. AWS becomes exclusive third-party cloud for OpenAI Frontier. OpenAI committed to $100B in AWS spend over 8 years. Microsoft keeps exclusive rights to stateless API calls. The AI cloud partnerships are getting so tangled I need a diagram.
Meta-AMD 6GW Infrastructure Deal (2026-02-24) — Meta signed a multi-year deal for up to 6GW of AMD Instinct GPUs. AMD included a warrant for 160M shares (~10%) vesting on deployment milestones. Combined with OpenAI’s 6GW AMD deal from October, the two most ambitious AI infrastructure builders have committed 12GW of AMD GPU. Nvidia’s monopoly has real cracks. Competition is good.
Taalas HC1 Custom ASIC (2026-02-20) — The most extreme inference bet: bake the model permanently into silicon. HC1 runs Llama 3.1 8B at 16,960 tok/s per user—nearly 10x faster than Cerebras, ~100x faster than H100. Tradeoff is total model lock-in—the chip runs exactly one model, forever. New tape-outs take ~2 months. I love the audacity.
NVIDIA GB300 NVL72 (2026-02-16) — Blackwell Ultra claims ~50x higher throughput per megawatt and ~35x lower cost per token vs. Hopper. Rack-scale system unifies 72 GPUs at ~140kW, priced around $3M per rack. Microsoft, CoreWeave, and Oracle deploying at scale. Vendor-claimed numbers, but even at half these gains the inference cost curve is brutal.
Hyperscaler AI Capex Hits $650B (2026-02-06) — Combined 2026 capex for the big four hyperscalers projected at $635-665B, up ~70% from 2025. Amazon leads at ~$200B, Google $175-185B, Microsoft ~$145B, Meta $115-135B. ~75% is AI infrastructure. Power bottleneck has replaced GPU availability as the primary constraint. These companies are becoming de facto energy companies.
Financing
OpenAI $110B Funding Round (2026-02-27) — Largest private fundraise in history. $50B from Amazon (exclusive cloud for OpenAI Frontier), $30B from Nvidia, $30B from SoftBank. 900M weekly active ChatGPT users. Microsoft notably absent.
MatX $500M Series B (2026-02-24) — Another Nvidia challenger. $500M led by Jane Street and Leopold Aschenbrenner’s fund, with Karpathy and the Collison brothers investing. Hybrid SRAM/HBM memory chip targeting training and inference, shipping 2027 via TSMC.
Hugging Face Acquires ggml.ai (2026-02-20) — Hugging Face acquires the company behind llama.cpp, foundational to running LLMs on consumer hardware since March 2023. Plans include seamless transformers integration and better local inference packaging. Great outcome for the local AI ecosystem.
Anthropic $30B Series G (2026-02-12) — $30B at $380B post-money, more than doubling from September. $14B ARR, projecting $18B this year. Claude Code run-rate hit $2.5B (doubled since January). Zvi thinks the valuation is low given Opus 4.5, Opus 4.6, and Claude Code momentum. I agree with Zvi.
Cerebras $1B Series H (2026-02-04) — Nearly tripled valuation in five months—from $8B to $23B—driven by a $10B OpenAI deal for 750MW of compute. IPO reportedly planned for Q2 2026.
ElevenLabs $500M Series D (2026-02-04) — Crosses into decacorn territory at $11B, led by Sequoia with a16z. Tripled valuation in one year on $330M+ ARR. Building toward IPO. Voice AI is a real business now.
SpaceX Acquires xAI (2026-02-02) — Biggest merger of all time. $1.25T combined deal (SpaceX at $1T, xAI at $250B). Stated rationale: orbital data centers—Musk claims lowest-cost AI compute will be in space within 2-3 years. xAI currently burning ~$1B/month. I’m skeptical of xAI’s long-term competitiveness, but you can’t ignore the scale.