DEV Community

Dan
Dan

Posted on

2026-02-05 Daily Ai News

#ai

GPT-5.2's "high" reasoning mode has redefined scalable cognition ceilings, posting a 50%-time-horizon of 6.6 hours (95% CI: 3h20m-17h30m) on METR's expanded software engineering suite—the longest to date—and eclipsing rivals on 80% success rates in linear-scale evals, while delivering 40% latency reductions alongside Codex. Perplexity AI simultaneously unveiled Deep Research Advanced on Opus 4.5, topping external benchmarks across finance, law, medicine, and tech via the open-sourced DRACO rubric evaluating synthesis over 100 real-world tasks, with polished UI rollout for Max/Pro users. This dual thrust signals a paradigm where inference-time compute scales task horizons from minutes to half-days, hardening multi-hour agency as table stakes for production AI.

Yet tensions persist: poker evals reveal persistent logical brittleness in replays, underscoring that raw duration amplifies flaws without architectural cures, while a Nature comment posits LLMs already satisfying human-level general intelligence sans full task mastery—like Einstein minus Mandarin, reframing AGI as probabilistic generality over exhaustive prowess.

GPT-5.2 METR time horizon plot

One year post-"vibe coding" meme, Andrej Karpathy charts its evolution into "agentic engineering"—orchestrating LLM agents for professional codebases with rigorous oversight—while David Shapiro forecasts full autonomy obsoleting tools in 1-2 years via networks like OpenClaw/Moltbook atop inference scaling, culminating in economy-wide automation. Allie K. Miller exemplifies with Claude Code's Google Workspace integration parsing screenshots into timezone-aware meetings and goal-aligned debriefs, as Matt Shumer spotlights multi-agent hierarchies managing sub-agents for complex coding; meanwhile, OpenAI's Codex racks 500k app downloads since Monday launch, fueling builder exodus. China counters with qwen3-coder-next: 80B MoE (3B active) trained on 800k verifiable tasks, runnable locally/free across Claude/Cursor/browser stacks.

This inversion—humans as conductors, agents as executants—compresses software lifecycles, but invites arbitrage windows: John Rush warns the next 36 months favor "dopers" (smarter users amplifying leverage), birthing national AI monopolies as adaptation lags harden generational divides.

Sam Altman's retort to Anthropic's Super Bowl ad critiquing hypothetical OpenAI ads exposes core rifts: OpenAI prioritizes free access (outscaling Claude's US users in Texas alone) and democratic ecosystems over Anthropic's controls like coding API blocks on rivals and prescriptive rules, even as Amazon eyes tens of billions investment for custom models supercharging Alexa/enterprise. Google thrives quietly with 17% YoY search revenue growth despite GPT-4 doomsaying, Gemini processing 10B tokens/min via API + 750M MAU; Stability AI's Emad Mostaque demands open-source stacks for government/healthcare/finance in 2026, echoing Yann LeCun's bazaar-model acceleration via fast publication.

The builder ethos triumphs short-term—Codex "winning" per Altman—but risks commoditization; proprietary moats like Perplexity's sandbox persist, while open challengers erode them, tilting toward resilient pluralism over singular authority.

NVIDIA's DreamZero World Action Model (WAM)—trained on diverse video-first data sans task repetitions—unlocks zero-shot prompting for novel verbs/nouns/environments via pixel-dreamed futures, bridging robot morphologies and human videos with 55-trajectory adaptation on unseen hardware. Humanoid frameworks accelerate: HUSKY's physics-aware DRL for skateboarding via kinematic truck-steering constraints; HumanX converts monocular videos to blind/MoCap skills via XGen retargeting + XMimic imitation. Kling 3.0 parallels in simulation with 3-15s 1080p multi-character audio/video, frame control, and crisp text handling.

Diversity over repetition redefines embodiment scaling, portending open-world physical agency; yet x-embodiment gaps linger, demanding pixel universality to evade morphology silos.

Exponential task duration wall for GPT-5.2 high

Top comments (0)