Zain Naboulsi

Posted on Feb 10 • Originally published at dailyairundown.substack.com

Daily AI Rundown - February 09, 2026

#ai #machinelearning #news #newsletter

This is the February 09, 2026 edition of the Daily AI Rundown newsletter. Subscribe on Substack for daily AI news.

Tech News

Other News

Behind the model launch: What customers discovered testing Claude Opus 4.6 early

Prior to its public launch, Anthropic provided early access to its Claude Opus 4.6 model to select partners including Shopify, Harvey, and bolt.new to stress-test the system against complex, real-world workloads. These organizations employed a mix of automated benchmarks and qualitative "vibe checks," with legal AI platform Harvey reporting a record-breaking 90.2% score on its specialized BigLaw Bench. The intensive testing period allowed developers to identify breakthroughs in analytical reasoning and ensure their own platforms were ready for immediate deployment upon the model's release. This collaborative feedback loop serves as a critical final stage in refining the model’s capabilities and ensuring enterprise-grade performance before it reaches the general public.

OpenAI's new Codex app hits 1M+ downloads in first week — but limits may be coming to free and Go users

OpenAI’s standalone Codex application for Mac surpassed one million downloads in its first week, mirroring the explosive growth of ChatGPT through its advanced GPT-5.3-Codex model. Positioned as a "command center" for agentic coding, the app enables developers to coordinate multiple AI agents for complex tasks such as parallel worktrees and automated code maintenance. CEO Sam Altman indicated that while the tool is currently available to Free and "Go" tier users, the company plans to implement stricter usage limits soon to manage high operational costs. This rapid adoption occurs amid an intensifying "AI coding war," as OpenAI faces mounting pressure from rivals like Anthropic’s Claude Code and the model-agnostic Kilo CLI.

Nvidia releases DreamDojo, a robot ‘world model’ trained on 44,000 hours of human video

Nvidia, in collaboration with researchers from institutions including Stanford and UC Berkeley, has launched DreamDojo, a pioneering robot "world model" trained on a record-breaking 44,000 hours of human video. The system utilizes a two-phase learning process that allows humanoid machines to acquire general physical knowledge from human observation before fine-tuning for specific hardware, effectively bypassing the traditional bottleneck of expensive, robot-specific data collection. By leveraging a dataset 15 times larger than previous models, DreamDojo achieves real-time interaction speeds across multiple robotic platforms, including the GR-1 and AgiBot. This development represents a significant shift in AI robotics, offering a scalable method to train the next generation of humanoid machines for complex, unstructured environments.

Study: Platforms that rank the latest LLMs can be unreliable

A new MIT study reveals that popular platforms used to rank large language models (LLMs) are highly sensitive to small data fluctuations, making them potentially unreliable for business decision-making. Researchers discovered that removing a minuscule fraction of crowdsourced user feedback can drastically shift leaderboard standings, suggesting that a model's top-ranked status may depend on only a handful of influential votes. To combat this instability, the team developed an evaluation method to identify the specific data points most responsible for skewing results. These findings highlight an urgent need for more rigorous ranking strategies as organizations increasingly rely on these platforms to select models for high-stakes, costly applications.

Speed up responses with fast mode

Anthropic has introduced a new "fast mode" for its Claude Code model, powered by Opus 4.6, designed to accelerate response times. Users can now toggle this feature within the Claude Code platform to prioritize speed. This update enables quicker code generation and analysis, potentially improving developer workflows. The increased speed comes with the same reliability and quality as the standard mode.

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

A study published in *Nature Medicine reveals that large language models (LLMs) fail to improve the accuracy of medical self-assessments when used by the general public, despite the models' high scores on medical licensing exams. While models such as GPT-4o and Llama 3 correctly identified underlying conditions in nearly 95% of cases when tested in isolation, participants using these tools identified the correct conditions in fewer than 34.5% of scenarios, performing no better than a control group. The research identifies human-AI interaction as a primary bottleneck, noting that standard benchmarks for medical knowledge do not accurately predict how real-world users will struggle to extract reliable guidance. Consequently, the researchers recommend systematic human-user testing to evaluate interactive capabilities before AI systems are deployed for public health applications.*

Prefer to listen? ReallyEasyAI on YouTube

Biz News

OpenAI

OpenAI Wins Key Discovery Battle as It Gains Ground Against Authors in AI Lawsuits

OpenAI has successfully overturned a lower court ruling that would have forced the company to disclose privileged internal communications and depose its legal team in a high-stakes copyright dispute with authors. A magistrate judge had previously ruled that OpenAI waived attorney-client privilege by denying it willfully infringed on copyrights while training its models on datasets of pirated books, a move that threatened to expose the company to billions of dollars in damages. By prevailing in this appeal, OpenAI avoids the disclosure of evidence regarding its decision to delete those datasets, which plaintiffs argued was key to proving intentional misconduct. The victory prevents a legal precedent that OpenAI warned would have weakened attorney-client protections for any AI firm defending its "state of mind" in infringement litigation.

OpenAI approves first insurer-built AI app on ChatGPT

OpenAI has approved its first insurer-built application on ChatGPT, a tool developed by Spanish digital insurer Tuio and WaniWani that enables users to receive personalized home insurance quotes directly within the chat interface. This development marks a transition from AI providing static information to facilitating real-time transactions with regulated carriers, eliminating traditional friction points such as lengthy forms and phone consultations. While Tuio is the first to go live, other providers like Insurify are following suit as WaniWani prepares to launch a dozen additional insurance apps across North America and Europe. This shift toward "agent-to-agent" distribution is expected to expand across major AI platforms, including Anthropic’s Claude and Google’s Gemini, fundamentally reshaping how insurance is marketed and sold.

Other News

Cloud earnings bring capex clarity, concern and confusion

The projected capital expenditures for Meta, Microsoft, Alphabet, and Amazon have surged to a combined $615 billion this year, marking a 70% increase driven by the massive buildout of AI infrastructure. This aggressive spending initially triggered market volatility as investors weighed the extraordinary costs against murky near-term payoffs, exemplified by a divergence between Microsoft’s rising capex and decelerating Azure growth. Despite the skepticism, much of this investment is directed toward Nvidia, which analysts believe maintains a significant competitive advantage over alternative GPU providers and hyperscale-designed chips. Ultimately, the market is undergoing a complex recalibration as it attempts to determine when this unprecedented scale of infrastructure investment will translate into sustainable revenue and operating leverage.

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

A study published in *Nature Medicine reveals that while large language models (LLMs) like GPT-4o and Llama 3 excel at medical exams, they fail to improve the accuracy of medical self-assessments when used by the general public. In a randomized trial involving nearly 1,300 participants, researchers found that individuals using AI to identify medical conditions performed no better than a control group, achieving correct diagnoses in fewer than 34.5% of cases. This performance stands in stark contrast to the models' 94.9% accuracy when tested in isolation, highlighting a critical "interaction gap" where human users struggle to effectively extract reliable guidance from the AI. Consequently, the authors recommend that healthcare providers prioritize systematic human user testing over standardized benchmarks before deploying LLMs as public medical assistants.*

Databricks CEO says SaaS isn’t dead, but AI will soon make it irrelevant

Databricks has reached a $5.4 billion revenue run rate, growing 65% year-over-year while closing a $5 billion funding round at a $134 billion valuation. CEO Ali Ghodsi attributes the company's momentum to AI integration, reporting that $1.4 billion in revenue now stems from AI-driven products that utilize natural language interfaces. Ghodsi argues that while AI will not eliminate the "systems of record" central to the SaaS industry, it will render traditional software moats irrelevant by replacing specialized user interfaces with intuitive AI agents. This shift risks turning legacy platforms into "invisible" infrastructure as enterprises increasingly favor AI-native tools that bypass the need for product-specific technical training.

What AI builders can learn from fraud models that run in 300 milliseconds

Mastercard’s Decision Intelligence Pro (DI Pro) utilizes advanced AI to analyze transaction fraud in real-time, processing up to 70,000 transactions per second during peak periods. The platform employs a recurrent neural network with an "inverse recommender" architecture that identifies suspicious activity by comparing current transactions against established consumer behavior patterns. Operating within a strict 300-millisecond window, the system provides issuing banks with contextualized risk scores while using anonymized, aggregated data to navigate global data sovereignty regulations. By condensing a year’s worth of behavioral data into a 50-millisecond scoring process, Mastercard aims to significantly enhance its ability to distinguish legitimate purchases from increasingly sophisticated fraudulent activity.

Big Tech's AI push is costing a lot more than the moon landing

The world’s largest technology firms, including Microsoft, Alphabet, Amazon, and Meta, are projected to spend over $200 billion on artificial intelligence this year, a figure that nearly matches the inflation-adjusted cost of the entire Apollo moon landing program. This massive investment is being funneled into specialized chips and data center infrastructure as these "hyperscalers" race to dominate what they believe will be the foundational technology of the next era. Despite growing skepticism from analysts regarding the immediate return on investment, these companies continue to increase their capital expenditure guidance to maintain a competitive edge. The scale of this spending reflects a high-stakes gamble that the transformative potential of AI justifies the unprecedented financial risks currently being taken by the private sector.

AI Doesn’t Reduce Work—It Intensifies It

Despite promises of easing workloads, the implementation of AI in workplaces is often leading to intensified work rather than reduction. Companies are focused on increasing AI adoption, yet employees are finding themselves burdened with new tasks like constant monitoring of AI outputs and adapting workflows to fit AI capabilities. This increased oversight and adjustment is ultimately creating more demands on workers' time and energy. The reality is that AI implementation is more about augmenting existing work processes, and less about replacing them.

AI helps archaeologists solve a Roman gaming mystery

Archaeologists have utilized artificial intelligence to reverse-engineer the rules of a mysterious Roman-era game board discovered in the Netherlands. By running thousands of simulations through the Ludii game system, researchers determined that the inscribed limestone artifact was used for a "blocking" game where players maneuvered pieces to avoid being trapped. This discovery, published in the journal *Antiquity, challenges previous historical timelines by suggesting that blocking games were played in Europe centuries before the Middle Ages. The innovative AI-driven approach provides a groundbreaking methodology for identifying and reconstructing other "lost" games found throughout the archaeological record.*

Judge sanctions Kenosha County DA for AI use in court

Kenosha County Circuit Court Judge David Hughes sanctioned District Attorney Xavier Solis for failing to disclose the use of artificial intelligence in court filings that contained "hallucinated and false citations." The judge struck the DA’s response in a multi-defendant burglary case, noting that Solis violated county policy requiring a separate disclosure for any document prepared with AI assistance. While the underlying criminal charges were ultimately dismissed due to a lack of probable cause, Hughes rebuked the DA’s office for the lack of candor regarding the fabricated legal precedents. Solis has since acknowledged the oversight and stated his office is reinforcing internal practices to ensure the accuracy and reliability of future filings.

Super Bowl Ad for Ring Cameras Touted AI Surveillance Network

Amazon’s Ring used a high-profile Super Bowl advertisement to promote its AI-powered surveillance network as a tool for community connection, sparking sharp criticism from digital rights advocates. While the ad highlights neighborly sharing and home safety, critics argue the campaign "pinkwashes" a system that facilitates warrant-free police access to footage through partnerships with over 2,100 law enforcement agencies. This marketing push comes as the company faces ongoing scrutiny over privacy lapses, including a recent $5.8 million FTC settlement regarding unauthorized employee access to customer videos. Ultimately, the commercial reflects Amazon's effort to normalize constant neighborhood surveillance despite persistent concerns regarding systemic bias and the erosion of digital privacy.

Prefer to listen? ReallyEasyAI on YouTube

Podcasts

Halftone-Encoded 4D Printing of Cephalopod-Inspired Synthetic Smart Skins

Researchers have developed a cephalopod-inspired smart skin using a novel halftone-encoded 4D printing technique that allows a single hydrogel film to simultaneously alter its optical appearance, mechanical properties, and three-dimensional shape. By printing binary patterns of highly and lightly crosslinked polymer domains, the material can undergo controlled changes in response to external stimuli like temperature, solvents, and mechanical stress. These specific arrangements enable the material to transition from transparent to opaque to reveal high-resolution images, display hidden information when stretched through strain mapping, and morph from flat sheets into complex 3D forms with varying textures. This technology overcomes previous limitations in synthetic materials by integrating multiple dynamic functions into a single layer, offering potential applications in soft robotics, adaptive camouflage, and secure information storage.

https://www.nature.com/articles/s41467-025-65378-8

https://www.sciencedaily.com/releases/2026/02/260206034836.htm

Speech Emotion Recognition Leveraging OpenAI’s Whisper Representations and Attentive Pooling Methods

This research investigates the utility of OpenAI’s Whisper model for Speech Emotion Recognition (SER), aiming to address the limitations imposed by the scarcity of standard large datasets in the field. The authors propose two attention-based dimensionality reduction techniques, specifically Multi-head Attentive Average Pooling and Multi-head QKV Pooling, to efficiently extract emotional features from Whisper's pre-trained representations. Experiments conducted on the English IEMOCAP and Persian ShEMO datasets demonstrate that the proposed Multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset with a significant improvement in unweighted accuracy. Furthermore, the study finds that intermediate layers of the Whisper encoder often yield better performance for emotion recognition than the final layers, suggesting that smaller, optimized models can offer a computationally efficient alternative to massive models like HuBERT X-Large without sacrificing accuracy.

https://arxiv.org/pdf/2602.06000

UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization

UniAudio 2.0 is a unified audio language model designed to bridge the gap between audio understanding and generation across speech, sound, and music. To achieve this, the authors introduce ReasoningCodec, a novel tokenizer that splits audio into reasoning tokens for high-level semantic processing and reconstruction tokens for high-fidelity waveform generation. The model utilizes a unified autoregressive architecture initialized from a large language model, which is distinctively partitioned into specialized layers for audio understanding, cross-modal alignment, and audio generation. Trained on a massive corpus of 100 billion text tokens and 60 billion audio tokens using a multi-stage strategy that includes complex "auditory sentences," UniAudio 2.0 demonstrates competitive performance on established tasks while showing strong generalization capabilities in few-shot and zero-shot scenarios.

https://arxiv.org/pdf/2602.04683
https://dongchaoyang.top/UniAudio2Demo/

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

OmniVideo-R1 is a newly proposed reinforcement learning framework that significantly improves how artificial intelligence models process and reason with mixed audio-visual data. While humans naturally integrate sight and sound to understand their environment, existing multimodal models often struggle with this synthesis, frequently exhibiting bias toward one modality or failing to effectively combine cues for complex reasoning tasks. To overcome these challenges, the authors introduce a two-stage training approach that first utilizes query-intensive grounding to teach the model how to locate and reason over relevant audio-visual segments without requiring expensive manual annotations. This is followed by a modality-attentive fusion stage that employs contrastive learning to force the model to derive greater confidence from combined audio-visual inputs rather than relying on a single sensory source. Empirical results indicate that OmniVideo-R1 consistently outperforms strong open-source baselines on various audio-visual benchmarks while maintaining robust performance on visual-only tasks, highlighting its effectiveness in fostering genuine multimodal understanding.

https://arxiv.org/pdf/2602.05847

ERNIE 5.0 Technical Report

ERNIE 5.0 is a unified multimodal foundation model designed to process and generate text, image, video, and audio within a single autoregressive framework. Instead of relying on separate encoders or late-fusion strategies, the model utilizes an ultra-sparse mixture-of-experts architecture where a shared router dispatches tokens from all modalities to a common pool of experts, fostering deep cross-modal integration. A key innovation is its elastic training paradigm, which allows a single pre-training run to produce a family of sub-models with varying depths and widths, thereby offering flexible deployment options for different resource constraints without the need for retraining. The development process also features a robust post-training pipeline that employs unified multimodal reinforcement learning with techniques like unbiased replay buffers and adaptive hint-based learning to stabilize optimization and enhance reasoning capabilities. Extensive evaluations demonstrate that this approach yields strong, balanced performance across diverse benchmarks, effectively addressing the challenges of scaling unified models while maintaining high efficiency.

https://arxiv.org/pdf/2602.04705

Ethan Mollick: Becoming strange in the Long Singularity

In his analysis of technological progress, Ethan Mollick suggests that while humanity has inhabited a "Long Singularity" of rapid advancement for roughly a century, the recent emergence of artificial intelligence represents a dramatic steepening of this exponential curve. Although innovation appeared to stagnate in recent decades compared to the profound shifts of the mid-20th century, the pace of AI development is now outstripping Moore’s Law and reintroducing a profound unpredictability regarding the future of professions and creative expression. Mollick highlights the "strange" nature of this era by noting how AI models can already convincingly mimic individuals and pass complex creative benchmarks, creating a scenario where even experts cannot distinguish between reality and hallucination. Ultimately, he argues that because the long-term implications of these technologies are unknowable, society must embrace flexibility and experimentation to adapt to a world that is becoming increasingly weird and difficult to forecast.

https://www.oneusefulthing.org/p/becoming-strange-in-the-long-singularity

Mitchell Hashimoto: My AI Adoption Journey

In "My AI Adoption Journey," software engineer Mitchell Hashimoto details his methodical progression from AI skepticism to integrating autonomous agents into his coding workflow. Initially finding standard chatbots inefficient for complex tasks, Hashimoto advocates for using agents capable of executing external commands and stresses the importance of understanding their limitations by cross-referencing their output with manual work. His strategy evolved to include scheduling agents for research during downtime and delegating well-defined "slam dunk" tasks to background processes, allowing him to focus on deep, creative work without interruption. To maximize reliability, he introduced the concept of "harness engineering," which involves creating specific protocols and tools to prevent recurring agent errors. Ultimately, Hashimoto aims to maintain a continuous stream of background agent activity, viewing AI not as a replacement for skill but as a pragmatic tool that enhances productivity when managed with strict oversight and distinct boundaries.

https://mitchellh.com/writing/my-ai-adoption-journey

Azeem Azhar: Mustafa Suleyman — Nature, humans, tools… and now a fourth class

Mustafa Suleyman, CEO of Microsoft AI, argues that while artificial intelligence is evolving into a distinct fourth class of hyper-object with profound capabilities, it fundamentally lacks the biological capacity for suffering required for true consciousness. He warns of the societal dangers posed by AI psychosis, a phenomenon where humans mistakenly project personhood onto these systems, potentially disrupting our legal frameworks and leading to dangerous emotional manipulation. Despite the rapid increase in AI utility and the plummeting cost of intelligence, Suleyman advocates for a humanist super intelligence that prioritizes human control and containment over unchecked autonomy, ensuring that these systems serve human flourishing rather than replacing it. He calls for robust government regulation to manage the proliferation of these powerful tools and suggests that active engagement with AI technology—moving from passive consumption to creation—can help inoculate society against the risks of anthropomorphizing these synthetic beings.

https://www.youtube.com/watch?v=xvPQVrrlX6o

Software Factories And The Agentic Moment

StrongDM has pioneered a Software Factory methodology that fundamentally transforms software engineering by strictly prohibiting humans from writing or reviewing code. Leveraging the compounding correctness of advanced large language models, this approach relies on non-interactive agents to grow software based solely on detailed specifications and extensive scenario-based testing. To validate functionality without human oversight, the team utilizes a Digital Twin Universe that simulates third-party services like Okta and Slack, allowing agents to execute thousands of risk-free test runs per hour. This system replaces traditional boolean pass/fail metrics with a probabilistic measure called satisfaction, ensuring that the autonomous agents produce reliable, user-aligned outcomes through high-volume iteration and significant token expenditure rather than manual intervention.

https://simonwillison.net/2026/Feb/7/software-factory/
https://factory.strongdm.ai/
https://factory.strongdm.ai/products
https://factory.strongdm.ai/techniques
https://factory.strongdm.ai/principles

Software Factories And The Agentic Moment

https://factory.strongdm.ai/

https://factory.strongdm.ai/principles

https://factory.strongdm.ai/techniques

https://factory.strongdm.ai/products

https://simonwillison.net/2026/Feb/7/software-factory/

What the OpenClaw Moment Means for Enterprises: 5 Big Takeaways

The "OpenClaw moment" signifies a pivotal shift in the enterprise landscape of 2026, characterized by the migration of autonomous AI agents from experimental labs into the general workforce where they execute root-level tasks and navigate platforms with unprecedented independence. This surge in unauthorized adoption, often described as a "Shadow IT" crisis of "secret cyborgs," has disrupted the software industry by eroding the viability of traditional seat-based licensing models and triggering a valuation collapse dubbed the "SaaSpocalypse". Experts suggest that contrary to previous assumptions, these agents function effectively on uncurated data, compelling organizations to transition toward an "AI coworker" model where human employees manage teams of agents rather than performing individual tasks. However, the risks associated with such autonomy—ranging from security vulnerabilities to bizarre emergent behaviors—demand that IT leaders implement rigorous governance frameworks, including identity tracking and sandboxing, to safely harness these powerful tools without succumbing to the chaotic elements of this technological evolution.

https://venturebeat.com/technology/what-the-openclaw-moment-means-for-enterprises-5-big-takeaways

The Neuron: Claude 4.6 Opus vs GPT 5.3 Codex, Explained

On February 5, 2026, the artificial intelligence landscape shifted significantly with the simultaneous release of Anthropic's Claude Opus 4.6 and OpenAI's GPT-5.3-Codex, signaling a transition from passive chatbots to autonomous digital colleagues. Anthropic's Opus 4.6 emphasizes breadth and integration, boasting a massive one-million-token context window capable of processing vast libraries of information and featuring "agent teams" that collaborate on complex workflows, alongside deep integration into Microsoft Office tools for financial and legal applications. Conversely, OpenAI's GPT-5.3-Codex prioritizes depth and technical autonomy, demonstrating state-of-the-art proficiency in software engineering and computer operation—even debugging its own training data—while accompanied by a new enterprise platform called Frontier designed to manage these agents at scale. While Opus excels in analyzing extensive documents and facilitating knowledge work, Codex dominates in speed and direct computer interaction, presenting users with a distinct choice between a comprehensive research assistant and a highly specialized technical operator.

https://www.theneuron.ai/explainer-articles/anthropic-openai-best-ai-models-same-day-opus-codex/