<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yujia Zhang</title>
    <description>The latest articles on DEV Community by Yujia Zhang (@yujia_zhang_0328).</description>
    <link>https://dev.to/yujia_zhang_0328</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3868704%2F0311e085-9698-4fe5-be0c-64eead99ed15.jpg</url>
      <title>DEV Community: Yujia Zhang</title>
      <link>https://dev.to/yujia_zhang_0328</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yujia_zhang_0328"/>
    <language>en</language>
    <item>
      <title>PJM has moved from auction tuning to market redesign</title>
      <dc:creator>Yujia Zhang</dc:creator>
      <pubDate>Tue, 02 Jun 2026 19:21:16 +0000</pubDate>
      <link>https://dev.to/yujia_zhang_0328/pjm-has-moved-from-auction-tuning-to-market-redesign-4obb</link>
      <guid>https://dev.to/yujia_zhang_0328/pjm-has-moved-from-auction-tuning-to-market-redesign-4obb</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;PJM's new market-design paper is not a pricing note. It is a structural question about whether common reliability can survive rising demand and constrained supply without changing who hedges, who curtails, and who pays.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Market Design - May 6, 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;PJM's May 6 paper does something unusual for a market operator: it admits that the design assumptions underlying the current capacity market no longer match the resource mix or load structure the market has to clear. The Reliability Pricing Model was built for dispatchable thermal generation and diffuse load growth. It assumes that resources can be valued on a common ICAP-to-UCAP conversion, that the Variable Resource Requirement demand curve correctly prices the social cost of reliability, and that forward capacity obligations can be met by a pool of similar resources with similar availability profiles. All three assumptions are now under stress simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The UCAP framework is the first technical seam.&lt;/strong&gt; UCAP weights each resource by its Effective Load-Carrying Capability  - a probabilistic measure of how much firm load a resource can reliably support during system-stress hours. Variable renewables and battery storage receive ELCC values significantly below their nameplate capacity because their availability during critical hours correlates with weather or state of charge, not with system need. As the resource mix shifts toward weather-dependent generation, aggregate UCAP declines relative to nameplate capacity. The market can clear formally while actual reliability margins tighten  - because ELCC assumptions underestimate tail-correlated failures at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The VRR demand curve is the second technical seam.&lt;/strong&gt; PJM calibrates the curve to reflect the value of lost load relative to the cost of new entry  - steep at low capacity levels, flatter above adequacy. Neither VOLL nor CONE has been fully recalibrated for an environment where large-load data centres represent a concentrated, politically visible share of incremental demand. If VOLL is understated relative to the new load profile's sensitivity to outages, the curve is too flat at the top  - it does not pay enough for the last marginal unit of adequacy. That underpricing of tail risk is one mechanism through which the market appears to clear adequately while physical reliability worsens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The paper's three strategic paths differ precisely in how they handle these seams.&lt;/strong&gt; Path A (hedged common model) maintains the shared reliability standard but requires longer-dated forward capacity hedges from load-serving entities so scarcity signals reach investment decisions earlier. Path B (differential reliability model) admits that not all load will receive the same adequacy standard  - interruptible large loads accept a lower tier in exchange for different cost treatment, which requires new metering, communication, and control infrastructure to implement fairly. Path C (energy market recovery shift) recovers more fixed costs through real-time scarcity pricing and ancillary service revenue, which requires more volatile spot prices and better hedging tools across all participants.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Path B is technically the hardest to implement and the most politically contested.&lt;/strong&gt; Differential reliability means the system knows, in real time, which loads are on firm service and which are interruptible, and can curtail the second class first without affecting the first. That requires advanced metering infrastructure capable of sub-second load-shedding commands, clear contractual demarcation at the meter level, and AGC systems that can target specific load clusters rather than issuing system-wide curtailment signals. PJM's current infrastructure was not built for that granularity. Building it requires capital expenditure and FERC tariff changes measured in years, not months  - a gap the paper names as a path without fully confronting.&lt;/p&gt;

&lt;p&gt;The legitimacy argument is the part the market deserves to take more seriously than a standard capacity-price analysis does. PJM argues that the market's effectiveness depends on participants believing the cost-allocation logic is defensible  - that customers paying capacity charges receive a reliable service and that new entrants can recover prudent investments. Concentrated hyperscale load breaks both conditions simultaneously. It introduces a class of customers that can self-supply or negotiate, giving them structural leverage over rule-setting that diffuse residential customers do not have. That asymmetry is what 'legitimacy' means in practice, and it is a harder problem to solve than recalibrating the VRR curve.&lt;/p&gt;




&lt;h3&gt;
  
  
  About the author
&lt;/h3&gt;

&lt;p&gt;Yujia Zhang — Energy Modeller &amp;amp; Quant Researcher (PhD). I cover AI infrastructure, power markets, and financial systems.&lt;/p&gt;

&lt;p&gt;🔗 live market intelligence at &lt;a href="https://yujiazhang.co.uk/news" rel="noopener noreferrer"&gt;yujiazhang.co.uk/news&lt;/a&gt;&lt;/p&gt;

</description>
      <category>energy</category>
      <category>markets</category>
      <category>finance</category>
    </item>
    <item>
      <title>PJM is now treating large-load curtailment as strategy</title>
      <dc:creator>Yujia Zhang</dc:creator>
      <pubDate>Tue, 02 Jun 2026 19:15:58 +0000</pubDate>
      <link>https://dev.to/yujia_zhang_0328/pjm-is-now-treating-large-load-curtailment-as-strategy-2opm</link>
      <guid>https://dev.to/yujia_zhang_0328/pjm-is-now-treating-large-load-curtailment-as-strategy-2opm</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;PJM's latest strategy update puts data-centre and large-load curtailment inside the formal planning agenda. That is a sign that demand flexibility is moving from emergency tool to design assumption.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Planning Strategy - May 20, 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;PJM's May 20 strategy update normalises curtailment by placing it on the same planning agenda as capacity procurement and transmission expansion. That institutional placement matters more than the headline. Once an operator lists demand-side curtailment alongside generation and wires investment as a formal path to reliability, it is signalling that the demand side is expected to contribute to the adequacy solution, not just consume it. The precise instruments PJM has to achieve that contribution differ significantly in their operational characteristics, and those differences determine how much flexibility the operator can actually count on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PJM currently operates demand-response programmes with different trigger conditions and performance obligations.&lt;/strong&gt; The Capacity Performance demand-response programme requires enrolled resources to respond within 30 minutes of a curtailment order during CP commitment periods, with strict financial penalties for non-performance  - similar in structure to the obligations on generators. For data centres running at full inference load, accepting a CP-style obligation on their full facility is operationally implausible. A data centre with flexible training workloads may accept CP obligations on non-critical compute clusters. The relevant design question is whether curtailment can be scoped at the workload level rather than the building level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch interruptibility at the workload level is the contractual architecture that makes this work.&lt;/strong&gt; A data centre's aggregate metered load looks like a single block from the grid's perspective, but internally it is composed of inference serving (high priority), training jobs (lower priority, deferrable), HVAC systems (variable within bounds), and support infrastructure. Curtailment as a planning tool becomes operationally credible only when the interface between the grid and the data centre can address the workload-priority hierarchy. That requires either API-driven load control  - where the operator dispatches curtailment signals to a workload scheduler, not a building management system  - or pre-negotiated block shedding backed by automated systems that can shed training jobs reliably within the contractual window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The site-selection implications are direct and quantifiable.&lt;/strong&gt; A hyperscaler comparing two PJM-region sites is comparing four variables beyond land and fiber: capacity cost per MW (the cleared auction price passed through their tariff), expected curtailment frequency under the stressed scenarios PJM now models explicitly, the cost of backup generation or on-site storage to manage curtailment events, and the regulatory trajectory on cost allocation. A difference of 50 hours per year in expected curtailment exposure, at a loading of 100 MW and a service-loss cost of $5,000 per hour, is $25 million annually  - material relative to permitting cost differences between sites in the same region.&lt;/p&gt;

&lt;p&gt;The scenario analysis in the strategy document is revealing precisely because it presents scenarios rather than a single forecast. PJM's capacity-squeeze scenario  - defined by load growth accelerating while interconnection and transmission additions lag further  - is an admission that the operator can construct a plausible trajectory in which current planning tools, including demand flexibility, may not be sufficient. Publishing that scenario in a formal strategy document signals to large loads that current terms of grid access may not remain fixed. The appropriate response for any hyperscaler with regional interconnection pending is to treat that signal as a contracting and planning input, not background noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The deeper structural point is about lead-time asymmetry.&lt;/strong&gt; Interconnection queues for new generation take four to seven years to clear. Transmission upgrades take five to ten years to permit and build. Contractual demand flexibility can be structured in six to eighteen months if the regulatory pathway and technical interfaces exist. PJM's strategy update is implicitly acknowledging that flexibility is the only adequacy tool with a lead time short enough to address near-term capacity-squeeze risk. That makes the commercial architecture of curtailment  - who gets paid how much, under what trigger, with what performance obligations  - one of the most consequential market-design questions PJM has open right now.&lt;/p&gt;




&lt;h3&gt;
  
  
  About the author
&lt;/h3&gt;

&lt;p&gt;Yujia Zhang — Energy Modeller &amp;amp; Quant Researcher (PhD). I cover AI infrastructure, power markets, and financial systems.&lt;/p&gt;

&lt;p&gt;🔗 live market intelligence at &lt;a href="https://yujiazhang.co.uk/news" rel="noopener noreferrer"&gt;yujiazhang.co.uk/news&lt;/a&gt;&lt;/p&gt;

</description>
      <category>energy</category>
      <category>markets</category>
      <category>finance</category>
    </item>
    <item>
      <title>FERC's summer outlook makes concentrated load hard to ignore</title>
      <dc:creator>Yujia Zhang</dc:creator>
      <pubDate>Tue, 02 Jun 2026 19:15:51 +0000</pubDate>
      <link>https://dev.to/yujia_zhang_0328/fercs-summer-outlook-makes-concentrated-load-hard-to-ignore-4nno</link>
      <guid>https://dev.to/yujia_zhang_0328/fercs-summer-outlook-makes-concentrated-load-hard-to-ignore-4nno</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;FERC says summer 2026 electricity consumption should exceed each of the previous five summers. In a grid already wrestling with large new loads, that keeps reliability and cost allocation on the front foot.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Summer Reliability - May 21, 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;FERC's summer assessment quantifies the adequacy picture regionally rather than as a single national headline  - which is where the analytically useful content lives. The Mid-Atlantic BPS, which includes most of PJM, is rated adequate under normal conditions but carries elevated risk under extreme conditions, particularly given the concentration of new large loads in Northern Virginia. New England faces a structurally tighter margin, with ISO-NE's reserve margin already below its 15% target and natural gas supply constraints that worsen under cold snaps. Those are different risk profiles driven by different mechanisms, and conflating them into a single 'summer adequacy' narrative misses the operative variable.&lt;/p&gt;

&lt;p&gt;The probabilistic methodology FERC staff uses is worth understanding, because it shapes what 'adequate' means in the report's terms. The assessment uses Loss of Load Probability and Expected Unserved Energy as reliability metrics, calculated by running a Monte Carlo simulation over a large ensemble of weather scenarios, demand realisations, and resource outage probabilities. A region passes the adequacy threshold if it meets the one-day-in-ten-years LOLP criterion across the simulation ensemble. The critical assumption embedded in that criterion is that weather, demand, and resource outages are drawn from independent distributions. When they are correlated  - as they are during prolonged heat events that simultaneously increase air-conditioning load, reduce thermal plant efficiency, and stress natural gas supply infrastructure  - the tail probabilities can be substantially larger than the Monte Carlo criterion implies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concentrated large loads create a specific technical challenge for this methodology.&lt;/strong&gt; Traditional demand-side variability is geographically diffuse and weather-correlated in predictable ways: hot days in Northern Virginia drive cooling load up proportionally across a large number of small commercial and residential customers. A 500 MW data centre cluster in the same geography adds a load component that is weather-independent  - its compute load is driven by inference demand and training schedules, not temperature  - but concentrated enough that a single interconnection failure or local transmission constraint creates a step-change in the local load balance. FERC's Monte Carlo model, calibrated on historical data without this load profile, will systematically underestimate the probability of coincident events in regions with high data-centre concentrations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The natural gas linkage adds a correlation pathway that standard LOLP models handle imperfectly.&lt;/strong&gt; Gas-fired generation provides roughly 40% of PJM's summer capacity. In extreme heat events, gas demand for cooling and industrial processes competes with gas demand for power generation on the same pipeline infrastructure. Pipeline constraints can cause gas-fired generators to derate or trip at exactly the moment peak power demand is highest  - introducing a supply-demand correlation that classical probabilistic adequacy models, which treat fuel availability as exogenous, do not capture. FERC notes this risk qualitatively; quantifying it requires coupled gas-power flow modelling that most RTOs do not run operationally.&lt;/p&gt;

&lt;p&gt;The practical implication for infrastructure investors is that the standard adequacy metric  - reserve margin against peak demand  - is a less reliable guide to actual risk in a correlated-stress environment than it was in the historical period it was calibrated on. A region with 15% theoretical reserve margin but high gas dependency, high data-centre concentration, and ageing transmission infrastructure has a materially different tail-risk profile than a region with the same headline margin but a more diverse resource mix and distributed load. FERC's summer reports are valuable precisely because they show the regional distribution of these risk factors, but they are not designed to quantify tail risk in the correlated-stress regime  - which is where serious reliability modelling now needs to go.&lt;/p&gt;

&lt;p&gt;The policy consequence is that summer reliability reports are becoming de facto pre-approval documents for large-load interconnection requests. State commissions in Virginia and Maryland already require data-centre developers to demonstrate that their load addition does not materially worsen regional reliability metrics. FERC's assessment provides the reference baseline against which those demonstrations are evaluated. Once that feedback loop is established  - adequacy report shapes procedural obligations for the load categories that drove its stress findings  - the time from final investment decision to energisation for a large data-centre project will include a reliability impact study as a critical-path item.&lt;/p&gt;




&lt;h3&gt;
  
  
  About the author
&lt;/h3&gt;

&lt;p&gt;Yujia Zhang — Energy Modeller &amp;amp; Quant Researcher (PhD). I cover AI infrastructure, power markets, and financial systems.&lt;/p&gt;

&lt;p&gt;🔗 live market intelligence at &lt;a href="https://yujiazhang.co.uk/news" rel="noopener noreferrer"&gt;yujiazhang.co.uk/news&lt;/a&gt;&lt;/p&gt;

</description>
      <category>energy</category>
      <category>markets</category>
      <category>finance</category>
    </item>
    <item>
      <title>OpenAI is treating coding agents like governed infrastructure</title>
      <dc:creator>Yujia Zhang</dc:creator>
      <pubDate>Tue, 02 Jun 2026 19:09:57 +0000</pubDate>
      <link>https://dev.to/yujia_zhang_0328/openai-is-treating-coding-agents-like-governed-infrastructure-9pm</link>
      <guid>https://dev.to/yujia_zhang_0328/openai-is-treating-coding-agents-like-governed-infrastructure-9pm</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;OpenAI's Codex safety notes are notable because they focus on approvals, network policy, and logs rather than raw coding benchmarks. That is what production agent deployment looks like when risk is taken seriously.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Agent Governance - May 8, 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;OpenAI's Codex safety architecture is built around three layers that matter in sequence: process isolation, network policy, and approval routing. The execution environment is a Windows sandbox using App Container isolation  - not a Linux container  - which is a deliberate choice. App Container restricts filesystem access, inter-process communication, and network connectivity at the OS level without requiring a separate hypervisor. Every tool call Codex makes  - git, npm, a compiler, a test runner  - runs inside that boundary. The default-deny network posture allowlists package registries and VCS hosts and blocks everything else. That default is what makes autonomous execution safe to enable in the first place.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The approval routing model is where the practical enterprise architecture lives.&lt;/strong&gt; Codex classifies each planned action into a risk tier before executing it. Read-only operations  - file reads, test runs, local builds  - run automatically. Write operations that cross repository boundaries  - git push, external API calls, file writes outside the working directory  - trigger asynchronous approval requests. Operations with production or security implications  - credential access, schema modifications, infrastructure changes  - require synchronous human approval before the step proceeds. That three-tier model mirrors the coarse-to-fine permission structure in any well-designed RBAC system. What is novel is applying it dynamically to agent action sequences rather than to static resource access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent-native telemetry is the capability that makes this auditable at scale.&lt;/strong&gt; Codex emits structured trace events at each step: tool call name, arguments, return values, latency, and the decision rationale logged before invoking the tool. Those events follow an OpenTelemetry-compatible format, which means they can be ingested by any observability stack an enterprise already runs. The critical invariant is that the trace is written before the action executes, not after  - so the log cannot be reconstructed post-hoc to explain an unexpected outcome. That write-ahead logging pattern is borrowed directly from database transaction systems. It is the correct pattern for any stateful agent that needs to explain itself to an auditor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The enterprise value implication is quantitative.&lt;/strong&gt; An autonomous coding agent that completes 65% of assigned tasks but generates one critical incident per month in a high-stakes codebase does not have positive expected value. The bounded-risk model attempts to shift the distribution: keep the 65% task completion rate while reducing the tail of incidents that consume more engineering time than the automation saves. The approval gate on high-risk actions is not primarily about risk aversion  - it is about preserving the net-positive expected value of autonomy at scale. A system that lets 90% of actions run automatically while surfacing the 10% that need human judgment can achieve higher net throughput than a system with higher raw task completion but unpredictable tail events.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enterprise deployment requires configuration before deployment, not after.&lt;/strong&gt; Teams should define their risk tiers explicitly: which paths in their repository are high-risk (production configs, auth modules, database schemas), which operations should never run automatically regardless of risk tier (force push, secret rotation, DNS changes), and which team members should receive approval requests for which categories. That configuration is the governance artefact that makes autonomous coding a managed process rather than an experiment. Teams that build that configuration before enabling Codex will have markedly different incident rates than teams that enable autonomy with default settings and tune later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Windows sandbox selection is worth noting as a distribution signal, not just a security choice.&lt;/strong&gt; Most cloud-native engineering infrastructure runs on Linux; most enterprise on-premises infrastructure runs Windows Server. Building the Codex sandbox around App Container rather than Docker or gVisor means enterprise IT departments that manage Windows fleets already know the security model, its Group Policy integration, and its audit logging. The security review for Codex deployment in a Windows-primary enterprise is substantially shorter than it would be for a Linux-container-based alternative. OpenAI is targeting the enterprise procurement process as deliberately as it is targeting the security engineering review.&lt;/p&gt;




&lt;h3&gt;
  
  
  About the author
&lt;/h3&gt;

&lt;p&gt;Yujia Zhang — Energy Modeller &amp;amp; Quant Researcher (PhD). I cover AI infrastructure, power markets, and financial systems.&lt;/p&gt;

&lt;p&gt;🔗 live market intelligence at &lt;a href="https://yujiazhang.co.uk/news" rel="noopener noreferrer"&gt;yujiazhang.co.uk/news&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>finance</category>
      <category>markets</category>
    </item>
    <item>
      <title>GameDevBench shows where multimodal coding agents still break</title>
      <dc:creator>Yujia Zhang</dc:creator>
      <pubDate>Tue, 02 Jun 2026 19:09:50 +0000</pubDate>
      <link>https://dev.to/yujia_zhang_0328/gamedevbench-shows-where-multimodal-coding-agents-still-break-4c5h</link>
      <guid>https://dev.to/yujia_zhang_0328/gamedevbench-shows-where-multimodal-coding-agents-still-break-4c5h</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A new OpenReview benchmark pushes agents into game-engine tasks with code, visuals, and assets in one loop. The result is sobering: the best baseline solves only 49% of tasks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Agent Evals - May 23, 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GameDevBench is structured around 358 tasks drawn from publicly available web and video tutorials for Unity and Godot.&lt;/strong&gt; The evaluation protocol is notable for what it measures: each task requires the agent to read a natural-language instruction, write or modify code in the engine's scripting language, and have the result evaluated against a behavioural test  - not a unit test, but a runtime check of whether the game object behaves correctly. That is a materially harder evaluation target than the patch-generation benchmarks that have dominated agent evaluation, because the ground truth is not a diff but a running system state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The performance breakdown by task type is mechanistically informative.&lt;/strong&gt; The paper reports 56.1% success on gameplay tasks, where the relationship between code and behaviour is relatively direct  - a movement script either produces the expected velocity or it does not. Success falls to 37.0% on 2D graphics tasks, where the agent must align code changes with visual output: the sprite must be positioned correctly, the animation must run at the right frame rate, the collision mesh must match the rendered shape. The 19-percentage-point gap is a measurement of the visual-code alignment penalty. The agent understands the code but cannot reliably verify its visual consequences without seeing them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The video feedback experiment is the most technically significant result.&lt;/strong&gt; Adding video feedback  - streaming the rendered game output back to the model as it works  - lifts Claude Sonnet 4.5 from 34.4% to 44.7%, a 30% relative improvement from a single systems change. The mechanism is straightforward: the agent can observe the delta between its expected visual outcome and the actual rendered frame, which activates a correction loop the text-only version cannot close. For sequential multi-step tasks, each unverified step compounds the error. By the time the agent reaches step five, the cumulative divergence between its internal world model and the actual game state can be large enough that its next action is based on false premises. Video feedback recalibrates the world model at each step, which is why its benefit is disproportionate relative to its implementation cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The underlying failure mode is what formal control theory calls an open-loop deficiency.&lt;/strong&gt; A text-only coding agent operates with a world model derived entirely from code, documentation, and prior context. It can predict what a piece of code should do, but it cannot verify what it actually did in a rendered or interactive environment. This property is not specific to games. Web automation that handles dynamically rendered pages, design systems that require visual regression testing, scientific visualisation pipelines where the output is a figure rather than a number, and robotic process automation targeting legacy GUIs all share the same open-loop deficiency. GameDevBench is the first benchmark to measure it cleanly.&lt;/p&gt;

&lt;p&gt;The practical implication for anyone deploying multimodal agents today is that the evaluation protocol matters as much as the model choice. An agent evaluated only on code generation quality will appear to perform well on tasks where the code is correct but the observable output is wrong. Teams deploying agents on visual-feedback-dependent tasks should build benchmark suites that include output verification  - screenshot comparison, rendered state validation, UI element detection  - as part of the task completion criteria. The 44.7% versus 34.4% result shows that investment in visual feedback loops has measurable return even with current models. The teams that build that infrastructure now will have a more accurate picture of where their agents actually fail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The benchmark also has a less obvious implication for model scaling.&lt;/strong&gt; The standard assumption is that better models solve more tasks because they reason better over code. GameDevBench suggests that some of the remaining performance gap is not primarily a reasoning deficit  - it is a feedback deficit. Providing the model with the right environmental state at each step matters more than scaling up its parameter count, at least for multimodal control tasks. That shifts the near-term research priority from training-time capability to inference-time environment design: how to give models the right inputs, not just how to make models smarter at processing the wrong ones.&lt;/p&gt;




&lt;h3&gt;
  
  
  About the author
&lt;/h3&gt;

&lt;p&gt;Yujia Zhang — Energy Modeller &amp;amp; Quant Researcher (PhD). I cover AI infrastructure, power markets, and financial systems.&lt;/p&gt;

&lt;p&gt;🔗 live market intelligence at &lt;a href="https://yujiazhang.co.uk/news" rel="noopener noreferrer"&gt;yujiazhang.co.uk/news&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>finance</category>
      <category>markets</category>
    </item>
    <item>
      <title>Mastercard and Yellow Card are productising stablecoin corridors</title>
      <dc:creator>Yujia Zhang</dc:creator>
      <pubDate>Tue, 02 Jun 2026 19:09:08 +0000</pubDate>
      <link>https://dev.to/yujia_zhang_0328/mastercard-and-yellow-card-are-productising-stablecoin-corridors-ljo</link>
      <guid>https://dev.to/yujia_zhang_0328/mastercard-and-yellow-card-are-productising-stablecoin-corridors-ljo</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Mastercard's partnership with Yellow Card targets remittances, B2B settlement, loyalty, and treasury. The important move is not generic blockchain support. It is corridor-specific infrastructure with local compliance attached.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Cross-Border Rails - May 7, 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Mastercard's contribution to this partnership operates through the Multi-Token Network, Mastercard's permissioned blockchain infrastructure expanded across EMEA corridors in 2025. MTN provides the settlement and token lifecycle management layer: it issues, transfers, and redeems tokenised representations of fiat-backed assets across participating financial institutions, with Mastercard's network settlement logic providing finality guarantees. Yellow Card provides the local operating layer: licensed stablecoin infrastructure with banking connectivity across more than 20 African markets, a network of local liquidity providers for on-ramp and off-ramp conversion, and regulatory relationships with central banks where MTN alone does not have direct operating access. The partnership is the combination of a global settlement network with local operating infrastructure in markets where neither party can serve the full stack independently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The FX and liquidity problem in African corridor payments deserves more technical attention than it typically receives.&lt;/strong&gt; A USDC transfer from a UK sender to a Nigerian recipient looks simple on the blockchain: tokens move from one address to another in seconds. The hard part is what happens at each end. On the UK side, the sender must convert GBP to USDC, requiring an on-ramp with adequate GBP liquidity and favourable spread economics. On the Nigerian side, the recipient needs NGN, requiring an off-ramp that can convert USDC at a rate acceptable to the recipient and clear within a window the recipient's institution supports. Yellow Card acts as the market maker: it quotes a spread, provides liquidity at each end, and manages currency risk on its own balance sheet during the settlement window. NGN liquidity is segmented between official and parallel exchange rates, and the corridor's economics depend critically on which rate a provider can offer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This corridor model differs from SWIFT GPI on the same routes in ways that matter operationally.&lt;/strong&gt; SWIFT GPI reduced international payment latency from days to hours by adding tracking and settlement timing guarantees on top of the correspondent banking chain. It did not reduce the number of intermediaries or the FX conversion costs  - those are structural features of the correspondent model, not latency problems. A stablecoin corridor replaces the correspondent chain with a single token transfer and moves FX conversion to the endpoints, reducing the number of institutions taking margin and shortening settlement to near-real-time. The trade-off is that stablecoin corridors require functioning on-ramp and off-ramp infrastructure at both endpoints  - a dependency SWIFT does not have, since it operates over pre-existing bank relationships. Building those endpoints is Yellow Card's core capability.&lt;/p&gt;

&lt;p&gt;The B2B settlement use case has different economics than remittances, and the partnership's explicit inclusion of both signals that Yellow Card's infrastructure is designed to handle commercial volumes. A remittance of $200 needs an efficient on-ramp, fast transfer, and reliable off-ramp; the margin opportunity per transaction is small, so volume drives economics. A B2B settlement of $50,000 for a trade finance payment has different requirements: the buyer and seller need settlement assurance, the currency risk window matters when the amount is large, and compliance documentation requirements are stricter. Yellow Card's banking relationships and multi-currency account infrastructure at both corridor ends are what make the higher-value B2B case viable; a pure crypto wallet solution lacks the bank account connectivity that most corporate treasuries require for settlement documentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The digital loyalty use case will scale fastest but receives the least analysis.&lt;/strong&gt; Loyalty point issuance, redemption, and exchange across markets is a payments problem that existing rails solve poorly  - each programme runs a closed loop, points expire in local economies, and programme operators cannot easily interoperate across borders. A stablecoin representation of loyalty value that settles on MTN's infrastructure and redeems at Yellow Card's local off-ramp points creates a programmable cross-border loyalty asset without requiring each programme operator to build corridor infrastructure independently. Loyalty liability is typically valued at a discount to face value on programme operators' balance sheets, so the ability to monetise or transfer that liability more efficiently has direct P&amp;amp;L implications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The template value of this partnership is its most important signal.&lt;/strong&gt; A workable corridor requires a global settlement layer with network effects (what Mastercard provides), a licensed local operator with banking connectivity (what Yellow Card provides), and a defined set of use cases with clear economics. That three-component template applies to every emerging-market corridor where traditional correspondent banking is expensive or slow. Mastercard cannot replicate Yellow Card's local infrastructure in every market; it can replicate the partnership model with locally licensed operators in each new corridor. The competitive question for the next three to five years is which stablecoin-native operators in each region can build the Yellow Card profile  - licensed, banked, multi-currency  - before either Mastercard or a direct competitor locks up those relationships.&lt;/p&gt;




&lt;h3&gt;
  
  
  About the author
&lt;/h3&gt;

&lt;p&gt;Yujia Zhang — Energy Modeller &amp;amp; Quant Researcher (PhD). I cover AI infrastructure, power markets, and financial systems.&lt;/p&gt;

&lt;p&gt;🔗 live market intelligence at &lt;a href="https://yujiazhang.co.uk/news" rel="noopener noreferrer"&gt;yujiazhang.co.uk/news&lt;/a&gt;&lt;/p&gt;

</description>
      <category>fintech</category>
      <category>finance</category>
      <category>markets</category>
    </item>
    <item>
      <title>SAP is pulling payments closer to the enterprise ledger</title>
      <dc:creator>Yujia Zhang</dc:creator>
      <pubDate>Tue, 02 Jun 2026 19:03:50 +0000</pubDate>
      <link>https://dev.to/yujia_zhang_0328/sap-is-pulling-payments-closer-to-the-enterprise-ledger-f7o</link>
      <guid>https://dev.to/yujia_zhang_0328/sap-is-pulling-payments-closer-to-the-enterprise-ledger-f7o</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Adyen's integration with SAP Unified Payment is a reminder that the next payments battle is not only at checkout. It is inside reconciliation, data consistency, and who owns the financial system of record.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Enterprise Integration - May 13, 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;SAP Unified Payment is architecturally an abstraction layer over SAP Commerce Cloud's order management system and the financial document flows in S/4HANA. It intercepts payment events at checkout, routes them to configured PSPs through a standardised adapter interface, receives status callbacks, and writes the resulting financial documents  - revenue recognition entries, tax postings, receivables  - directly into the S/4HANA general ledger without a batch import step. Adyen's integration implements that adapter interface natively, which means every authorisation, capture, refund, and chargeback event from Adyen's acquiring network maps to a corresponding SAP financial document in real time. The technical elimination of the batch import is not a minor convenience  - it is the mechanism that closes the reconciliation gap between what the payment system recorded and what the ledger shows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The reconciliation gap is where most enterprise payment operations lose time and create risk.&lt;/strong&gt; In a typical multi-system architecture, the payment gateway records a transaction, a middleware layer batches and transforms those records overnight, and the ERP imports and posts them the following morning. Any discrepancy between what the gateway recorded and what the ERP posted  - due to transformation errors, timing cutoffs, or currency rounding at different layers  - creates an exception requiring manual investigation. At scale, a merchant processing 100,000 transactions per day with a 0.1% exception rate is managing 100 manual investigations daily. Embedded payment infrastructure that writes gateway events directly to the ledger eliminates the transformation layer and therefore eliminates the class of exception that transformation creates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The data model unification is the less-visible but strategically significant consequence.&lt;/strong&gt; When payment events and financial document events share an SAP data model, a merchant can run queries across both layers without a join across systems: which products generated chargebacks in which geographies last quarter; which payment methods correlate with higher return rates; which customer segments have the highest lifetime payment reliability. In a separated architecture, those queries require exporting data from two systems, aligning on a common identifier, and managing schema divergence between the payment platform's data model and the ERP's. SAP Unified Payment's native integration makes those analytics a standard report rather than a data engineering project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The iDoc-versus-API question shapes implementation cost for most SAP customers.&lt;/strong&gt; Traditional SAP payment integration used iDocs  - SAP's proprietary data exchange format  - which required mapping tables, batch processing, and custom ABAP development to connect non-SAP systems. SAP Unified Payment uses modern REST APIs and standardised event schemas, which reduces the integration effort from months of custom development to days of configuration in the SAP Business Technology Platform integration suite. That reduction in integration cost is what makes the Adyen-SAP combination commercially significant for SAP's installed base: the barrier to switching from a legacy gateway arrangement to the embedded model has dropped materially.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The buyer profile shift is the most underappreciated consequence.&lt;/strong&gt; When payment capability is delivered through SAP, the buying decision involves not only the digital commerce team but also the CFO organisation, the group controller, and enterprise architecture. Those stakeholders care about different things: uptime and SLA for the CFO, audit trail completeness for the controller, vendor consolidation for enterprise architecture. A PSP that answers all three requirements in one conversation wins a different, larger, and more durable contract than a gateway that wins only the digital commerce team. Adyen's SAP partnership is an investment in accessing that wider buying committee rather than competing on checkout performance metrics alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The strategic consequence for future payment innovation is what the market is underpricing.&lt;/strong&gt; An enterprise merchant with embedded payment infrastructure in SAP does not need a new infrastructure project every time a new payment method appears. USDC acceptance, account-to-account payments, buy-now-pay-later integrations, and agent-triggered purchases can all be added as configuration changes to the existing SAP payment hub, provided the PSP supports them through the same adapter interface. That configurability converts each new payment method from a development project with its own security review and testing cycle into a policy decision. Merchants with embedded infrastructure can activate new rails in weeks; merchants with point-to-point integrations will take months. The speed advantage compounds over time into competitive separation.&lt;/p&gt;




&lt;h3&gt;
  
  
  About the author
&lt;/h3&gt;

&lt;p&gt;Yujia Zhang — Energy Modeller &amp;amp; Quant Researcher (PhD). I cover AI infrastructure, power markets, and financial systems.&lt;/p&gt;

&lt;p&gt;🔗 live market intelligence at &lt;a href="https://yujiazhang.co.uk/news" rel="noopener noreferrer"&gt;yujiazhang.co.uk/news&lt;/a&gt;&lt;/p&gt;

</description>
      <category>fintech</category>
      <category>finance</category>
      <category>markets</category>
    </item>
    <item>
      <title>Checkout.com turns stablecoin acceptance into a PSP feature</title>
      <dc:creator>Yujia Zhang</dc:creator>
      <pubDate>Tue, 02 Jun 2026 19:03:44 +0000</pubDate>
      <link>https://dev.to/yujia_zhang_0328/checkoutcom-turns-stablecoin-acceptance-into-a-psp-feature-2el</link>
      <guid>https://dev.to/yujia_zhang_0328/checkoutcom-turns-stablecoin-acceptance-into-a-psp-feature-2el</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Coinbase and Checkout.com are making USDC and USDT acceptance available through an existing enterprise PSP stack. That shifts stablecoins from a separate crypto integration into a payments-operations decision.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Merchant Acceptance - June 2, 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The Coinbase-Checkout.com integration is a three-layer stack, and the commercial insight lives in understanding which layer does what. The bottom layer is Base, Coinbase's Ethereum Layer 2 network: USDC or USDT transfers execute as EVM transactions, settle with finality in approximately two seconds, and are cryptographically irreversible once confirmed. The middle layer is Coinbase Commerce's acceptance SDK, which abstracts on-chain mechanics into a payment-request-and-verify flow  - the merchant's system calls an API endpoint, gets back a payment address, polls for confirmation, and receives a webhook when the transaction settles. The top layer is Checkout.com's PSP integration, which wraps that SDK into the same merchant API surface used for card and bank transfer acceptance. A merchant already integrated with Checkout.com does not see the on-chain layer at all.&lt;/p&gt;

&lt;p&gt;The USD settlement design is the engineering decision that determines whether enterprise finance teams approve the integration. An enterprise merchant cannot carry USDC as a balance-sheet asset without triggering accounting and treasury complexity that most finance teams will reject in a normal payments discussion. The integration solves this by running an automated off-ramp: as soon as the on-chain transfer confirms, Checkout.com's liquidity management system converts USDC to USD at a real-time rate, nets the conversion against the daily settlement batch, and delivers USD to the merchant's bank on the same settlement cadence as card transactions. From the merchant's accounting perspective, the receivable is always USD  - the on-chain leg is an infrastructure detail, not a balance-sheet event.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The refund mechanism is where most stablecoin integrations fail in practice, and the design here matters.&lt;/strong&gt; A stablecoin transfer is cryptographically irreversible  - the blockchain cannot undo a confirmed transaction. Refunds must be new outbound transfers from the merchant to the original sender address. For B2C payments, this requires the merchant to hold a stablecoin operational balance specifically for refunds, which reintroduces the treasury complexity the USD settlement process was designed to avoid. The likely design is a hybrid: refunds are processed as USD credits through Checkout.com's existing refund rails, with the on-chain leg settled internally through Coinbase's liquidity book. That keeps refund accounting in USD while using Coinbase's balance sheet to absorb the timing mismatch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Know-your-transaction compliance is the other integration point that determines enterprise adoption velocity.&lt;/strong&gt; Card networks have decades of transaction monitoring built into their authorisation flow; stablecoin networks do not have an equivalent at the protocol level. Coinbase addresses this through its on-chain analytics layer  - the same infrastructure used for Exchange and Institutional custody  - which screens every inbound payment address against sanctions lists, flags addresses associated with mixer protocols or high-risk DeFi contracts, and can trigger a hold before the merchant's order is confirmed. That screening happens at the Coinbase Commerce layer, not at the PSP layer. Checkout.com does not need to build its own blockchain analytics capability: the compliance boundary sits between the two providers, with each contributing its core competency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The market structure consequence is that the PSP layer becomes the natural aggregator of stablecoin acceptance capacity.&lt;/strong&gt; A merchant accepting stablecoins through Checkout.com outsources custody, off-ramp, refund settlement, and KYT to Coinbase's infrastructure, while outsourcing checkout UX, fraud rules, and merchant reconciliation to Checkout.com's. Neither party manages the full stack; both contribute the layer where they have existing economies of scale. That division of labour is why PSP-native stablecoin acceptance is more likely to reach mass adoption than wallet-native acceptance, which requires merchants to integrate independently with custody, compliance, and checkout infrastructure most of them do not have the engineering bandwidth to manage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The scale effect on adoption is worth quantifying.&lt;/strong&gt; Checkout.com serves more than 1,000 enterprise merchants. If 10% activate stablecoin acceptance in the first twelve months  - a conservative assumption given that activation is a configuration choice rather than a new integration project  - that is over 100 large merchants adding stablecoin as a payment method simultaneously. That creates a step-change in wallet-holder demand signal: the case for carrying a USDC balance becomes more compelling when a meaningful fraction of major e-commerce merchants accept it at checkout. Network effects in payment acceptance are driven by simultaneous growth on both sides of the market; the PSP distribution model is the most credible path to achieving that simultaneity.&lt;/p&gt;




&lt;h3&gt;
  
  
  About the author
&lt;/h3&gt;

&lt;p&gt;Yujia Zhang — Energy Modeller &amp;amp; Quant Researcher (PhD). I cover AI infrastructure, power markets, and financial systems.&lt;/p&gt;

&lt;p&gt;🔗 live market intelligence at &lt;a href="https://yujiazhang.co.uk/news" rel="noopener noreferrer"&gt;yujiazhang.co.uk/news&lt;/a&gt;&lt;/p&gt;

</description>
      <category>fintech</category>
      <category>finance</category>
      <category>markets</category>
    </item>
    <item>
      <title>Oil's return to the centre of the tape is forcing portfolios back into supply-shock math</title>
      <dc:creator>Yujia Zhang</dc:creator>
      <pubDate>Wed, 15 Apr 2026 20:56:26 +0000</pubDate>
      <link>https://dev.to/yujia_zhang_0328/oils-return-to-the-centre-of-the-tape-is-forcing-portfolios-back-into-supply-shock-math-1hen</link>
      <guid>https://dev.to/yujia_zhang_0328/oils-return-to-the-centre-of-the-tape-is-forcing-portfolios-back-into-supply-shock-math-1hen</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;With U.S. gas prices above $4 and crude back above $100, the market is treating energy as a regime variable again  - not a sector detail.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Oil 路 March 31, 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;When oil moves far enough and long enough, it stops behaving like a commodity story and starts behaving like a market regime variable. The reason is simple: energy feeds into transport, production, consumer budgets, inflation expectations, and policy assumptions at the same time. Once those channels move together, the shock propagates through the whole discounting system. The 2022 energy shock demonstrated this clearly  - it was not merely an oil price event but a repricing of the real economy's energy dependency at every level of the production chain, with consequences for inflation, real wages, sovereign fiscal positions, and central bank policy that played out over two years.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The supply-side dynamics in the current move are distinct from prior cycles in important ways.&lt;/strong&gt; OPEC+ production cuts have held with unusual cohesion, reflecting both the disciplinary role of Saudi Arabia's fiscal breakeven price requirements and the reduced internal political pressure to cheat on quotas that characterises periods of genuine supply constraint. Russian production has been more resilient than Western sanctions expected, but export routes have been rerouted at cost  - adding a logistical friction premium that did not exist before 2022. Meanwhile, U.S. shale's responsiveness to price signals has moderated from prior cycles, as operators have prioritised returns to shareholders over volume growth at high prices. The combination of OPEC+ discipline, Russian logistics friction, and U.S. supply restraint is unusual, and it implies that the supply response to elevated prices will be slower than in previous cycles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That is why spot price alone is a poor summary statistic.&lt;/strong&gt; The relevant variable is persistence. A short-lived spike can often be absorbed as noise by businesses that hedge or defer investment. A sustained move changes margin assumptions, hedging behaviour, and the cross-asset relationship between rates and equities. Investors then have to price not only a higher input cost, but the duration of that higher-cost state. Mathematically, if the supply shock has expected persistence P and pass-through elasticity epsilon, then the net present cost to the economy is proportional to P x epsilon x DeltaP_oil, and it is that integral  - not the spot price  - that drives the change in the equity risk premium and the growth forecast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For equity portfolios, the effect is nonlinear.&lt;/strong&gt; Sectors with pricing power and energy leverage  - integrated oil companies, commodity producers, industrial businesses that can pass costs through  - can benefit in nominal terms even as the overall market re-rates lower. Energy-intensive businesses facing a double squeeze from costs and softer demand  - chemicals, aviation, discretionary retail  - are disproportionately affected. For multi-asset portfolios, the more difficult issue is that supply shocks weaken the traditional stock-bond hedge when inflation expectations rise at the same time growth expectations weaken. In that stagflationary configuration, bonds sell off as inflation pricing pushes yields higher, while equities sell off on weaker earnings and slower growth  - and the two assets that are supposed to diversify each other move in the same direction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The central bank response function is the critical uncertainty.&lt;/strong&gt; If major central banks treat the supply shock as transitory and maintain accommodative policy, nominal asset prices may hold even as real returns erode. If central banks tighten in response to energy-driven inflation, as in 2022, the growth hit is compounded by higher discount rates. The 2022 playbook  - where both bonds and equities fell sharply together  - rewarded commodity exposure and real assets while devastating duration. Whether that playbook repeats depends on inflation expectations: if they remain well-anchored, central banks have room to look through a commodity shock; if they become unanchored, they do not. That conditional is now one of the primary macro risk variables in any multi-asset portfolio.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is why energy deserves a more explicit place in market models again.&lt;/strong&gt; The post-2014 decade of low and stable energy prices trained a generation of portfolio managers to treat oil as a second-order factor  - a sector detail rather than a systemic risk. The current move is a reminder that energy is always a first-order variable in the real economy; it only appears to disappear from the analysis during periods when it is stable. When the regime shifts, the portfolios built on the assumption of stable energy lose their calibration. Repositioning for structurally elevated energy costs  - through direct commodity exposure, equity tilts toward energy-leveraged sectors, or inflation-linked duration  - requires recognising the shift before it is fully priced, which in practice means watching persistence signals rather than waiting for the spot price to confirm what the forward curve is already implying.&lt;/p&gt;




&lt;h3&gt;
  
  
  About the author
&lt;/h3&gt;

&lt;p&gt;Yujia Zhang — Energy Modeller &amp;amp; Quant Researcher (PhD). I cover AI infrastructure, power markets, and financial systems.&lt;/p&gt;

&lt;p&gt;🔗 live market intelligence at &lt;a href="https://yujiazhang.co.uk/news" rel="noopener noreferrer"&gt;yujiazhang.co.uk/news&lt;/a&gt;&lt;/p&gt;

</description>
      <category>energy</category>
      <category>markets</category>
      <category>finance</category>
    </item>
    <item>
      <title>Tech companies are building a shadow grid - and 30% of data centre power may soon be off-grid</title>
      <dc:creator>Yujia Zhang</dc:creator>
      <pubDate>Wed, 15 Apr 2026 20:51:09 +0000</pubDate>
      <link>https://dev.to/yujia_zhang_0328/tech-companies-are-building-a-shadow-grid-and-30-of-data-centre-power-may-soon-be-off-grid-2o06</link>
      <guid>https://dev.to/yujia_zhang_0328/tech-companies-are-building-a-shadow-grid-and-30-of-data-centre-power-may-soon-be-off-grid-2o06</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Chevron is building a dedicated gas plant for a Microsoft data centre in Texas. Amazon secured 1.5 GW of dedicated solar. Roughly 30% of all planned data centre capacity is now expected to be on-site. The regulated grid is being bypassed at scale.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Infrastructure 路 April 3, 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The regulated electricity grid was designed around a simple topology: centralised generation, transmission across long distances, and distribution to a dispersed population of end users. What it was not designed for is a class of industrial users large enough to require the equivalent of a small city's power supply in a single location, growing fast enough to outpace any utility planning cycle. The response from those users has been to stop waiting for the grid and start building their own. This is not a workaround or a temporary measure  - it is an architectural shift in how the most capital-intensive technology infrastructure in history is being powered, and it is happening across every major hyperscaler simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chevron is working on a deal to build a dedicated natural gas plant for a Microsoft data centre in Texas.&lt;/strong&gt; Amazon secured 1.5 gigawatts of dedicated solar capacity in the same state. According to a February 2026 report by Cleanview, a market intelligence firm, roughly 30% of all planned data centre power capacity is now expected to be on-site  - up from almost nothing a year earlier. Forty-six data centre projects with a combined planned capacity of 56 GW are pursuing dedicated generation infrastructure outright. Microsoft's agreement with Constellation Energy to restart the Three Mile Island Unit 1 nuclear reactor is the most prominent example of a broader pattern: when grid power is too expensive, too slow to contract, or too unreliable in capacity-constrained regions, hyperscalers are moving directly to the supply side.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nuclear power purchase agreements have emerged as the premium end of the dedicated generation market.&lt;/strong&gt; The economics are straightforward: nuclear plants provide 24/7 baseload power with no fuel-cost exposure and near-zero carbon emissions, making them ideal for data centres with sustainability commitments and reliability requirements. Google's agreement with Kairos Power for six to seven small modular reactors, Amazon's investment in X-energy, and several other hyperscaler nuclear PPA announcements in 2025-2026 collectively represent a revival of commercial nuclear power demand that the existing fleet of large reactors  - most of them built before 1990  - cannot fully satisfy. The result is a private-sector nuclear construction market developing in parallel to, and largely independent of, the public policy debates about nuclear subsidy and regulation.&lt;/p&gt;

&lt;p&gt;This divergence  - between AI infrastructure that is increasingly self-powered and everything else that depends on the regulated grid  - has material consequences for both electricity markets and for the ratepayers who remain on it. Dedicated generation removes high-volume, technically predictable load from the grid's demand base, which would normally reduce capacity market costs. The complication is that it does not reduce the fixed infrastructure costs of the grid itself  - transmission lines, substations, distribution networks  - which were built to serve a certain total load and must still be maintained regardless of how much of that load migrates off-grid. Those fixed costs are then allocated over a smaller remaining customer base, implying rising per-unit costs for residential customers and small businesses who have no alternative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The energy island model also creates a new category of infrastructure investment.&lt;/strong&gt; Developers who can originate, finance, and build dedicated generation assets at data centre scale  - whether gas, nuclear, or large-scale solar with storage  - are operating in a market that did not meaningfully exist three years ago. The project economics are structurally attractive: long-dated offtake at contracted prices from creditworthy counterparties, with demand visibility that is orders of magnitude better than merchant generation. The critical bottleneck is not capital  - there is significant institutional interest in infrastructure assets with technology-company counterparties  - but execution: the engineering, permitting, and interconnection work required to deliver firm power to a specific site at a specific date is resource-constrained in ways that capital alone cannot solve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The permitting and regulatory dimension is underappreciated.&lt;/strong&gt; Dedicated generation that operates behind the meter  - physically connected to a data centre without flowing through the public grid  - faces a different regulatory framework than grid-connected generation in most U.S. jurisdictions. State utility commissions, FERC, and in some cases the Nuclear Regulatory Commission all have jurisdictional interests depending on the technology and configuration. Some states are moving to streamline permitting for data centre power agreements; others are resisting, viewing the shadow grid as a threat to their integrated resource planning authority. The regulatory patchwork is one reason that Texas  - with its deregulated ERCOT market and lighter state oversight of merchant generation  - has attracted a disproportionate share of early dedicated generation deals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For energy modellers and power market practitioners, the shadow grid is already a significant modelling variable.&lt;/strong&gt; The traditional assumption that data centre demand flows through the regulated grid is becoming incorrect at scale. Understanding the fraction of AI load that is off-grid, and how that changes marginal pricing, capacity market clearing, and transmission utilisation, is now a first-order input into any serious power market analysis. A model that assumes full grid dependency for AI load will overestimate forward demand on regulated networks and underestimate the rate at which fixed cost socialisation pressures accumulate for remaining grid customers  - both errors with significant consequence for utility valuation, capacity investment decisions, and regulatory rate cases.&lt;/p&gt;




&lt;h3&gt;
  
  
  About the author
&lt;/h3&gt;

&lt;p&gt;Yujia Zhang — Energy Modeller &amp;amp; Quant Researcher (PhD). I cover AI infrastructure, power markets, and financial systems.&lt;/p&gt;

&lt;p&gt;🔗 live market intelligence at &lt;a href="https://yujiazhang.co.uk/news" rel="noopener noreferrer"&gt;yujiazhang.co.uk/news&lt;/a&gt;&lt;/p&gt;

</description>
      <category>energy</category>
      <category>markets</category>
      <category>finance</category>
    </item>
    <item>
      <title>Data centres are forcing a fight over who pays for grid adequacy</title>
      <dc:creator>Yujia Zhang</dc:creator>
      <pubDate>Wed, 15 Apr 2026 20:51:02 +0000</pubDate>
      <link>https://dev.to/yujia_zhang_0328/data-centres-are-forcing-a-fight-over-who-pays-for-grid-adequacy-2pd5</link>
      <guid>https://dev.to/yujia_zhang_0328/data-centres-are-forcing-a-fight-over-who-pays-for-grid-adequacy-2pd5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;The question is no longer just how much firm capacity PJM needs. It is whether the new load that creates the need should pay for it directly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Rate Design 路 April 7, 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Capacity markets socialise the cost of reliability across all ratepayers.&lt;/strong&gt; That works when load growth is diffuse and slow. It becomes controversial when the incremental demand comes from a small number of industrial users with enough bargaining power to shape the rules. AI data centres sit exactly in that category.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The policy response is starting to move away from the auction itself and toward tariff design.&lt;/strong&gt; If a data centre creates the need for new firm capacity, one obvious answer is to charge it through a dedicated rate class or a direct procurement obligation. That would stop the cost from being spread across households and small businesses that did not create the new load.&lt;/p&gt;

&lt;p&gt;Virginia has already created a separate electricity rate class for data centres, which is a sign that regulators are no longer treating the issue as a generic utility problem. Other jurisdictions are likely to follow with their own versions of the same question: should a hyperscaler that needs a large block of firm power be required to sponsor the generation it consumes, or should the grid recover the cost through the normal socialised tariff structure?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The supply side still matters, but it is not the only constraint.&lt;/strong&gt; New dispatchable capacity takes years to build, PJM's interconnection queue remains long, and FERC reforms will not erase the bottleneck overnight. That means the allocation decision is being made before the supply response arrives, which makes the rate-design choice more consequential than the auction print itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For investors and operators, the key variable is no longer just the level of capacity prices.&lt;/strong&gt; It is the regulatory regime that determines who is exposed to them. That determines whether AI load becomes a broadly socialised grid transition or a more explicit industrial user-pays model.&lt;/p&gt;




&lt;h3&gt;
  
  
  About the author
&lt;/h3&gt;

&lt;p&gt;Yujia Zhang — Energy Modeller &amp;amp; Quant Researcher (PhD). I cover AI infrastructure, power markets, and financial systems.&lt;/p&gt;

&lt;p&gt;🔗 live market intelligence at &lt;a href="https://yujiazhang.co.uk/news" rel="noopener noreferrer"&gt;yujiazhang.co.uk/news&lt;/a&gt;&lt;/p&gt;

</description>
      <category>energy</category>
      <category>markets</category>
      <category>finance</category>
    </item>
    <item>
      <title>OpenAI's Promptfoo deal puts evaluation and red-teaming at the centre of the agent stack</title>
      <dc:creator>Yujia Zhang</dc:creator>
      <pubDate>Wed, 15 Apr 2026 20:45:30 +0000</pubDate>
      <link>https://dev.to/yujia_zhang_0328/openais-promptfoo-deal-puts-evaluation-and-red-teaming-at-the-centre-of-the-agent-stack-2208</link>
      <guid>https://dev.to/yujia_zhang_0328/openais-promptfoo-deal-puts-evaluation-and-red-teaming-at-the-centre-of-the-agent-stack-2208</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;The acquisition signals that agent quality is no longer judged only by fluency  - it is judged by whether organisations can test, document, and govern failure before deployment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;AI Security 路 March 9, 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;When AI systems are connected to tools, data, and production workflows, average-case quality stops being enough.&lt;/strong&gt; What matters is the tail of the distribution: prompt injection, tool misuse, hidden data leakage, escalation pathways, and brittle behaviour under edge conditions. Those are not branding problems. They are operational risk problems  - and they are exactly what evaluation frameworks like Promptfoo are designed to surface before deployment. The Promptfoo platform specifically provides an open-source CLI for running structured test suites against LLM applications, with built-in attack libraries covering indirect prompt injection, jailbreaks, and tool-call manipulation  - exactly the failure modes that matter when an agent has write access to production systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That is what makes the acquisition strategically significant.&lt;/strong&gt; It represents the institutionalisation of evals, security testing, and structured reporting into the build cycle itself. The agent stack is acquiring the equivalent of a serious QA and risk function. This is precisely what happens when a technology moves from experimentation into managed production: the discipline of testing catches up with the pace of capability. Promptfoo was already used by over 35,000 developers and had processed millions of test runs before the acquisition, giving OpenAI an installed base and a workflow integration point across a significant fraction of enterprise AI teams.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From a mathematical perspective, the case is direct.&lt;/strong&gt; A system with high average productivity but fat-tailed failure modes can still have negative expected value once deployed into sensitive workflows. Evaluation is the discipline of shrinking that loss distribution before it shows up in incidents, compliance failures, or broken customer journeys. The ROI on evals is not benchmarks  - it is avoided production incidents at scale. If a single agent failure in a high-stakes financial or legal workflow costs more than the entire monthly productivity gain, the expected value of deployment is negative until the tail is controlled. That framing is now routine in risk-aware enterprise AI adoption, and it explains why evaluation tooling is no longer a niche product category.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The regulatory and compliance dimension adds urgency.&lt;/strong&gt; The EU AI Act's high-risk provisions require documented testing, ongoing monitoring, and audit trails for AI systems in consequential domains. In the United States, sector regulators  - including the OCC and SEC  - are publishing guidance that implies evaluable, documented AI behaviour as a condition of supervised deployment. Enterprises can no longer treat agent behaviour as a best-effort property. They need records of what was tested, what failed, how it was fixed, and who approved the deployment  - and that documentation must survive an examination. Promptfoo's structured reporting output is a direct input into that compliance workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The deeper implication is competitive.&lt;/strong&gt; Platform providers that can make testing native to the build cycle will be better positioned than those that leave safety and oversight to external wrappers. Enterprises do not merely want capable agents. They want agents whose behaviour can be inspected, challenged, and defended  - to a compliance team, a regulator, or a customer who received an incorrect output. OpenAI's move effectively embeds evaluation infrastructure inside its developer toolchain, making it harder for enterprises to justify using a separate foundation model provider when the testing and governance tools are already integrated into the platform they are building on.&lt;/p&gt;

&lt;p&gt;For practitioners in energy modelling and quantitative research  - domains where model outputs feed directly into financial and operational decisions  - the evals framing is already familiar. Backtesting, stress-testing, and out-of-sample validation are the analogue of red-teaming in quantitative work. The distinction is that quantitative models are typically applied to well-defined tasks with clear ground truth, whereas LLM agents operate in open-ended task spaces where failure modes are harder to enumerate in advance. That gap is what red-teaming libraries exist to partially close  - by adversarially probing the system before it encounters novel inputs in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The broader pattern is one of industrial maturation.&lt;/strong&gt; Every technology that has moved from innovation to regulated production has eventually acquired a testing and certification infrastructure: aviation has airworthiness standards, pharmaceuticals have clinical trial protocols, financial models have model risk management frameworks. AI agents are arriving at the same inflection point. OpenAI's acquisition of Promptfoo is not merely a product decision  - it is a bet that the evaluation layer will become a mandatory cost of doing business, and that the company which owns the best tooling for it will have a structural advantage in enterprise accounts where compliance is non-negotiable.&lt;/p&gt;




&lt;h3&gt;
  
  
  About the author
&lt;/h3&gt;

&lt;p&gt;Yujia Zhang — Energy Modeller &amp;amp; Quant Researcher (PhD). I cover AI infrastructure, power markets, and financial systems.&lt;/p&gt;

&lt;p&gt;🔗 live market intelligence at &lt;a href="https://yujiazhang.co.uk/news" rel="noopener noreferrer"&gt;yujiazhang.co.uk/news&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>finance</category>
      <category>markets</category>
    </item>
  </channel>
</rss>
