OpenAI baked a chip called Jalapeño, DeepSeek cracked 85% faster responses, and your local LLM might be getting dumber by the hour
Monday morning and the AI news feed is already smoking. Three things stood out this weekend: OpenAI finally showed its custom silicon, DeepSeek dropped an inference speed trick that actually works without quality loss, and there's this growing unease about local LLMs slowly losing it the longer you run them. Plus HP just made a massive bet on OpenAI's Frontier platform, and Shopify quietly built the kind of model-agnostic stack most enterprises only dream about. Let me untangle all of this.
OpenAI's Jalapeño Chip — Not Just a Spicy Name
So OpenAI's first custom inference chip is real. Codenamed Jalapeño, co-developed with Broadcom, designed from scratch for LLM inference — not general-purpose compute, not training, just running models efficiently. Greg Brockman says they're already testing it internally with GPT-5.3-Codex-Spark, and the plan is to deploy at gigawatt scale with Microsoft by end of 2026.
A few things here. First, the nine-month design-to-tape-out cycle is pretty impressive for a first-gen custom chip. Second, this is clearly OpenAI's long-term hedge against Nvidia dependency — something they've been hinting at since early this year when Reuters reported they were unhappy with some Nvidia hardware. But let's be real: Jalapeño is for inference, not training. Nvidia still owns training, and that's not changing overnight.
What I'm actually curious about is whether Jalapeño delivers on the "more compute with less energy" claim. OpenAI hasn't released benchmarks yet, which is typical for first silicon, but without numbers it's just marketing speak. If this thing can run GPT-5.5-tier models at half the power cost of H100s, that's a genuinely big deal for API pricing. If it's marginally better, well, Broadcom gets a nice reference design and OpenAI gets a PR win.
Either way, the multi-generation roadmap suggests we'll see Jalapeño 2 and 3 within the next 18 months. The chip infrastructure game is accelerating hard.
DeepSeek DSpark — 85% Faster, No Quality Sacrifice
DeepSeek dropped DSpark on June 27, and this one deserves attention. It's a speculative decoding framework baked into DeepSeek-V4 — not a new model, but an optimization layer. On the Flash variant, per-user response speed improves 60-85%. On Pro, 57-78%. Zero quality regression, and they open-sourced it.
Speculative decoding isn't new conceptually — Google and Anthropic have used variants of it for a while — but DeepSeek made it practical at scale as a drop-in framework. What this means in practice: if you're running V4 Flash in production, your p50 latency goes from maybe 800ms to under 200ms. That's the difference between "waiting for the AI" and "the AI replying while I'm still typing."
The open-source part is what makes this interesting. Anyone can take DSpark, integrate it, and speed up their own inference pipeline without retraining. DeepSeek is leaning hard into the "inference efficiency > raw model size" narrative, and honestly, I think they're right. The industry has been obsessed with parameter counts and benchmarks while ignoring that slow inference kills user experience in real products.
One gripe: speculative decoding adds some memory overhead on the draft model side. If you're already GPU-constrained at 4-bit quantization, this might not be a free lunch. It's worth testing in your own setup.
Your Local LLM Isn't Imagining Things — It's Actually Getting Dumber
This XDA Developers piece hit close to home. Someone ran a local LLM continuously for hours on an RTX 5090 and watched output quality degrade in real time. It's not the model "learning" bad patterns — it's context window saturation combined with thermal throttling and memory fragmentation.
I've noticed the same thing running Qwen3.6-27B. The advertised 262k context window works in theory, but past about 32k tokens the model starts losing thread coherence. Responses get shorter, more repetitive, and occasionally it just repeats the last user message back. Andrew Zhu on Medium documented this systematically — performance collapse past a critical threshold is real.
For people running local models daily (and I know a lot of you are), here's the practical take: restart sessions more often than you think you need to. Every 50-100 messages, or whenever the model starts feeling "off." Use tools that periodically trim context. And don't trust those huge context window numbers in model cards — test your own workload and find the real sweet spot.
The hardware side matters too. Running a 70B model at Q4 for 6 hours straight on consumer GPUs pushes thermals into throttling territory. Your LLM might be getting "dumber" because your VRAM is literally overheating.
HP Goes All-In on OpenAI Frontier — And Shopify Shows How to Stay Unstuck
HP announced a strategic partnership with OpenAI on June 28, adopting the Frontier platform across customer-facing experiences, internal ops, and software development. They're integrating it with WXP (their workforce experience platform, a Gartner Magic Quadrant leader). Prakash Arunkundrum frames it as AI becoming "an operating layer" for HP.
This is HP making a big bet — and spending big money — on OpenAI's enterprise play. The Frontier platform includes agentic capabilities, which HP plans to use for customer support, telemetry insights, and employee productivity.
But here's where Shopify's approach, detailed by VentureBeat a few days earlier, offers a useful counterpoint. Shopify built an LLM proxy that routes across multiple providers with automatic failover. When Anthropic shut down Claude Fable 5, Shopify engineers didn't even notice — the proxy moved them to Opus or GPT 5.5 seamlessly. They also run a distillation pipeline that has produced models 30x cheaper and faster for specific subtasks.
The contrast is instructive. HP is going deep with one vendor. Shopify is staying model-agnostic. Which strategy is smarter? Honestly, both work for their contexts — HP needs enterprise-grade SLAs and deep integration; Shopify has 10,000+ microservices and needs flexibility. The real lesson is: know your risk tolerance. If one model provider going dark would break your pipeline, you need a proxy layer. If you can afford lock-in for better integration, go deep. Just don't pretend neither risk exists.
Gartner's Reality Check — AI Coding Costs Will Overtake Developer Salaries by 2028
Gartner released a forecast that stopped me mid-sip: by 2028, AI coding costs (token consumption + licensing) will exceed the average developer's salary. The logic is straightforward — token usage is growing exponentially, and pricing is shifting to consumption-based models without ceilings.
Right now, most teams think of AI coding tools as cheap. A $20/month Copilot subscription per dev feels like a no-brainer. But when you factor in heavy Codex usage, custom model fine-tuning, and agent orchestration costs, the bill adds up fast. Gartner's data suggests enterprises aren't tracking token spend per developer accurately today, which means budget surprises are coming.
My take: this doesn't mean AI-assisted development is overhyped. It means the economics are about to change. We'll see more fixed-price licensing, more caching layers, and probably more distilled models optimized for specific tasks (exactly what Shopify's doing). The teams that survive the cost crunch are the ones that measure token spend the way they measure cloud compute spend — per feature, per team, with alerts and circuit breakers.
Quick Bites
- Perplexity's $34.5B Chrome bid — The search startup wants to buy Chrome if Google is forced to divest it. Open Web Advocacy warns this could slash web platform investment by 70%. Perplexity previously argued nobody else should run Chrome. Awkward.
- Firmus + Nvidia building in Batam — Singapore-based AI startup Firmus partnering with Nvidia and DayOne for a data centre in Batam, Indonesia. Another chunk of Southeast Asian AI infrastructure taking shape.
- Agent memory benchmark — New open benchmark on GitHub (Kausha3/agent-memory-bench) testing failure modes of agent memory: retraction, collision, recall, conflict. Offline, zero-dependency. If you're shipping agents, this is worth a look.
- ICAI + Sarvam AI LLM for Indian CAs — Domain-specific LLM for chartered accountants handling sensitive financial data, trained on Indian regulatory frameworks. Privacy-first approach.
If you're sizing up your AI stack this week, the takeaway is pretty clear: don't put all your inference eggs in one basket, don't trust context window numbers at face value, and start measuring your token spend now before it becomes a surprise line item. The Jalapeño chip and DSpark are genuine engineering wins, but they're optimizations on top of a landscape that's still shifting fast.
Running cost calculations for your next project? PayCalc has some handy tools that might save you a spreadsheet headache.
AI Pulse — June 29, 2026
Disclaimer: Some information in this post is based on early reports and may be updated as official details emerge. Unverified industry rumor, for reference only.

Top comments (0)