DEV Community: Martin Alderson

Why on-device agentic AI can't keep up

Martin Alderson — Sun, 01 Mar 2026 22:22:28 +0000

There's a growing narrative that on-device AI is about to free us from the cloud - the pitch is compelling. Local inference means privacy, zero latency, no API costs. Run your own agents on your computer or phone, no cloud required.

Indeed, the pace of improvements in open weights models has been spectacular - if you've got (tens of) thousands to drop on a Mac Studio cluster or a high end GPU setup, local models are genuinely useful. But for the other 99% of devices people actually carry around, every time I open llama.cpp to do some local on device work, it feels - if anything - like progress is going backwards relative to what I can do with frontier models.

There are some hard physical limits to what consumer hardware can do - and they're not going away any time soon.

For the purposes of this article, I'm referring to agentic capabilities in a personal admin capacity. Think searching emails and composing a reply and sending a calendar invite. More advanced capabilities like we see in software engineering are even harder to do on device.

The state of RAM

While the models themselves are getting hugely more capable, there's an intrinsic problem that they require a lot of ideally fast RAM.

Right now, 16GB laptops are the most common configuration for new devices - but 8GB is still very common.

On phones, the situation is (understandably) even more constrained. Apple is still shipping phones with 8GB for the most part - the iPhone 16e and 17 ship with 8GB of RAM, and only the Pro models have 12GB. Google on their Pixel lineup is more generous, shipping 12GB on the 'standard' models, with 16GB on the Pro models.

The issue is that this RAM isn't just for on device AI models. It's also for the OS, running apps. Realistically you want at least 4GB for this - and that's cutting it fine for web browsers and other RAM heavy apps on your phone. On laptops I'd suggest you want at least 8GB of RAM for your OS and apps.

This leaves very little space for the AI capabilities themselves - perhaps 4GB on non-"Pro" models and 8GB on the Pro models. Equally even a new MacBook Air is only going to have 8GB of space in RAM for AI. And these are brand new devices. The majority of people are running multiyear old hardware.

KV cache eats everything

The models present one space issue. A 3B param model (which in comparison to frontier models is tiny) requires on the order of 2GB in a highly quantised (think compressed) variant. A 7B param model - which in my experience is vastly more capable - requires more like 5GB. In comparison, full scale models are around the 1TB mark - 200-500x larger.

While this is an incredible achievement to get any level of "intelligence" in such a (relatively) small space, you can see the issue already - a 7B model won't fit in most new consumer hardware, leaving only space for a 3B model.

This is only half of the problem though. You don't just need RAM for storing the model, you also need space to cache the context of the interactions with the models. This is where it quickly becomes unusable for many agentic use cases.

You can get away with a very small amount of context for simple tasks - think text summarisation or tagging. This may fit into a few thousand tokens of KV cache, and is doable on device (both Apple and Google limit on device context to 4K tokens from my research on phones).

Even 'basic' agentic tasks quickly become unusable at this 4K limit though.

Tool definitions (think 'send message', or 'read calendar events') alone probably require that size of context. That's before you start doing prompts against it, or including data from your phone (the actual iMessages, or your emails).

KV cache memory for a 7B Q4 model at different context lengths. Even at 32K context, you've blown past what an iPhone 17 can offer.

It simply doesn't work in 4GB, or even 8GB. At a bare minimum I think you'd want 32K tokens of context window, and ideally a 7B+ param model. This is getting close to 16GB of RAM¹ just for the AI part of your device. As such, we need to see consumer devices with 24GB, or ideally 32GB of on device memory before a lot more possibilities open up.

There are techniques that help close the memory gap - grouped-query attention, sliding window attention, quantised KV caches. They're real and they're shipping. But they often trade off precision in exactly the scenarios agentic workflows need most - multi-hop reasoning, precise tool calling, and maintaining coherence across longer conversations. They help, but not nearly enough.

But then the supply chain issues started

Arguably we were on track to hit this - 32GB laptops were becoming more common. But then the price of RAM skyrocketed over 300%. Manufacturers are more likely to cut RAM now than add more. And given the huge lead time of additional DRAM manufacturing capacity, this situation is unlikely to change in the near future.

This is a great example of crowding out. HBM (datacentre class RAM) and standard DDR5 compete for the same DRAM wafer starts - so every wafer allocated to HBM for datacentres is one not used for the DDR5 in your laptop.

However, speed is an issue

Let's run the hypothetical that overnight we have far more DRAM manufacturing capacity across the globe. There's still another massive issue - speed.

While devices have impressively fast compute available to them, especially in something that you can carry around with you in your pocket, there's another context related problem that pops up.

A consumer device might be able to process tokens on the order of 30tok/s on a small, local model. This is actually surprisingly usable - not fast, but probably passable for many use cases.

However, as context scales - and as I described before, it massively scales in agentic tasks - the processing speed drops off a cliff. To put this in perspective, even a Radeon 9070 XT - a 304W desktop GPU - drops from 100 tok/s to less than 10 tok/s on an 8B model at 16K context once you factor in prefill. Good luck getting that on a phone.

Cloud inference barely flinches as context grows. On-device collapses to near zero past 16K tokens - exactly where agentic tasks start.

Speculative decoding - where a tiny draft model proposes tokens and a larger model verifies them - can help with speed. But it requires holding two models in RAM simultaneously, which makes the already dire memory situation even worse.

At this speed even a quick couple of paragraphs long email might take a minute to generate - at which point it's almost certainly quicker to type it yourself.

Even worse, hammering your phones hardware this hard for extended periods of time really impacts battery life and makes your phone heat up, so much so that the phone has to slow itself down to avoid overheating. This makes it even slower.

This is a far more difficult challenge than just providing more RAM. You need more compute (for prefill) and much faster memory. These are both expensive (with or without supply chain issues) and much more power hungry. It feels like we are still a way off even GDDR memory - which itself is still ~an order of magnitude slower than datacentre class HBM - being able to be put into a phone.

As you can see, between the RAM limits and speed limitations, on device models are going to have a very difficult time in the next year or two getting any real traction for even basic agentic workflows. Of course there could be some architecture breakthroughs that allow this - but assuming that doesn't happen - I think it is safe to say that most of us will be running most of our tokens through a cloud provider for the foreseeable future.

Cloud offload

This brings me to one last point - compute capacity on the cloud itself. While Apple has pushed the narrative of on device for simple tasks, and offloading to more capable models on the cloud, running the maths on this actually exposes some serious issues for agentic tasks.

It's hard to fathom the scale of the iOS installed base (and Android even larger). There's somewhere on the order of 2 billion active iOS devices, and another 4 billion Android devices out there.

The compute demands to bring this even on the cloud to even a sizeable minority of these users is enormous. I estimate that Claude Code has low single digit millions of users, and I strongly suspect it is melting Anthropic's entire compute supply.

If Apple were to roll out agentic capabilities in to the OS - even with a lot of limitations - you are easily looking at requiring an entire Anthropic in terms of compute capacity, for just a small minority of iOS users.²

If anyone tells you on device models are just round the corner, they clearly haven't run the maths on the memory and compute requirements. Datacentres aren't going anywhere soon.

This is a giant simplification and there are many approaches to reduce this. For example, hybrid attention significantly reduces KV cache memory requirements, but does trade off precision. There's a great roundup by Sebastian Raschka, but it gets extremely technical very quickly. ↩
Assuming even a 5% rollout to 100M active iOS users, and each user uses 5% of the tokens of a Claude Code user. This feels ~roughly reasonable in terms of token consumption - but it really depends on what the product looks like. Directionally this feels right to me though. ↩

How to convert Excel spreadsheets to Python models with Claude Code

Martin Alderson — Sun, 22 Feb 2026 19:29:18 +0000

I've had a few people reach out saying they'd like more info on how to convert Excel to Python with Claude Code - something I've mentioned in some of my other blogs. Here's the step-by-step process I use to convert complex Excel financial models into Python modules, verified against the original spreadsheet.

Why do this?

The key benefit is the speed of trying many combinations of models. This is slow and painful to do in Excel, whereas it's trivial to run millions (or more) of variations in Python in seconds.

Also, it allows you to embed your Excel model in other applications (e.g. a web app dashboard, or internal tools). And Python has far more advanced capabilities than Excel, especially when it comes to machine learning.

Finally, once you have it in Python you can iterate with the coding agent itself to improve it. This is far easier to do in Python than getting an agent to understand a spaghetti mess of Excel formulae.

1. Audit your Excel sheet

Open a folder with Claude Code and drop a copy of your Excel file in.

Once that's done, open Claude Code in that with the plan mode enabled and ask something like this:

Please convert my complicated financial model in this folder (financial_model.xlsx) from Excel to a python module. It should work _identically_. Let's start with you doing a full audit of the sheet and all workbooks. Then come up with a plan to implement this and verify outputs against the Excel model. Use multiple subagents if helpful

I'd strongly recommend using the strongest model you can for this - at the time of writing this was Opus 4.6 or Codex 5.3-xhigh.

2. Confirm the plan

Once you've got a plan, carefully read it and make sure it sounds reasonable - it should have understood the intent and what each sheet/part of your Excel workbook. It's far better to argue with it here than further down the line. Also, don't be scared to ask it to explain its understanding of parts of it if you're not sure if it's really understood.

It may suggest multiple phases if it's very complicated - that's fine to accept.

3. Implement

Once you've done this - let it implement it. This will take a while depending on your model. Again, don't be scared to interrupt it if you think it's going off track from some of the logs that it comes back with - or get it to explain further. But in my experience this part has been fairly automatic even on very large Excel files.

4. Iterate

Once the implementation is done, ask if it has anything to finish. On complex Excel files it often will leave some parts out because it's too complicated - make sure it's finished it all off. Once you've confirmed with the agent it thinks it is finished, ask it to run the model for you with various inputs and sanity check this against your Excel.

You may find small rounding errors (this may or may not be a problem) due to Python's numeric formats often being marginally different to Excel. In my experience this hasn't caused anything major but I could see things getting out if you have a very recursive model which requires very high precision. I'm sure this can be resolved by iterating with the model to emulate Excel's numeric precision more closely if needed. In my experience though, if you follow the steps here and ensure it's validating against the real Excel file, the output is usually perfect. If anything, it may find issues with your actual Excel formulae!

5. Run experiments with the model

The most powerful bit as I alluded to in the intro is to run many scenarios with the model's help.

For example, ask the agent to pull GDP/inflation trend figures from the internet, then feed it into the model to see the impact on your assumptions under various macroeconomic situations.

Or, ask it to search through all your organisation's presentations in the past and input some of the assumptions that were made to see how they turned out.

The sky is the limit - you can also have the agent make a plan to improve your model now it's in code. Ask it what weaknesses it thinks you have, what data sources could be connected, etc. For example, you could pull in PowerBI data automatically (or any other SQL datasource) to keep the model completely up to date without any copying and pasting into Excel.

You'll hopefully find that the agent can find loads of interesting 'flaws' in your model and be able to implement them very quickly.

Finally, you can also combine this with my making great looking reports approach to have the model write findings from your collaborative runs together into presentations for you to present to your wider team, in your corporate branding. Keep in mind it's trivial to get Claude to plot graphs and charts from the Python model too, with a far better level of precision than (I at least!) can get in Excel.

The whole process takes less than an hour even for very complex models. Once it's in Python, you'll wonder why you ever did scenario analysis by hand in Excel.

If you'd like help converting your own Excel models or want to talk about how this could work for your team, feel free to get in touch.

Who fixes the zero-days AI finds in abandoned software?

Martin Alderson — Fri, 20 Feb 2026 17:34:37 +0000

Anthropic's red team released research showing that Claude Opus 4.6 can find critical vulnerabilities in established open source projects. They found over 500 high-severity bugs across projects like GhostScript and OpenSC - some of which had gone undetected for decades.

This is impressive, and genuinely useful work. But their research focused on maintained software - projects where patches can actually be shipped. The scarier problem is the enormous long tail of abandoned software that nobody will ever fix.

A few weeks before they published, I'd been testing the same idea against abandoned software.

The issue

It's been obvious for a while that AI agents are getting good at finding security vulnerabilities, but the pace is still surprising. Anthropic's Opus 4.6 paper found critical bugs that had gone undetected for decades in projects that actually have dedicated security teams. That's the maintained stuff. The unmaintained stuff is in a lot more trouble.

There is a lot of software out there. We've had ~40 years of internet enabled software. A lot of this is unsupported, and even the supported software has major delays in getting security patches.

This long tail of software hasn't been a (huge) security concern because each individual software package used to take human time to investigate and exploit. If an application only has a few hundred installs they tended to get overlooked.

Finding a critical security vuln in <15mins

To test my theory out I asked Claude to find some software packages that are 'abandoned' by their maintainers but still has an active userbase. I did this a few weeks before the Anthropic paper came out - I was curious how far this had come in practice. It suggested a bunch of old PHP apps, one of which I had heard of before. So I decided to start there.

The process was very trivial. I cloned the repo, opened Claude Code, and asked it to find critical security vulnerabilities while I made a coffee. It found a bunch very quickly that turned out to be somewhat false positives (bad programming for sure but not directly exploitable).

So far, so secure. I changed approach and had it spin up the application in question and told it that we only care about vulnerabilities that can be exploited directly via a simple HTTP call - not convoluted attack patterns. The agent therefore had a feedback mechanism to find exploits, and attempt them against the containerised app.

Within 2-3 minutes it had found a 'promising' exploit, that initially failed because of some naïve filtering in the app. Another 2 minutes later it figured an encoding mechanism that bypassed the filtering the app did and it had found a complete RCE, and written a full proof of concept.

At this point I reached out to the maintainers of the (mostly it looks like) abandoned projects security email to let them know. I'm not naming the project here because there's no maintainer to ship a patch and thousands of servers are still exposed. It's been three weeks and I've heard nothing.

I estimate there are many thousands (minimum) of vulnerable servers.¹ Most look to be hosted on VPSs. The more concerning risk is the sensitive data likely sitting on them, but even as raw botnet infrastructure, that's a serious amount of firepower.

Quantity, not quality

It was clear to me that you could run an agent to find vulnerabilities like this automatically in a VM. Clone a git repo (based on some heuristics of popularity and last commit), ask it to set it up, find exploits, save them, discard VM and continue ad infinitum.

I suspect within days you could get dozens if not hundreds of RCE exploits. You could then have another agent scan and exploit as many servers as possible.

This flip in economics changes how we think about information security. When it used to "cost" time to find these bugs it simply wasn't worth infosec (either white, grey or blackhat) people spending time and effort to find vulnerabilities in the long tail, for the most part.

Mitigations

There has been some effort on the frontier AI labs side to stop this kind of research - Claude Code actually had a pretty strict system prompt to not allow even defensive security research when I ran this, which later got reverted. It did pop up at one point stopping itself in its tracks (arguably too late) saying that it actually can't do this kind of work and I need to use specialised tools.

Unfortunately it was trivial to bypass - I just said I was the maintainer of the project and we've had reports of a serious security vulnerability and we need to fix it. It totally understood - and continued, never to worry again about my intentions. I'm very doubtful trying to add guardrails to LLMs for this would work - it's too hard to differentiate between offensive and defensive security work, and I'm sure that more aggressive guardrails would end up with a lot of normal software tasks being flagged.

Plus we have the problem that the genie is out of the bottle on this - I'm sure that even if the frontier labs did manage to put effective guardrails, adversaries could build their own models off (e.g.) open weights models to do this.

Defensive ability

Sam Altman recently wrote on X:

Even Altman acknowledges that product restrictions are just a starting point - his long-term plan is "defensive acceleration", helping people patch bugs faster. Which is great, but it still assumes there's someone on the other end to apply the patch. Anthropic's paper actually proves the point - they found the vulnerabilities, and patches got shipped. Great. But that playbook of 'find vulnerability, issue patch, wait for adoption' doesn't work when there's nobody to issue the patch.

I suspect this is going to require some quite drastic measures along the lines of disabling internet access to vulnerable servers en masse.²

The uncomfortable truth is that even Anthropic's research, genuinely important as it is, only scratches the surface. Finding bugs in maintained software and getting patches shipped is important work. But below that is a massive iceberg of software that nobody is maintaining and nobody will ever patch - and it's running on tens of millions of servers right now.

The only thing protecting it was that it wasn't worth a human's time to look. That's no longer true.

Based on Shodan fingerprinting of the application's default HTTP headers and page signatures. The actual number is likely higher as many instances will have been customised enough to evade simple fingerprinting. ↩
This isn't without precedent - ISPs and hosting providers already quarantine servers that are part of active botnets. The difference is scale: we're talking about proactively identifying and isolating tens of millions of servers running software that will be exploited, not just ones that already have been. In the meantime, if you run infrastructure, now is a good time to audit all listening services across your network - especially anything publicly accessible. If you find software that's no longer maintained, either firewall it off or migrate to a supported alternative. The days of "nobody will bother attacking this" are over. ↩

Attack of the SaaS clones

Martin Alderson — Tue, 17 Feb 2026 22:56:09 +0000

I cloned most of Linear's core functionality in 20 prompts using Claude Code. It took a couple of evenings and a few million tokens. Here's what I think that means for the future of SaaS economics.

A quick recap

In my previous two posts, AI agents are starting to eat SaaS and a more recent one about the sharp decline in software company valuations, I've covered some of the risks to existing SaaS businesses now agentic coding capabilities have increased so much (and perhaps more importantly, continued to improve so quickly, with no real slowdown in sight).

In these essays I argued that there are two emergent challenges to software businesses. Firstly, organisations will increasingly look to build their own "internal" versions of SaaS rather than procure external vendors.

Secondly, agents are increasingly replacing SaaS entirely. Take design - agents can build pretty great looking reports and slide decks without an intermediary Google Slides, Figma or Prezi step. A lot of productivity and analytics software is at risk from being completely eaten by agents.

And even the threat of this will allow organisations to push back much harder on price increases from SaaS vendors. Given so much of SaaS is now owned by PE, who have borrowed money against future (growing) revenue streams¹, this is a significant problem.

But do people really want to manage internal SaaS?

A (very) fair critique of a lot of the "build your own SaaS" narrative is that while people can build their own internal versions of SaaS, it's one thing building it and quite another managing it, updating it and running the infrastructure for it. It's hard to disagree with this - though I think it oversimplifies some of the competitive advantages companies can have from having bespoke line of business software that is perfectly aligned to their business goals.

But clearly, it's not a great idea for a company to be spending all their time building and managing tools that distract them from their main business. Adam Smith's pin factory still very much stands.

However, this overlooks that while the end users themselves may not want this headache of management, there are many people who would happily build competing platforms, manage it and sell it on for a fraction of the cost of existing SaaS vendors.

And this is where I think software companies are perhaps the most exposed.

Building a Linear alternative in 20 prompts

Please note that this was built purely for research purposes and I have no intention to release this code nor commercialise this in any way. I'm not intending to infringe on anyone's brand, copyright or other intellectual property. I'm a huge fan of Linear and think it is a brilliant piece of design and software.

To test this theory out (and to use the awesome new Teams feature in Claude Code), I went about seeing how possible it would be for a coding agent to replicate Linear. Linear is an excellent project management tool, which I chose to look at simply because I've read so many comments in reply to my article(s) saying that while building a simple SaaS clone is possible, it wouldn't be possible to build a Linear clone.

I followed a pretty simple process. Firstly, I opened my web browser with DevTools open, and browsed the platform, collecting all the network traces.

I interacted with a few features in the app so it'd have a trace of most of the software's functionality in this.

I then exported this as a HAR file (the rightmost icon above the search bar), which is an archive file of every network request. This produced an enormous HAR file with thousands of CSS and minified JS files.

HAR to software

I then set Claude Code off to use multiple subagents to understand all the functionality and the design of the software.

It did an incredible job at this, reverse engineering everything in the archive to a very high level of fidelity.

From this I used the new Teams feature in Claude Code to spin up multiple agents to start working on the front and backend of the product. This was very hands off; the first iteration was incredibly buggy, so I had to insist it start adding unit and integration tests. The quality dramatically improved after I did this.

I estimate I did 20 prompts, asking it to find placeholder content and replace it with actual functionality a few times.

A few million tokens later - which was more than covered by my $200/month Claude Max subscription - and I got a pretty faithful clone of Linear, with most key pieces of functionality working, persisting to a SQLite database.

Built entirely from reverse-engineering network traces. No source code required.

Now it's certainly not perfect and it's missing a lot of important pieces of functionality².

My point here isn't that people can copy an existing project in 20 prompts and get it perfect. It's that I managed with Claude Code to do this in a couple of evenings while paying very little attention. I suspect a couple of motivated engineers could get a production quality version ready in a few weeks/months. Linear has had nearly a decade of some of the best designers and developers in the industry working extremely hard on it (and it shows) - it's renowned for its impeccable design and polish. Most SaaS isn't anywhere near that level. If an agent can produce a passable clone of Linear, it can probably do a very good job on the vast amount of SaaS out there that is, frankly, barely functional.

What does this mean?

I think all SaaS is vulnerable to this to some level. Software that has significant network effects or proprietary datasets, or specialised infrastructure requirements are much more defensible, however. I expect you could even just paste the public API docs of many projects in and get a pretty workable version of the software back - the API docs usually expose a lot of the inherent business logic in it.

In a way this isn't anything the industry hasn't seen before - indeed the PC itself was a clone of the original IBM PC. And we've had many people build compatible implementations of many APIs - the AWS S3 object storage API, the Java APIs³ and even ironically the OpenAI inference API standard itself have all become de-facto industry standards. Microsoft itself famously built most of its initial marketshare by doing exactly this - building affordable "alternatives" of existing software (MS-DOS, Windows, Excel and Word were all extremely inspired by their contemporaries in the market).

But now I think we have the ability of a handful of people to reimplement hundreds (or thousands) of developer-years of effort in a very compressed timescale. And I think this will become a pervasive risk for SaaS going forward. Much like Rocket Internet in the early 2010s cloned every popular American platform going for the European market, I think we are going to see some very high quality alternatives to every major SaaS vertical - but without requiring billions of dollars of VC money to do so.

And even if these platforms don't get marketshare for whatever reason - they again put downwards pressure on the software pricing equations. And most importantly, it doesn't require users to manage this themselves - they can get a familiar tool at a far lower cost.

Of course, the product is only one part of the equation. Sales, marketing, customer support, compliance certifications - these are all things that existing SaaS vendors have spent years and millions building out. But I think these are increasingly solvable with very small teams now too. AI is compressing the effort required across all of these functions, not just engineering. A handful of people can now credibly stand up a competing product and the business around it.

Can SaaS companies fight back?

I'm not a lawyer, but my rough understanding is that functionality itself cannot be copyrighted. While software patents may apply, I suspect unlike a lot of other technology companies SaaS companies are very patent light - it's very hard to patent a lot of the "CRUD" workflows that SaaS is famous at helping automate.

This puts them in a difficult place legally to enforce this. While they can certainly enforce their brand and trademarks, it's much more difficult for them to send C&Ds if competitors are careful to not infringe that. Reverse engineering like this could certainly be against Terms of Service, but exactly how enforceable this is given they ship these files publicly is not clear to me - it's not "hacking" into their backends which is far more clear cut. And most SaaS contracts cap liability at fees paid, so even if a vendor successfully enforced a ToS breach, the damages from a $20/month subscription aren't exactly a deterrent.

They can however make it much harder to get your existing data out of their systems, and we are already starting to see a lot of API price changes in the SaaS marketplace. For example, the popular accounting platform (outside of the US), Xero has announced brutal API charges, which cost far more than the underlying SaaS fees in most cases. I'm not sure how related this is, but putting up tolls to get your data out is one option.

The issue though is that these APIs are just what people need to "legitimately" build agentic workflows against these products, so by making this expensive, you also reduce the utility of your product for new agentic workflows and make it more compelling for people to switch off.

Perhaps the most durable moat SaaS companies have is one that's rarely discussed: liability. Established vendors can take on contractual liability for data breaches, offer indemnification clauses, carry millions in cyber insurance, and back it all up with SOC2 audits and compliance certifications. Enterprise buyers are paying for more than software - they're paying for someone to be legally and financially on the hook when things go wrong. A two-person clone shop simply can't offer that. Maybe this, more than any technical moat, is what ultimately protects incumbent SaaS.

I certainly didn't have agents being able to take a HAR file and build a passable clone on my 2026 bingo card. I hope the industry finds a way to protect the incentives to build great software like Linear in the future. Because right now, I'm not sure what that looks like.

There's a great article in the FT called How private equity’s big bet on software was derailed by AI with more information about this trend which is well worth a read. ↩
There's no authentication system, no real-time collaboration, and some views are still using placeholder data. But given the loop I was in - find broken things, fix them, add tests, repeat - I don't think any of these would be particularly hard to continue with. The iterative improvement cycle was working well and each pass was producing noticeably better results. ↩
Which famously resulted in Oracle losing their copyright lawsuit in Google LLC v. Oracle America, Inc over Android reusing the Java APIs. ↩