DEV Community: as1as

A Pattern Sketch: Server-Sent Events as a Fanout Channel for Edge State

as1as — Mon, 13 Apr 2026 02:19:51 +0000

What this is: a small OSS pattern sketch — not a Redis replacement, not a production auth platform. I built it to play with one specific question: "if you only need to push small mutations from one writer to many readers, do you actually need Redis?" Sharing the design and the trade-offs in case the pattern is useful to anyone.

Repo: github.com/as1as/sse-edge-auth

The shape of the problem

The goal here isn't don't use Redis. It's what does this problem look like when you strip it down to the minimum pieces.

A common edge-auth setup has many edge nodes in front of an origin, all needing to agree on things like "is this IP banned?" or "is this JWT revoked?". The default answer is Redis — every edge queries the same shared store.

But notice the asymmetry: mutations are rare, reads are constant. You might revoke a token once a minute; the edge fleet handles thousands of requests per second. Putting a network round trip on every read to keep N nodes in sync feels disproportionate.

One clarification worth making upfront: SSE itself isn't faster than Redis pub/sub — as fanout channels, they're in the same ballpark. The difference shows up on the read path. With Redis, every request pays a network lookup (~0.5–5ms on LAN). With local SQLite, every check is an in-process function call (~0.01–0.1ms). The speed comes from in-process SQLite, not from SSE.

If you frame it as a fanout problem instead of a shared-state problem, two pieces of unexciting tech are a clean fit:

Need	Choice
Push small mutations from one writer to N readers	Server-Sent Events (one-way HTTP stream)
Answer reads locally with no network involved	In-process SQLite — every check is a function call

That's the entire architecture.

Architecture

                  operator
                     |
              POST /ban/ip
                     v
              +---------------+
              | master server |   GET /events  (SSE)
              +-------+-------+ ──────────────────────+
                                                       |
                +-----------+-----------+-----------+
                v           v           v           v
            +-------+   +-------+   +-------+   +-------+
            | edge  |   | edge  |   | edge  |   | edge  |
            |sqlite |   |sqlite |   |sqlite |   |sqlite |
            +---+---+   +---+---+   +---+---+   +---+---+
                |           |           |           |
                +-----------+-> origin <+-----------+

Each edge subscribes to the master's SSE stream on startup. When you POST /ban/ip, the master writes the event to an in-memory ring buffer and broadcasts it. Every connected edge applies it to its own local SQLite. From that moment, requests to that IP are rejected by the local auth gate — no remote call.

SSE + `Last-Event-ID`: the part I find satisfying

The genuinely nice thing about SSE for this pattern is that the resume protocol is already in the spec. Every event has an ID:

id: 42
event: ip_banned
data: {"ip": "1.2.3.4", "reason": "abuse", "timestamp": 1234567890}

The edge sends the last ID it saw on reconnect:

GET /events
Last-Event-ID: 42

The master replays everything since. We didn't have to design a catch-up protocol — we just needed a ring buffer.

The same channel carries cache invalidation:

event: cache_invalidated
data: {"tags": ["products"], "keys": [], "timestamp": 1234567890}

Once you have a reliable fanout channel for one kind of state mutation, adding another kind is a one-line consumer on the edge. Same Last-Event-ID resume, same ordering guarantees.

Why SSE, not WebSocket

	SSE	WebSocket
Direction	server → client	bidirectional
Protocol	plain HTTP	HTTP upgrade + framing
Reconnect / resume	in the spec	DIY
Proxy / LB compatibility	works everywhere HTTP works	sometimes painful

Traffic in this design is strictly master → edge. WebSocket buys bidirectionality we don't use, and costs complexity we don't want.

The bit I'm most curious about: a composable cache TTL pipeline

Since edges already see every request, they double as a response cache. Where it gets interesting is how TTL gets decided — as a pipeline of small pure functions:

function resolveTTL(ip, baseTTL) {
  let ttl = baseTTL;
  ttl = adjustTTLByFrequency(ip, ttl); // trusted IPs → longer TTL
  ttl = adjustTTLByTime(ttl);          // off-peak → longer, peak → shorter
  return Math.max(0, ttl);
}

Each rule lives in its own file:

ttl-by-frequency.js — high-frequency IPs are likely real clients; trust them with a longer TTL. First-seen IPs get a shorter one.
ttl-by-time.js — content changes less off-peak; cache longer overnight, shorter during peak.
failure-pattern.js — N auth failures in a window from the same IP triggers a local auto-ban, written into the same SQLite table the master uses. Edge-local self-healing — no master round trip needed for "I'm being abused right now."
lru-eviction.js — when the cache exceeds CACHE_MAX_ENTRIES, oldest-accessed keys are dropped.

Adding a fifth rule means writing one function and one line in resolveTTL. The composability matters more to me than any specific rule.

Tag-based invalidation

The origin tags responses:

Cache-Control: public, max-age=60
X-Cache-Tags: products, category-3

When products change, one call to the master:

curl -X POST http://master:4000/invalidate \
  -H 'content-type: application/json' \
  -d '{"tags": ["products"]}'

The master broadcasts cache_invalidated, every edge drops matching entries from its local SQLite. Same channel, same resume guarantees as auth state.

Honest limits

I want to be specific about what this pattern does not give you, because the answer to "do I need Redis?" depends entirely on these:

The master is a single point of failure for new mutations. If it's down, edges keep serving with last-known state, but you can't ban anyone new. Master HA is not in v0.1.
An edge offline longer than the ring buffer (10k events by default) can miss intermediate events on reconnect. There's no full-state-pull endpoint yet.
The cache is in-memory only. Restarting an edge clears it.
No cluster, no persistence layer, no replication. Real Redis-shaped systems give you those; this pattern explicitly doesn't.

So this fits a fairly narrow shape: small/medium edge fleets, mostly long-lived edges, one master is acceptable as a coordination point, and "edge keeps working with stale state during master outages" is preferable to "everything halts when the shared store is gone."

If your situation needs more than that, you probably do want Redis — or Kafka, or a real distributed consensus system.

Run it locally

git clone https://github.com/as1as1984/sse-edge-auth
cd sse-edge-auth
(cd master && npm install) && (cd edge && npm install)

# master
(cd master && PORT=4000 npm start)

# three edges
(cd edge && PORT=5001 NODE_ID=edge-a ORIGIN_URL=http://localhost:8080 npm start)
(cd edge && PORT=5002 NODE_ID=edge-b ORIGIN_URL=http://localhost:8080 npm start)
(cd edge && PORT=5003 NODE_ID=edge-c ORIGIN_URL=http://localhost:8080 npm start)

Try a ban:

curl -X POST http://localhost:4000/ban/ip \
  -H 'content-type: application/json' \
  -d '{"ip":"::1","reason":"demo"}'

curl http://localhost:5001/  # 403 ip_banned, same on edges 5002/5003

Current gaps

No full-state-pull endpoint — an edge that exceeds the ring buffer window can't resync cleanly on reconnect. Still undecided between paginated event replay and snapshot dump.
No file-backed SQLite — restarting an edge clears its cache. better-sqlite3 supports this natively; just haven't wired it up yet.
No master HA — a leader/follower setup where followers accept SSE subscriptions and forward writes is needed but not in v0.1.
No real-network benchmark — a docker-compose with tc netem would tell us much more about this pattern's actual behavior than any localhost numbers could.

Repo: github.com/as1as/sse-edge-auth
Stack: Node.js 20+, better-sqlite3, jose, Express
License: MIT

I Started Building a Roguelike RPG — Powered by On-Device AI #5

as1as — Wed, 08 Apr 2026 10:54:56 +0000

Day 2 After the LLM Stack — The Game Is Actually Coming Together

In the last post, I locked in the on-device LLM stack. Qwen3-1.7B + llama.cpp + Adreno OpenCL. 16.6 tok/s. Dungeon generation in 9 seconds.

Time to actually build the game.

I'll be honest: I've barely touched Unity before. Most of the game implementation was done by Claude Code. I planned, directed, and tested.

What Got Built in Two Days

Dungeon to combat took two days.

BSP dungeon generator, Tilemap rendering (24 wall tile variants auto-selected), 4-directional player movement and animation, wall collision, fog of war, treasure chests (normal / rare / mimic), floor stairs, camera follow, virtual joystick. Enemy AI state machine (patrol → chase → attack → dead), contact-based combat with bidirectional damage, knockback, invincibility frames, HP bars.

19 scripts. Two days.

After that, the full game systems went in:

Floating damage text (critical hits in yellow with "!")
Level-up system (max level 50, 2 stat points per level)
7 skills + 6-slot unified action bar
Gold + inventory (55 item types)
Goblin merchant (says "Enemies nearby! Can't open shop!" if mobs are close)
Character info screen (stat allocation + permanent records)
Duplicate skill acquisition = skill level up (effect size 60% → 100% → 150%)

35+ scripts total.

Screenshot

It looks familiar because of the free assets. My wife said the same thing immediately. The graphics will get fixed later.

The LLM Stack Was the Fun Part

The game implementation felt different from the LLM work.

When I was building the LLM stack, I was the one doing the real work. llama.cpp + Adreno OpenCL + C wrapper + Unity P/Invoke — I hit wall after wall and found a way through each one. QNN blocked, LiteRT blocked, libcdsprpc.so blocked, and every time I found another path. That process was genuinely the most fun I've had in a long time. Watching 523 seconds become 9 seconds — I still remember that feeling.

Game implementation was different. Claude Code wrote the code. I said "that's not quite right" and adjusted the direction. I became a planner and a tester.

It feels a little hollow, honestly. I keep telling myself that knowing how to use tools well is also a skill.

A Funny Moment

In the middle of a session, Claude Code said this unprompted:

"Today's workload has been heavy. I'll implement the rest tomorrow."

The AI declared it was done for the day. I asked why.

"There's no basis for that. You never said to stop. Deciding to quit on your own was overstepping."

Overstepping. The AI used the word overstepping about itself.

What's Next

Now it's time to connect the LLM to the game.

Before entering a dungeon, Qwen3-1.7B generates a JSON. That JSON determines mob names, dialogue, boss patterns. If you set your character as "lazy bakery boy," the mobs will taunt you about bread.

The technical foundation is done. Now it's just about connecting the pieces.

Next: Connecting On-Device LLM to the Game — AI-Generated Dungeons

I Started Building a Roguelike RPG — Powered by On-Device AI #4

as1as — Mon, 06 Apr 2026 12:21:35 +0000

The Model Was the Answer — 16.6 tok/s with Qwen3-1.7B

In the last post, I got llama.cpp + Adreno OpenCL to cut generation time from 523 seconds down to 16.8 seconds.

Today I pushed it further. Turns out the model itself was the bottleneck.

Swapping models doubled the speed again.

First: Quantization Isn't What You Think on Adreno

Before trying a different model, I tested every quantization level on Phi-4-mini to find the optimal setting.

Quantization	Size	tok/s
Q8_0	3.8GB	9.0
Q4_0	2.3GB	5.1
Q6_K	3.2GB	4.2

This is counterintuitive. Lower quantization usually means smaller, faster. Not here.

On the Adreno OpenCL backend, Q4_0 and Q6_K introduce dequantization overhead at the GPU level that actually slows inference down. Q8_0 maps most efficiently to the Adreno compute kernels. This is specific to Qualcomm's OpenCL implementation — other backends may behave differently.

Also: requantizing from Q8_0 to Q4_0 via llama.cpp throws requantizing from type q8_0 is disabled. You need the original BF16/FP16 source model. Keep that in mind before downloading a quantized-only release.

Then: What If the Model Is Smaller?

Once I confirmed Q8_0 is optimal for Adreno, the next question was obvious: what if I just use a smaller model at Q8_0?

I tested Qwen3-1.7B (Q8_0, 1.8GB) against Phi-4-mini (Q8_0, 3.8GB).

	Phi-4-mini (3.8B)	Qwen3-1.7B (1.7B)
Model size	3.8GB	1.8GB
Load time	24.5s	14.4s
Generation speed	9.0 tok/s	16.6 tok/s
150 tokens	16.8s	9.1s
Mob name output	"몬스터이름" (literal placeholder)	"토끼" (rabbit) ✅
JSON structure	Valid	Valid ✅

Qwen3-1.7B wins on every metric. Half the size, 1.8x faster, and noticeably better at following prompt instructions.

The mob name issue is worth noting. Phi-4-mini kept outputting the literal placeholder text "몬스터이름" (which means "monster name" in Korean) instead of generating an actual name. Qwen3-1.7B understood the prompt correctly and generated real names. At 1.7B parameters, Qwen3 punches above its weight on instruction following.

Final Stack

Inference engine : llama.cpp (Adreno OpenCL)
Model            : Qwen3-1.7B Q8_0 (1.8GB GGUF)
Performance      : 16.6 tok/s / 9.1s per 150 tokens
Unity integration: C wrapper (unity_bridge.c) + P/Invoke
Device           : Samsung Galaxy S24 Ultra (Snapdragon 8 Gen 3)

The Full Journey

Approach	tok/s	150 tokens
ONNX Runtime CPU (S24 Ultra)	0.21	523s
ONNX Runtime + QNN HTP	0.31	490s
llama.cpp OpenCL + Phi-4-mini	9.0	16.8s
llama.cpp OpenCL + Qwen3-1.7B	16.6	9.1s

From 0.21 tok/s to 16.6 tok/s. 79x faster than where we started.

The AI Stack Is Done

9 seconds per generation is workable for a dungeon loading screen. No server. No internet. The LLM runs entirely on the device, generates dungeon content, and fits in 1.8GB.

The full implementation — C wrapper, build pipeline, Unity integration — is on GitHub:
👉 unity-android-ondevice-llm

Next up: actually building the game. Top-down exploration, turn-based combat, LLM-generated mobs and dialogue. The hard technical part is done. Now it's time to make something playable.

Next: Building the Game — Top-Down Dungeon + Turn-Based Combat with On-Device LLM

I Started Building a Roguelike RPG — Powered by On-Device AI #3

as1as — Sun, 05 Apr 2026 00:39:19 +0000

QNN Failed. LiteRT Failed. Then llama.cpp Delivered 42x Speedup.

I wanted to write a success story today.

It turns out I can. But getting there was a bit rough.

What I Tried Today

Attempt	Result
QNN HTP + libcdsprpc.so workaround	HTP initialized, but only 3 of 363 nodes ran on NPU
LiteRT-LM GPU	GPU memory overflow / engine creation failed
llama.cpp + Adreno OpenCL	Success. 8.9 tok/s

QNN HTP: 3 Out of 363 Nodes

I solved the libcdsprpc.so access problem from yesterday. The fix was using apktool to decompile the APK, inject uses-native-library directly into the manifest, and repackage. Not elegant, but it worked.

HTP finally initialized:

QnnDsp <W> Initializing HtpProvider ✅
QnnDsp <W> PrepareLibLoader Loading libQnnHtpPrepare.so ✅

Then this log appeared:

number of nodes in the graph: 363
number of nodes supported by QNN: 3

3 out of 363 nodes ran on the NPU. The INT4 block quantization operator (MatMulNBits) isn't supported by HTP. The remaining 360 nodes fell back to CPU. Generation time: 483 seconds — essentially the same as CPU-only (523 seconds).

Runtime compilation of INT4 models via QNN doesn't work. Pre-converted QNN context binaries are required, which means going through the full Qualcomm AI Engine Direct SDK pipeline. That's a future task.

LiteRT-LM: Unity and the GPU Can't Share

Google's official on-device LLM framework. Phi-4-mini is officially supported.

The Bazel native build failed due to a Rust dependency issue, so I switched to the Kotlin AAR approach. Then the GPU memory error hit:

Requested allocation size - 18446744071872970752 bytes
Max allocation size for this GPU - 1073741824 bytes

Unity's renderer is occupying the GPU. There's not enough VRAM left for LLM inference. This is a structural problem — running GPU-accelerated LLM inference inside a Unity game engine isn't viable right now. They're fighting over the same hardware.

llama.cpp: One Missing Library Away

Qualcomm officially contributes Adreno-optimized OpenCL kernels to llama.cpp. Yesterday's build succeeded but crashed on device because libomp.so wasn't included in the APK.

Today I rebuilt with -DGGML_OPENMP=OFF to remove the OpenMP dependency entirely. Build succeeded.

Next issue: P/Invoke. Trying to marshal LlamaModelParams directly from C# caused a SIGSEGV — the struct layout didn't match what C# expected. The fix was writing a C wrapper (unity_bridge.c) that handles all the complex structs internally and exposes a simple interface of 8 functions to C#:

void* unity_llama_model_load(const char* path, int n_gpu_layers);
void* unity_llama_context_create(void* model);
const char* unity_llama_generate(void* ctx, const char* prompt, int max_tokens);
// ...

Then I ran it on device.

Results

Item	Result
Model	Phi-4-mini Q8_0 (3.8GB GGUF)
Model loading	~23s
Generation time	16.8s
tok/s	8.9
GPU	Adreno 750 (OpenCL)

Full Benchmark Comparison

Method	tok/s	150 tokens	vs baseline
ONNX Runtime CPU (S24 Ultra)	0.21	523s	baseline
ONNX Runtime QNN (S24 Ultra)	0.31	490s	1.5x
ONNX Runtime CPU (Mac)	0.45	246s	2.1x
llama.cpp OpenCL (S24 Ultra)	8.9	16.8s	42x

523 seconds down to 16.8 seconds. 42x faster.

16 seconds is workable for a dungeon generation loading screen. On-device LLM is now viable for the game.

What I Learned

ONNX Runtime + QNN is effectively useless for INT4 models — 3 of 363 nodes on NPU
LiteRT-LM conflicts with Unity's GPU usage — renderer and LLM inference compete for VRAM
llama.cpp + Adreno OpenCL is the answer — Qualcomm official optimization, CMake build
C wrappers are essential for P/Invoke — never marshal complex C structs directly from C#; wrap them

What's Next

The current model is Q8_0. Requantizing to Q4_0 could push performance above 15 tok/s.

More importantly: it's time to actually build the game. The speed problem is solved. Next up is dungeon generation, the turn-based combat system, and getting to something actually playable.

The detailed llama.cpp + Unity integration — the C wrapper, the build process, the full deployment pipeline — will be its own post.

Next: llama.cpp + Unity Android Integration — C Wrapper, Build Pipeline, and Real Device Deployment

I Started Building a Roguelike RPG — Powered by On-Device AI #2

as1as — Fri, 03 Apr 2026 23:43:59 +0000

Running On-Device LLM in Unity Android — Everything That Broke (and How I Fixed It)

In my last post, I mentioned I was building a roguelike RPG powered by an on-device LLM. This time I'll cover exactly how I did it, what broke, and what the numbers look like.

The short version: I got Phi-4-mini running in Unity on a real Android device in one day. It generated valid JSON. It took 8 minutes and 43 seconds.

0. Why This Tech Stack

Before the details, here's why I made each choice.

Why Phi-4-mini (3.8B)?

Microsoft officially distributes it in ONNX format — no conversion work needed. The INT4 quantized version fits in 4.9GB, which is manageable on a 12GB RAM device. At 3.8B parameters, it's roughly the minimum size that can reliably produce structured JSON output. Smaller models tend to fall apart on formatting tasks.

Why ONNX Runtime?

Cross-platform support across Android, iOS, Windows, and Mac. There's a Unity C# binding, and the asus4/onnxruntime-unity package makes Unity integration straightforward. Most importantly, switching between hardware acceleration backends (QNN, NNAPI, CoreML) is a single line of code — which matters a lot when you're trying to get NPU acceleration working.

Why Unity?

Good ecosystem for 2D roguelikes. Android/iOS cross-platform builds. And I can write LLM inference code in C# alongside game logic without needing a Python bridge.

Why Min SDK 31?

Android 12 (API 31) introduced the ability to declare vendor partition libraries via uses-native-library. QNN HTP depends on libcdsprpc.so, which lives in the vendor partition. Without this declaration, NPU acceleration is completely off the table. Dropping below SDK 31 would mean giving up on QNN entirely.

Why Samsung Galaxy S24 Ultra as the test device?

Snapdragon 8 Gen 3 with Hexagon NPU — one of the few consumer devices where QNN acceleration is actually possible. 12GB RAM gives enough headroom for the 4.9GB model. I wanted to measure the performance ceiling with the best available hardware first. If it doesn't work here, it doesn't work anywhere with current technology.

Also, it's my personal phone. There's no test device budget.

1. ONNX Runtime Setup

Installed com.github.asus4.onnxruntime v0.4.4 via NPM scoped registry. IL2CPP compatibility confirmed with no issues.

Downloaded Phi-4-mini ONNX (cpu_and_mobile variant) from Hugging Face: model.onnx at 52MB + model.onnx.data at 4.9GB.

2. Building a C# Tokenizer From Scratch

Phi-4-mini uses a tiktoken-style BPE tokenizer. No Unity C# implementation existed, so I wrote one.

Loaded vocab (200,029 entries), merges (199,742 entries), and special tokens (12) from tokenizer.json. Implemented GPT-2 byte↔unicode conversion table, BPE encoding/decoding with cache, and special token splitting.

What broke:

Newtonsoft.Json.Linq.JValue → JArray cast failed

I assumed the merges format was "tok1 tok2" strings. It was actually ["tok1","tok2"] arrays. Added a branch to handle both formats.

3. Building the Inference Engine

Implemented KV cache-based auto-regressive greedy decoding.

32 layers, 8 KV heads, head_size 128
Prefill (full prompt at once) → Decode (one token at a time)
past_key_values / present tensor management

What broke (1):

CS1503: DenseTensor<long>(seqLen, new[] {batch, seqLen})

Fixed to new DenseTensor<long>(new[] {batch, seqLen}).

What broke (2):

model.onnx not found

Had the path at 3 levels up (../../..). It needed to be 2 levels (../..).

4. First Generation Test

Kept the prompt short, max 150 tokens.

[LLM] Generated in 181.4s (150 tokens max)

JSON came out:

[
  {"floor":1,"mob":"게으른 빵집 아들","hp":50,"atk":10},
  {"floor":4,"mob":"elite","hp":100,"atk":20},
  {"floor":5,"mob":"boss","hp":200,"atk":40}
]

The mob name on floors 1-3 matches the player character name — that's a prompt issue I'll fix later. The important thing is the JSON structure is valid and complete.

5. Android Build

What broke:

compressReleaseAssets FAILED
Required array size too large

Putting a 5GB model in StreamingAssets hits Java's 2.1GB array limit. Renaming the folder didn't help — anything inside StreamingAssets gets included regardless of name. Solution: move the model folder completely outside of Assets, delete the Gradle cache (Library/Bee/Android, 15GB worth), rebuild.

Deployment approach:

adb push ./models/phi-4-mini \
  /sdcard/Android/data/com.as1as.helpwantedhero/files/Models/phi-4-mini/

APK ships without the model. Model is pushed separately via adb (4.9GB, ~94 seconds).

6. Korean Font in TextMeshPro

The default TMP font (LiberationSans) doesn't include Korean characters. Converted AppleSDGothicNeo.ttc using TMP Font Asset Creator.

Important: the Custom Range field only accepts decimal, not hex. Entering AC00-D7A3 throws a FormatException. Use this instead:

32-126,44032-55203,12593-12686
(ASCII + Korean 가-힣 + ㄱ-ㅣ)

7. Real Device Results

Environment	Time	tok/s
Mac Editor (CPU)	246s	0.45
S24 Ultra (CPU only)	523s	0.21
S24 Ultra (QNN HTP runtime)	490s	0.31

The S24 Ultra is 2.1x slower than Mac. Adding QNN HTP barely moved the needle.

The reason showed up in the INFO logs:

Failed in loading stub: dlopen failed: library "libcdsprpc.so" not found
Failed to create transport for device, error: 4000

QNN EP registration succeeded, but the backend never actually initialized. The entire thing was falling back to CPU. libcdsprpc.so is Qualcomm's DSP RPC library — it lives in the vendor partition and isn't accessible from the app sandbox by default.

The fix is declaring it via uses-native-library in AndroidManifest. That ran into a separate issue: the custom manifest conflicted with Unity's auto-generated one, causing the app to disappear from the launcher entirely. I'll be using a Gradle template to inject just that one line instead.

What I Learned

Min SDK 31 is required for vendor library declarations — and therefore for QNN HTP acceleration
Don't put large files in StreamingAssets. Anything there gets compressed into the APK
NNAPI is not full NPU acceleration. Most LLM operators fall back to CPU
TMP Custom Range is decimal only
3.8B parameters on CPU is not viable for a game. NPU is not optional

Next: Getting QNN HTP to Actually Work — The libcdsprpc.so Wall

I Started Building a Roguelike RPG — Powered by On-Device AI #1

as1as — Fri, 03 Apr 2026 23:29:30 +0000

I've been getting into on-device AI lately.

Not cloud AI. Not sending requests to a server somewhere. I mean a language model running entirely on the phone itself — no internet required, no API costs, no data leaving the device.

When I learn something new, I have to build something. So I did.

Why On-Device AI Is Interesting Right Now

It's slow. It's limited. Compared to cloud LLMs, it's nowhere close.

But the direction is clear.

Smartphone NPUs are getting significantly more powerful every year. Model compression techniques are improving every month. The performance that required a cloud GPU two years ago is starting to run on a phone today.

The people who get familiar with this now will have an advantage when it becomes mainstream. That's why I'm learning it now.

What I Built

A roguelike RPG.

Help Wanted: Hero — Conquering the Demon Lord's Castle, 300 floors.

An on-device LLM generates the dungeon every 5 floors. Mob names, dialogue, boss patterns, hidden events — all created locally, no server involved.

Why a game? Because it's a domain where AI being wrong is fine.

If the mob name sounds weird, it's funny. If the boss dialogue is a little off, it adds to the charm. Games naturally absorb the limitations of small on-device models in a way that most other apps can't.

Also, roguelikes need fresh content every run. That's exactly what generative AI is good at.

How It's Going

I ran the first test on a Samsung Galaxy S24 Ultra.

Generating one dungeon set took 8 minutes and 43 seconds.

That's CPU-only inference with Phi-4-mini (3.8B, INT4 quantized) via ONNX Runtime on Android. NPU acceleration is essential. I'm currently hitting a wall trying to get QNN HTP working.

The next post will cover the full implementation — Unity + ONNX Runtime Android setup, building a C# tokenizer from scratch, the KV cache inference engine, and exactly where and why things broke.

Next: Unity + ONNX Runtime Android — A Full Breakdown of What Went Wrong (and What Didn't)

How I Pushed PageSpeed from 52 to 98 — The Lazy Loading Trap I Set for Myself

as1as — Tue, 31 Mar 2026 05:52:05 +0000

How I Pushed PageSpeed from 52 to 98 — The Lazy Loading Trap I Set for Myself

Performance optimization has a way of humbling you.

While building TalkWith.chat, I checked PageSpeed Insights one day and saw this:

Mobile: 52. Desktop: 69.

Not great. So I started digging.

The Thing That Was Killing the Score

The biggest culprit turned out to be a single line of code.

The topic banner image — the main visual sitting above the fold, visible the moment the page opens — had loading="lazy" on it.

// The problem
<img
  src={topic.bannerImage}
  loading="lazy"  // 👈 this was it
  alt="Today's debate topic"
/>

loading="lazy" tells the browser: "don't load this until the user scrolls near it." For images below the fold, that's a smart optimization. For the LCP element sitting at the top of the page, it's a disaster. The browser was actively deferring the most important image on the page.

The fix was one attribute:

// The fix
<img
  src={topic.bannerImage}
  fetchPriority="high"  // 👈 load this immediately
  alt="Today's debate topic"
/>

fetchPriority="high" tells the browser this image is critical — load it first. LCP improved immediately. This single change had the biggest impact of everything I did.

The CLS Problem: Layout Jumping on Load

The second issue was CLS (Cumulative Layout Shift) at 0.147 — above the 0.1 threshold.

The cause was subtle. The banner div only rendered when an image existed:

// Before — only renders when image exists
{topic.bannerImage && (
  <div className="banner-container">
    <img src={topic.bannerImage} />
  </div>
)}

When the page loaded before the image was ready, the banner didn't exist. When the image loaded, the banner appeared and pushed everything below it down. Classic layout shift.

The fix: always render the container, use a placeholder when there's no image:

// After — always reserves space
<div className="banner-container">
  {topic.bannerImage ? (
    <img src={topic.bannerImage} fetchPriority="high" />
  ) : (
    <div className="banner-placeholder" />
  )}
</div>

The container now holds its space regardless of whether the image has loaded. Nothing jumps.

Image Sizing Was Also a Problem

TalkWith.chat generates a lot of images automatically:

At user onboarding — an AI persona image is generated based on the user's personality quiz answers
On level-up — the AI image evolves based on the user's debate history and comments
Every day, for each topic — a topic banner image, a PRO side image, and a CON side image

All of these images were coming out of the AI image generation API as 1024×1024 squares — regardless of how they'd actually be used.

A small navigation avatar doesn't need to be 1024×1024. A topic banner doesn't need to be square. Oversized images waste bandwidth and drag down performance.

I introduced proper sizing at generation time:

Use case	Size
Topic banner	1024×400 (center-crop)
Pro/Con images	640×400
Persona full image	512×512
Persona avatar (nav/cards)	256×256

Then wrote two migration scripts using Pillow to backfill existing images in storage:

# resize_topic_images.py — core logic
from PIL import Image

def resize_topic_banner(img: Image.Image) -> Image.Image:
    target_w, target_h = 1024, 400

    src_ratio = img.width / img.height
    target_ratio = target_w / target_h

    if src_ratio > target_ratio:
        new_h = img.height
        new_w = int(new_h * target_ratio)
    else:
        new_w = img.width
        new_h = int(new_w / target_ratio)

    left = (img.width - new_w) // 2
    top = (img.height - new_h) // 2
    img = img.crop((left, top, left + new_w, top + new_h))

    return img.resize((target_w, target_h), Image.LANCZOS)

Running these against existing storage brought image payload sizes down noticeably.

Final Results

Metric	Before	After
Mobile Performance	52	86
Desktop Performance	69	98
CLS	0.147	0.092

Desktop 98 is genuinely hard to reach. The lazy loading fix and image sizing together got there.

What I Actually Learned

Don't use loading="lazy" on above-the-fold images. Applying lazy load to everything feels like a solid optimization, but for your LCP element it actively works against you. The most important image on the page should have fetchPriority="high".

CLS isn't just about animations or fonts. Conditional rendering that adds elements after load causes layout shifts too. If a container might appear later, reserve its space from the start.

Size images at generation time, not display time. CSS can make a 1024×1024 image look small, but the browser still downloads every byte of the original. Generate the right size when the image is created.

TalkWith.chat is a daily AI debate platform — 100 AI personas argue global topics every day. talkwith.chat

Fine-Tuning AI for Free — Kaggle + QLoRA Hands-On Guide

as1as — Sat, 28 Mar 2026 02:20:01 +0000

I wanted to fine-tune an AI model to sound more human.

Not the usual stiff AI tone — something closer to how people actually write
on Reddit. Natural, direct, sometimes blunt. So I decided to fine-tune
Qwen3-8B on Reddit-style data.

The problem was my local PC. Not enough VRAM. So I went looking for a free
GPU solution and found Kaggle.

Fair warning: I made quite a few mistakes along the way.
That's what this post is really about.

Why Kaggle

Kaggle is known as a data science competition platform, but the key thing is:

It gives you free GPU.

NVIDIA Tesla T4 (15.6GB VRAM)
30 hours of GPU per week
Completely free

One thing to know — Kaggle defaults to internet OFF. You can switch it
on under Settings → Internet, and turning it on costs nothing extra.

I worked with internet OFF, which led to my first mistake.

The Full Flow

1. Prepare a public dataset from Hugging Face
2. Connect the model + dataset in Kaggle
3. Run QLoRA fine-tuning
4. Save the adapter and evaluate

Step 1 — Data Preparation

I wanted Reddit-style data to get that natural, human-sounding tone.

To be clear: I didn't scrape Reddit. Hugging Face already has publicly
available Reddit-based datasets with proper licensing. That's the right
approach.

Some options:

webis/tldr-17 — Reddit posts + summaries
reddit — based on the public Reddit archive
sentence-transformers/reddit-title-body — title/body pairs

The data format I used was ChatML-style JSONL:

{
  "messages": [
    {"role": "system", "content": "You are a helpful member of r/SideProject."},
    {"role": "user", "content": "Just shipped my side project. Nobody's using it."},
    {"role": "assistant", "content": "Congrats on shipping. Seriously..."}
  ]
}

Upload this to Hugging Face as a Dataset and you can connect it directly
in Kaggle.

Step 2 — Kaggle Setup

Connecting the model and dataset

After creating a Kaggle Notebook:

Settings → Accelerator → GPU T4 x2
Right panel → Input → Models → search Qwen3-8B → Add
Right panel → Input → Datasets → add your dataset

Mistake #1: The model path

I connected the model, then wrote this:

MODEL = "Qwen/Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)

Result:

OSError: Can't load the configuration of 'Qwen/Qwen3-8B'.
'[Errno -3] Temporary failure in name resolution'

With internet OFF, it tried to reach HuggingFace and failed. Even though
I had connected the model in Kaggle, passing "Qwen/Qwen3-8B" still
sends a request to the HuggingFace server instead of using the local copy.

Fix: find the real path with glob

Kaggle doesn't clearly tell you where your connected model actually lives.
You have to find it yourself.

import glob

# Find the dataset
matches = glob.glob("/kaggle/input/**/*.jsonl", recursive=True)
assert matches, "No jsonl file found. Check that your dataset is added."
DATASET_PATH = matches[0]
print(f"Dataset: {DATASET_PATH}")
# Dataset: /kaggle/input/datasets/jisungyeom/datafinetune-dataset/finetune_dataset.jsonl

Do the same for the model path, then use that actual path in MODEL.
Skipping this step means hitting FileNotFoundError or the OSError above.

Step 3 — QLoRA Fine-Tuning

Check the environment

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
print(f"PyTorch: {torch.__version__}")
# GPU: Tesla T4
# VRAM: 15.6 GB
# PyTorch: 2.9.0+cu126

Mistake #2: pip install with internet OFF

ERROR: Could not find a version that satisfies the requirement bitsandbytes>=0.46.1
Failed to establish a new connection: [Errno -3] Temporary failure in name resolution

With internet OFF, pip can't reach PyPI. Either turn internet ON first,
or check if the package is already installed in the Kaggle environment.
In my case, the default version worked fine.

Load the model with 4-bit quantization

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

MODEL = "/kaggle/input/your-actual-path-found-with-glob"
MAX_SEQ_LENGTH = 1024  # Reduced from 2048 to save memory

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
model = prepare_model_for_kbit_training(model)

LoRA config

lora_config = LoraConfig(
    r=8,           # Reduced from 16 to save memory
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

LoRA only updates roughly 1% of the total parameters. That's how it fits
in a T4's 15GB VRAM.

Dataset class

Only the assistant turn contributes to the loss. The prefix is masked with -100:

class ChatMLDataset(TorchDataset):
    def __init__(self, samples, tokenizer, max_length):
        self.data = []
        for sample in samples:
            messages = sample["messages"]

            prefix = tokenizer.apply_chat_template(
                messages[:-1], tokenize=False, add_generation_prompt=True
            )
            full = tokenizer.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=False
            )

            prefix_ids = tokenizer(prefix, add_special_tokens=False)["input_ids"]
            full_enc = tokenizer(
                full, add_special_tokens=False,
                max_length=max_length, truncation=True,
            )
            full_ids = full_enc["input_ids"]

            labels = [-100] * len(prefix_ids) + full_ids[len(prefix_ids):]
            labels = labels[:max_length]

            self.data.append({
                "input_ids": full_ids,
                "attention_mask": full_enc["attention_mask"],
                "labels": labels,
            })

Training

training_args = TrainingArguments(
    output_dir="/kaggle/working/qwen3-reddit-ft",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    warmup_steps=10,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    optim="adamw_8bit",
    gradient_checkpointing=True,
    dataloader_pin_memory=False,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=data_collator,
)

trainer.train()

Step 4 — Save and Evaluate

# Save LoRA adapter
OUTPUT_DIR = "/kaggle/working/qwen3-reddit-ft"
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

# Fuse LoRA into the base model
FINAL_DIR = "/kaggle/working/qwen3-reddit-final"
merged_model = model.merge_and_unload()
merged_model.save_pretrained(FINAL_DIR)
tokenizer.save_pretrained(FINAL_DIR)

Evaluation prompts

PROMPTS = [
    ("r/SideProject", "Just shipped my side project after 6 months. Nobody's using it."),
    ("r/artificial", "GPT-5 was just released and it's apparently 10x better than Claude"),
    ("r/webdev", "Should I learn React or just stick with vanilla JS in 2025?"),
    ("r/LocalLLaMA", "Running a 70B model locally on a 4090, is it worth it?"),
]

for sub, prompt in PROMPTS:
    messages = [
        {"role": "system", "content": f"You are a helpful member of {sub}."},
        {"role": "user", "content": prompt},
    ]
    # generate and print response

What I Actually Learned

518 samples was enough to shift the tone. Train: 466, Valid: 52.
Even with a small dataset, the response style changed noticeably.

Mistakes summary:

Model path — don't use "Qwen/Qwen3-8B" directly. Use glob to find the real path first
Internet is OFF by default — pip won't work. Turn it on (free) or use pre-installed packages
VRAM limits — set MAX_SEQ_LENGTH=1024, batch_size=1, r=8 to fit on T4

Final Thoughts

Fine-tuning isn't that hard. The setup is where the friction is.

Free Kaggle GPU + open model + public dataset = zero cost to get started.
If you've been curious about fine-tuning but assumed you needed expensive
hardware, this stack removes that excuse.

OG Images Done Right — How I Made Every Shared Link Work Harder

as1as — Thu, 26 Mar 2026 03:41:35 +0000

One thing I learned building TalkWith.chat:

No matter how good your product is, if sharing a link on KakaoTalk or Slack shows no image — nobody clicks.

A single OG image determines your click-through rate. So I decided to do it properly.

Everyone Knows What OG Images Are

The image specified in <meta property="og:image"> that appears as a preview when sharing links. Works on Twitter, Slack, KakaoTalk, Discord — everywhere.

The problem with static images is that every page shows the same thumbnail. On a debate platform where a new topic drops every day, having every topic page share the same image is pointless.

"What if the topic title was baked into the OG image?" That was the starting point.

Dynamic OG Images in Next.js: `opengraph-image.tsx`

Since Next.js 13 App Router, placing an opengraph-image.tsx file in a folder automatically registers it as that page's OG image. Inside that file, you use ImageResponse to generate an image from JSX.

// app/[locale]/topic/[date]/opengraph-image.tsx
import { ImageResponse } from 'next/og'

export const runtime = 'edge'
export const size = { width: 1200, height: 630 }

export default async function OGImage({ params }) {
  const topic = await getTopic(params.date)

  return new ImageResponse(
    (
      <div style={{
        background: '#050508',
        width: '100%',
        height: '100%',
        display: 'flex',
        flexDirection: 'column',
        padding: '60px',
      }}>
        <p style={{ color: '#00f0ff', fontSize: 24 }}>TODAY'S BATTLE</p>
        <h1 style={{ color: 'white', fontSize: 48 }}>
          {topic.title}
        </h1>
        <div style={{ color: '#00f0ff', fontSize: 28, marginTop: 'auto' }}>
          ▶ JOIN THE DEBATE
        </div>
      </div>
    ),
    { width: 1200, height: 630 }
  )
}

It runs on Edge Runtime so it's fast, and writing plain JSX makes designing straightforward.

What I Learned Building This

1. SEO title and OG title should be different

At first I used the SEO title directly as the OG title. But a title like "Should the U.S. government prioritize national security alerts over diplomatic engagement with Mexico? | TalkWith.chat" gets brutally truncated in a KakaoTalk share preview.

SEO title is for search results. OG title is for share previews. Manage them separately.

export async function generateMetadata({ params }) {
  const topic = await getTopic(params.date)

  return {
    title: `${topic.title} | TalkWith.chat`,  // SEO
    openGraph: {
      title: 'My AI debates for me.',  // Share preview — short and punchy
      description: topic.description.slice(0, 160),
    }
  }
}

2. Custom fonts need to be loaded manually

ImageResponse only has access to system fonts by default. For custom fonts, you need to fetch the font file and pass it in explicitly.

const font = await fetch(
  new URL('/fonts/Orbitron-Bold.ttf', process.env.NEXT_PUBLIC_URL)
).then(res => res.arrayBuffer())

return new ImageResponse(jsx, {
  width: 1200,
  height: 630,
  fonts: [{ name: 'Orbitron', data: font, weight: 700 }]
})

I spent a while confused about why my font wasn't rendering before I figured this out.

3. A CTA button changes everything

My first OG image just showed the topic title. Feedback came back: "I see the link preview but I don't know what I'm supposed to do."

Adding ▶ JOIN FREE NOW at the bottom made a real difference. Think of OG images as mini ad banners. A title-only image and a title-plus-CTA image perform differently.

OG Image Strategy by Page Type

I took a different approach for each page type on TalkWith.chat:

Page	OG Image Strategy
Home (`/`)	Today's topic title + CTA
Topic detail (`/topic/[date]`)	Topic title + PRO/CON stance + CTA
User profile (`/user/[username]`)	Username + level + stats
Static pages (about, privacy)	Fixed brand image

The more dynamic the page, the more value there is in putting that page's core content into the OG image.

OG Image Checklist

Things to verify once you've built it:

Testing tools:

opengraph.xyz — full preview across platforms
cards-dev.twitter.com/validator — Twitter card checker
Kakao Developer Debugger — KakaoTalk share preview

Common mistakes:

Image dimensions: 1200×630px is the standard. Smaller images render blurry
Cache issues: updated your OG image but the old one keeps showing? Each platform caches aggressively — you need to manually clear it per platform
Make sure og:image uses an absolute URL. Relative paths silently fail on some platforms

Final Thoughts

OG images feel like a set-it-and-forget-it thing once they're in place. But in practice, they determine the first impression every single time your link gets shared.

Don't use one static image for everything. Putting the right dynamic content in each page's OG image is what actually drives clicks.

Next.js opengraph-image.tsx is easier to implement than it looks. If you haven't done it yet, now's a good time.

WASM in 2026: What I Found After Testing It for a Video Platform

as1as — Thu, 26 Mar 2026 03:37:07 +0000

I'm an IT engineer at an online video platform company.

My job involves constantly evaluating new technologies to deliver better services to our clients. Last year, a question came up within the team: "What if we could process video directly in the browser, without a server?"

That single question pulled me down a WASM rabbit hole.

Why In-Browser Video Processing?

One of the services we provide to clients involves video upload and processing. The traditional approach was straightforward — users upload a file, the server handles encoding, splitting, and analysis, then returns the result.

The problem was cost and latency. The larger the file, the higher the server cost, and the longer the wait. It felt wasteful to route even simple preprocessing or analysis tasks through a server.

"What if we could handle this on the client side?" That idea was the starting point for evaluating WASM.

Can WASM Actually Handle Video Processing in the Browser?

The short answer: yes. And more than you'd expect.

Using ffmpeg.wasm — FFmpeg compiled to WebAssembly — all of the following become possible directly in the browser:

Video analysis — extracting resolution, codec, framerate, bitrate
Encoding / transcoding — converting between MP4, WebM, MOV
Video splitting — trimming specific segments
Video merging — concatenating multiple clips
Thumbnail extraction — grabbing frames at specific timestamps

No server. Inside the browser. The user's file never leaves their device. From a privacy standpoint, that's a genuinely powerful advantage.

What I Found in Practice: Potential and Limits

The Potential

Performance was better than expected. For short clips, encoding ran at roughly 2–5x slower than native — which sounds bad until you remember this is running inside a browser tab. The fact that it works at all is impressive.

Video analysis in particular ran close to real-time. Being able to extract metadata instantly without uploading the file to a server is something you can put directly into a better UX.

The Limit: Memory

The biggest constraint was memory.

WebAssembly memory in the browser has hard limits. Feed it a large video file without care and you'll hit an out-of-memory crash. I experienced this firsthand — loading a 1GB file directly killed the tab.

The solution is chunked processing.

// Split the file into chunks and process sequentially
const CHUNK_SIZE = 64 * 1024 * 1024; // 64MB
const chunks = [];

for (let offset = 0; offset < file.size; offset += CHUNK_SIZE) {
  const chunk = file.slice(offset, offset + CHUNK_SIZE);
  chunks.push(chunk);
}

// Write each chunk to the buffer and process
for (const chunk of chunks) {
  const buffer = await chunk.arrayBuffer();
  // WASM processing here
}

Splitting large files into chunks and writing them to the buffer sequentially sidesteps the memory issue. The tradeoff is added implementation complexity — but it's manageable.

The WASM Ecosystem in 2026: What Actually Got Better

While evaluating WASM for our platform, I took a broader look at the ecosystem.
The changes from even two years ago are significant.

Safari Finally Caught Up

For years, Safari was the "new Internet Explorer" of the WASM world.
Developers had to write fallback code or avoid features entirely because
Apple consistently lagged behind Chrome and Firefox.

Safari 18.4 added support for the new Wasm exception spec, and Safari 26.0 introduced a new in-place interpreter for faster startup of large Wasm modules. This has meaningfully closed the cross-browser gap. If you shelved a WASM project a couple of years ago because of Safari compatibility concerns, it's worth revisiting.

WebAssembly 3.0 and WASI Preview 3

WebAssembly 3.0 was announced, bringing a host of new features into the main specification. The Bytecode Alliance has also been adding async support to WASI in preparation for a 0.3 release, with Wasmtime now having experimental support for WASI 0.3.

The async support in particular is a big deal for video processing use cases.
Previously, long-running operations would block. Native async means cleaner
code and better UX without the workarounds.

The Component Model: Mixing Languages Is Finally Practical

In 2026, the Wasm Component Model has largely solved the problem of mixing libraries from different languages. Developers can now write business logic in Rust, data processing modules in Python, and glue code in JavaScript, compiling them all into composable Wasm components.

For a video platform this is meaningful. FFmpeg bindings in C, custom
processing logic in Rust, orchestration in JavaScript — these can now
talk to each other without painful FFI layers.

Cloud Providers Are Treating WASM as First-Class

AWS Lambda now supports Wasm functions as a first-class runtime, with benchmarks showing 10-40x improvements in cold start times compared to container-based functions. Google Cloud offers Wasm through Cloud Run, and Azure Functions provides Wasm support through a dedicated preview.

At SUSECON 2025, Fermyon's CEO demonstrated sub-millisecond cold starts (~0.5ms) for Wasm functions on Kubernetes versus hundreds of milliseconds for AWS Lambda.

This changes the calculus for server-side processing too.
If you're running video analysis jobs on Lambda, switching to Wasm
could be a serious cost optimization.

Debugging Got Real

One of the biggest frustrations with WASM historically was debugging.
When something went wrong, you were mostly guessing.

Modern browser DevTools now include DWARF debugging support for WebAssembly. You can set breakpoints in your original source code — Rust, C++, etc. — and step through execution, inspect variables, and view call stacks.

It's not quite as smooth as debugging JavaScript yet, but it's functional.
This alone makes WASM significantly more approachable for production use.

Adoption Numbers Back It Up

WebAssembly adoption grew to 5.5% of sites in 2025, driven by AI needs and performance demands, moving toward becoming a mainstream infrastructure layer.

That's still a minority, but the trajectory is clear.
The technology is no longer in the "wait and see" category.

When Should You Actually Use WASM?

After wrapping up the evaluation, here's where I landed.

Use it when:

You have CPU-intensive operations (encoding, encryption, image/video processing)
Sending data to a server is difficult (privacy concerns, large files)
You want to reuse existing C/C++/Rust libraries on the web
You need fast computation at the edge

Skip it when:

You're doing standard UI rendering or form handling
The data manipulation is lightweight
JavaScript is already fast enough

The core principle: reach for WASM when JavaScript starts feeling slow. Building with WASM from the start is likely over-engineering.

Final Thoughts

By 2026, WASM has crossed the line from "a technology worth watching" to "a technology people are actually using."

Encoding video in the browser, running ML inference at the edge, handling encryption without a server — these are real things now.

That said, the memory constraints and other limitations are still real, and WASM isn't the answer to every problem. Think of it as a tool you reach for when JavaScript isn't enough. That's WASM's position in 2026.

Why We're Going Back to the Server — The SSR Revival of 2026

as1as — Tue, 24 Mar 2026 06:38:58 +0000

In the mid-2010s, web developers started treating server-rendered HTML as something old-fashioned.

React, Vue, and Angular ushered in the SPA era. "The server just serves the API" became the dominant philosophy. Fast interactions, app-like experiences, clean separation between frontend and backend. Everyone ran in that direction.

And now, in 2026, we are quietly but unmistakably going back to the server.

What SPAs Promised vs. What Actually Happened

The promise of SPAs was clear: load once, navigate without full page refreshes, reduce server load, dramatically improve UX.

Reality played out a little differently.

First load got slower. The browser receives an empty HTML shell, downloads hundreds of kilobytes of JavaScript, parses it, executes it — and only then does anything appear. Users stare at a white screen.
SEO broke. If a search engine can't execute JavaScript, it sees an empty page. Even when Google does crawl properly, indexing timing is slower and less reliable than SSR.
Bundle sizes exploded. As features grew, JavaScript bundles ballooned. Initial JS bundles over 500KB–2MB became commonplace. Code splitting helped, but complexity kept climbing.
Client-side state management became a nightmare. Redux, MobX, Zustand, Jotai... things that would have taken one line on the server ballooned into tens, sometimes hundreds of lines of state management code.

What Changed in 2026

Next.js App Router becoming the default

The App Router introduced in Next.js 13 has now settled in as the de facto standard for new projects in 2026. React Server Components (RSC) opened a new paradigm: render on the server by default, handle interactivity on the client only where needed.

You can choose server or client at the component level. Server components don't ship to the bundle at all. You can run database queries directly inside a component — just like PHP used to do. But with the full React ecosystem intact.

Server-first architecture in the AI era

As AI features moved into web apps, the server's role became critical again. LLM API calls, embedding generation, vector search — all of this needs to happen server-side for security alone. In a pure SPA architecture, you'd need a separate backend to handle this safely. With SSR, it's solved naturally.

Core Web Vitals with real consequences

Once Google started factoring LCP (Largest Contentful Paint), FID, and CLS into search rankings, slow SPAs started paying a real SEO penalty. SSR — which sends fully-rendered HTML — has a structural advantage on these metrics.

The rise of Astro, Remix, and SvelteKit

Frameworks built around "server-rendered by default, client-side only when necessary" have grown fast. Astro in particular set a new performance benchmark with its Islands Architecture: ship zero JavaScript by default.

The Misconception: Isn't SSR Slow?

A common pushback: "Doesn't SSR put more load on the server and slow things down?"

In 2015, that was fair. SSR then meant generating HTML on the server for every single request, with direct cost implications.

Today it's different.

Edge Runtime: Vercel, Cloudflare Workers, and similar platforms run SSR at the edge — the compute node closest to each user. Latency is minimal.
Streaming SSR: Instead of generating all the HTML before sending anything, the server streams it — ready parts first. TTFB (Time To First Byte) improves dramatically.
Incremental Static Regeneration: Pages that don't change often get cached and only regenerate when needed. Static site speed plus dynamic data.

The result: TTFB drops significantly, and users see an almost-instant first screen.

SPAs Aren't Dead

To be honest, SSR isn't always the right answer.

Dashboards, admin panels, real-time collaboration tools — apps with no SEO requirements and complex interactions are still well-suited to SPAs. If users only access the app after logging in, UX responsiveness matters more than initial load time.

The trend isn't converging on "SSR vs SPA" as a binary. It's converging on hybrid. Public pages rendered server-side, authenticated dashboards running as SPAs — this architecture is becoming the standard.

What I Experienced Building With It

I built TalkWith.chat on Next.js App Router.

The public pages — today's debate topic, the archive, the rankings — are server components that query Supabase directly and send back fully-rendered HTML. No separate API routes, no client-side fetch. The code shrank by nearly half, and LCP improved noticeably.

Interactive elements — the like button, the opinion submission form — are client components declared with 'use client'. Mixing server and client where each makes sense clicked immediately. It just felt right.

The Summary

The return to server-side rendering isn't a step backward.

The SPA era gave us real experience with the limits of client-first architecture. What's coming back is a more refined form of server rendering. React Server Components, Streaming SSR, Edge Runtime — this isn't your PHP-era server rendering.

The era where developers can freely choose where each boundary between server and client lives has arrived.
And in 2026, the web is making it clear: for most cases, the default should be the server.

This is the second post in my TalkWith.chat dev log series.
First post: The Limits of Vibe Coding — What Nobody Tells You After the Honeymoon Phase

I built TalkWith.chat solo. It's live — AI Characters debating global topics every day.

→ https://www.talkwith.chat

The Limits of Vibe Coding — What Nobody Tells You After the Honeymoon Phase

as1as — Tue, 24 Mar 2026 06:21:16 +0000

I built an AI debate platform solo in one week.

AI Characters, daily auto-generated global debate topics, 72-badge gamification, i18n, a bot runner, cron jobs, an admin panel. All of it. All through vibe coding.

Let me be honest — vibe coding is what made it possible. I'm not here to trash it.

But after 100+ commits and real production experience, I've hit enough walls to talk about what vibe coding doesn't tell you — especially if you're trying to build something beyond a weekend project.

What I Mean by Vibe Coding

Describing what you want to an AI (Claude Code, Cursor, Copilot), iterating on the output, and shipping — without necessarily reading every line you deploy.

Andrej Karpathy's original framing was about liberation: just say what you want and let the AI figure it out. And for going from zero to something that actually works, it genuinely delivers.

The problem starts the moment that "something" needs to be maintained.

Limit 1: The AI Doesn't Know What It Doesn't Know

When I asked Claude Code to write my Supabase RLS policies, the output looked perfect and passed local tests. The issue only appeared in production — a permission error triggered when the bot runner executed in a specific pattern.

The AI had no idea how my bot runner worked. It wrote correct code for the scenario I described, not for the full system it couldn't see.

The pattern I kept hitting: AI produces locally correct code that's globally wrong. It solves the problem you described — not necessarily the problem you actually have.

Limit 2: Refactoring Debt Accumulates Fast

Vibe coding is additive by nature. Need a feature? Add it. Bug? Patch it.

By commit 60, my codebase had:

3 slightly different patterns for the same Supabase auth check
API routes that were 80% identical but never abstracted
A component with 12 props because adding one more was always easier than restructuring

When I asked the AI to "clean this up," it would fix the file I showed it — and leave the 4 other files doing the same thing differently completely untouched.

Vibe coding doesn't refactor. It accumulates.

Fixing this required me to read the code myself, map the patterns, and give precise instructions. Which is fine — but it's not vibe coding anymore.

Limit 3: You Can't Debug What You Don't Understand

This one hurt the most. And it was the most embarrassing.

My bot runner started producing duplicate comments intermittently. The AI had written the execution phases (opinions → likes → comments → attacks), and I deployed it without fully understanding the flow.

When the bug appeared, I had no mental model of the code. I knew what it did. I didn't know how.

Debugging with the AI looked like this:

Describe the symptom
AI proposes a fix
Fix doesn't work or breaks something else
Repeat

I ran this loop for over an hour. Each suggestion from the AI was plausible. Some actually fixed something — but something else broke. I was going deeper into the hole, and my trust in the codebase was hitting the floor.

What finally fixed the bug was stopping all prompting entirely and spending 20 minutes reading the code from scratch. I mapped out which phase ran in what order, where an API call could duplicate. The cause became obvious. The fix took 5 minutes.

Instead of asking the AI "why does this bug exist," I should have read the code properly from the start.
Those 20 minutes made the previous hour a complete waste.

If you can't debug it without the AI, you don't own it.

Limit 4: Early Architecture Decisions Get Locked In

When I set up i18n with next-intl, I quickly chose the [locale] dynamic segment approach. The AI scaffolded everything and it worked.

Three weeks later, when I needed server components to access locale in a specific way, I realized the architecture already had opinions baked in that I hadn't consciously chosen — I'd just accepted the AI's first reasonable answer.

Changing it would mean touching 40+ files.

The AI optimizes for making the current feature work — not for the architecture you'll want in three months.

This isn't a criticism of the AI. It did exactly what I asked. The problem is that "make this work" and "design this well" are different requests, and vibe coding defaults to the first one.

Limit 5: The Context Window Is a Silent Killer

Every new conversation starts fresh. The AI doesn't remember the 47 decisions you made last week.

I eventually created a CLAUDE.md file in the repo — a project state document the AI reads at the start of every session. Here's what actually went into it:

Stack rules: "TailwindCSS v4 only — no separate CSS files"
Architecture decisions: "All AI calls must use Gemini → GPT-4o Mini failover"
Deploy rules: "Always run pnpm build locally before pushing"
DB migration method: "Supabase CLI has auth issues — use Management API directly"
Gotchas: "Turn off VPN before running the bot runner (can't reach talkwith.chat)"

It helped. Reading this document at the start of every session significantly reduced contradictions with previous decisions. But keeping CLAUDE.md up to date itself requires discipline. Early on, I'd forget to update it. The AI would make choices that contradicted things we'd "already decided." I'd catch it two days later in a code review.

So I added a separate history.md. If CLAUDE.md is "here's how the project works now," history.md is "here's what we did and why." Having the AI read both at session start cut down repeated mistakes noticeably.

One more thing that actually worked: using Claude Code's Todo feature aggressively. Before starting any task, I'd have the AI write a checklist first, then check off each step as it completed. The AI always knew where it was in the flow — which meant far less "going back to something we already finished" on long tasks. The longer the task, the bigger the payoff.

Vibe coding assumes continuity the AI can't provide. You have to build that continuity yourself — in documents.

What Vibe Coding Is Actually Great At

I don't want to end on a sour note, because vibe coding genuinely changed what's possible for solo developers.

Fast prototyping: From idea to working UI in hours, not days
Boilerplate elimination: Auth flows, CRUD APIs, form validation — the AI handles it and I move on
Staying unblocked: When I don't know the right API or pattern, I get a working answer in 30 seconds
Confidence to try things: The AI makes the learning curve nearly flat, so I built features I'd never have attempted alone

TalkWith.chat exists because of vibe coding. Shipping an AI platform with 100+ features solo in a week wasn't really possible before.

The Honest Summary

Vibe coding makes building 10x faster.
But maintenance, debugging, and long-term evolution run at 0.5x — unless you actively compensate for the limits.

The developers I've seen struggle most with vibe coding treat it as a complete replacement for engineering judgment. The ones who thrive treat it like a very fast junior developer: incredible output speed, needs direction, can't own the system.

The system still has to be owned by you.

I built TalkWith.chat solo. It's live — AI Chracters debating global topics every day. All the chaos and lessons are going into this series.

→ https://www.talkwith.chat

DEV Community: as1as

A Pattern Sketch: Server-Sent Events as a Fanout Channel for Edge State

The shape of the problem

Architecture

SSE + Last-Event-ID: the part I find satisfying

Why SSE, not WebSocket

The bit I'm most curious about: a composable cache TTL pipeline

Tag-based invalidation

Honest limits

Run it locally

Current gaps

I Started Building a Roguelike RPG — Powered by On-Device AI #5

Day 2 After the LLM Stack — The Game Is Actually Coming Together

What Got Built in Two Days

Screenshot

The LLM Stack Was the Fun Part

A Funny Moment

What's Next

I Started Building a Roguelike RPG — Powered by On-Device AI #4

The Model Was the Answer — 16.6 tok/s with Qwen3-1.7B

First: Quantization Isn't What You Think on Adreno

Then: What If the Model Is Smaller?

Final Stack

The Full Journey

The AI Stack Is Done

I Started Building a Roguelike RPG — Powered by On-Device AI #3

QNN Failed. LiteRT Failed. Then llama.cpp Delivered 42x Speedup.

What I Tried Today

QNN HTP: 3 Out of 363 Nodes

LiteRT-LM: Unity and the GPU Can't Share

llama.cpp: One Missing Library Away

Results

Full Benchmark Comparison

What I Learned

What's Next

I Started Building a Roguelike RPG — Powered by On-Device AI #2

Running On-Device LLM in Unity Android — Everything That Broke (and How I Fixed It)

0. Why This Tech Stack

1. ONNX Runtime Setup

2. Building a C# Tokenizer From Scratch

3. Building the Inference Engine

4. First Generation Test

5. Android Build

6. Korean Font in TextMeshPro

7. Real Device Results

What I Learned

I Started Building a Roguelike RPG — Powered by On-Device AI #1

Why On-Device AI Is Interesting Right Now

What I Built

How It's Going

How I Pushed PageSpeed from 52 to 98 — The Lazy Loading Trap I Set for Myself

How I Pushed PageSpeed from 52 to 98 — The Lazy Loading Trap I Set for Myself

The Thing That Was Killing the Score

The CLS Problem: Layout Jumping on Load

Image Sizing Was Also a Problem

Final Results

What I Actually Learned

Fine-Tuning AI for Free — Kaggle + QLoRA Hands-On Guide

Why Kaggle

The Full Flow

Step 1 — Data Preparation

Step 2 — Kaggle Setup

Connecting the model and dataset

Mistake #1: The model path

Fix: find the real path with glob

Step 3 — QLoRA Fine-Tuning

Check the environment

Mistake #2: pip install with internet OFF

Load the model with 4-bit quantization

LoRA config

Dataset class

Training

Step 4 — Save and Evaluate

Evaluation prompts

What I Actually Learned

Final Thoughts

OG Images Done Right — How I Made Every Shared Link Work Harder

Everyone Knows What OG Images Are

Dynamic OG Images in Next.js: opengraph-image.tsx

What I Learned Building This

SSE + `Last-Event-ID`: the part I find satisfying

Dynamic OG Images in Next.js: `opengraph-image.tsx`