No team. No budget. Just Go. This is the engineering deep dive behind pastaay, a chaos engine that breaks everything from HTTP headers to physical memory.
1. Why This Exists
I started this project with a simple question: why do all chaos engineering tools stop at the network?
Netflix’s Chaos Monkey kills instances. That’s useful if your failure mode is “pod died.” Gremlin adds CPU spikes. Litmus runs pod level experiments in Kubernetes. They’re all valuable tools. But none of them let you walk into a single Go binary and say: “intercept this gRPC bidirectional stream, drop every third Kafka message, corrupt this MongoDB aggregation pipeline, and while you’re at it, eat 2GB of RAM and burn 4 CPU cores for 30 seconds.” None of them give you one YAML file that defines destruction across seven different protocols plus the OS itself, then deploys that destruction through a CLI with a dead man’s switch, a Kubernetes operator managing CRDs, or an AI that reads your live Prometheus metrics and writes the attack plan for you.
This is the gap pastaay fills. It’s not a wrapper around existing tools. It’s not a YAML generator for litmus experiments. Every interceptor, every operator controller, every JavaScript panel, every line of Go is built from scratch.
The name comes from paSta’ay, a ritual of Taiwan’s Saisiyat people honoring the Da’ai, a tribe of diminutive but uncommonly powerful spirits. Chaos, remembrance, and things that punch above their weight. Seemed fitting.
Let me walk you through how each piece works, with real code, real architecture diagrams, and the real engineering decisions behind them.
2. System Architecture
Before we dive into individual components, here’s how the entire system fits together:
The engine has three entry points, web console, CLI, and Kubernetes operator, all feeding into the same Config Manager. The Manager holds the source of truth: an atomic pointer to the current configuration. Every interceptor reads from this pointer on every request. The file watcher updates this pointer when the YAML changes on disk. The telemetry bus collects events from every layer and feeds them to the web console, Prometheus, and OpenTelemetry.
3. The Policy Engine
3.1 O(1) Lookup Without a Single Heap Allocation
This is the most important performance decision in the entire project. On every HTTP request, literally every request your application serves, pastaay needs to check: “is there an active policy that targets this path?” If that check allocates memory, and you’re serving 50,000 requests per second, you’re creating 50,000 heap allocations per second. The garbage collector will eat you alive.
The solution is a type-indexed cache behind an atomic pointer. Thanks to Go 1.19’s atomic.Pointer, we can do this safely and cleanly without interface{} type assertions:
type Manager struct {
mu sync.Mutex
cfg atomic.Pointer[PastaayConfig]
typedPolicies atomic.Pointer[map[string][]Policy]
startTime time.Time
sensorStatus sync.Map
}
func (m *Manager) GetActivePolicies(policyType string) []Policy {
ptr := m.typedPolicies.Load()
cfg := m.cfg.Load()
if ptr == nil || (cfg != nil && time.Since(m.startTime) < cfg.WarmupDuration) {
return nil
}
return (*ptr)[policyType]
}
When the configuration is updated (by file watcher, webhook, or operator), the Manager takes a write lock once, rebuilds the type-indexed map, and atomically swaps the pointer. Every subsequent read, and there are potentially millions of reads per second, is just an atomic load followed by a map lookup. Zero allocation. Zero lock contention.
The warmup check at the top prevents chaos from triggering during the initial configuration load. If your application starts and chaos policies fire before the server is ready, you get a false-positive outage. The warmup period (default 10 seconds) gives everything time to stabilize.
3.2 Stable Hashing for Policy Identity
Every policy gets a deterministic FNV-1a hash computed from all its fields:
func generateStableHash(p *Policy) uint64 {
// FNV-1a: a non-cryptographic hash that distributes well
// and is deterministic, same input always produces same output.
var h uint64 = 14695981039346656037 // FNV offset basis
const fnvPrime uint64 = 1099511628211
// Hash string fields with field separators to prevent
// "ab" + "c" from colliding with "a" + "bc".
sep := func() { h ^= 0; h *= fnvPrime }
for _, s := range []string{p.Name, p.Target, p.Type, p.ErrorBody, p.StreamRollMode} {
for i := 0; i < len(s); i++ {
h ^= uint64(s[i])
h *= fnvPrime
}
sep()
}
// ... numeric fields, match headers ...
return h
}
Why does this matter? The resource sabotage daemon uses policy hashes for deduplication. If you change a policy’s RAM chunk from 256MB to 512MB, the hash changes, and the daemon kills the old RAM leaker and spawns a new one. If nothing changed, the daemon does nothing. This prevents the classic “restart all goroutines on every config reload” problem.
3.3 SQL Command Normalization
SQL interception has a unique challenge: the same logical query can have different textual representations. SELECT * FROM users and select * from users and SELECT * FROM users are the same query. The CleanSQLCommand function normalizes SQL text by stripping comments, collapsing whitespace, and uppercasing:
func CleanSQLCommand(cmd string) string {
var result strings.Builder
inString := false
var stringChar byte
for i := 0; i < len(cmd); i++ {
char := cmd[i]
if char == '\'' || char == '"' {
// Handle escaped quotes
isEscaped := false
for j := i - 1; j >= 0 && cmd[j] == '\\'; j-- {
isEscaped = !isEscaped
}
if !isEscaped {
if inString && char == stringChar {
inString = false
} else if !inString {
inString = true
stringChar = char
}
}
}
if !inString {
if char == '-' && i+1 < len(cmd) && cmd[i+1] == '-' {
// Skip SQL line comment
for i < len(cmd) && cmd[i] != '\n' {
i++
}
result.WriteByte(' ')
continue
}
}
result.WriteByte(char)
}
return strings.ToUpper(strings.Trim(result.String(), " \r\n\t;()"))
}
This is one of those functions that looks simple but took three iterations to get right. The string aware comment stripping (so inside a SQL string literal doesn't get treated as a comment) was the hardest part. Version 1 used a regex. Version 2 used a simple loop but broke on escaped quotes. Version 3 (this one) is the first that handles all edge cases correctly.
4. The Config Manager
4.1 Atomic Pointer Swap: How Policy Updates Don’t Block Requests
The Manager’s Update method is the only write path. It's protected by a mutex, but that mutex is never contended during normal operation because updates happen rarely (once per config file change). Here's the full method:
func (m *Manager) Update(newCfg *PastaayConfig) {
m.mu.Lock()
defer m.mu.Unlock()
if newCfg != nil {
// Pre-compute all policy metadata
for i := range newCfg.Policies {
p := &newCfg.Policies[i]
// Generate metric tag (type:target, truncated to 64 chars)
tag := p.Type + ":" + p.Target
if len(tag) > 64 {
tag = tag[:61] + "..."
}
p.MetricTag = tag
// SQL regex compilation
if strings.EqualFold(p.Type, "sql") {
// ... compile regex for SQL target matching ...
}
// Compute stable FNV-1a hash
p.PolicyHash = generateStableHash(p)
}
m.cfg.Store(newCfg)
}
// Rebuild type-indexed cache
cache := make(map[string][]Policy)
if newCfg != nil {
for _, p := range newCfg.Policies {
cache[p.Type] = append(cache[p.Type], p)
}
}
m.typedPolicies.Store(&cache)
}
The critical design choice here: the metadata pre-computation happens under the write lock, not on the hot path. Every interceptor calls GetActivePolicies which does nothing but an atomic load and a map lookup. The regex compilation, hash computation, and metric tag generation all happen once during the update, not millions of times during request processing.
5. The Hot Reload Watcher
5.1 Why File Watching Is Harder Than You Think
The naive approach to watching a configuration file:
watcher, _ := fsnotify.NewWatcher()
watcher.Add("pastaay.yaml")
for event := range watcher.Events {
if event.Has(fsnotify.Write) {
reload()
}
}
This breaks in production. Here’s why:
- Vim writes use a temp file then rename. The watcher sees a
RENAMEevent, notWRITE. - Kubernetes ConfigMaps create a symlink swap. The old inode is deleted, a new one appears.
- Atomic editors (like
sed -i) create a new file and rename it over the old one. - Multiple write events fire for a single logical change. Without debouncing, you reload the config 4 times in 10 milliseconds.
Pastaay’s watcher handles all of these:
const debounceWindow = 300 * time.Millisecond
func WatchConfig(filePath string, reloadCallback func(*PastaayConfig)) (cancel func(), err error) {
ctx, cancelCtx := context.WithCancel(context.Background())
watcher, werr := fsnotify.NewWatcher()
if werr != nil {
cancelCtx()
return nil, werr
}
if aerr := watcher.Add(filePath); aerr != nil {
_ = watcher.Close()
cancelCtx()
return nil, aerr
}
var (
timer *time.Timer
timerMu sync.Mutex
fireWG sync.WaitGroup // tracks in-flight reloads so cancel() waits for them
stopFlag = make(chan struct{})
)
// scheduleReload is the single entry point for ALL file events.
// Remove, Rename, Chmod (inode changes), Write, Create, and even
// watcher errors all go through here. This eliminates the race
// condition that existed when Remove and Write had separate paths.
scheduleReload := func(reason string) {
timerMu.Lock()
defer timerMu.Unlock()
// If stopFlag is closed, cancel() has been called.
// Don't schedule anything new.
select {
case <-stopFlag:
return
default:
}
// Reset the debounce timer. Every new event pushes the reload
// 300ms into the future. This prevents 4 rapid reloads when an
// editor saves a file.
if timer != nil {
timer.Stop()
}
timer = time.AfterFunc(debounceWindow, func() {
// Register BEFORE doing any work so cancel() can track us.
fireWG.Add(1)
defer fireWG.Done()
// Re-check: if cancel() was called during the 300ms wait,
// bail out immediately.
select {
case <-stopFlag:
return
default:
}
// Reattach: remove the old inode, then retry Add.
// vim writes a temp file then renames. sed -i creates a
// new file. K8s ConfigMaps swap symlink targets. All of
// these are handled by Remove + Add.
_ = watcher.Remove(filePath)
for attempt := 0; attempt < 10; attempt++ {
if err := watcher.Add(filePath); err == nil {
break
}
select {
case <-stopFlag:
return
case <-time.After(50 * time.Millisecond):
}
}
newCfg, loadErr := LoadConfig(filePath)
if loadErr != nil {
log.Printf("[Pastaay-Config] reload skipped (%s): %v", reason, loadErr)
return
}
// Validate rejects broken YAML. Engine keeps the last valid config.
if vErr := newCfg.Validate(); vErr != nil {
log.Printf("[Pastaay-Config] reload rejected (%s): %v", reason, vErr)
return
}
// Final gate: if cancel() was called during LoadConfig or
// Validate, do NOT invoke the callback. The caller has
// already torn down and a stale config push would be a bug.
select {
case <-stopFlag:
return
default:
}
reloadCallback(newCfg)
})
}
var loopWG sync.WaitGroup
loopWG.Add(1)
go func() {
defer loopWG.Done()
defer watcher.Close()
for {
select {
case <-ctx.Done():
return
case event, ok := <-watcher.Events:
if !ok { return }
// All event types go to the same debounced path.
// No more separate CAS reattach goroutine.
if event.Has(fsnotify.Write) || event.Has(fsnotify.Create) ||
event.Has(fsnotify.Remove) || event.Has(fsnotify.Rename) ||
event.Has(fsnotify.Chmod) {
scheduleReload(event.Op.String())
}
case errEv, ok := <-watcher.Errors:
if !ok { return }
log.Printf("[Pastaay-Config] watcher error (forcing reattach): %v", errEv)
scheduleReload("watcher-error")
}
}
}()
cancel = func() {
close(stopFlag) // signal all goroutines to stop
cancelCtx() // cancel the root context
timerMu.Lock()
if timer != nil {
timer.Stop() // stop any pending debounce timer
}
timerMu.Unlock()
loopWG.Wait() // wait for the event loop goroutine
fireWG.Wait() // wait for any in-flight reload callback
}
return cancel, nil
}
All file events; Write, Create, Remove, Rename, Chmod, and watcher
errors, go through a single scheduleReload path. A 300ms debounce
prevents burst reloads. Inside the callback, watcher.Remove + Add
handles inode changes from vim, sed -i, and Kubernetes ConfigMap
symlink swaps. If LoadConfig fails or Validate rejects, the engine
keeps running with the last valid config.
The critical detail: fireWG.Add(1) happens inside the AfterFunc body
but BEFORE the I/O work. cancel() closes stopFlag, stops the timer,
then calls loopWG.Wait() then fireWG.Wait(). Any callback past the
stopFlag gate completes before Wait returns. Any callback still in
the debounce window sees stopFlag closed and bails. Zero use-after-cancel.
Invalid configurations are rejected. If you write broken YAML, the engine keeps running with the last valid configuration. The error is logged. No crash, no rollback, no surprise.
6. The Interceptor Pattern
6.1 HTTP: The Reference Implementation
Every protocol interceptor in pastaay follows the same skeleton. Here’s the HTTP middleware in full:
func Middleware(mgr *config.Manager) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Step 1: Check if this path is protected (whitelist)
if mgr.IsCommandIgnored("http", r.URL.Path) {
next.ServeHTTP(w, r)
return
}
// Step 2: Get active policies for this protocol (zero alloc atomic load)
policies := mgr.GetActivePolicies("http")
// Step 3: Iterate policies, match path and headers
for i := range policies {
p := &policies[i]
if !matchPath(r.URL.Path, p) || !matchHeaders(r, p.MatchHeaders) {
continue
}
// Step 4a: Latency injection with context cancellation support
if p.LatencyChance > 0 && rand.Float64() < p.LatencyChance {
metrics.InjectedFaultsTotal.WithLabelValues(metricTag, "latency").Inc()
ctx, span := tracing.StartChaosSpan(r.Context(),
"pastaay.http.latency", p.Target, "latency")
timer := time.NewTimer(p.LatencyDuration)
select {
case <-timer.C:
span.End()
case <-ctx.Done():
timer.Stop()
span.End()
w.WriteHeader(499) // 499: Nginx standard for Client Closed Request
return
}
}
// Step 4b: Error injection
if p.ErrorChance > 0 && rand.Float64() < p.ErrorChance {
metrics.InjectedFaultsTotal.WithLabelValues(metricTag, "error").Inc()
_, span := tracing.StartChaosSpan(r.Context(),
"pastaay.http.error", p.Target, "error")
defer span.End()
status := p.ErrorCode
if status == 0 { status = http.StatusInternalServerError }
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(status)
if p.ErrorBody != "" {
io.WriteString(w, p.ErrorBody)
}
return // Stop chain, don't call next handler
}
}
// Step 5: No matching policy with active probability, pass through
next.ServeHTTP(w, r)
})
}
}
6.2 The Wildcard Matching Problem
Path matching isn’t as simple as strings.HasPrefix. Consider:
Policy target: /api/users/*
Request path: /api/users/123 ✓ Match
Request path: /api/usersettings ✗ Should NOT match
The solution requires checking that the character immediately after the prefix is either / or end of string:
func matchPath(reqPath string, p *config.Policy) bool {
// Exact match or wildcard "all" catches everything.
if strings.EqualFold(p.Target, "all") || strings.EqualFold(reqPath, p.Target) {
return true
}
if p.IsWildcard {
reqPathUpper := strings.ToUpper(reqPath)
if strings.HasPrefix(reqPathUpper, p.WildcardPrefix) {
remaining := reqPathUpper[len(p.WildcardPrefix):]
// The character AFTER the prefix must be '/' or end-of-string.
// This prevents /api/users* from matching /api/usersettings.
if strings.HasSuffix(p.WildcardPrefix, "/") ||
len(remaining) == 0 ||
remaining[0] == '/' {
return true
}
}
}
return false
}
The third condition (remaining[0] == '/') is the key: after stripping the wildcard prefix, the next character must be a path separator. This prevents /api/users* from matching /api/usersettings.
6.3 gRPC: Streaming vs Unary
gRPC interception is more complex than HTTP because of bidirectional streaming. A unary RPC that fails returns an error code to the client immediately. A streaming RPC that fails mid-stream leaves the client hanging unless the server actively closes the stream.
The gRPC interceptor handles this by distinguishing between unary and streaming contexts. For streaming, the interceptor wraps the server stream to inject faults on individual messages rather than the entire stream. This allows you to drop message #3 in a stream of 100 messages without affecting the rest, a far more realistic failure mode than killing the entire connection.
7. The SQL Driver Wrapper
7.1 Intercepting database/sql at the Driver Level
SQL interception is fundamentally different from HTTP interception. With HTTP, you wrap a handler. With SQL, you wrap the database driver itself, at the database/sql/driver interface level.
type WrapperDriver struct {
original driver.Driver // the real driver (pgx, sqlite3, mysql, etc.)
cfgManager *config.Manager
}
type WrapperConnector struct {
original driver.Connector // Go 1.10+ connector interface
cfgManager *config.Manager
}
func (d *WrapperDriver) Open(name string) (driver.Conn, error) {
// Apply chaos BEFORE opening: can drop connections entirely
// or add latency before the handshake.
if err := applyConnectionChaos(context.Background(), d.cfgManager); err != nil {
return nil, err
}
conn, err := d.original.Open(name)
if err != nil {
return nil, err
}
// Wrap the real connection so every Query/Exec/Begin gets intercepted.
return &WrapperConn{originalConn: conn, cfgManager: d.cfgManager}, nil
}
Every connection that passes through the wrapper gets a WrapperConn that intercepts Prepare, Exec, Query, Begin, and Commit. Each of these methods checks the policy engine before passing through to the real connection.
The OpenConnector method handles Go 1.10+ driver connector interface, which is what modern database drivers use internally:
func (d *WrapperDriver) OpenConnector(name string) (driver.Connector, error) {
// Go 1.10+ database/sql uses connectors internally.
// We must wrap the connector too, or the driver bypasses our interceptor.
if dc, ok := d.original.(driver.DriverContext); ok {
connector, err := dc.OpenConnector(name)
if err != nil {
return nil, err
}
return &WrapperConnector{original: connector, cfgManager: d.cfgManager}, nil
}
// Fallback for older drivers that don't implement DriverContext.
return &fallbackConnector{driver: d, name: name}, nil
}
7.2 The “Don’t Break Transactions” Rule
SQL chaos has a critical safety constraint: never inject faults inside an active transaction’s lifecycle. Breaking a COMMIT is the worst thing you can do to a database, it leaves locks held, connections stuck, and the application in an undefined state.
SQL chaos checks are applied at connection open and statement execution
points. BEGIN also receives a chaos check, but Commit and Rollback are
passed through without interception. You can still introduce latency (the transaction takes longer), but you can't drop the connection or inject query errors during the transaction body.
This is the kind of constraint that only comes from production experience. Version 1 of the SQL driver didn’t have this protection, and it corrupted test data. Version 2 added it after a very long debugging session.
8. OS Resource Sabotage
This is the feature that makes people do a double take. Most chaos tools stop at the network layer. Pastaay eats your CPU and RAM.
8.1 CPU Starvation: Beating the Compiler Optimizer
func BurnCPU(ctx context.Context, cores int, threshold int) {
if cores <= 0 {
cores = runtime.NumCPU() // saturate all cores by default
}
if threshold <= 0 {
threshold = 100000 // empirically ~95% CPU on a single core
}
var wg sync.WaitGroup
for i := 0; i < cores; i++ {
wg.Add(1)
go func() {
defer wg.Done()
payload := []byte("pastaay-cpu-vector")
var localSink [32]byte // compiler can't optimize this away
for {
for j := 0; j < threshold; j++ {
// SHA-256 is in assembly (crypto/sha256).
// The compiler cannot elide it as dead code.
localSink = sha256.Sum256(payload)
}
// runtime.KeepAlive tells the compiler "localSink is still
// live", prevents the entire loop from being optimized out.
runtime.KeepAlive(localSink)
select {
case <-ctx.Done(): // clean exit when TTL expires or HALT is pressed
return
default:
continue
}
}
}()
}
wg.Wait()
}
The Go compiler is smart. If it detects that your computation has no observable side effects, it eliminates the entire loop. The empty for {} burns CPU in debug builds but vanishes in optimized builds. So we use SHA-256 hashing, a cryptographic operation the compiler cannot elide because it's implemented in assembly via the crypto/sha256 package.
Two subtleties here:
**runtime.KeepAlive(localSink)**, Without this, the compiler might notice that localSink is never read after the loop and eliminate the assignment. KeepAlive is a compiler directive that says "this variable is still live." It's a no op at runtime but prevents dead code elimination.
**select** with **ctx.Done()**, This is the clean escape hatch. The default branch keeps the loop spinning. When the context is cancelled (experiment TTL expires, halt button pressed, policy removed), the goroutine exits immediately. No leaks, no orphans.
The threshold parameter controls intensity through iteration count. Empirically:
- 10,000 iterations = ~30% CPU on a single core
- 100,000 iterations = ~95% CPU
- 500,000 iterations = effectively 100%
8.2 Memory Exhaustion: Page Forcing and Global Ceilings
func LeakRAM(ctx context.Context, chunkMB int, interval time.Duration) {
if interval <= 0 { interval = 1 * time.Second }
if chunkMB <= 0 { return }
ticker := time.NewTicker(interval)
defer ticker.Stop()
chunkSize := chunkMB * 1024 * 1024
var pool [][]byte
defer func() {
atomic.AddInt64(¤tPoolMB, -int64(len(pool)*chunkMB))
pool = nil
runtime.GC()
}()
allocate := func() bool {
if atomic.LoadInt64(¤tPoolMB)+int64(chunkMB) > maxRAMPoolMB {
log.Printf("[Pastaay-Resource] RAM ceiling %dMB reached, refusing new chunk (%dMB)",
maxRAMPoolMB, chunkMB)
return false
}
chunk := make([]byte, chunkSize)
// Page forcing: touch every 4096th byte
for i := 0; i < chunkSize; i += 4096 {
chunk[i] = 1
}
pool = append(pool, chunk)
atomic.AddInt64(¤tPoolMB, int64(chunkMB))
return true
}
allocate()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
allocate()
}
}
}
This function embodies three engineering principles I consider non negotiable for any chaos tool:
1. Page forcing. make([]byte, chunkSize) allocates virtual memory but doesn't commit physical pages. The OS uses demand paging, physical RAM is only assigned when a page is actually touched. By writing to every 4096th byte (the standard page size on both x86 and ARM64), we force the kernel to commit real, physical memory. Without this, your /proc/meminfo looks scary but the kernel hasn't actually allocated anything meaningful.
2. Global ceiling. maxRAMPoolMB is hard coded at 4096MB. I hardcoded this to 4GB because it’s a safe upper bound that prevents catastrophic node eviction (OOMKill) on standard 8GB/16GB cloud worker nodes, while still being devastating enough for any application-level test. No matter what the YAML says, the engine will never allocate more than 4GB across all RAM leaking policies combined. The check is an atomic load compare at the allocation site, if you're already at 3.9GB and request another 256MB chunk, the allocation is refused and logged. The guard module catches unrealistic values at validation time, but the runtime ceiling is the absolute last line of defense.
3. Guaranteed cleanup, the Amnesia Protocol. The defer block zeros the pool slice, subtracts the atomic counter, and triggers runtime.GC(). When the context is cancelled (because the experiment's TTL expired, you clicked the HALT button, or the policy was removed from the YAML), all memory returns to the OS. No restart required. No dangling allocations. The name "Amnesia Protocol" comes from the idea that the engine should forget it ever held that memory, complete, verifiable amnesia.
The resource sabotage daemon MonitorAndTrigger polls policies every 2 seconds, comparing policy hashes to detect changes. This is a state machine with exactly two states: running (with a cancel function) and idle:
func MonitorAndTrigger(ctx context.Context, mgr *config.Manager) {
ticker := time.NewTicker(2 * time.Second)
defer ticker.Stop()
var lastResourceHash uint64 // tracks policy identity across reloads
var activeCancel context.CancelFunc
for {
select {
case <-ctx.Done():
if activeCancel != nil {
activeCancel() // kill any running sabotage goroutines
}
return
case <-ticker.C:
policies := mgr.GetActivePolicies("resource")
// Kill-switch: if all resource policies are removed from YAML,
// immediately stop sabotage and release all memory.
if len(policies) == 0 {
if activeCancel != nil {
activeCancel()
activeCancel = nil
lastResourceHash = 0
}
continue
}
// Compute combined hash of all active resource policies.
// Only restart sabotage if policies actually changed.
var combinedHash uint64
for _, p := range policies {
combinedHash = (combinedHash<<1 | combinedHash>>63) ^ p.PolicyHash
}
if combinedHash != lastResourceHash {
if activeCancel != nil {
activeCancel() // kill old sabotage before starting new
}
lastResourceHash = combinedHash
var chaosCtx context.Context
chaosCtx, activeCancel = context.WithCancel(ctx)
for _, p := range policies {
TriggerResourceSabotage(chaosCtx, buildResourcePolicy(p))
}
}
}
}
}
9. The Blast Radius Guard
The guard is not a gatekeeper, it doesn’t stop you from deploying dangerous policies. It’s an analysis tool that tells you exactly how dangerous your policies are before you deploy them.
func Analyze(cfg *config.PastaayConfig) PlanResult {
res := PlanResult{Issues: make([]string, 0)}
if cfg == nil || len(cfg.Policies) == 0 {
return PlanResult{Status: "SAFE", Score: 0, TotalRisk: 0.0, Issues: []string{}}
}
systemSurvival := 1.0 // multiplicative survival probability
// Core guard: disabled default-ignored = 15% flat risk penalty
if !cfg.EnableDefaultIgnored {
res.Issues = append(res.Issues,
"CORE GUARD DISABLED: System base vulnerability increased.")
systemSurvival *= 0.85
}
targets := make(map[string]string) // detect overlapping policies
for _, p := range cfg.Policies {
// Skip no-op policies (all probabilities zero, no resource effects)
if p.LatencyChance == 0 && p.ErrorChance == 0 &&
!p.DropConnection && p.RAMChunkMB == 0 && p.ThrottleThreshold == 0 {
continue
}
// Factor 1: Scope — how much of the system does this hit?
// 0.4 = targeted endpoint, 1.0 = entire protocol layer
scopeWeight := 0.4
if p.Target == "all" || p.Target == "database" || p.Target == "*" || p.Target == "" {
scopeWeight = 1.0
res.Issues = append(res.Issues,
fmt.Sprintf("[%s] Global Target: Exposes entire '%s' infrastructure layer.",
p.Name, p.Type))
}
// Factor 2: Collision — two policies hitting the same target
// compound. Each overlap adds +0.2 to scope weight.
key := p.Type + ":" + p.Target
if orig, exists := targets[key]; exists {
res.Issues = append(res.Issues,
fmt.Sprintf("[%s] Collision: Overlaps with '%s'. Cascading failure probability increased.",
p.Name, orig))
scopeWeight = min(1.0, scopeWeight+0.2)
}
targets[key] = p.Name
// Factor 3: Fault severity — the worst single effect this policy can cause
maxSeverity := 0.0
if p.DropConnection {
maxSeverity = 1.0 // hard TCP drop = maximum severity
res.Issues = append(res.Issues,
fmt.Sprintf("[%s] Hard TCP Drop: Triggers immediate network circuit-breakers.", p.Name))
}
if p.ErrorChance > 0 {
maxSeverity = max(maxSeverity, p.ErrorChance)
}
if p.LatencyChance > 0 {
latMultiplier := 0.3
if p.LatencyDuration >= 5*time.Second {
latMultiplier = 1.0 // thread pool exhaustion territory
res.Issues = append(res.Issues,
fmt.Sprintf("[%s] 5s+ Timeout: Causes severe thread pool exhaustion.", p.Name))
} else if p.LatencyDuration >= 1*time.Second {
latMultiplier = 0.6
}
maxSeverity = max(maxSeverity, p.LatencyChance*latMultiplier)
}
if p.Type == "resource" {
resSeverity := 0.0
if p.RAMChunkMB >= 1024 {
resSeverity = 0.9 // OOM territory
} else if p.RAMChunkMB >= 256 {
resSeverity = 0.5
} else if p.RAMChunkMB > 0 {
resSeverity = 0.2
}
if p.ThrottleThreshold >= 100000 {
resSeverity = max(resSeverity, 0.8) // near-100% CPU lock
}
maxSeverity = max(maxSeverity, resSeverity)
}
// Policy risk = severity × scope. Multiplicative survival model.
policyRisk := maxSeverity * scopeWeight
systemSurvival *= (1.0 - policyRisk)
}
finalRisk := 1.0 - systemSurvival
res.TotalRisk = finalRisk
res.Score = int(finalRisk * 100.0)
switch {
case res.Score >= 75: res.Status = "CRITICAL"
case res.Score >= 50: res.Status = "HIGH"
case res.Score >= 25: res.Status = "ELEVATED"
default: res.Status = "SAFE"
}
return res
}
The scoring model is multiplicative survival analysis: each policy reduces the “survival probability” of the system by maxSeverity × scopeWeight. Policies with global scope and high severity compound dramatically. Two CRITICAL policies don't add, they multiply.
10. The Telemetry Bus
The telemetry system uses a lock-free circular buffer that can handle concurrent writes from multiple goroutines (one per protocol) and concurrent reads from the web console polling endpoint:
type LogEntry struct {
Pod string `json:"pod"`
Source string `json:"source"`
Name string `json:"name"`
Message string `json:"msg"`
Ts int64 `json:"ts"`
}
var (
mu sync.RWMutex
buf [256]LogEntry
head int
size int
nodeName string
)
func Emit(source, name, msg string) {
mu.Lock()
defer mu.Unlock()
buf[head] = LogEntry{
Source: source, Name: name, Pod: source + "/" + name,
Message: msg, Ts: time.Now().UnixMilli(),
}
head = (head + 1) % 256
if size < 256 { size++ }
}
func Snapshot() []LogEntry {
mu.RLock()
defer mu.RUnlock()
out := make([]LogEntry, size)
start := (head - size + 256) % 256
for i := 0; i < size; i++ {
out[i] = buf[(start+i)%256]
}
return out
}
The ring buffer holds 256 entries. When it fills, older entries are silently overwritten. This is intentional, the telemetry bus is not a persistence layer. If you need long term storage, pipe the logs to your existing observability stack via OpenTelemetry or scrape the Prometheus metrics.
The EmitError and EmitInfo helpers automatically attach OpenTelemetry trace and span IDs when available, creating a direct link between chaos events in the log and spans in your tracing backend:
func EmitError(protocol, target, msg, payload string, span trace.Span) {
logData := map[string]interface{}{
"level": "ERROR", "protocol": protocol, "target": target,
"message": msg, "payload": payload,
}
if span != nil && span.SpanContext().IsValid() {
logData["trace_id"] = span.SpanContext().TraceID().String()
logData["span_id"] = span.SpanContext().SpanID().String()
}
jsonLog, _ := json.Marshal(logData)
Emit(nodeName, protocol, string(jsonLog))
}
This trace correlation was one of the best decisions I made. When you see “Kafka message dropped” in the web console journal, you can copy the trace ID, paste it into Jaeger, and see the entire request journey, which services it touched, where the fault was injected, and what the downstream effects were.
11. The Oracle
11.1 Teaching an LLM to Write Valid Chaos Engineering YAML
The Oracle is the feature that gets the most attention, so let me explain exactly how it works.
The problem: designing good chaos engineering policies requires SRE expertise. Most developers don’t know what failure modes to test, what probability thresholds make sense, or how to combine faults into realistic scenarios. An experienced SRE might spend hours designing a multi vector attack. Most teams never do it at all.
The solution: give an LLM structured context about your running system and let it figure out the attack plan.
But LLMs hallucinate. They invent YAML fields. They use range durations like 5s-15s which are syntactically valid YAML but semantically meaningless. They output error_chance: 0 instead of omitting the field. They generate single policy tests when you need multi vector attacks.
The system prompt for the Oracle is 40 lines of engineering specification that exist to constrain the LLM into producing only valid, deployable output.
The actual system prompt, constructed in Go and sent to the LLM, looks like this:
You are Pastaay Oracle, a Senior Site Reliability Engineering (SRE) AI.
Analyze the provided telemetry and system configuration matrices.
Your ONLY output must be a highly complex, devastating, multi layered
Chaos Engineering blueprint in valid Pastaay V1 YAML wrapped in a markdown
yaml block.
CRITICAL DIRECTIVES:
1. Output ONLY valid Pastaay V1 YAML wrapped in a markdown yaml block
(using triple backticks and yaml specifier). NO conversational text.
2. DO NOT write single policy basic tests. Generate a Multi Vector Attack
containing at least 3 concurrent policies.
3. INTENSITY LEVEL HIGH: Use severe probabilities (0.7-0.9), extreme latency
(3s-8s), and enable drop_connection where appropriate.
4. STRICT SCHEMA RULES:
- NEVER use ranges for durations (e.g., '5s-15s' is ILLEGAL.
Use exactly '5s' or '15s').
- NEVER invent types like 'multi'. Stick EXACTLY to the provided schema.
- FATAL GUARD: For 'resource' policies, NEVER exceed ram_chunk_mb: 512.
- CLEAN YAML RULE: NEVER output error_chance: 0 or latency_chance: 0.
If a probability is 0, completely OMIT the field from the YAML.
- RESOURCE TYPE RULE: For 'resource' policies, completely OMIT
error_chance, error_code, latency_chance, and drop_connection.
They are invalid for OS level sabotage.
Notice the style: imperative, negative constraints, all caps enforcement words (NEVER, ILLEGAL, FATAL GUARD, STRICT, ONLY). I went through four iterations of this prompt. Version 1 was conversational and polite. It produced garbage YAML. Version 2 added schema rules but was too permissive. Version 3 added negative constraints but still allowed ranges. Version 4 (this one) works reliably across all four LLM providers.
I learned something important about prompt engineering: LLMs are not creative collaborators. They are literal instruction followers. If you leave room for interpretation, they will interpret, usually incorrectly. If you say “don’t use ranges,” they might still use them. If you say “NEVER use ranges, this is ILLEGAL,” they comply. The difference in output quality between “avoid X” and “NEVER do X. X is ILLEGAL” is dramatic.
11.2 Live Telemetry Context Injection
The Oracle doesn’t operate in a vacuum. Before calling the LLM, the engine constructs a telemetry matrix from live Prometheus data:
finalPrompt := fmt.Sprintf(
"User Request: %s\n\n--- LIVE TELEMETRY MATRIX ---\n%s",
userPrompt, sysContext,
)
The sysContext contains:
- Active policy count and types
- Prometheus counter values for
pastaay_injected_faults_totalby target and fault type - Currently targeted services
- Sensor health status
This means if you type “stress the payment service,” the Oracle sees that payments is already receiving latency faults but not errors, and generates an attack plan that introduces error injection to the payment endpoints while adding latency to the notification service for a multi vector scenario. It’s not guessing, it’s making data informed decisions.
11.3 Multi Model Architecture
The Oracle supports four providers through a unified interface. OpenAI and DeepSeek share a code path because both use OpenAI compatible APIs:
func callLLM(ctx context.Context, apiKey, model, url, sysPrompt, userPrompt, provider string) (string, error) {
// OpenAI and DeepSeek share the same API shape: Chat Completions.
payload := map[string]interface{}{
"model": model,
"messages": []map[string]string{
{"role": "system", "content": sysPrompt},
{"role": "user", "content": userPrompt},
},
"temperature": 0.5,
}
req, err := buildJSONRequest(ctx, http.MethodPost, url, payload)
if err != nil {
return "", err
}
req.Header.Set("Authorization", "Bearer "+apiKey) // standard OpenAI-style auth
return executeRequest(req, provider, func(b []byte) (string, error) {
var res struct {
Choices []struct {
Message struct {
Content string `json:"content"`
} `json:"message"`
} `json:"choices"`
}
if err := json.Unmarshal(b, &res); err != nil {
return "", fmt.Errorf("%s decode: %w", provider, err)
}
if len(res.Choices) == 0 {
return "", fmt.Errorf("%s: no choices", provider)
}
return res.Choices[0].Message.Content, nil
})
}
Gemini uses a fundamentally different API:
func callGemini(ctx context.Context, apiKey, model, sysPrompt, userPrompt string) (string, error) {
// Gemini uses a different API shape than OpenAI/Anthropic.
// The key goes into a header, NOT the URL query string.
// Query params leak into proxy logs and OTel spans.
url := "https://generativelanguage.googleapis.com/v1beta/models/" +
model + ":generateContent"
// Gemini's system instruction is a top-level field, not a message role.
payload := map[string]interface{}{
"system_instruction": map[string]interface{}{
"parts": map[string]string{"text": sysPrompt},
},
"contents": []map[string]interface{}{
{"parts": []map[string]string{{"text": userPrompt}}},
},
"generationConfig": map[string]interface{}{"temperature": 0.5},
}
req, err := buildJSONRequest(ctx, http.MethodPost, url, payload)
if err != nil {
return "", err
}
req.Header.Set("x-goog-api-key", apiKey) // NOT ?key= in URL
return executeRequest(req, "gemini", func(b []byte) (string, error) {
var res struct {
Candidates []struct {
Content struct {
Parts []struct{ Text string `json:"text"` } `json:"parts"`
} `json:"content"`
} `json:"candidates"`
}
if err := json.Unmarshal(b, &res); err != nil {
return "", fmt.Errorf("gemini decode: %w", err)
}
// Gemini wraps responses deeper than OpenAI.
// Guard against empty responses from rate limits or safety filters.
if len(res.Candidates) == 0 || len(res.Candidates[0].Content.Parts) == 0 {
return "", fmt.Errorf("gemini: no candidates")
}
return res.Candidates[0].Content.Parts[0].Text, nil
})
}
Anthropic uses yet another format:
func callAnthropic(ctx context.Context, apiKey, model, sysPrompt, userPrompt string) (string, error) {
// Anthropic uses its own Messages API, different from both
// OpenAI and Gemini. System prompt is a top-level field.
payload := map[string]interface{}{
"model": model,
"max_tokens": 4096,
"system": sysPrompt,
"messages": []map[string]string{{"role": "user", "content": userPrompt}},
}
req, err := buildJSONRequest(ctx, http.MethodPost,
"https://api.anthropic.com/v1/messages", payload)
if err != nil {
return "", err
}
req.Header.Set("x-api-key", apiKey)
req.Header.Set("anthropic-version", "2023-06-01")
return executeRequest(req, "anthropic", func(b []byte) (string, error) {
var res struct {
Content []struct{ Text string `json:"text"` } `json:"content"`
}
if err := json.Unmarshal(b, &res); err != nil {
return "", fmt.Errorf("anthropic decode: %w", err)
}
if len(res.Content) == 0 {
return "", fmt.Errorf("anthropic: no content")
}
return res.Content[0].Text, nil
})
}
Each model has a carefully chosen default: DeepSeek R1 (deepseek-reasoner) for the Oracle's reasoning heavy workload, GPT-4o-mini for cost efficiency, Gemini 2.5 Flash for speed, and Claude Sonnet for Anthropic's ecosystem.
12. The Web Console
12.1 Vanilla Everything
There is no React. No Node.js. No npm. No webpack. No build step. The entire frontend is embedded in the Go binary:
//go:embed templates/*
var TemplatesFS embed.FS
//go:embed static/*
var StaticFS embed.FS
When pastaay starts, it registers routes that serve these embedded files directly from memory:
func RegisterHandlers(mux *http.ServeMux, mgr *config.Manager) {
tmpl := template.Must(template.ParseFS(TemplatesFS, "templates/*.html"))
mux.Handle("/static/", http.FileServer(http.FS(StaticFS)))
mux.Handle("/console/docs/raw/",
http.StripPrefix("/console/docs/raw/", http.FileServer(http.FS(docs.FS))))
mux.HandleFunc("/console", func(w http.ResponseWriter, r *http.Request) {
renderHTML(tmpl, w, "dashboard")
})
mux.HandleFunc("/console/docs", func(w http.ResponseWriter, r *http.Request) {
renderHTML(tmpl, w, "docs")
})
mux.HandleFunc("/console/builder", func(w http.ResponseWriter, r *http.Request) {
renderHTML(tmpl, w, "builder")
})
mux.HandleFunc("/console/api/state", requireConsoleToken(adminToken, handleState(mgr)))
mux.HandleFunc("/console/api/oracle", requireConsoleToken(adminToken, handleOracle))
mux.HandleFunc("/console/api/plan", requireConsoleToken(adminToken, handlePlan))
mux.HandleFunc("/console/api/rollback", requireConsoleToken(adminToken, handleRollback(mgr)))
mux.HandleFunc("/console/api/probe", requireConsoleToken(adminToken, handleProbe))
mux.HandleFunc("/console/api/metrics", handleMetrics)
mux.HandleFunc("/console/api/discover", requireConsoleToken(adminToken, handleDiscover))
}
The authentication middleware uses constant time comparison to prevent timing attacks on the API token. If PASTAAY_WEBHOOK_TOKEN is set as an environment variable, every API call requires it in the X-Pastaay-Token header. If it's empty, the console is open access, which is fine for local development but should be configured in production.
12.2 The Widget System
The dashboard grid is a proper drag and drop implementation. Each panel is a self contained widget that manages its own data fetching, rendering, and lifecycle. The layout persists in localStorage:
// From widget.js, layout persistence
saveLayout() {
const order = [];
this.grid.querySelectorAll('[data-widget-id]').forEach(el => {
order.push(el.dataset.widgetId);
});
localStorage.setItem('pastaay_layout', JSON.stringify(order));
}
loadLayout() {
const saved = localStorage.getItem('pastaay_layout');
if (!saved) return;
const order = JSON.parse(saved);
order.forEach(id => {
const el = this.grid.querySelector(`[data-widget-id="${id}"]`);
if (el) this.grid.appendChild(el);
});
}
Widgets communicate with the engine through a REST API served by the same Go process. The state endpoint returns the current policy state, telemetry snapshot, and sensor health in a single JSON response. Widgets poll this endpoint at configurable intervals and update themselves. There’s no WebSocket, no Server Sent Events, no complex state management, just HTTP polling that works through any firewall.
A closer look at two of the panels. The Global Fault Velocity chart tracks the total fault injection rate across all targets in real time:
The Blast Radius Matrix breaks down errors, latency spikes, and dropped connections across the most targeted services:
12.3 The Resilience Probe
The resilience probe uses Apdex scoring. Apdex (Application Performance Index) classifies response times into three buckets relative to a configurable threshold T:
- Satisfied: response ≤ T
- Tolerating: T < response ≤ 4T
- Frustrated: response > 4T or error
The Apdex score = (Satisfied + Tolerating/2) / Total. A score of 0.94+ means the target is healthy. Below 0.70 means it’s degrading. Below 0.50 means it’s failing.
The probe sends HTTP requests through a server side proxy to bypass browser CORS restrictions:
func handleProbe(w http.ResponseWriter, r *http.Request) {
var req struct{ URL string `json:"url"` }
json.NewDecoder(r.Body).Decode(&req)
start := time.Now()
// Server-side proxy, bypasses browser CORS restrictions.
// The probe targets services the browser can't reach directly.
client := &http.Client{Timeout: 10 * time.Second}
resp, err := client.Get(req.URL)
elapsed := time.Since(start).Milliseconds()
result := map[string]interface{}{"elapsed_ms": elapsed}
if err != nil {
result["status"] = 0
result["error"] = err.Error()
} else {
defer resp.Body.Close()
result["status"] = resp.StatusCode
}
json.NewEncoder(w).Encode(result)
}
The probe supports multi target round robin probing, EMA smoothed scoring, and clickable diagnostic popovers that explain each field in plain English.
12.4 Kubernetes Log Streaming
The web console can stream pod logs in real time. The backend uses the Kubernetes Watch API to detect new pods and attach log streams:
// Watch for new pods and stream their logs in real time.
w, err := clientset.CoreV1().Pods(namespace).Watch(ctx, metav1.ListOptions{})
for event := range w.ResultChan() {
pod, ok := event.Object.(*corev1.Pod)
if !ok { continue }
switch event.Type {
case watch.Added, watch.Modified:
// Only attach to Running pods, pending/terminating pods ignored.
if pod.Status.Phase == corev1.PodRunning {
streamCtx, cancel := context.WithCancel(ctx)
activeStreams[pod.Name] = cancel // store cancel for cleanup
go startKubeLogStreamer(streamCtx, clientset, namespace, pod.Name)
}
case watch.Deleted:
// Pod removed — cancel its log stream to free resources.
if cancel, exists := activeStreams[pod.Name]; exists {
cancel()
delete(activeStreams, pod.Name)
}
}
}
Log streams use the Kubernetes Pod Log API with Follow: true. Each line is emitted to the telemetry bus, which feeds the System Output Journal panel in the web console. The journal supports hierarchical filtering (Pod → Protocol → Method), text search, live/pause toggle, and clic -to decrypt payload tracing.
The watch loop reconnects on failure with exponential backoff, capping at 30 seconds. If the watch channel closes (which happens periodically), the loop restarts the watch and reattaches to any running pods.
Here is the full console in action:
13. The Kubernetes Operator
The pastaay Kubernetes operator manages chaos policies as Custom Resources. When you apply a ChaosPolicy CRD to your cluster, the operator detects it, translates it into the engine's configuration format, and pushes it to the target pods through the webhook API:
apiVersion: chaos.pastaay.io/v1
kind: ChaosPolicy
metadata:
name: payment-service-stress
spec:
policies:
- name: http-latency-spike
type: http
target: /api/payments/*
latency_chance: 0.8
latency_duration: 3s
- name: db-connection-drop
type: sql
target: database
drop_connection: true
error_chance: 0.3
The operator controller watches for CRD changes and reconciles the desired state against the engine. It also handles the RBAC necessary for the log streaming feature, a separate ClusterRole that grants pod log read access.
[IMAGE: docs/assets/operator_header.png, the operator architecture header]
14. Zero Allocation
14.1 The Performance Guarantee
Every policy lookup in pastaay is zero allocation.
1. Atomic pointers for configuration. The policy cache is behind an atomic.Pointer. Reading it requires no lock. The pointer swap during updates takes microseconds and doesn't block readers.
2. Type indexed slice cache. Instead of iterating all policies on every request, policies are pre grouped by protocol type. An HTTP request only checks HTTP policies. A Kafka message only checks Kafka policies. The map lookup is O(1).
3. Lock free PCG random number generator. Go’s math/rand/v2 uses the PCG (Permuted Congruential Generator) algorithm, which is lock free and allocation free. Comparing rand.Float64() < p.ErrorChance is a single atomic operation.
4. No intermediate allocations. The policy struct is stack allocated. The path matching uses only stack variables. The context based delay uses time.NewTimer which pre allocates its internal structures. No fmt.Sprintf, no strings.Builder, no temporary slices.
5. OpenTelemetry BatchSpanProcessor. Spans are batched and flushed asynchronously. If your tracing backend is slow or offline, chaos spans queue up in memory and flush later. The critical path (injecting the fault) never blocks on span export.
15. What I Learned
This section is the part you won’t find in the README or the documentation. These are the things that cost me days of debugging, rewrites, and staring at stack traces at 2 AM.
Filesystem watchers will betray you
I spent three days debugging why the watcher stopped working after a Kubernetes ConfigMap update. The ConfigMap gets mounted as a symlink to a directory containing the file. When the ConfigMap changes, Kubernetes replaces the symlink target, which means the old file’s inode is deleted and a new one appears. fsnotify sees this as RENAME followed by nothing, because the watcher was attached to the old inode. The fix took an hour to code and three days to discover was necessary.
I learned more about Linux filesystems from this one bug than from any course or book. Actual knowledge: fsnotify watches inodes, not paths. Vim writes to a temp file and renames. sed -i does the same. Kubernetes ConfigMaps create symlink forests. Your watcher needs to handle all of these, or your tool silently stops working.
LLMs need negative constraints, not positive guidance
My first Oracle prompt was friendly. “Please generate a chaos engineering YAML configuration.” The output: a single HTTP policy with error_chance: 0 and a conversational preamble thanking me for the opportunity.
I tried making it more specific. “Generate a multi vector attack with at least 3 policies.” The output: three policies, but one had type multi (not a valid type), one used latency_duration: 5s-15s (range, not a valid duration), and one had ram_chunk_mb: 2048 which the guard would reject.
The working prompt is imperatives and prohibitions. “NEVER use ranges. NEVER invent types. output_chance: 0 is ILLEGAL. OMIT the field entirely.” The difference in output quality between “please avoid” and “NEVER do X” is the difference between a working tool and a demo that crashes.
This is general advice for anyone building LLM powered tools: don’t be polite. Be precise. Negative constraints are more powerful than positive ones. Tell the model what NOT to do, and exactly why it’s wrong.
Zero allocation is a mindset, not a technique
The first version of the policy engine used sync.Map and allocated a new slice on every lookup. It worked. At 100 requests per second, nobody noticed. At 10,000 requests per second, the garbage collector woke up every 50 milliseconds and ate 30% of CPU.
Rewriting to use atomic.Pointer and type indexed slices took two weeks. The first week was figuring out what to change. The second week was discovering all the places that implicitly allocated, a fmt.Sprintf here, a strings.Join there, a temporary map created for header matching.
The lesson: design for zero allocation from the start. Adding it later is refactoring your entire hot path. The success rate of that kind of refactoring is low. Mine worked because pastaay was small enough at that point. For a larger codebase, it would have been impossible.
Trace correlation closes the observability gap
I almost cut the OpenTelemetry integration from scope. It seemed like a “nice to have” that would add weeks of work for marginal benefit. I was wrong. The first time I traced a single HTTP request through latency injection, to a Kafka consumer that dropped the message, to a Redis hook that returned a timeout, all visible in a single Jaeger trace tree, the entire value proposition of distributed tracing clicked.
Chaos engineering without tracing is guesswork. You inject a fault and observe the outcome, but you don’t know the causal chain. Did the payment service fail because of the latency, or because the notification service dropped a message that the payment service was waiting for? Without tracing, you can’t tell. With trace correlation, every chaos event has a span ID that links it to the specific request journey it affected.
16. What’s Next
Pastaay is not finished. Here’s what’s on the roadmap:
CEL Driven Rule Engine. Google’s Common Expression Language compiles expression strings to ASTs and evaluates them with zero allocation overhead. Imagine writing policies like:
policies:
- name: conditional-latency
type: http
target: /api/*
condition: "request.header('X-Priority') == 'low' && error_chance > 0.5"
latency_duration: 5s
The condition is compiled once at policy load time and evaluated on every request inline with the existing interceptor. This turns pastaay from a probability based chaos tool into a truly programmable chaos platform.
Trace Aware Injection. Using OpenTelemetry Baggage to propagate chaos decisions across service boundaries. If a request enters your system with a specific baggage header, pastaay can inject faults based on the complete distributed request journey, targeting specific end to end transaction flows rather than individual services.
Community Interceptors. The interceptor pattern is designed to be extensible. Adding a new protocol requires implementing a single interface. I’d like to see community contributed interceptors for Cassandra, Elasticsearch, S3, and whatever else people are running.
If you’ve read this far, you now understand pastaay better than most people who will ever use it. This article is my attempt to document not just what the code does, but why it does it that way, the constraints, the failed approaches, the late night realizations.
Pastaay is at github.com/CemAkan/pastaay. It’s a single Go binary. It has no external runtime dependencies. It works on macOS, Linux, and anywhere Go compiles. Everything in this article is backed by real, working, tested code.
I built this alone, from scratch, during my senior year. If you try it and something breaks, open an issue. If you build an interceptor for a protocol I haven’t covered, open a PR. If you use it to find a bug in production before your users do, that’s exactly why it exists.















Top comments (0)