Adam Leskis

Posted on Jun 10 • Edited on Jun 27

Automating AI Out of a Job

#cli #learning #linux #performance

A deliberately provocative title, to be sure, but it's a good way to frame the following discussion. A big theme in the industry these days is about whether AI is going to replace the need for junior engineers, as well as whether current juniors can actually learn anything by using AI in the course of their roles.

This article is going to flip the script on those upstart AIs though, and get them to create some tools to replace the need to constantly query them, so we can keep learning without burning down 5 rainforests per hour!

For a video walkthrough of these tools, check here

Current situation for learning things

And actually, this isn't a problem just isolated to junior engineers!

Both new and experienced engineers still have a need to learn about systems, and even though AI (more specifically LLMs) can seem like a one-stop shop in terms of learning resources, they have a number of drawbacks:

The best ones cost money (unequal access)
It’s not necessarily correct (requires background knowledge to evaluate, the lack of which is the exact reason for trying to learn via this medium)
The map is not the territory (it facilitates high confidence but without the requisite experience...or, said another way, knowing that servers might need a reboot at 3AM isn't the same as actually having done that at least once)
The data has to flow through a central institution's data centers

But what if instead of just asking AI, we built our own tools to help ourselves learn what we want to learn?

In contrast to using AI FOR learning, AI-built learning tools have the following advantages:

They’re free to run
Using real tools in real systems will expose users to real conditions (scaffolding can further enhance the acquisition of requisite background knowledge)
These tools can be structured around running Linux environments, which are themselves the territory
No data necessarily leaves the user’s system
Infinitely customizable and shareable via the opensource ecosystem

Julia Evans (and other folks) have done some stuff like this, for example, https://messwithdns.net. So I decided to do something similar, and the rest of the article will detail my experiences with vibe-coding my own self-scaffolding learning materials.

Case Study: I wanted to learn about some Linux perf

I'm coming from basically zero experience with running the Linux performance tooling to investigate an active Linux system, and this is what informed the approach I took to teach myself more about this domain.

There's clearly...a lot here.

And I wasn't trying to learn everything there is to know about Linux perf. Brendan Gregg (alias: the God of Linux perf) has some good guidance in this article about the first 60,000 milliseconds on a server.

TL;DR - run these:

uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top

Alright, so that all sounds good, but how can I teach myself, not just to memorize these commands, but to also get more experience applying the interpretation of the output to diagnose and fix a running system?

And just to contextualize the whole point of the exercise, I want to eventually be able to go from some sort of alert in our system to actually fixing the issue. It's possible this could surface from a regression or even just be an exploration of whether there's an opportunity to tune a system for better performance.

But I didn't start with trying to create one tool to guide me through the entire end-to-end process of doing that. I thought I'd start at the start, and make sure that I understood the basic concepts in Linux performance.

What are the main constraints in a system, and what's the difference between workload characterization (a top-down approach) and resource characterization (eg, the USE methodology, which is a bottom-up approach)? The tools presented here are for the USE methodology, but I'm hoping to make even more stuff later to do something a bit more top-down...stay tuned!

So with the much narrower focus on "what are the commands to run, and what does the output of the commands mean?", we're ready to start making some tools to help us practice that!

A great pattern for using AI to create tools to investigate cloud environments instead of accessing by the AI directly.

Similar to how you would (normally) use AI to help debug issues in production environments, rather than giving the AI direct access, you can get a script that will answer the questions you have. Maybe this is "what are the current workloads running", or "Why have our storage costs gone up 10x in the last month?".

The main point is that the script that AI outputs is very quick to audit/review and then you can run that as many times as you want, parameterized for different environments, without any tokens burned.

You only burn tokens once to create your learning assistant, then you can run the tool/script as much as you want.

You can do the same thing for learning tools. AI can scaffold it out super quick and get an MVP up and running for iterative feedback.

Quick digression on learning objectives and materials design

It's obviously not as simple as "make me a tool to learn linux perf". Creating something useful involves a lot more things like:

identification of learning objectives
differentiation of learning taxonomy goals (eg, Bloom's Taxonomy)
principles of psychometric assessment and construct validity
item analysis (if your thing uses multiple-choice questions, like mine does) and how to evaluate distractors
etc...

This is no doubt the trickiest part of the whole enterprise, and I don't mean to downplay it, but it's a much (like, MUCH) bigger discussion, that we don't have the space to really cover here.

I gave a presentation delving a bit more into these things, in the context of apps that teach languages about 10 years ago if you want a slightly more in-depth exploration

Use-tool

https://github.com/lpmi-13/use-tool

I wanted to start at a very basic level, which was teaching myself:

what are the commands to run for each of the 4 resource types involved in Brendan Gregg's USE Methodology?
what does the output of those commands mean (eg, which columns mean what)?
how is the output of the commands related to the actual state of an observed system (eg, which output means high cpu utilization)?

So for the first part, I wanted to just give myself practice typing the commands and quizzing myself on the output. That's what use-tool guide cpu does:

=== CPU — Utilization, Saturation, Errors — guided walkthrough ===
Investigate CPU using Brendan Gregg's USE method.
Run commands at the prompt; the harness captures their output
and asks specific questions about what you saw.

Detected system: 8 logical CPUs.
At each step, run the suggested command (or an alternative if shown). Type `skip` to move on, `exit` to quit.
During a check, answer with a number; use `$ <command>` to inspect more data first.

--- Step 1/4: loadavg ---
Step 1: Load averages and recent run-queue counters give a coarse
picture of CPU pressure over 1-, 5-, and 15-minute windows.
Suggested: cat /proc/loadavg
[guide] $

You can literally just type along with the suggestions, and it asks you questions to check understanding of the meaning of different values in different columns. For example, in the guided memory mode, you see things like:

--- Check ---
In `free -h` output, what does the `available` column report for the `Mem:` row?
  1. Free memory plus free swap
  2. Memory currently held by processes that could be killed cleanly
  3. Completely unused memory only
  4. The kernel's estimate of memory usable for new allocations without swapping (including reclaimable cache)
Choice: 4

--- Feedback ---
Your answer: The kernel's estimate of memory usable for new allocations without swapping (including reclaimable cache)
Result: correct

Whereas for a bit more of a free-form exploration, you can use the use-tool practice memory mode (here for investigation of memory instead of cpu, like we did above):

=== Memory — Utilization, Saturation, Errors — practice mode ===
Investigate memory using Brendan Gregg's USE method.
Run commands at the prompt; the harness captures their output
and asks specific questions about what you saw.

Detected system: 8 logical CPUs.
Shell commands run on this live system.
Builtins: `report` (snapshot of what you've gathered), `commands` (cheatsheet),
          `diagnose` (check the system's USE state from what you saw), `help`, `exit`.

[practice] $

You get dropped into something that looks like a normal shell session, but prefixed with [practice], and you can run any commands you want. Then, when you're ready to see a high-level report of the findings, you can run report to see what the tool has captured and from which commands:

[practice] $ free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       6.9Gi       784Mi       527Mi       7.7Gi       7.4Gi
Swap:          2.0Gi       2.0Gi        44Mi
[practice] $ report

============================================================
Captured from: free -h; ifconfig

Utilization
  Memory used                    50.7%  (7.6 / 15.0 GiB used (not counting reclaimable cache) — from rounded `free -h`)
  Memory available               7.40 GiB  (from rounded `free -h`)
  Cache + buffers                7.70 GiB  (reclaimable — from rounded `free -h`)
  Swap used                      97.9%  (2.0 / 2.0 GiB — from rounded `free -h`)

Not captured (no data yet):
  vmstat si (swap-in)
  vmstat so (swap-out)
  PSI memory some (avg10)
  PSI memory full (avg10)
  OOM events in dmesg

[practice] $

you can see above that the tool captured me running ifconfig, which has nothing to do with memory, but shows that the tool doesn't care about which commands are run, you still see the regular stdout output, it only captures the relevant output for addition into the report, and for the later diagnose step, which asks questions like this:

=== Diagnose: Memory — Utilization, Saturation, Errors ===
Signals you've observed this session:
  [1] Memory available = 7.40 GiB  (from rounded `free -h`)
  [2] Swap used = 97.9%  (2.0 / 2.0 GiB — from rounded `free -h`)
  [3] vmstat si (swap-in) = 0, 0  (max 0, mean 0)
  [4] vmstat so (swap-out) = 0, 0  (max 0, mean 0)
  [5] PSI memory some (avg10) = 0%
  [6] PSI memory full (avg10) = 0%
  [7] OOM events in dmesg = 134/134 lines mention OOM

For each USE dimension, give your verdict, then cite the evidence
that supports it (the bracketed numbers above, comma-separated; `none` if none).

Utilization — your verdict?
> 1. high
  2. moderate
  3. low
  4. not enough data
↑/k ↓/j move | 1-4 jump | Enter choose | q quit | ? help

and you have to select supporting evidence from the data you captured. Then, after submitting, you get some nice feedback on whether your decisions are actually supported by the evidence:

--- Diagnosis feedback ---                                                                                                              

Utilization                                                                                                                             
  Verdict:    low                                                                                                                       
  Assessment: supported, but thin — strong claims want a second, independent signal                                                     

  Evidence:                                                                                                                             

    ✓ Memory available                                                                                                                  
      Supports this verdict.                                                                                                            
      Observed: 7.40 GiB (from rounded `free -h`)                                                                                       
      Why: available memory in the low-GiB range means real pressure (Linux already                                                     
      counts reclaimable cache toward available)                                                                                        

  Other relevant evidence you captured:                                                                                                 
    • Swap used                                                                                                                         
      Reads "high" and would point against this verdict.                                                                                


  To gather more supporting evidence:                                                                                                   
    • cat /proc/meminfo                                                                                                                 
      Canonical kernel memory accounting in kB.                                                                                         

  Note:                                                                                                                                 
    Strongest utilization signal reads "high".

The point is to have two different modes, each targeting a slightly different learning outcomes.

The guide mode is mainly for the identification of what the columns mean and what the numbers in the columns represent.

the practice mode is mainly for understanding how the results of the command indicate the status of a system. And it asks you the user to apply knowledge of what they saw in tool outputs to assess the state of the system.

Use-practice

https://github.com/lpmi-13/use-practice

Okay, these names are kind of annoying, but I liked the focus on USE (Utilization/Saturation/Errors), so that's why they're named that.

This tool is for providing a ready-made environment for the use-tool to actually interpret various states of a Linux server. If you only have your local system to run it on, that's probably not very interesting.

So we want to generate some load so we can test out our observations and conclusions from the earlier use-tool. Even better if we can do that on a remote ephemeral system like iximiuz Labs where it doesn't matter if we put extra load on a system. If anything goes wrong, we can just destroy the VM.

It has both a random mode to provide varying resource loads so there are no clues about where to go with the investigation. It also has a slightly more scaffolded version where you can provide a resource type (cpu/memory/disk/network), and the tool starts a random type of load for that resource (this is basically either utilization or errors). It's also possible to reveal the particular type of resource load in the tool to provide a slightly different type of scaffolding.

Combined with the use-tool (which also has a use-tool practice system mode for running all the USE commands when you don't know which resource might be the constraint), you can safely explore different load profiles and test out your ability to reason about the output of different linux perf commands on a real live system.

Where to run these things

I mentioned this a bit earlier, and if you've read any of my other recent posts, you know exactly where this is going...

I designed the use-tool so that you could run it on your own system if you want (or any random system), since it's load agnostic and running it on a very quiet system exposes an exploration of another valid state, so analyzing and interpreting that is still a reasonable and authentic use case.

However, if we want to generate random load without stressing our own systems, it's gotta be iximiuz Labs. Ephemeral VMs to configure as much load as you want, then you get to run the use-tool to see if you can find out what's going on.

And for a preloaded lab with the tooling already installed and ready to go, you can check out a custom playground I made for free (requires a GitHub user):

https://labs.iximiuz.com/playgrounds/use-practice-4ce4816f