DEV Community

Cover image for ๐—œ๐—ป๐˜๐—ฒ๐—ด๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ผ๐—ณ ๐˜€๐—ผ๐˜€ ๐—ฟ๐—ฒ๐—ฝ๐—ผ๐—ฟ๐˜ ๐—ถ๐—ป ๐—œ๐—ป๐—ฐ๐—ถ๐—ฑ๐—ฒ๐—ป๐˜ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€.
Jorge Luis Rueda Beirana
Jorge Luis Rueda Beirana

Posted on

๐—œ๐—ป๐˜๐—ฒ๐—ด๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ผ๐—ณ ๐˜€๐—ผ๐˜€ ๐—ฟ๐—ฒ๐—ฝ๐—ผ๐—ฟ๐˜ ๐—ถ๐—ป ๐—œ๐—ป๐—ฐ๐—ถ๐—ฑ๐—ฒ๐—ป๐˜ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€.

Monitoring and observability toolsโ€Šโ€”โ€ŠGrafana, Prometheus, traces, logsโ€Šโ€”โ€Štell you that something is wrong and where. They do not tell you what the host operating system was doing at that moment: which processes were consuming memory, what the kernel OOM killer decided, whether a filesystem was having an I/O contention problem, what the block device queue looked like, what firewall rules were in effect. That data lives on the node, is often ephemeral, and disappears or changes as the system recovers.

The purpose of integrating the widely available open-source ๐˜€๐—ผ๐˜€ ๐—ฟ๐—ฒ๐—ฝ๐—ผ๐—ฟ๐˜ Linux command into the pipeline is to ๐—ฐ๐—ฎ๐—ฝ๐˜๐˜‚๐—ฟ๐—ฒ ๐˜๐—ต๐—ฎ๐˜ ๐—ข๐—ฆ-๐—น๐—ฒ๐˜ƒ๐—ฒ๐—น ๐˜€๐—ป๐—ฎ๐—ฝ๐˜€๐—ต๐—ผ๐˜ ๐—ฎ๐˜‚๐˜๐—ผ๐—บ๐—ฎ๐˜๐—ถ๐—ฐ๐—ฎ๐—น๐—น๐˜†, ๐—ฎ๐˜ ๐˜๐—ต๐—ฒ ๐—บ๐—ผ๐—บ๐—ฒ๐—ป๐˜ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—ฎ๐—น๐—ฒ๐—ฟ๐˜, ๐—ฏ๐—ฒ๐—ณ๐—ผ๐—ฟ๐—ฒ ๐˜๐—ต๐—ฒ ๐—ฒ๐˜ƒ๐—ถ๐—ฑ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐—ฑ๐—ฒ๐—ด๐—ฟ๐—ฎ๐—ฑ๐—ฒ๐˜€ without requiring a human to log into the node and collect it manually.

More specifically it achieves four things:

๐—ฆ๐—ฝ๐—ฒ๐—ฒ๐—ฑ ๐—ผ๐—ณ ๐—ฑ๐—ถ๐—ฎ๐—ด๐—ป๐—ผ๐˜€๐—ถ๐˜€. The data is already collected and analysed by the time the SRE opens the alert. They review findings instead of gathering evidence.

๐—˜๐˜ƒ๐—ถ๐—ฑ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐—ฝ๐—ฟ๐—ฒ๐˜€๐—ฒ๐—ฟ๐˜ƒ๐—ฎ๐˜๐—ถ๐—ผ๐—ป. Memory state, kernel ring buffer entries, and process tables are ephemeral. Automated collection catches them before the system recovers and overwrites them.

๐—ฅ๐—ฒ๐—ฑ๐˜‚๐—ฐ๐—ฒ๐—ฑ ๐˜๐—ผ๐—ถ๐—น. Manual OS diagnostics during an incident are slow, error-prone, and inconsistent between engineers. Presets make the collection reproducible and automatic.

๐—–๐—ผ๐—บ๐—ฝ๐—น๐—ฒ๐˜๐—ฒ๐—ป๐—ฒ๐˜€๐˜€. Every incident of the same type produces the same shape of data, making cross-incident comparison and pattern recognitionโ€Šโ€”โ€Šincluding by an AI analysis tool meaningful and reliable.

In short: monitoring tells you the what, tracing tells you the where, and sos report presets tell you the why automatically, consistently, and fast enough to be useful during the incident rather than after it.

The best part is that you do not need to install anything.

If you like to know how can this be done, this article contains detailed instructions on how can this be achieved for a concrete production environment involving Kubernetes, Grafana and Ansible

visit sos-vault for a complete reference on how to use sos report command effectivley

Top comments (0)