LaTerral Williams

Posted on Jun 5

🛠️ Mastering Text Processing in Linux with `awk`, `sed`, and Friends

#cloudwhistler #linux #cloud #opensource

In a Linux environment working with text-based files is part of the daily workflow. Whether you're parsing logs, cleaning structured data, or automating system reporting, tools like awk, sed, grep, find, and their counterparts are indispensable.

This article walks you through a realistic scenario to learn and apply these powerful tools in a step-by-step format.

📚 Table of Contents

Scenario Overview
Step 1: Set Up the Practice Environment
Step 2: Find the Log Files (find)
Step 3: Filter Failed Logins (grep)
Step 4: Remove Comments (sed)
Step 5: Extract IP Addresses (awk)
Step 6: Cut Out Usernames (cut)
Step 7: Remove Duplicates (sort + uniq)
Step 8: Normalize Case (tr)
Step 9: Chain Commands (xargs)
Full Workflow Example
Command Cheat Sheet
Summary

🧑‍💻 Scenario: Audit and Clean a Login Activity Log

You're a junior system administrator tasked with auditing a collection of login activity logs. These files:

Are spread across multiple directories
Contain both successful and failed logins
Include redundant entries
Use a structured key=value format common in system-generated logs

Your goal: extract meaningful data, clean it, and produce a simple report.

🔧 Step 1: Set Up the Practice Environment

To simulate a real-world situation, you'll create a set of log files to work with.

📁 Create a Directory

mkdir -p ~/log_audit/logs/2025
cd ~/log_audit/logs/2025

📝 Create Sample Log Files

cat <<EOF > log1.txt
[2025-06-01 08:12:55] LOGIN: user=john src=192.168.1.10 status=SUCCESS
[2025-06-01 08:13:02] LOGIN: user=mary src=192.168.1.15 status=FAIL
[2025-06-01 08:13:02] LOGIN: user=mary src=192.168.1.15 status=FAIL
[2025-06-01 08:15:42] LOGIN: user=alice src=10.0.0.5 status=SUCCESS
EOF

cat <<EOF > log2.txt
# Generated by login system
[2025-06-01 08:18:22] LOGIN: user=bob src=192.168.1.20 status=FAIL
[2025-06-01 08:19:00] LOGIN: user=john src=192.168.1.10 status=SUCCESS
EOF

These files mimic logs from a PAM-enabled login system. Each entry contains a timestamp, user, source IP, and login result.

🔎 Step 2: Find the Log Files (`find`)

find ~/log_audit/logs -name "*.txt"

find: Searches for files in a directory tree.
-name "*.txt": Filters for .txt log files.

🔍 Step 3: Filter Failed Logins (`grep`)

grep "status=FAIL" *.txt

grep: Searches for patterns in text.
"status=FAIL": Finds entries for failed login attempts.

Use the -i flag for case-insensitive matching:

grep -i "status=fail" *.txt

🪄 Step 4: Remove Comments (`sed`)

sed '/^#/d' log2.txt

sed: A stream editor for filtering and transforming text.
/^#/d: Removes lines that begin with # (comments or metadata).

To update the file directly:

sed -i '/^#/d' log2.txt

/^#/: Matches lines that start with #.

d: Deletes those lines from the stream.

This command does not modify the file unless -i is added.

Be cautious with this command because it is possible to remove meaningful information that may have been commented out intentionally, for example:

#2025-06-01: Backup completed

📊 Step 5: Extract IP Addresses (`awk`)

awk: is a powerful text processing tool used to scan, extract, and manipulate text, especially structured data.

awk -F 'src=| status' '{print $2}' log1.txt

-F 'src=| status': Defines a custom field separator using regex.
{print $2}: Extracts the IP address from the line.

✅ In Greater Detail:

-F 'src=| status'

-F sets the field separator (the character(s) that split each line into fields).
'src=| status' is a regular expression that uses two delimiters:
- src= → marks the beginning of the IP address.
- status → marks the end of the IP address.

This means:

Field $1: everything before src=
Field $2: the value between src= and status → the IP address
Field $3: everything after status

Example:

[2025-06-01 08:12:55] LOGIN: user=john src=192.168.1.10 status=SUCCESS

Becomes:

$1: [2025-06-01 08:12:55] LOGIN: user=john
$2: 192.168.1.10
$3: SUCCESS

If you receive different results, check your syntax... you may have an additional space in the delimiter.

✂️ Step 6: Cut Out Usernames (`cut`)

grep "user=" log1.txt | cut -d '=' -f 2 | cut -d ' ' -f1

First cut: Extracts everything after user=.
Second cut: Removes trailing text after the space.

✅ In Greater Detail:

grep "user=" log1.txt
- Filters lines in log1.txt that contain the string user=.
- Why? Ensures you're only processing lines with login information.
cut -d '=' -f 2
- Extracts the part after user=.
- -d '=': Sets the delimiter to =.
- -f 2: Selects the second field (the part after user=).
cut -d ' ' -f1
- Trims off everything after the username.
- -d ' ': Sets the delimiter to a space.
- -f1: Selects the first field (the username only).

📑 Step 7: Remove Duplicates (`sort` + `uniq`)

sort log1.txt | uniq

To count how many times each line appears:

sort log1.txt | uniq -c

🔁 Step 8: Normalize Case (`tr`)

awk -F 'status=' '{print $2}' log1.txt | tr '[:upper:]' '[:lower:]'

tr: Translates characters.
Converts uppercase status values to lowercase.

🔗 Step 9: Chain Commands (`xargs`)

find . -name "*.txt" | xargs grep "status=SUCCESS" | awk -F 'user=| src' '{print $2}' | sort | uniq

xargs: Converts find output into arguments for grep.
Result: Clean list of users with successful logins.

🧪 Full Workflow Example: Cleaned Login Report

find . -name "*.txt" | xargs sed '/^#/d' | grep "status=SUCCESS" | awk -F 'user=| src' '{print $2}' | sort | uniq > clean_logins.txt

🧰 Command Cheat Sheet

Tool	Purpose	Example
`find`	Locate files in directories	`find . -name "*.txt"`
`grep`	Search for matching lines	`grep "status=FAIL"`
`sed`	Edit or delete lines in streams	`sed '/^#/d'`
`awk`	Extract and process fields	`awk -F 'src=
{% raw %}`cut`	Remove selected portions of text	`cut -d '=' -f 2`
`sort`	Sort lines in text files	`sort log.txt`
`uniq`	Remove or count duplicate lines	`uniq -c`
`tr`	Translate or delete characters	`tr '[:upper:]' '[:lower:]'`
`xargs`	Pass input as arguments to commands	`xargs grep "SUCCESS"`

✅ Summary

awk and sed are essential for processing structured text on Linux systems.
When combined with tools like grep, cut, sort, uniq, tr, xargs, and find, you have a full-featured toolkit for automated log parsing, data cleaning, and report generation.
The scenario above mirrors a common task in system administration: making sense of raw logs and producing actionable insights.

💬 Need a Challenge? Try the following:

Now that you’ve mastered the basics:

Use awk to generate formatted CSV reports
Automate your workflows with cron
Explore real system logs with journalctl or /var/log/secure

DEV Community

🛠️ Mastering Text Processing in Linux with `awk`, `sed`, and Friends

📚 Table of Contents

🧑‍💻 Scenario: Audit and Clean a Login Activity Log

🔧 Step 1: Set Up the Practice Environment

📁 Create a Directory

📝 Create Sample Log Files

🔎 Step 2: Find the Log Files (`find`)

🔍 Step 3: Filter Failed Logins (`grep`)

🪄 Step 4: Remove Comments (`sed`)

📊 Step 5: Extract IP Addresses (`awk`)

✂️ Step 6: Cut Out Usernames (`cut`)

📑 Step 7: Remove Duplicates (`sort` + `uniq`)

🔁 Step 8: Normalize Case (`tr`)

🔗 Step 9: Chain Commands (`xargs`)

🧪 Full Workflow Example: Cleaned Login Report

🧰 Command Cheat Sheet

✅ Summary

💬 Need a Challenge? Try the following:

Top comments (0)

📚 Table of Contents

🧑‍💻 Scenario: Audit and Clean a Login Activity Log

🔧 Step 1: Set Up the Practice Environment

📁 Create a Directory

📝 Create Sample Log Files

🔎 Step 2: Find the Log Files (find)

🔍 Step 3: Filter Failed Logins (grep)

🪄 Step 4: Remove Comments (sed)

📊 Step 5: Extract IP Addresses (awk)

✂️ Step 6: Cut Out Usernames (cut)

📑 Step 7: Remove Duplicates (sort + uniq)

🔁 Step 8: Normalize Case (tr)

🔗 Step 9: Chain Commands (xargs)

🧪 Full Workflow Example: Cleaned Login Report

🧰 Command Cheat Sheet

✅ Summary

💬 Need a Challenge? Try the following:

🔎 Step 2: Find the Log Files (`find`)

🔍 Step 3: Filter Failed Logins (`grep`)

🪄 Step 4: Remove Comments (`sed`)

📊 Step 5: Extract IP Addresses (`awk`)

✂️ Step 6: Cut Out Usernames (`cut`)

📑 Step 7: Remove Duplicates (`sort` + `uniq`)

🔁 Step 8: Normalize Case (`tr`)

🔗 Step 9: Chain Commands (`xargs`)