DEV Community

LaTerral Williams
LaTerral Williams

Posted on

πŸ› οΈ Mastering Text Processing in Linux with `awk`, `sed`, and Friends

In a Linux environment working with text-based files is part of the daily workflow. Whether you're parsing logs, cleaning structured data, or automating system reporting, tools like awk, sed, grep, find, and their counterparts are indispensable.

This article walks you through a realistic scenario to learn and apply these powerful tools in a step-by-step format.


πŸ“š Table of Contents


πŸ§‘β€πŸ’» Scenario: Audit and Clean a Login Activity Log

You're a junior system administrator tasked with auditing a collection of login activity logs. These files:

  • Are spread across multiple directories
  • Contain both successful and failed logins
  • Include redundant entries
  • Use a structured key=value format common in system-generated logs

Your goal: extract meaningful data, clean it, and produce a simple report.


πŸ”§ Step 1: Set Up the Practice Environment

To simulate a real-world situation, you'll create a set of log files to work with.

πŸ“ Create a Directory

mkdir -p ~/log_audit/logs/2025
cd ~/log_audit/logs/2025
Enter fullscreen mode Exit fullscreen mode

πŸ“ Create Sample Log Files

cat <<EOF > log1.txt
[2025-06-01 08:12:55] LOGIN: user=john src=192.168.1.10 status=SUCCESS
[2025-06-01 08:13:02] LOGIN: user=mary src=192.168.1.15 status=FAIL
[2025-06-01 08:13:02] LOGIN: user=mary src=192.168.1.15 status=FAIL
[2025-06-01 08:15:42] LOGIN: user=alice src=10.0.0.5 status=SUCCESS
EOF

cat <<EOF > log2.txt
# Generated by login system
[2025-06-01 08:18:22] LOGIN: user=bob src=192.168.1.20 status=FAIL
[2025-06-01 08:19:00] LOGIN: user=john src=192.168.1.10 status=SUCCESS
EOF
Enter fullscreen mode Exit fullscreen mode

These files mimic logs from a PAM-enabled login system. Each entry contains a timestamp, user, source IP, and login result.

Image description


πŸ”Ž Step 2: Find the Log Files (find)

find ~/log_audit/logs -name "*.txt"
Enter fullscreen mode Exit fullscreen mode
  • find: Searches for files in a directory tree.
  • -name "*.txt": Filters for .txt log files.

Image description


πŸ” Step 3: Filter Failed Logins (grep)

grep "status=FAIL" *.txt
Enter fullscreen mode Exit fullscreen mode
  • grep: Searches for patterns in text.
  • "status=FAIL": Finds entries for failed login attempts.

Use the -i flag for case-insensitive matching:

grep -i "status=fail" *.txt
Enter fullscreen mode Exit fullscreen mode

Image description


πŸͺ„ Step 4: Remove Comments (sed)

sed '/^#/d' log2.txt
Enter fullscreen mode Exit fullscreen mode
  • sed: A stream editor for filtering and transforming text.
  • /^#/d: Removes lines that begin with # (comments or metadata).

To update the file directly:

sed -i '/^#/d' log2.txt
Enter fullscreen mode Exit fullscreen mode

/^#/: Matches lines that start with #.

d: Deletes those lines from the stream.

This command does not modify the file unless -i is added.

Image description

Be cautious with this command because it is possible to remove meaningful information that may have been commented out intentionally, for example:

#2025-06-01: Backup completed
Enter fullscreen mode Exit fullscreen mode

πŸ“Š Step 5: Extract IP Addresses (awk)

  • awk: is a powerful text processing tool used to scan, extract, and manipulate text, especially structured data.
awk -F 'src=| status' '{print $2}' log1.txt
Enter fullscreen mode Exit fullscreen mode
  • -F 'src=| status': Defines a custom field separator using regex.
  • {print $2}: Extracts the IP address from the line.

βœ… In Greater Detail:

-F 'src=| status'

  • -F sets the field separator (the character(s) that split each line into fields).

  • 'src=| status' is a regular expression that uses two delimiters:

    • src= β†’ marks the beginning of the IP address.
    • status β†’ marks the end of the IP address.

This means:

  • Field $1: everything before src=

  • Field $2: the value between src= and status β†’ the IP address

  • Field $3: everything after status

Example:

[2025-06-01 08:12:55] LOGIN: user=john src=192.168.1.10 status=SUCCESS
Enter fullscreen mode Exit fullscreen mode

Becomes:

  • $1: [2025-06-01 08:12:55] LOGIN: user=john

  • $2: 192.168.1.10

  • $3: SUCCESS

Image description

If you receive different results, check your syntax... you may have an additional space in the delimiter.


βœ‚οΈ Step 6: Cut Out Usernames (cut)

grep "user=" log1.txt | cut -d '=' -f 2 | cut -d ' ' -f1
Enter fullscreen mode Exit fullscreen mode
  • First cut: Extracts everything after user=.
  • Second cut: Removes trailing text after the space.

βœ… In Greater Detail:

  • grep "user=" log1.txt

    • Filters lines in log1.txt that contain the string user=.
    • Why? Ensures you're only processing lines with login information.
  • cut -d '=' -f 2

    • Extracts the part after user=.
    • -d '=': Sets the delimiter to =.
    • -f 2: Selects the second field (the part after user=).
  • cut -d ' ' -f1

    • Trims off everything after the username.
    • -d ' ': Sets the delimiter to a space.
    • -f1: Selects the first field (the username only).

Image description


πŸ“‘ Step 7: Remove Duplicates (sort + uniq)

sort log1.txt | uniq
Enter fullscreen mode Exit fullscreen mode

To count how many times each line appears:

sort log1.txt | uniq -c
Enter fullscreen mode Exit fullscreen mode

Image description


πŸ” Step 8: Normalize Case (tr)

awk -F 'status=' '{print $2}' log1.txt | tr '[:upper:]' '[:lower:]'
Enter fullscreen mode Exit fullscreen mode
  • tr: Translates characters.
  • Converts uppercase status values to lowercase.

Image description


πŸ”— Step 9: Chain Commands (xargs)

find . -name "*.txt" | xargs grep "status=SUCCESS" | awk -F 'user=| src' '{print $2}' | sort | uniq
Enter fullscreen mode Exit fullscreen mode
  • xargs: Converts find output into arguments for grep.
  • Result: Clean list of users with successful logins.

Image description


πŸ§ͺ Full Workflow Example: Cleaned Login Report

find . -name "*.txt" | xargs sed '/^#/d' | grep "status=SUCCESS" | awk -F 'user=| src' '{print $2}' | sort | uniq > clean_logins.txt
Enter fullscreen mode Exit fullscreen mode

Image description


🧰 Command Cheat Sheet

Tool Purpose Example
find Locate files in directories find . -name "*.txt"
grep Search for matching lines grep "status=FAIL"
sed Edit or delete lines in streams sed '/^#/d'
awk Extract and process fields `awk -F 'src=
{% raw %}cut Remove selected portions of text cut -d '=' -f 2
sort Sort lines in text files sort log.txt
uniq Remove or count duplicate lines uniq -c
tr Translate or delete characters tr '[:upper:]' '[:lower:]'
xargs Pass input as arguments to commands xargs grep "SUCCESS"

βœ… Summary

  • awk and sed are essential for processing structured text on Linux systems.
  • When combined with tools like grep, cut, sort, uniq, tr, xargs, and find, you have a full-featured toolkit for automated log parsing, data cleaning, and report generation.
  • The scenario above mirrors a common task in system administration: making sense of raw logs and producing actionable insights.

πŸ’¬ Need a Challenge? Try the following:

Now that you’ve mastered the basics:

  • Use awk to generate formatted CSV reports
  • Automate your workflows with cron
  • Explore real system logs with journalctl or /var/log/secure

Top comments (0)