DEV Community

Grace Valerie
Grace Valerie

Posted on

Useful Linux Commands For Data Engineers

Linux servers dominate as the most preferred compute environment for large scale data systems. Mastering Linux helps data engineers to efficiently manage data pipelines and process data.

This article is a deep dive into the most useful Linux commands that are relevant for data engineering tasks.

Prerequisites

  1. Setup a Linux server environment for testing purposes.
  2. Should be familiar with the command line

Table of Contents

  1. File and Directory Operations
  2. File System and Storage Management
  3. File Attributes and Permissions
  4. User and Group Management
  5. Networking and Security
  6. File Compression and Encryption
  7. Editors
  8. File transfer commands

File and Directory Operations

These commands are used for navigating and manipulating the Linux file system. They emulate CRUD operations, but on the file structure.

# returns the current working directory
pwd

# lists directory contents including files and subdirectories
ls 
# includes hidden files in the list
ls -a
# changes the current working directory // cd /path/to/directory
cd 
# deleting files and directories 
rm
# creating a file
touch
# returns the contents of a file 
cat 
# quickly view the first or last lines of a file without opening the entire file.
head / tail

Enter fullscreen mode Exit fullscreen mode

Data engineers can use these commands to inspect logs and raw data dumps.

File System and Storage Management

The file system partitions, organizes and stores data on disk.

/dev ~ directory representing actual storage disks
sd ~ the storage disk
a ~ the first disk 
1 ~ first partition on the disk

Enter fullscreen mode Exit fullscreen mode

cat /proc/partitions ~ view all storage disks and partitions recognized by the system
fdisk /dev/sda ~ create, manage and delete partitions on /dev/sda
sudo mkfs.xfs /dev/sdb1 ~ install a filesystem on partition /dev/sdb1
sudo mount /dev/sdb1 /myfolder ~ attach storage space to myfolder directory which acts as the mount point
df -f /myfolder ~ verify the storage space exists
pvcreate /dev/sdb1 ~ initialize a physical volume to use with the Logical Volume Manager

As data engineers work on systems with intensive data, they need to skillfully manipulate partitions and filesystems to prevent losing data. Backing up data is recommended before running these commands.

File Attributes and Permissions

Linux file permissions offer a security mechanism for determining who can read, write or execute files on a system.

# returns file metadata including file permissions
ls -l 

# makes a file immutable
chattr +i 

# list file attributes
lsattr 

# add or remove execute permissions
chmod +x or -x 

# set Read/Write for owner, read for group/others
chmod 644 myapp.py
# change file ownership
 chown 

Enter fullscreen mode Exit fullscreen mode

Using these commands, a data engineer will be able restrict access to sensitive file data and protect files from accidental modifications by setting strict permissions.

User and Group Management

Use the useradd command to create a new user

sudo useradd username

#create new group
sudo groupadd groupname

# -m ~creates the home directory for the user
sudo useradd -m username

# assign the user to a specific group
sudo useradd -G groupname $username
Enter fullscreen mode Exit fullscreen mode

View user information with

id username
# returns UID and GID
Enter fullscreen mode Exit fullscreen mode

To add a password for the new user, run:

sudo passwd username
Enter fullscreen mode Exit fullscreen mode

All existing users will be listed in the /etc/passwd file

cat /etc/passwd
Enter fullscreen mode Exit fullscreen mode

Switch to a different user with

su -h username

Enter fullscreen mode Exit fullscreen mode

In order for data engineers to maintain least privilege access to resources, they need to properly implement user and group management. It is recommended to use user accounts instead of the root user to minimize access.

Networking and Security

At the center of Linux are four key components that block malicious access - Firewalls filter traffic, Encryption encrypts data in transit, Authentication verifies user identities, Monitoring analyzes traffic.

Using UFW which is a firewall interface for iptables, you can check the currently registered profiles with

sudo ufw app list
Enter fullscreen mode Exit fullscreen mode

To enable an application profile, RUN

sudo ufw allow appname
Enter fullscreen mode Exit fullscreen mode

There is a rule that allows you to specify the port instead of the profile name

# check if ufw is running
sudo ufw status

# allow ssh
sudo ufw allow 22

# specify port ranges (apply with specific protocol)
sudo ufw allow 6000:6009 /tcp

# allow connections from IP address
sudo ufw allow from 201.8.139.4

Enter fullscreen mode Exit fullscreen mode

With the rules applied you can enable firewall with:

sudo ufw enable
Enter fullscreen mode Exit fullscreen mode

use sudo ss --tulnp|grep to show listening ports and ping to test connectivity.

File Compression and Encryption

Compressing a file reduces the amount of storage needed and will help speed up data transfer.

Use gzip to compress a single file as:

# results in filename.gz
gzip filename

Enter fullscreen mode Exit fullscreen mode

The process of archiving will combine multiple files into a single file archive. create a simple archive using tar package

# -cf creates and names the archive
tar -cf myarchive.tar app.py main.py project/
Enter fullscreen mode Exit fullscreen mode

To extract an archive use the command:

tar -xf myarchive.tar
Enter fullscreen mode Exit fullscreen mode

Encryption will ensure that you can safely transit the data over the internet. Use GPG or openssl for encryption

# File encrypted using passphrase
gpg -c filename.txt

# AES encryption
openssl enc -aes-256-cbc -in data.txt -out data.enc

Enter fullscreen mode Exit fullscreen mode

Data engineers can compress raw log files to save space, then encrypt data archived files before uploading to cloud storage.

Editors

The most popular text editors on Linux are Nano and Vim.

nano file.txt

vi file.txt
Enter fullscreen mode Exit fullscreen mode

Both commands will create if not exists, then open file.txt

File transfer commands

These commands can be used to handle file transfer either locally or remotely.

# Secure File Transfer Protocol ~ used for transferring files
sftp username@ip_address/hostname

# copy files and directories locally
cp file1.txt /backup/file1.txt

# move or rename files
mv filename/ renamedfile/

# transfer files remotely over ssh
scp data.csv username@ip_address:/data

#synchronize directories remotely for incremental updates
rsync -avz /backup/log username@ip_address/hostname
Enter fullscreen mode Exit fullscreen mode

A data engineer will be able to efficiently sync files on the local server to a remote server.

Conclusion

This article covers several categories of Linux commands and how data engineers use them.

References

How to Manage Linux Storage
Linux file permissions explained
Differences between archiving and compression

Top comments (0)