Linux servers dominate as the most preferred compute environment for large scale data systems. Mastering Linux helps data engineers to efficiently manage data pipelines and process data.
This article is a deep dive into the most useful Linux commands that are relevant for data engineering tasks.
Prerequisites
- Setup a Linux server environment for testing purposes.
- Should be familiar with the command line
Table of Contents
- File and Directory Operations
- File System and Storage Management
- File Attributes and Permissions
- User and Group Management
- Networking and Security
- File Compression and Encryption
- Editors
- File transfer commands
File and Directory Operations
These commands are used for navigating and manipulating the Linux file system. They emulate CRUD operations, but on the file structure.
# returns the current working directory
pwd
# lists directory contents including files and subdirectories
ls
# includes hidden files in the list
ls -a
# changes the current working directory // cd /path/to/directory
cd
# deleting files and directories
rm
# creating a file
touch
# returns the contents of a file
cat
# quickly view the first or last lines of a file without opening the entire file.
head / tail
Data engineers can use these commands to inspect logs and raw data dumps.
File System and Storage Management
The file system partitions, organizes and stores data on disk.
/dev ~ directory representing actual storage disks
sd ~ the storage disk
a ~ the first disk
1 ~ first partition on the disk
cat /proc/partitions ~ view all storage disks and partitions recognized by the system
fdisk /dev/sda ~ create, manage and delete partitions on /dev/sda
sudo mkfs.xfs /dev/sdb1 ~ install a filesystem on partition /dev/sdb1
sudo mount /dev/sdb1 /myfolder ~ attach storage space to myfolder directory which acts as the mount point
df -f /myfolder ~ verify the storage space exists
pvcreate /dev/sdb1 ~ initialize a physical volume to use with the Logical Volume Manager
As data engineers work on systems with intensive data, they need to skillfully manipulate partitions and filesystems to prevent losing data. Backing up data is recommended before running these commands.
File Attributes and Permissions
Linux file permissions offer a security mechanism for determining who can read, write or execute files on a system.
# returns file metadata including file permissions
ls -l
# makes a file immutable
chattr +i
# list file attributes
lsattr
# add or remove execute permissions
chmod +x or -x
# set Read/Write for owner, read for group/others
chmod 644 myapp.py
# change file ownership
chown
Using these commands, a data engineer will be able restrict access to sensitive file data and protect files from accidental modifications by setting strict permissions.
User and Group Management
Use the useradd command to create a new user
sudo useradd username
#create new group
sudo groupadd groupname
# -m ~creates the home directory for the user
sudo useradd -m username
# assign the user to a specific group
sudo useradd -G groupname $username
View user information with
id username
# returns UID and GID
To add a password for the new user, run:
sudo passwd username
All existing users will be listed in the /etc/passwd file
cat /etc/passwd
Switch to a different user with
su -h username
In order for data engineers to maintain least privilege access to resources, they need to properly implement user and group management. It is recommended to use user accounts instead of the root user to minimize access.
Networking and Security
At the center of Linux are four key components that block malicious access - Firewalls filter traffic, Encryption encrypts data in transit, Authentication verifies user identities, Monitoring analyzes traffic.
Using UFW which is a firewall interface for iptables, you can check the currently registered profiles with
sudo ufw app list
To enable an application profile, RUN
sudo ufw allow appname
There is a rule that allows you to specify the port instead of the profile name
# check if ufw is running
sudo ufw status
# allow ssh
sudo ufw allow 22
# specify port ranges (apply with specific protocol)
sudo ufw allow 6000:6009 /tcp
# allow connections from IP address
sudo ufw allow from 201.8.139.4
With the rules applied you can enable firewall with:
sudo ufw enable
use sudo ss --tulnp|grep to show listening ports and ping to test connectivity.
File Compression and Encryption
Compressing a file reduces the amount of storage needed and will help speed up data transfer.
Use gzip to compress a single file as:
# results in filename.gz
gzip filename
The process of archiving will combine multiple files into a single file archive. create a simple archive using tar package
# -cf creates and names the archive
tar -cf myarchive.tar app.py main.py project/
To extract an archive use the command:
tar -xf myarchive.tar
Encryption will ensure that you can safely transit the data over the internet. Use GPG or openssl for encryption
# File encrypted using passphrase
gpg -c filename.txt
# AES encryption
openssl enc -aes-256-cbc -in data.txt -out data.enc
Data engineers can compress raw log files to save space, then encrypt data archived files before uploading to cloud storage.
Editors
The most popular text editors on Linux are Nano and Vim.
nano file.txt
vi file.txt
Both commands will create if not exists, then open file.txt
File transfer commands
These commands can be used to handle file transfer either locally or remotely.
# Secure File Transfer Protocol ~ used for transferring files
sftp username@ip_address/hostname
# copy files and directories locally
cp file1.txt /backup/file1.txt
# move or rename files
mv filename/ renamedfile/
# transfer files remotely over ssh
scp data.csv username@ip_address:/data
#synchronize directories remotely for incremental updates
rsync -avz /backup/log username@ip_address/hostname
A data engineer will be able to efficiently sync files on the local server to a remote server.
Conclusion
This article covers several categories of Linux commands and how data engineers use them.
References
How to Manage Linux Storage
Linux file permissions explained
Differences between archiving and compression
Top comments (0)