Hussein Mahdi

Posted on Feb 26 • Edited on Mar 10

Mastering Pandas — Part 1: Reading, Sorting & Displaying Data

#ai #programming #tutorial #python

Pandas for Data Science Series — Article #1

What is Pandas and Why Does It Matter?

Pandas is an open-source Python library built specifically for data manipulation and analysis. Released in 2008 and named after "Panel Data" — a term from econometrics — it has since become one of the most essential tools in the entire Python ecosystem.

At its core, Pandas gives you two powerful data structures:

Series — a single column of data
DataFrame — a full table with rows and columns (think of it as a turbocharged spreadsheet you control with code)

Whether you're working in data science, machine learning, financial analysis, or business intelligence, chances are you'll be loading, exploring, and transforming data with Pandas before you do anything else. It integrates seamlessly with libraries like NumPy, Matplotlib, Seaborn, Scikit-learn, and TensorFlow, making it the essential starting point in nearly every data workflow.

In this first article of the series, you'll cover two essential skill sets: reading data into Pandas from virtually any file format, and sorting and displaying that data in ways that let you understand it quickly.

Let's get started 🎯

Part 1 — Reading Files into Pandas

Before you can analyze anything, you need to load your data. Pandas makes this simple with a family of read_*() functions — one for almost every file format you'll encounter in the real world.

CSV File — Most Common

CSV (Comma-Separated Values) files are the bread and butter of data work. read_csv() is almost certainly the first Pandas function you'll use in any project.

import pandas as pd

df = pd.read_csv("file.csv")                          # basic read
df = pd.read_csv("file.csv", header=0)                # first row as header
df = pd.read_csv("file.csv", index_col="order_id")    # set column as index
df = pd.read_csv("file.csv", nrows=100)               # read only first 100 rows
df = pd.read_csv("file.csv", skiprows=2)              # skip first 2 rows
df = pd.read_csv("file.csv", usecols=["price","quantity"])  # specific columns only
df = pd.read_csv("file.csv", na_values=["NA","?"])    # define missing values
df = pd.read_csv("file.csv", sep=";")                 # semicolons instead of commas

Excel File

Excel files are ubiquitous in business environments. read_excel() handles them cleanly, and you can target specific sheets by name or position.

df = pd.read_excel("file.xlsx")                       # basic read
df = pd.read_excel("file.xlsx", sheet_name="Sheet1")  # specific sheet by name
df = pd.read_excel("file.xlsx", sheet_name=0)         # first sheet by index
df = pd.read_excel("file.xlsx", skiprows=2)           # skip first 2 rows
df = pd.read_excel("file.xlsx", usecols="A:D")        # read columns A to D

JSON File

JSON is the standard format for web APIs. read_json() can parse various JSON structures, including a flat list of records.

df = pd.read_json("file.json")                        # basic read
df = pd.read_json("file.json", orient="records")      # list of records format

HTML File — Web Scraping Made Easy

One of Pandas' more surprising capabilities: it can extract tables directly from HTML pages, including live websites. Simple web scraping with zero extra libraries.

tables = pd.read_html("file.html")                          # returns list of all tables
df     = pd.read_html("https://website.com/table")[0]       # first table from a URL

SQL Database

When your data lives in a database, Pandas can query it directly. You bring a connection object and a SQL query, and it returns a DataFrame.

import sqlite3

conn = sqlite3.connect("database.db")

df = pd.read_sql("SELECT * FROM table", conn)    # full SQL query
df = pd.read_sql_table("table_name", conn)       # read entire table directly

Text File

read_table() is the text-file counterpart to read_csv(). The only real difference is its default separator — a tab instead of a comma — but it accepts all the same parameters.

df = pd.read_table("file.txt")                      # tab separated (default)
df = pd.read_table("file.txt", sep=",")             # comma separated
df = pd.read_table("file.txt", sep=";")             # semicolon separated
df = pd.read_table("file.txt", sep="|")             # pipe separated
df = pd.read_table("file.txt", header=None)         # file has no header row
df = pd.read_table("file.txt", names=["col1","col2"])  # add column names manually
df = pd.read_table("file.txt", skiprows=2)          # skip first 2 rows
df = pd.read_table("file.txt", nrows=100)           # read only 100 rows
df = pd.read_fwf("file.txt")                        # fixed-width text file

read_csv() vs read_table() — The only practical difference is the default separator: read_csv() uses a comma (,) while read_table() uses a tab (\t). Every other parameter works identically in both.

	`read_csv()`	`read_table()`
Default separator	`,` comma	`\t` tab
File type	CSV files	TXT files
Speed	Same	Same
Parameters	Same	Same

Other Formats

# Clipboard — copy any table, then run this
df = pd.read_clipboard()

# Parquet — the preferred format for big data (very fast)
df = pd.read_parquet("file.parquet")

# XML
df = pd.read_xml("file.xml")

Summary — All Read Functions at a Glance

Function	File Type	Common Use
`read_csv()`	CSV	Most common — daily use
`read_excel()`	XLSX	Excel files from business
`read_json()`	JSON	Web APIs and REST data
`read_html()`	HTML	Web scraping tables
`read_sql()`	Database	SQL queries
`read_table()`	TXT	Tab-separated text files
`read_clipboard()`	Clipboard	Quick copy-paste workflow
`read_parquet()`	Parquet	Large-scale / big data
`read_xml()`	XML	Structured XML data

Most Important Parameters — Work With ALL Functions

nrows      →  how many rows to read
skiprows   →  how many rows to skip at the top
usecols    →  which columns to read
index_col  →  which column to use as index
na_values  →  what to treat as missing value (NaN)
header     →  which row to use as column header
sep        →  what separator/delimiter to use

Part 2 — Sorting and Displaying Data

Once your data is loaded, the first thing you want to do is understand it. The functions in this section are your primary tools for exploring structure, distributions, relationships, and ordering — before you write a single line of analysis.

We'll use this sample DataFrame throughout all the examples:

import pandas as pd

data = {
    'Country':          ['China', 'India', 'USA', 'Brazil', 'UK'],
    'Continent':        ['Asia', 'Asia', 'North America', 'South America', 'Europe'],
    '2022 Population':  [1412000000, 1380000000, 331000000, 214000000, 67000000],
    'Area (km2)':       [9597000, 3287000, 9834000, 8516000, 243000]
}
df = pd.DataFrame(data)

Sorting Functions

1. sort_values()

The most commonly used sorting function. Reorders the DataFrame by the values in one or more columns. By default it sorts ascending; pass ascending=False to flip it.

# Ascending order (default)
df.sort_values('2022 Population')

# Descending order
df.sort_values('2022 Population', ascending=False)

# Sort by multiple columns: Continent A→Z, then Population largest first
df.sort_values(['Continent', '2022 Population'], ascending=[True, False])

2. sort_index()

Sorts by the DataFrame's index rather than column values. This becomes important after filtering, merging, or shuffling rows, which can leave the index disordered.

df.sort_index()                # sort index ascending
df.sort_index(ascending=False) # sort index descending

3. nlargest()

Returns the top N rows with the largest values in a specified column. More efficient and more readable than combining sort_values() with head().

# Top 3 most populated countries
df.nlargest(3, '2022 Population')

4. nsmallest()

Returns the top N rows with the smallest values in a specified column.

# 3 smallest countries by area
df.nsmallest(3, 'Area (km2)')

5. rank()

Assigns a rank to each row based on a column's values, without changing the row order. Useful for adding a ranking column to your existing DataFrame.

# Add a population rank column (1 = largest)
df['Population Rank'] = df['2022 Population'].rank(ascending=False)

Displaying Functions

1. head() & tail()

The simplest inspection tools. head() shows the first N rows (default 5) and tail() shows the last N rows. These are usually the first things you run on any new dataset.

df.head()    # first 5 rows
df.head(10)  # first 10 rows
df.tail()    # last 5 rows
df.tail(3)   # last 3 rows

2. sample()

Returns N randomly selected rows. Ideal for getting a representative look at a large dataset where the first or last rows might not be representative.

df.sample(3)  # 3 random rows

3. info()

Displays the DataFrame's structure: column names, data types, number of non-null values, and memory usage. This should be one of the very first functions you call on any new dataset — it immediately reveals missing data and wrong data types.

df.info()

4. describe()

Returns a statistical summary for all numeric columns: count, mean, standard deviation, min, max, and quartiles. A quick way to understand the scale and distribution of your data.

df.describe()

5. value_counts()

Returns the frequency of each unique value in a column. Invaluable for understanding the distribution of categorical data at a glance.

# How many countries exist per continent?
df['Continent'].value_counts()

6. corr()

Calculates the correlation between all numeric columns. Returns values from -1.0 (perfect negative relationship) to 1.0 (perfect positive relationship), with 0.0 meaning no relationship.

Important: corr() requires numeric columns only. Always use select_dtypes() before calling it to avoid a ValueError.

# Safe pattern — works on any DataFrame
df.select_dtypes(include='number').corr()

# Or select specific columns manually
df[['2022 Population', 'Area (km2)']].corr()

7. select_dtypes()

Filters and returns only the columns that match a specified data type. Useful when you need to work on a subset of columns without knowing their names in advance.

df.select_dtypes(include='number')              # numeric columns only (int and float)
df.select_dtypes(include='object')              # string columns only
df.select_dtypes(include='bool')                # boolean columns only
df.select_dtypes(include=['int64', 'float64'])  # specific numeric types
df.select_dtypes(exclude='object')              # drop all string columns
df.select_dtypes(exclude='number')              # drop all numeric columns
df.select_dtypes(include='number', exclude='bool')  # numeric but not boolean

8. Select Specific Columns

Returns a new DataFrame containing only the columns you need. Helps reduce output to what is relevant.

df[['Country', '2022 Population']]

Matplotlib Global Settings — plt.rcParams

While not a Pandas function, plt.rcParams is something you'll configure right alongside your Pandas setup. It controls the default appearance of all plots in your session — set it once at the top of your notebook and every chart inherits those settings automatically.

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (20, 8)   # 20 wide, 8 tall (in inches)
plt.rcParams['font.size']       = 14       # default font size for all text
plt.rcParams['figure.dpi']      = 100      # resolution (higher = sharper)
plt.rcParams['lines.linewidth'] = 2        # default line width
plt.rcParams['axes.grid']       = True     # show grid on all plots by default

Common Patterns

Filter, then Sort

The most frequent real-world pattern: narrow the dataset to a subset, then rank within that subset.

# Top 3 Asian countries by population
df[df['Continent'] == 'Asia'].nlargest(3, '2022 Population')

Sort and Preview

Both lines produce the same result — the second is preferred for its clarity.

df.sort_values('2022 Population', ascending=False).head(10)
df.nlargest(10, '2022 Population')  # ✅ preferred

Clean Display of Numbers

Control how many decimal places are shown across the entire session.

pd.set_option('display.float_format', '{:.2f}'.format)

Complete Summary Table

Function	Category	Purpose
`sort_values()`	Sorting	Sort by column values
`sort_index()`	Sorting	Sort by index
`nlargest()`	Sorting	Get top N largest values
`nsmallest()`	Sorting	Get top N smallest values
`rank()`	Sorting	Assign rank to each row
`head()`	Displaying	Show first N rows
`tail()`	Displaying	Show last N rows
`sample()`	Displaying	Show random N rows
`info()`	Displaying	Show structure and data types
`describe()`	Displaying	Statistical summary
`value_counts()`	Displaying	Count frequency of unique values
`corr()`	Displaying	Correlation between numeric columns
`select_dtypes()`	Displaying	Filter columns by data type
`plt.rcParams`	Matplotlib	Set global defaults for all plots

This is Part 1 of the Pandas for Data Science series. Next up: Part 2—GroupBy & Indexing.

Refrences

GitHub Repo : https://github.com/Hu8MA/Mastering-Pandas-Reference
Website : https://pandas.pydata.org/
GitHub Library : https://github.com/pandas-dev/pandas
Course : https://youtu.be/Mdq1WWSdUtw

DEV Community