DEV Community

Juhi Kushwah
Juhi Kushwah

Posted on

The next basic concept of Machine Learning after NumPy: Pandas

The emphasis on NumPy in the heading, despite this post focusing on the Pandas library, reflects my intent to document my iterative learning journey on this platform as part of the #100DaysOfCode challenge. Additional information on NumPy can be found here: Understanding NumPy in the context of Python for Machine Learning

After NumPy, the next basic concept for Machine Learning is Pandas, followed closely by data preprocessing concepts.

Let me explain this as a clear learning path, not just a list.

Recap NumPy
We learned:

  • Arrays & matrices
  • Vectorized operations
  • Basic linear algebra

This is the math engine of ML.

Next Core Concept: Pandas

What is Pandas?
Pandas is a Python library for data handling and analysis.
While NumPy handles numbers, Pandas handles real-world datasets.

In ML, most of your time (~70%) is spent on data, not modeling.

Why Pandas Comes Next in ML?

  1. Real ML data is messy
    Datasets usually come as:

    • CSV / Excel / JSON files
    • Missing values
    • Mixed data types (numbers + text)

    Pandas makes this easier:

    import pandas as pd
    df = pd.read_csv("data.csv")
    
  2. Data cleaning & preprocessing (CRUCIAL for ML)
    This is where ML actually begins.
    Common tasks:

    • Handling missing values
    • Encoding categorical variables
    • Feature selection
    • Filtering rows/columns
     df.isnull()
     df.dropna()
     df.fillna(df.mean())
    
  3. Bridge between raw data and ML models
    ML libraries (scikit-learn) expect NumPy arrays.

    Pandas makes conversion seamless:

     X = df[['Age', 'Salary']].values
     y = df['Purchased'].values
    
  4. Tabular data representation (DataFrames)

    Pandas introduces DataFrame (like an Excel table):

Sample Excel data

     df.head()
     df.columns
     df.shape
Enter fullscreen mode Exit fullscreen mode

One-line takeaway

After NumPy, learn Pandas — because Machine Learning starts with data, not models.

Top comments (0)