DEV Community

David Bean
David Bean

Posted on

Blog Post 2: NumPy Through a C++ Programmer's Eyes

Blog Post 2: NumPy Through a C++ Programmer's Eyes

Week Two: Finally Writing Code That Feels Fast

Week two of my ML learning journey, and I'm starting to see why Python dominates machine learning despite being "slow."

The secret? Most of the time, you're not actually running Python.

This week was all about NumPy and pandas - the foundations of pretty much every ML library. And as someone who's written a lot of C++ code focused on performance, watching NumPy operations run was genuinely satisfying. These aren't slow Python loops. They're compiled C code operating on contiguous arrays, using SIMD instructions where possible.

It's basically everything I love about C++ performance, wrapped in Python's convenience.

Day 8: Building Image Transformations Without Image Libraries

The first challenge: implement image transformations (rotate, flip, crop, brightness adjustment) using only NumPy. No OpenCV, no PIL for the actual transformations.

The rotation algorithm was the fun part. I knew I needed to rotate 90° clockwise, but which operations exactly? After some debugging with test patterns (red left half, blue right half), I figured it out:

def rotate_90(image):
    # Step 1: Transpose (swap rows and columns)
    transposed = np.transpose(image, (1, 0, 2))

    # Step 2: Flip vertically
    return np.flip(transposed, axis=0)
Enter fullscreen mode Exit fullscreen mode

Transpose alone doesn't give rotation - you need transpose + flip. I only really understood this after printing intermediate steps and tracing through what should happen to each quadrant.

When Claude suggested np.transpose(image, (1, 0, 2)), I made myself stop and ask: what does that tuple actually mean? Turns out (1, 0, 2) means "put axis 1 first, axis 0 second, keep axis 2 third." So columns become rows, rows become columns, color channels stay unchanged. The debugging process of creating test patterns and visualizing transformations taught me more than just reading documentation would have.

The performance difference is wild. Every operation works on entire arrays at once. No loops over millions of pixels. image * brightness_factor multiplies every single pixel value in one vectorized operation. This is the SIMD parallelism I'm used to from C++, but I didn't have to write it myself.

Days 9-10: Pandas Element-Wise Operators Are Not Python Operators

Pandas threw me for a loop because it looks like regular Python but behaves completely differently.

The element-wise operator confusion:

I kept trying to write conditionals like normal Python:

# This doesn't work:
df[df['age'] > 120 or df['age'] < 0]  # ERROR!

# You need element-wise operators:
df[(df['age'] > 120) | (df['age'] < 0)]  # Works!
Enter fullscreen mode Exit fullscreen mode

Use | for OR, & for AND, ~ for NOT. Always. This tripped me up for a solid day until it finally clicked: these operators work on entire columns at once, not single values.

The groupby-aggregate pattern is everywhere:

This pattern appears constantly in ML preprocessing:

# Calculate total spending per customer
customer_totals = df.groupby('customer_id')['amount'].sum()

# Map those totals back to every row
df['customer_total'] = df['customer_id'].map(customer_totals)
Enter fullscreen mode Exit fullscreen mode

Split the data into groups, apply some aggregation, combine the results back. Once I understood this pattern, tons of feature engineering operations made sense.

The CSV string conversion gotcha:

My favorite bug of the week: I had integration tests failing because pandas reads CSV columns as strings, not numbers. My unit tests all passed (they used real Python numbers), but when I tested the complete pipeline reading from a file, everything broke.

# CSV gives you strings:
df.iloc[:, 0].tolist()  # ['1', '2', '3'] - all strings!

# Need explicit conversion:
pd.to_numeric(df.iloc[:, 0], errors='coerce')  # [1, 2, 3]
Enter fullscreen mode Exit fullscreen mode

This is exactly why you need integration tests, not just unit tests. Different test types catch different bugs.

Day 12: The 150x Speedup

This was the most satisfying day. I had a function that processed transactions using .apply() with lambdas and some iterrows loops. It worked. It was slow. Claude challenged me to optimize it using vectorization.

The results:

  • Slow version: 0.46 seconds for 10k rows (21,559 rows/second)
  • Fast version: 0.003 seconds for 10k rows (3,249,635 rows/second)
  • Speedup: 150x faster

Same input. Same output (verified with pd.testing.assert_frame_equal()). Just replaced Python loops with vectorized NumPy operations.

The key transformations:

# SLOW - apply with lambda
df['total'] = df.apply(lambda row: row['price'] * row['quantity'], axis=1)

# FAST - vectorized multiplication  
df['total'] = df['price'] * df['quantity']
Enter fullscreen mode Exit fullscreen mode
# SLOW - apply with if/elif/else function
def categorize(row):
    if row['amount'] < 50:
        return 'small'
    elif row['amount'] < 200:
        return 'medium'
    else:
        return 'large'
df['category'] = df.apply(categorize, axis=1)

# FAST - np.select with conditions
conditions = [
    df['amount'] < 50,
    (df['amount'] >= 50) & (df['amount'] < 200),
    df['amount'] >= 200
]
choices = ['small', 'medium', 'large']
df['category'] = np.select(conditions, choices)
Enter fullscreen mode Exit fullscreen mode

The lesson: .apply() and .iterrows() are 150x slower because they're Python loops in disguise. Every iteration has interpreter overhead. Vectorized operations run in compiled C code with no per-element overhead.

This isn't "premature optimization." This is fundamental to how you write pandas code. You can't just "optimize later" - you need to think vectorized from the start.

Days 13-14: Making Data Problems Visible

The weekend project was building a data quality dashboard. I took the matplotlib visualizations from Day 13 and wrapped them in a Streamlit app.

The result: upload any CSV, instantly see:

  • Amount distribution (with outliers highlighted in red)
  • Time series (with missing data periods shaded)
  • Age distribution (valid vs impossible values)
  • Category balance (class imbalance visualization)

Plus automated detection of:

  • Missing values
  • Statistical outliers
  • Invalid ages (negative or >120)
  • Negative amounts
  • With specific counts and recommendations for each issue

What I learned about Streamlit:

It's refreshingly simple. The entire script reruns on every user interaction, which sounds inefficient but makes the programming model dead simple. No state management, no callbacks, no frontend/backend separation.

uploaded_file = st.file_uploader("Choose a CSV")

if uploaded_file is not None:
    df = pd.read_csv(uploaded_file)
    st.dataframe(df.head(10))
    # Show visualizations...
Enter fullscreen mode Exit fullscreen mode

That's it. Upload → Process → Display. No web development required.

The "calculate once, use twice" pattern:

I caught myself calling the same detection functions multiple times:

# Inefficient - calls function twice:
st.metric("Missing", find_missing(df))
if find_missing(df) > 0:
    st.warning("Found missing values")
Enter fullscreen mode Exit fullscreen mode

My C++ performance instincts kicked in:

# Better - calculate once:
missing_count = find_missing(df)
st.metric("Missing", missing_count)
if missing_count > 0:
    st.warning("Found missing values")
Enter fullscreen mode Exit fullscreen mode

Not a huge deal for small datasets, but good habits matter.

What My C++ Background Got Right and Wrong

What transferred well:

  • Performance awareness: I instinctively noticed when operations might be slow and looked for vectorized alternatives.
  • Memory layout intuition: Understanding that NumPy arrays are contiguous in memory made sense immediately.
  • Type thinking: Python's type hints feel natural. When pandas operations convert uint8 to float64, I notice.
  • Debugging mindset: Add logging, test edge cases, isolate the problem systematically.

What I had to unlearn:

  • Loops are fine → Loops are death: In C++, loops are normal. In pandas, they're 150x slower. This is a fundamental mental shift.
  • Control flow is explicit → Control flow is vectorized: Can't use if/elif/else on arrays. Must use np.select() or np.where().
  • Build from scratch → Use the ecosystem: C++ culture is "roll your own." Python ML culture is "there's definitely a library for that."

The biggest surprise: NumPy gives me C++ performance without writing C++. Most of the time. When I eventually need even more speed, the roadmap has me implementing custom C++ extensions later. But for now, vectorized NumPy is fast enough.

The Discovery-Based Learning Struggle

The hardest part of this week wasn't the code - it was staying curious instead of copying solutions.

When Claude suggested using np.transpose(image, (1, 0, 2)) for rotation, I had to force myself to stop and ask:

  • What does the (1, 0, 2) tuple actually mean?
  • Why those specific numbers?
  • What happens if I change the order?

This turns a 5-minute "just make it work" into a 20-minute learning session where I actually understand axis manipulation.

Same with pd.to_numeric(..., errors='coerce'):

  • What does 'coerce' do?
  • What are the alternatives?
  • When would I use 'raise' or 'ignore' instead?

It's slower. Sometimes frustrating. But it's the difference between having code that works vs understanding why it works.

What Actually Tripped Me Up

The "Rumpelstiltskin problem" is real. The hardest part of learning pandas isn't understanding concepts - it's knowing what operations exist and what they're called.

I can't use .mask() if I don't know it exists. I can't search for "how to do X" if I don't know X is called "broadcasting." This is where having Claude as a guide helps - it can suggest the right operation for the problem, then I go understand how it works.

NaN propagation is weird. Coming from languages where NULL works differently, pandas' NaN behavior took getting used to. It silently propagates through operations in ways that break boolean logic:

# Without na=False, NaN breaks filtering:
df['email'].str.contains('@')  # Returns [True, False, NaN, True]
df[df['email'].str.contains('@')]  # ERROR!

# Must handle explicitly:
df['email'].str.contains('@', na=False)  # Returns [True, False, False, True]
Enter fullscreen mode Exit fullscreen mode

Week 2 vs Week 1

Week 1 was about development practices (testing, error handling, packaging). Week 2 was about the actual data manipulation tools (NumPy, pandas, visualization).

Both feel essential. You can't build production ML without both:

  • Clean code that doesn't crash (Week 1)
  • Fast data processing that scales (Week 2)

The combination is what makes ML engineering work in production.

What's Next

Week 3 starts traditional machine learning - linear models, decision trees, ensemble methods. Still using Claude's discovery-based approach: here's the problem, here's the documentation, now figure it out.

I'm getting more comfortable with this pattern. The first few days I wanted explicit instructions. Now I appreciate the struggle - it's where the learning happens.

Also: I've told Claude to start writing most of my tests because I understand the patterns now. Learning to delegate to AI is part of learning with AI.

Two weeks in. Still no neural networks. Just data engineering foundations. And honestly? I'm starting to understand why everyone says data engineering is 80% of ML work.


About this series: I'm a software engineer learning ML using a custom roadmap designed by Claude. The approach focuses on production skills and problem-solving over tutorials. Week 2 complete: NumPy, pandas, and an interactive data quality dashboard. All code and daily summaries on [GitHub link].


Feedback welcome: Did the C++ perspective add value or just clutter? Should I include more code examples or keep it high-level?

Top comments (0)