yuval mehta

Posted on Jun 21

Unleashing GPU Power: Supercharge Your Data Processing with cuDF

#datascience #data #python #programming

This time, while randomly scrolling through some blog post about the latest AI advancement and its capabilities, I found out about cuDF , which is part of the family of software libraries and APIs called RAPIDS for accelerating data operations and Machine Learning on GPUs. During data feeding, cuDF allows for the parallel processing on NVIDIA GPUs which, in turn, may be effective in large data operations. The next blog will give an overview as what cuDF is, major current functionalities related to cuDF and how to perform data manipulation using cuDF.

What is cuDF?

cuDF is a GPU DataFrame library which is pandas like for handling data on GPU. It enables data scientists and engineers to work with large amounts of data and carry out in-memory processing, thus it is ideal for pre-processing steps.

Key Features of cuDF

High Performance: Due to the use of the GPUs, the cuDF is able to perform data operations faster than that of the other CPU based libraries.
Pandas Compatibility: cuDF is built to have a similar interface to pandas so that users of pandas do not have to learn how to use a new system but can transfer over to using the GPU-based system instead.
Seamless Integration: cuDF is interoperable with other tensor libraries in the RAPIDS ecosystem such as cuML for machine learning and cuGraph for graph analytics.

Getting Started with cuDF

Now, without further ado, let’s go over the basic setup and how to use cuDF for data manipulation.

Step 1: Installing cuDF

First of all, it is necessary to mention that the use of cuDF is possible in case if the user has a proper NVIDIA GPU, as well as the suitable version of CUDA toolkit. Accordingly, you can download it from Rapids AI

This is the command which I got from Rapids AI installation guide for my system

conda create -n rapids-24.06 -c rapidsai -c conda-forge -c nvidia  \
    rapids=24.06 python=3.11 cuda-version=12.2

Step 2: Importing cuDF

import cudf
import numpy as np
import pandas as pd

Step 3: Creating a cuDF DataFrame
You can create a cuDF DataFrame from various data sources, including pandas DataFrames, CSV files, and more.

# Create a cuDF DataFrame from a pandas DataFrame
pdf = pd.DataFrame({
    'a': np.random.randint(0, 100, size=10),
    'b': np.random.random(size=10)
})
gdf = cudf.DataFrame.from_pandas(pdf)
print(gdf)

Step 4: Data Manipulation with cuDF
cuDF provides a rich set of functions for data manipulation, similar to pandas.

# Adding a new column
gdf['c'] = gdf['a'] + gdf['b']

# Filtering data
filtered_gdf = gdf[gdf['a'] > 50]

# Grouping and aggregation
grouped_gdf = gdf.groupby('a').mean()
print(grouped_gdf)

Step 5: Reading and Writing Data
cuDF supports reading from and writing to various file formats, such as CSV, Parquet, and ORC.

# Reading from a CSV file
gdf = cudf.read_csv('data.csv')

# Writing to a Parquet file
gdf.to_parquet('output.parquet')

Step 6: Performance Comparison with Pandas

import time

# Create a large pandas DataFrame
pdf = pd.DataFrame({
    'a': np.random.randint(0, 100, size=100000000),
    'b': np.random.random(size=100000000)
})

# Create a cuDF DataFrame from the pandas DataFrame
gdf = cudf.DataFrame.from_pandas(pdf)

# Timing the pandas operation
start = time.time()
pdf['c'] = pdf['a'] + pdf['b']
end = time.time()
print(f"Pandas operation took {end - start} seconds")

# Timing the cuDF operation
start = time.time()
gdf['c'] = gdf['a'] + gdf['b']
end = time.time()
print(f"cuDF operation took {end - start} seconds")

From the image we can see that cuDF is 40 times more faster than pandas

Step 7: Using cuDF as a no-code-change accelerator for pandas

%load_ext cudf.pandas

# Pandas operations now use the GPU!
import pandas as pd
import time

# Create a large pandas DataFrame
pdf = pd.DataFrame({
    'a': np.random.randint(0, 100, size=100000000),
    'b': np.random.random(size=100000000)
})

# Timing the pandas operation with cudf.pandas
start = time.time()
pdf['c'] = pdf['a'] + pdf['b']
end = time.time()
print(f"Pandas operation with cuDF loaded took {end - start} seconds")

We can see from the image that it gives almost similar performance compared to using cuDF APIs

Conclusion
cuDF is the equipping methodology to speed up data processing pipelines by using the parallel computing system, GPUs. This is the tool’s biggest strength: Since its usage directly corresponds with pandas, users can switch and start enjoying the performance improvements quickly. Thus, with the help of cuDF lets incorporate it in data science movement, which will help to work with larger datasets and perform complex operations faster than conventional computers.

Resources:

If you have any questions about cuDF or if you have used it in your project in the past then please feel free to drop the questions and/or experiences in the comments section below. Happy computing!

DEV Community

Unleashing GPU Power: Supercharge Your Data Processing with cuDF

What is cuDF?

Key Features of cuDF

Getting Started with cuDF

Top comments (0)

Read next

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Adding search to a static Astro website

1038. Binary Search Tree to Greater Sum Tree

Effectively Marketing Devtools with Educational Content