DEV Community

yuval mehta
yuval mehta

Posted on

Unleashing GPU Power: Supercharge Your Data Processing with cuDF

This time, while randomly scrolling through some blog post about the latest AI advancement and its capabilities, I found out about cuDF , which is part of the family of software libraries and APIs called RAPIDS for accelerating data operations and Machine Learning on GPUs. During data feeding, cuDF allows for the parallel processing on NVIDIA GPUs which, in turn, may be effective in large data operations. The next blog will give an overview as what cuDF is, major current functionalities related to cuDF and how to perform data manipulation using cuDF.

Rapids cuDF

What is cuDF?

cuDF is a GPU DataFrame library which is pandas like for handling data on GPU. It enables data scientists and engineers to work with large amounts of data and carry out in-memory processing, thus it is ideal for pre-processing steps.

Key Features of cuDF

  1. High Performance: Due to the use of the GPUs, the cuDF is able to perform data operations faster than that of the other CPU based libraries.
  2. Pandas Compatibility: cuDF is built to have a similar interface to pandas so that users of pandas do not have to learn how to use a new system but can transfer over to using the GPU-based system instead.
  3. Seamless Integration: cuDF is interoperable with other tensor libraries in the RAPIDS ecosystem such as cuML for machine learning and cuGraph for graph analytics.

Getting Started with cuDF

Now, without further ado, let’s go over the basic setup and how to use cuDF for data manipulation.

Step 1: Installing cuDF

First of all, it is necessary to mention that the use of cuDF is possible in case if the user has a proper NVIDIA GPU, as well as the suitable version of CUDA toolkit. Accordingly, you can download it from Rapids AI

This is the command which I got from Rapids AI installation guide for my system

conda create -n rapids-24.06 -c rapidsai -c conda-forge -c nvidia  \
    rapids=24.06 python=3.11 cuda-version=12.2
Enter fullscreen mode Exit fullscreen mode

Step 2: Importing cuDF

import cudf
import numpy as np
import pandas as pd
Enter fullscreen mode Exit fullscreen mode

Step 3: Creating a cuDF DataFrame
You can create a cuDF DataFrame from various data sources, including pandas DataFrames, CSV files, and more.

# Create a cuDF DataFrame from a pandas DataFrame
pdf = pd.DataFrame({
    'a': np.random.randint(0, 100, size=10),
    'b': np.random.random(size=10)
})
gdf = cudf.DataFrame.from_pandas(pdf)
print(gdf)
Enter fullscreen mode Exit fullscreen mode

Step 4: Data Manipulation with cuDF
cuDF provides a rich set of functions for data manipulation, similar to pandas.

# Adding a new column
gdf['c'] = gdf['a'] + gdf['b']

# Filtering data
filtered_gdf = gdf[gdf['a'] > 50]

# Grouping and aggregation
grouped_gdf = gdf.groupby('a').mean()
print(grouped_gdf)
Enter fullscreen mode Exit fullscreen mode

Step 5: Reading and Writing Data
cuDF supports reading from and writing to various file formats, such as CSV, Parquet, and ORC.

# Reading from a CSV file
gdf = cudf.read_csv('data.csv')

# Writing to a Parquet file
gdf.to_parquet('output.parquet')
Enter fullscreen mode Exit fullscreen mode

Step 6: Performance Comparison with Pandas

import time

# Create a large pandas DataFrame
pdf = pd.DataFrame({
    'a': np.random.randint(0, 100, size=100000000),
    'b': np.random.random(size=100000000)
})

# Create a cuDF DataFrame from the pandas DataFrame
gdf = cudf.DataFrame.from_pandas(pdf)

# Timing the pandas operation
start = time.time()
pdf['c'] = pdf['a'] + pdf['b']
end = time.time()
print(f"Pandas operation took {end - start} seconds")

# Timing the cuDF operation
start = time.time()
gdf['c'] = gdf['a'] + gdf['b']
end = time.time()
print(f"cuDF operation took {end - start} seconds")
Enter fullscreen mode Exit fullscreen mode

Comparison output
From the image we can see that cuDF is 40 times more faster than pandas

Step 7: Using cuDF as a no-code-change accelerator for pandas

%load_ext cudf.pandas 
Enter fullscreen mode Exit fullscreen mode
# Pandas operations now use the GPU!
import pandas as pd
import time

# Create a large pandas DataFrame
pdf = pd.DataFrame({
    'a': np.random.randint(0, 100, size=100000000),
    'b': np.random.random(size=100000000)
})

# Timing the pandas operation with cudf.pandas
start = time.time()
pdf['c'] = pdf['a'] + pdf['b']
end = time.time()
print(f"Pandas operation with cuDF loaded took {end - start} seconds")
Enter fullscreen mode Exit fullscreen mode

Result
We can see from the image that it gives almost similar performance compared to using cuDF APIs

Conclusion
cuDF is the equipping methodology to speed up data processing pipelines by using the parallel computing system, GPUs. This is the tool’s biggest strength: Since its usage directly corresponds with pandas, users can switch and start enjoying the performance improvements quickly. Thus, with the help of cuDF lets incorporate it in data science movement, which will help to work with larger datasets and perform complex operations faster than conventional computers.

Resources:

If you have any questions about cuDF or if you have used it in your project in the past then please feel free to drop the questions and/or experiences in the comments section below. Happy computing!

Top comments (0)