DEV Community

Cover image for Speeding up geodata processing
Sophia Parafina
Sophia Parafina

Posted on

4 2

Speeding up geodata processing

I've been using the excellent geopandas for working with largish geodata sets and CSV files. While geopandas has been great for working with data, it is slow to ingest geodata. I ran a simple test to time reading a 1.2GB line shapefile into a dataframe.

import geopandas as gpd
import time
import pickle

# read shapefile
read_start = time.process_time()
data = gpd.read_file("Streets.shp")
read_end = time.process_time()

read_time = read_end - read_start
print(str(read_time/60)+" minutes")
Enter fullscreen mode Exit fullscreen mode

25.43723483333333 minutes

Martin Fleischmann suggested using pyogrio instead of geopandas. The result was quite impressive.

# alternate package for reading data
from pyogrio import read_dataframe
import time
import pickle

# read shapefile
read_start = time.process_time()
data = read_dataframe("Streets.shp")
read_end = time.process_time()

read_time = read_end - read_start
print(str(read_time/60)+" minutes")
Enter fullscreen mode Exit fullscreen mode

2.9936875333333335 minutes

While going from 25 minutes to 3 minutes is quite an improvement, I'm building out a data processing pipeline and I want to reduce read time even more. My next experiment uses Python's pickle which serializes the data frame into a byte stream and writes it to a file. Here are the results from pickling the dataframe and reading the pickled data.

#create a file
picklefile = open('streets', 'wb')

#pickle the dataframe
pickle_write_start = time.process_time()
pickle.dump(data, picklefile)
pickle_write_end = time.process_time()

#close file
picklefile.close()

pickle_write = (pickle_write_end - pickle_write_start)/60
print(str(pickle_write)+" minutes")
Enter fullscreen mode Exit fullscreen mode

4.362236583333333 minutes

#read the pickle file
picklefile = open('streets', 'rb')

#unpickle the dataframe
pickle_read_start = time.process_time()
df = pickle.load(picklefile)
pickle_read_end = time.process_time()

#close file
picklefile.close()

pickle_read = (pickle_read_end - pickle_read_start)/60
print(str(pickle_read)+" minutes")
Enter fullscreen mode Exit fullscreen mode

0.9217719833333339 minutes

Wow! We've gone from 25 minutes to read a 1.2GB shapefile to less than a minute.

Finally, pickling the dataframe reduces the shapefile size from 1.2GB to 984MB. However, pickled data can be efficiently compressed.

import gzip
import shutil
with open('streets', 'rb') as f_in:
    with gzip.open('streets.gz', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
Enter fullscreen mode Exit fullscreen mode

The compressed file is 78.8MB, a bit more than a 10x reduction in file size.

Conclusion

If you're working with geodata that remains static, you can improve geopanda's read times by using pyogrio and pickling the resulting dataframe. Additionally, pickle files can be compressed efficiently, which can lower costs for data egress when using cloud storage.

Pickle image by Renee Comet

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more