DEV Community

Cover image for Colab and DuckDB
Cris Crawford
Cris Crawford

Posted on • Edited on

Colab and DuckDB

Last week we had a dlt (Data Load Tool) workshop in Data Engineering class. I'll talk about dlt in another post. Briefly, it's an open source Python library that makes data loading easy. To complete the homework, I used Google Colab to read the Python notebook and DuckDB for the database. The verdict is: Colab is not worth the trouble and DuckDB is awesome.

First, why was I using Colab and not Jupyter notebook? The reason is this line of code: %%capture. It preceded the line !pip install dlt[duckdb]. Jupyter notebook gave me the error UsageError: Line magic function '%%capture' not found. I should have asked ChatGPT about this right away, but instead, I assumed I had to use Colab. More on this later.

Colab is Google's answer to Jupyter notebook. It runs in the browser. I suppose there are reasons it's better than Jupyter notebook. But for me, there wasn't any advantage, and there were a few annoyances. First of all, I couldn't just run Colab and open the notebook. I had to upload my notebook to my Google drive. Maybe not, but that's what ChatGPT said to do. I don't like Google drive. I have a ton of files shared with me that I would like to never see again, but I don't know how to unshare them. Then when there is a file that I do want to see, I have to scroll down many screens to find it. And if I exceed 15G quota, I have to pay. Anyway, I uploaded my file and then invoked Colab. It didn't see the file. I looked again, and there it was, but Colab didn't recognize it. So I asked ChatGPT how to deal with this. It told me to control-click on the file and open it with Colab. I did that, but I didn't see Colab in the "Open With" menu. I had to install it. I did that, and was able to open the notebook and proceed with the homework assignment.

Had I asked ChatGPT about the error message, I could have saved myself some time and trouble. I wondered about that and asked it later. It said to just delete the line %%capture. I did that, ran the cell that now said !pip install dlt[duckdb], and I got another error: zsh:1: no matches found: dlt[duckdb] ChatGPT said to put single quotes around dlt[duckdb]. That worked, and I could complete the assignment in jupyter notebook. So basically I did the homework twice, but now I don't have to download my notebook from Google drive.

I still don't know what the Line magic function %%capture was for.

DuckDB is a different story. It's an open source, in-application database that is easy to install. It has no outside dependencies. It runs on Windows, MacOS (both chips), and Linux. You can query it easily using SQL. It's fast, and it's completely free. You can read more about it at https://duckdb.org.

Top comments (2)

Collapse
 
dlt-library profile image
adrian

Ahh good to know! we chose colab for the course as it's an uniform environment and should not add challenges. On the other hand, it did :) I think for the next workshop we should offer a jupyter notebook in the repo and colab just as link

Collapse
 
patrick_pointdujour_22cda profile image
patrick pointdujour

thanks for sharing Cris, I have been playing with DuckDB in colab with files in my drive and everything was working great until I tried to connect to Azure Data Lake , the connection succeeded and I could loop through the files, but when trying to select using the "file path" it started giving errors about a certificate missing and file Not found.
I chose Colab because it's been great to not have to set up anything else locally and just POC within that isolated environment, but this is a roadblock