DEV Community

Cover image for Colab and DuckDB
Cris Crawford
Cris Crawford

Posted on • Updated on

Colab and DuckDB

Last week we had a dlt (Data Load Tool) workshop in Data Engineering class. I'll talk about dlt in another post. Briefly, it's an open source Python library that makes data loading easy. To complete the homework, I used Google Colab to read the Python notebook and DuckDB for the database. The verdict is: Colab is not worth the trouble and DuckDB is awesome.

First, why was I using Colab and not Jupyter notebook? The reason is this line of code: %%capture. It preceded the line !pip install dlt[duckdb]. Jupyter notebook gave me the error UsageError: Line magic function '%%capture' not found. I should have asked ChatGPT about this right away, but instead, I assumed I had to use Colab. More on this later.

Colab is Google's answer to Jupyter notebook. It runs in the browser. I suppose there are reasons it's better than Jupyter notebook. But for me, there wasn't any advantage, and there were a few annoyances. First of all, I couldn't just run Colab and open the notebook. I had to upload my notebook to my Google drive. Maybe not, but that's what ChatGPT said to do. I don't like Google drive. I have a ton of files shared with me that I would like to never see again, but I don't know how to unshare them. Then when there is a file that I do want to see, I have to scroll down many screens to find it. And if I exceed 15G quota, I have to pay. Anyway, I uploaded my file and then invoked Colab. It didn't see the file. I looked again, and there it was, but Colab didn't recognize it. So I asked ChatGPT how to deal with this. It told me to control-click on the file and open it with Colab. I did that, but I didn't see Colab in the "Open With" menu. I had to install it. I did that, and was able to open the notebook and proceed with the homework assignment.

Had I asked ChatGPT about the error message, I could have saved myself some time and trouble. I wondered about that and asked it later. It said to just delete the line %%capture. I did that, ran the cell that now said !pip install dlt[duckdb], and I got another error: zsh:1: no matches found: dlt[duckdb] ChatGPT said to put single quotes around dlt[duckdb]. That worked, and I could complete the assignment in jupyter notebook. So basically I did the homework twice, but now I don't have to download my notebook from Google drive.

I still don't know what the Line magic function %%capture was for.

DuckDB is a different story. It's an open source, in-application database that is easy to install. It has no outside dependencies. It runs on Windows, MacOS (both chips), and Linux. You can query it easily using SQL. It's fast, and it's completely free. You can read more about it at https://duckdb.org.

Top comments (1)

Collapse
 
dlt-library profile image
adrian

Ahh good to know! we chose colab for the course as it's an uniform environment and should not add challenges. On the other hand, it did :) I think for the next workshop we should offer a jupyter notebook in the repo and colab just as link