DEV Community

Raman Butta
Raman Butta

Posted on

Getting started with Kaggle

Basic libraries like Numpy, Pandas, Seaborn, Matplotlib are pre-installed in Kaggle. You just need to import them in your notebook.

Loading datasets in your notebook

Kaggle is organised in directories.
/kaggle/working/ is your present working directory.
/kaggle/input/ is where public datasets are kept.

So you can go to the sidebar in your Kaggle notebook and "Add Input" (for public Kaggle databases) or "Upload" (if you have own/non-Kaggle database). For example, you can add the "Titanic" dataset and then it appears in your /kaggle/input directory.

To print all the datasets added to your notebook, you can use Python's built-in os module, which can walk through a directory tree (like /kaggle/input).
So os.walk('/kaggle/input') is a generator that walks through the directory /kaggle/input and returns 3 things for each folder it visits:

  • dirname: current folder path
  • subdirs: a list of subdirectories (which we can ignore with _)
  • filenames: a list of all the files in that folder

So you can run the following code :

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
Enter fullscreen mode Exit fullscreen mode

and it will output something like

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv
Enter fullscreen mode Exit fullscreen mode

Note that you only see what your notebook has mounted — not all Kaggle datasets.

To master data science projects (like the Titanic one), it's important to follow a structured pipeline :

  1. Data Loading
  2. Data Preprocessing:
    • Data Cleaning (you can merge all dataframes before cleaning it)
    • Feature Engineering
    • Encoding
  3. EDA
  4. Preprocessing
  5. Model Building
  6. Model Tuning

Top comments (0)