DEV Community

Cover image for Top 30 Pandas Interview Questions and Answers.
Kaira Kelvin.
Kaira Kelvin.

Posted on

Top 30 Pandas Interview Questions and Answers.

A quick recap of pandas ✨

1️⃣ What is Pandas in Python?

  • Pandas is an open-source Python library with powerful and built-in methods to efficiently clean, analyze, and manipulate datasets. Developed by Wes McKinney in 2008, this powerful package can easily blend with various other data science modules in Python. Pandas is built on top of the NumPy library, i.e., its data structures Series and DataFrame are the upgraded versions of NumPy arrays.

2️⃣ Why doesn't DataFrame.shape have parenthesis?

  • In pandas, shape is an attribute and not a method. **df.shape **outputs a tuple with the number of rows and columns in a DataFrame.

3️⃣ What is an index in pandas?

  • The index is a series of labels that can uniquely identify each row of a DataFrame. The index can be of any datatype like integer,string,hash.

4️⃣ What is the difference between loc and iloc?

  • Both loc and the iloc methods in pandas are used to select subsets of a DataFrame. Practically these are widely used for filtering DataFrame based on conditions. Loc is used to select data using actual labels of rows and columns, while the iloc method is used to extract data based on integer indices of rows and columns.

5️⃣ How do you get the count of all unique values of a categorical column in a DataFrame.

  • The function returns the count of each unique series or a column.
Series.value_counts() 
Enter fullscreen mode Exit fullscreen mode

6️⃣ What is Timedelta?

  • Timedelta represents the duration i.e. the difference between two dates or times, measured in units as days, hours, minutes, and seconds.

7️⃣ What is the difference between append **and **concat methods?

  • We can use the concat method to combine DataFrames either along rows or columns while append is used to combine DataFrames but only the rows.

8️⃣ What is the pandas method to get the statistical summary of all the columns in a DataFrame?
df.describe()

  • df.describe() generates descriptive statistics, including those that summarize the central tendency, dispersion and shape of the dataset's distribution.

9️⃣ What is the difference between Series and DataFrame?

  • DataFrame: The pandas DataFrame will be in tabular format with multiple rows and columns where each column can be of different data types. A two dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

  • Series: The Series is a one-dimensional labeled array that can store any data type, but all of its values should be of the same data type. The Series data structure is more like a single column of a Dataframe. It consumes less memory than a Dataframe.such as intergers ,strings,python ,objects.

1️⃣0️⃣ How do you access the top 6 rows and last 7 rows of a pandas DataFrame? Also Known as Viewing data.

The head() method in pandas is used to access the initial rows of a DataFrame, and tail() method is used to access the last rows.
The opposite method of head()istail() which shows the last n
(5 by default) rows of the dataframe object.

To access the top 6 rows: dataframe_name.head(6)
To access the last 7 rows: dataframe_name.tail(7)

1️⃣1️⃣ How do you get the max and min index in a dataframe.
df['Column_name'].idxmax()
df['Column_name'].idxmin()

1️⃣2️⃣ How do you u describe the Relationship between the items or columns in a dataframe?

  • Python pandas provides tools and methods to perform the analysis efficiently.

-Correlation Analysis- Correlation analysis is a statistical
technique used to measure the strength and direction of the
linear relationship between two variables.

Interpretation of Correlation Coefficients:

  • The correlation coefficient (often denoted by the symbol ρ or r)
    ranges from -1 to 1:

  • A correlation coefficient of 1 indicates a perfect positive
    linear relationship, meaning that as one variable increases, the
    other variable also increases proportionally.

  • A correlation coefficient of -1 indicates a perfect negative
    linear relationship, meaning that as one variable increases, the
    other variable decreases proportionally.

  • A correlation coefficient of 0 indicates no linear relationship
    between the variables.Values between -1 and 1 represent the
    strength and direction of the linear relationship:

Values closer to 1 indicate a stronger positive correlation.
Values closer to -1 indicate a stronger negative correlation.
Values closer to 0 indicate a weaker or no linear relationship.

1️⃣3️⃣ How do you get the value of an item in a row after running the code?
value_counts() to get values

1️⃣4️⃣ .What does info()method print in pandas and how do u export the DataFrame to a CSV file?

info()- This method prints out a concise summary of the data frame including information about the index, data types, columns, non- null values, and memory usage.

  • After you have cleaned and preprocessed your data, the next step may be to export the data frame to a file. to_csv()

iris_data.to_csv("cleaned_iris_data.csv")

1️⃣5️⃣ How do you get the highest value of a column in pandas and also how to create a copy of a data set

 df.[column].max()
Enter fullscreen mode Exit fullscreen mode

states2 = states.copy()

1️⃣6️⃣ What are some effective methods for handling missing data in pandas?

  • ffill(Forward Fill)- used to fill missing values in data analysis- ffill: propagates last valid observation forward to next valid.

df.fillna(method='ffill', inplace=True)

bfill- (backward fill) is a method used to fill missing values in a DataFrame or Series by propagating the next valid observation backward.

df.bfill(*,axis=None, inplace=False,limit=none,limit_area=None)

df.fillna(method='bfill', inplace=True)

  • Interpolation: Interpolate missing values based on the values of neighboring data points. This method works well for ordered data such as time series.
    df['column_name'] = df['column_name'].interpolate(method='linear')

  • Using Machine Learning Model- Ml model to predict missing values based on other features in the dataset. It might be complex for small datasets.

    `from sklearn.impute import SimpleImputer
     imputer =SimpleImputer(strategy='mean)
     df['column_name']= imputer.fit_transform(df[['column_name']`
    

1️⃣7️⃣ .Briefly describe the function of df.isnull().sum()?
df.isnull() .sum() used to get the total number of missing values per column.

1️⃣8️⃣ Which syntax is correct for the add method for a set?
set.add(item)

1️⃣9️⃣ Which brackets would you use to create a python tuple?
()

2️⃣0️⃣ Which data structure allows you to add duplicate heterogeneous data

 type elements in it?   `list`
Enter fullscreen mode Exit fullscreen mode

2️⃣1️⃣.You have an if statement consisting of two expressions with a
logical and operator between them. Which case would lead the Python
interpreter to run the body of the if statement?

When the expressions return True and True responses.
Enter fullscreen mode Exit fullscreen mode

2️⃣2️⃣ .In Python which name would you choose for a variable?

    _count
Enter fullscreen mode Exit fullscreen mode

Top comments (0)