"Programs must be written for people to read, and only incidentally for machines to execute." - Harold Abelson and Gerald Jay Sussman, Structure and Interpretation of Computer Programs
Introduction to Pandas
Pandas is an open-source Python library that is widely used for data manipulation and analysis.
- Summarizes the data.
- Read and write different formats of file like CSV, JSON, EXCEL, HTML etc.
- We can filter and modify the data based on multiple conditions.
- We can merge multiple files.
Difference between Attributes and Methods
Attributes are used to represent properties or state of an object, while methods are used to represent behaviors or operations on its data. Attributes are accessed using the dot notation without parentheses, while methods are called using the dot notation with parentheses and optional arguments.
Importing Pandas
To use the Pandas library in Python, we first need to import it into our code. There are different ways to import Pandas, but the most common one is using the import statement
This statement imports the entire Pandas library, and we can access its functions and classes using the pd namespace.
Reading and Viewing the csv file
To work with real-world data, I have selected the Stack Overflow Annual Developer Survey file, which is a widely used dataset for data analysis and machine learning. This dataset contains information about the demographics, education, employment, and technology preferences of software developers from different parts of the world. The survey is conducted annually by Stack Overflow, a popular Q&A website for programmers.
To read a CSV file using Pandas, we use the pd.read_csv() function.
- df.head(n): Displays the first n rows of the DataFrame (by default, n=5).
- df.tail(n): Displays the last n rows of the DataFrame (by default, n=5).
- df.shape: Returns a tuple containing the number of rows and columns in the DataFrame.
- df.columns: Returns a list of column names in the DataFrame.
- df.dtypes: Returns the data type of each column in the DataFrame.
To check null values in data we use. This function counts the total number of missing data from columns and sums them up.
To give summary of the data we use. It only includes columns that are numerical and not strings.
Gives all information of column such as number of rows, missing value, data types.
We are not allowed to see all columns so we use this function
DataFrame
In Pandas, a DataFrame is a two-dimensional table-like data structure that consists of rows and columns. Once created, a DataFrame can be manipulated, transformed, and analyzed using various Pandas functions and methods.
iloc and loc are two methods in Pandas that allows to select subsets of rows and columns from a DataFrame based on their index or label values. iloc is used for integer-based indexing, while loc is used for label-based indexing.
Conclusion
I am interested in continuing my exploration of the Pandas library because there is a lot to learn from it that can be helpful for my future applications. I will continue listing my daily progress and try to remain consistent. Please do share your feedback on how I can my 100daysofcode challenge more productive. I'll see you tomorrow for my daily update.
Top comments (0)