DEV Community

Cover image for Excel for Data Science: Analyze, Visualize, and Excel
John
John

Posted on

Excel for Data Science: Analyze, Visualize, and Excel

Excel is a tool that is useful to people working with data in their own personal projects or in their organizations. It is a most essential tool for data analysts and data scientists.

Most articles I read when learning to use Excel stated that most renowned data scientists had Excel as their foundational tool. Excel is a great tool for performing analytics. It offers a user — friendly interface for data manipulation, analysis, and visualization.

Microsoft Excel is a spreadsheet developed by Microsoft. It has a total of 1,048,576 rows and 16,384 columns per worksheet.

Excel has computational capabilities, graphing tools, pivot tables, and a Macro- programming language called Visual Basic Applications (VBA).
Excel combines these capabilities to:

  • Calculate data

  • Format data

  • Organize data

Excel offers an all-in-one toolpak, that is;

  • A data repository

  • Graphical User Interface

  • Analysis tool

  • Scripting environment with VBA.

Below is common Excel terminology:

  1. Row — They run horizontally across cells. They are commonly called observations, records, tuples, and trials.

  2. Column— They run vertically across cells. They are also referred to as features, fields, attributes, predictors, and also variables.

  3. Cell— An intersection between a row and a column. Users enter data in cells.

  4. Cell reference — A set of coordinates showing the location of a cell. Rows are numbered while Columns are vertical and are assigned letters.

  5. Address bar — This bar is located on the upper left side of the home ribbon. It displays the coordinates of the selected cell.

  6. Workbook — An Excel file that contains one or more worksheets.

  7. Worksheet tab — These are tabs arranged from the bottom left of the spreadsheets that are used to select or navigate between worksheets.

  8. Column and row headers — Numbered and lettered cells located outside of the columns and rows. Selecting a header highlights the whole row or column

  9. Auto-fill — A feature that enables users to add more than one cell that occurs in a series automatically by selecting the cell with the value and dragging across the cells they wish to auto-fill with the value.
    It is important to note that after getting a value using a formula, one can autofill on other cells.

  10. Auto-sum — This enables users to add multiple values. Users can select the cells they want to add and press the Alt and Equal keys. There is also a button to enable this feature on the top right of the home page, above “Fill” and to the left of “Sort and Filter”.

  11. Pivot table — A data summarization tool that sorts and calculates data automatically. It is located under the insert tab on the far left.

  12. Pivot Chart — A visual aid to the pivot table, providing graph representations of the data. Located under the middle of the insert page, next to maps.

  13. Source data — Information used to create a pivot table

  14. Formula bar — A long input bar located to the right of the address bar. It is used to visualize and enter values and formulas in cells. It is denoted by an FX label

  15. Formula — Mathematical equations with cell references and functions relating to the cells that work together to output a desired value/ computation. Formulas have to be initialized with an “ = ” sign.

Excel can perform a variety of tasks in data analytics. These include:

  • Data entry and storage
  • Sorting and filtering data to find specific information
  • Statistical analysis
  • Strategic analysis
  • Accounting and budgeting
  • Conditional formatting cells to closely observe trends in data and pull insights

Excel files are denoted by .xls extensions or .csv extensions. A .xls is a spreadsheet file that can be created by Excel or other spreadsheet programs. The file type represents an Excel Binary file format.

A CSV file is an abbreviation for Comma Separated Values. Excel files and CSV files can be integrated with Python for data analysis.

Excel allows the user to copy data and therefore can perform analysis and changes on data while still retaining the original dataset. This becomes useful when an analyst needs to start the process all over again or has effected changes that they wish to revert.

Excel has advanced features used to perform complex analyses such as regression analysis, time series analysis, and statistical analysis. Excel performs statistical analysis for measures of central tendency including mean, median, mode, and measures of variance including standard deviation, regression, and correlation.

Excel for data science can be used to perform:

  • Data cleaning and processing — remove duplicates, fill in missing values, and transform data
  • Data transformation — convert data from one format to another, from a source system’s format to the required format of a destination system.
  • Data wrangling — This is removing errors and combining complex data sets to make them more accessible and easier for data analysis.
  • Data exploration and visualization — Excel has a rich variety of visualization tools including graphs, charts, and pivot tables which are useful in data exploration.

To perform its tasks excel uses:

  1. Formulas — Mathematical equations that are formed using functions. Desired data is passed through the functions as input and the desired output is derived based on the formula function applied
  2. Conditional formatting — This is a way to see patterns and trends in data. It helps users and analysts to easily spot trends and patterns in data using bars, colors, and icons to easily highlight important values.
  3. Charts — Used to visually represent data in bar, line, and pie charts. Each type of visualization has a preferred data set where it is more applicable. E. g line graphs work well with time and trend analysis.
  4. Pivot tables — These are features in Excel that help to summarize data and sort of give a different-eye view of the data set. This is achieved by comparing desired columns to those that directly relate to them without interfering with the structure of the data set.

Top comments (0)