DEV Community

Cover image for Beginner's Data Analyst Glossary
preeti deshmukh
preeti deshmukh

Posted on

Beginner's Data Analyst Glossary

Every term you’ve been nodding along to in meetings, finally explained unambiguously so you stop Googling them under the table. 🤣

(Disclaimer: I did not Googled. Possibly LLM-ed. 😜)

1. Core Data Concepts

Start here. These are the foundational building blocks every data analyst must understand before anything else.

Term Acronym Definition Example
Data - Raw facts and figures that have not yet been processed or analyzed Sales numbers, customer names, timestamps
Dataset - A structured collection of data organized for analysis A spreadsheet of 10,000 customer orders
Database DB An organized system for storing and retrieving structured data MySQL, PostgreSQL, Microsoft SQL Server
Spreadsheet - A grid-based tool for organizing, calculating, and visualizing data Microsoft Excel, Google Sheets
Row / Record - A single entry in a table - represents one item or event One customer's order details
Column / Field - A category or attribute shared across all rows in a table Customer name, order date, price
Data Type - The kind of value a field holds - number, text, date, boolean, etc. Age = integer; Name = text; Active = boolean
Structured Data - Data organized in rows and columns with a defined format A sales table in a database
Unstructured Data - Data with no fixed format or schema Customer emails, social media posts, images
Semi-Structured Data - Data with some organization but not a strict table format JSON files, XML documents, log files
Metadata - Data that describes other data - its structure, origin, and meaning A file's creation date, author, and size
Primary Key PK A unique identifier for each row in a database table Customer ID = 10042 (no two customers share it)
Foreign Key FK A field in one table that links to the primary key of another table Order table has a Customer ID column that links to the Customer table
Schema - The blueprint that defines the structure of a database - its tables, columns, and data types A schema specifying that the Orders table has 5 columns with specific types

2. SQL & Querying

SQL is the most important skill for a data analyst. These are the terms and commands you will use every single day.

Term Acronym Definition Example
Query - A question or request you send to a database to retrieve or manipulate data SELECT * FROM orders WHERE date > '2024-01-01'
SQL SQL The standard language for querying and managing relational databases Used in MySQL, PostgreSQL, SQL Server, BigQuery
SELECT - SQL command to retrieve data from a table SELECT name, age FROM customers
WHERE - SQL clause to filter rows based on a condition WHERE country = 'India'
JOIN - SQL operation to combine rows from two or more tables based on a related column JOIN orders ON customers.id = orders.customer_id
GROUP BY - SQL clause that groups rows sharing a value so aggregate functions can be applied GROUP BY city - then COUNT() per city
ORDER BY - SQL clause that sorts results by one or more columns ORDER BY sales DESC - highest sales first
Aggregate Function - A function that performs a calculation on a set of values and returns a single result SUM(), AVG(), COUNT(), MIN(), MAX()
Subquery - A query nested inside another query SELECT * FROM sales WHERE amount > (SELECT AVG(amount) FROM sales)
Window Function - A function that calculates values across a set of rows related to the current row, without collapsing them ROW_NUMBER(), RANK(), LAG(), LEAD() - used for rankings and running totals
CTE CTE A temporary named result set defined within a query, making complex queries easier to read WITH top_customers AS (SELECT ...) SELECT * FROM top_customers
Index - A database structure that speeds up data retrieval by providing a fast lookup path An index on Customer ID makes searches by ID much faster
View - A saved query that acts like a virtual table A 'monthly_sales' view that always returns the latest month's data
Stored Procedure - A saved set of SQL statements that can be executed on demand A procedure that calculates monthly bonuses for all employees
NULL - A missing or unknown value in a database - not zero, not blank, but absent A customer's phone number field with no value entered

3. Statistics for Data Analysts

You don't need a statistics degree, but understanding these core concepts will make your analysis trustworthy and rigorous.

Term Acronym Definition Example
Mean - The arithmetic average of a set of values Average order value = total revenue / number of orders
Median - The middle value when data is sorted - less affected by outliers than mean If salaries are 20K, 30K, 35K, 40K, 200K - median is 35K, not 65K
Mode - The most frequently occurring value in a dataset If most customers are from Mumbai, Mumbai is the mode
Standard Deviation SD Measures how spread out values are from the mean Low SD = values clustered near average; high SD = wide spread
Variance - The average of the squared differences from the mean - related to standard deviation Variance = SD squared; used in many statistical models
Distribution - How values are spread across a dataset Normal (bell curve), skewed, uniform distributions
Percentile - The value below which a given percentage of data falls 90th percentile salary = 90% of employees earn below this amount
Correlation - A measure of how closely two variables move together, from -1 to +1 Temperature and ice cream sales have positive correlation
Causation - One variable directly causes a change in another - stronger than correlation Smoking causes lung cancer; correlation alone does not imply this
Outlier - A data point that is significantly different from the rest A transaction of \$1,000,000 in a dataset of typical \$50 purchases
Hypothesis - A testable statement about a relationship or effect in data Customers who receive emails spend 20% more on average
P-value - The probability that results as extreme as observed could occur by chance alone - below 0.05 is typically significant P-value of 0.02 = only 2% chance the result is due to random noise
Confidence Interval CI A range within which the true value is expected to fall, with a stated level of certainty Average delivery time is 3.5 days ± 0.4 days at 95% confidence
Sample - A subset of data drawn from a larger population for analysis Surveying 1,000 customers to represent all 100,000 customers
Bias - A systematic error that skews results in a particular direction Surveying only premium users skews satisfaction data upward

4. Python & Data Tools

Python is the go-to language for data analysis. These are the tools and concepts you will encounter in real workflows.

Term Acronym Definition Example
Python - The most popular programming language for data analysis, known for simplicity and a rich library ecosystem Used for cleaning data, building models, and automating workflows
Pandas - A Python library for data manipulation and analysis using DataFrames df.groupby('city')['sales'].sum() - sales by city in one line
NumPy - A Python library for fast numerical computing with arrays and matrices Used under the hood by Pandas and most ML libraries
Jupyter Notebook - An interactive coding environment that combines code, output, and narrative text Run Python cells, see charts, and add explanations all in one file
DataFrame - A 2-dimensional table-like data structure with labeled rows and columns - the core object in Pandas Like a spreadsheet inside Python
Data Cleaning - The process of fixing errors, removing duplicates, handling missing values, and standardizing formats Replacing blank cells with averages, fixing typos in city names
Data Wrangling - The process of transforming raw data into a format suitable for analysis Merging tables, reshaping wide to long format, parsing dates
ETL ETL Extract, Transform, Load - the process of pulling data from sources, cleaning it, and loading it into a destination Pulling raw sales data from an API, cleaning it, and loading it into a data warehouse
Data Pipeline - An automated sequence of steps that moves and transforms data from source to destination A daily pipeline that refreshes a dashboard with yesterday's orders
Regular Expressions Regex A syntax for pattern matching and text manipulation in strings Extracting all email addresses from a column of free text
API API A way for software systems to communicate - data analysts use them to pull data from services Pulling weather or financial data directly into Python via an API call
Web Scraping - Automatically extracting data from websites using code Scraping product prices from an e-commerce site with BeautifulSoup
Version Control - Tracking changes to code over time so you can undo, review, and collaborate safely Git - saves snapshots of your analysis scripts
Git - The standard version control system for tracking code changes git commit -m 'cleaned null values in orders table'

5. Data Visualization & Reporting

Turning numbers into clear visuals and stories is one of the highest-value skills a data analyst can develop.

Term Acronym Definition Example
Dashboard - An interactive visual display of key metrics and data, updated in real time or on schedule A sales dashboard showing daily revenue, top products, and regional performance
KPI KPI A measurable value that indicates how well a goal is being achieved Monthly Active Users, Conversion Rate, Customer Acquisition Cost
Chart - A visual representation of data - bars, lines, pies, scatter plots, etc. A bar chart comparing revenue by product category
Bar Chart - A chart using rectangular bars to compare values across categories Revenue by department
Line Chart - A chart showing how a value changes over time using connected data points Monthly website traffic over 12 months
Scatter Plot - A chart plotting individual data points on two axes to show relationships between variables Customer age vs. total spend
Heatmap - A grid where values are represented by color intensity Hour-by-day view of website traffic - darker = more visitors
Histogram - A chart showing the frequency distribution of a single numeric variable Distribution of customer ages in 10-year bins
Tableau - A leading business intelligence and data visualization platform Drag-and-drop dashboards connected to live databases
Power BI - Microsoft's business intelligence tool for building interactive reports and dashboards Common in organizations already using Microsoft 365
Data Storytelling - The practice of communicating insights from data using a narrative with visuals A slide deck explaining why sales dropped last quarter using charts and context
Exploratory Data Analysis EDA An initial analysis phase to summarize data, find patterns, and detect anomalies before modeling Running df.describe() and df.hist() in Pandas to understand a new dataset

6. Advanced & Enterprise Concepts

Once you have the fundamentals, these terms appear constantly in data engineering, warehousing, and larger organizations.

Term Acronym Definition Example
Data Warehouse DW A central repository storing large volumes of historical, structured data from multiple sources - optimized for analysis Google BigQuery, Amazon Redshift, Snowflake
Data Lake - A storage system for raw data in any format - structured, unstructured, or semi-structured An AWS S3 bucket holding raw logs, JSON files, and CSVs
Data Mart - A subset of a data warehouse focused on a specific business area A marketing data mart containing only ad spend and campaign data
OLAP OLAP Online Analytical Processing - systems designed for complex queries and aggregations on large historical data Running multi-dimensional sales analysis: by region, product, and quarter simultaneously
OLTP OLTP Online Transaction Processing - systems designed for fast, frequent, small read/write operations A system processing thousands of online orders per second
Snowflake Schema - A database schema where dimension tables are normalized into multiple related tables, reducing redundancy A date dimension split into year, quarter, month sub-tables
Star Schema - A simple warehouse schema with one central fact table linked to multiple dimension tables A Sales fact table linked to Date, Product, and Customer dimension tables
Fact Table - A table in a data warehouse storing measurable, quantitative data about events Each row = one sales transaction with amount, date, and customer ID
Dimension Table - A table storing descriptive attributes used to filter and group fact data Customer table with name, city, age, and segment
Data Governance - The policies and processes that ensure data quality, security, privacy, and proper use across an organization Rules defining who can access customer PII data and how long it is retained
Data Lineage - A record of where data originated and how it has moved and transformed over time Knowing that a dashboard number traces back to a specific raw database table
Data Catalog - A searchable inventory of all data assets in an organization with metadata and documentation A company-wide system where analysts can search for available tables and understand what each column means
Partitioning - Dividing a large database table into smaller segments based on a column value for faster querying Partitioning a billion-row table by month so each query only scans one month
Apache Spark - An open-source big data processing framework for distributed computation on large datasets Processing terabytes of clickstream data across a cluster of servers
dbt dbt Data Build Tool - a framework for transforming raw data in a warehouse using SQL, with version control and testing Writing SQL models that are automatically run, tested, and documented

7. Popular Tools & Platforms

Category Tools
Query & SQL MySQL, PostgreSQL, BigQuery, Snowflake, SQL Server
Python Analysis Pandas, NumPy, Jupyter Notebook
Data Visualization Tableau, Power BI, Matplotlib, Seaborn, Plotly
Spreadsheets Microsoft Excel, Google Sheets
Big Data & Pipelines Apache Spark, Apache Airflow, dbt, Kafka
Cloud Platforms AWS (Redshift, S3), Google Cloud (BigQuery), Azure Synapse
Version Control Git, GitHub
Data Cleaning OpenRefine, Pandas, Excel Power Query
BI & Reporting Looker, Metabase, Superset, Grafana
Statistics & ML Scikit-learn, R, SciPy, Statsmodels

Recommended Learning Path

Don't try to learn everything at once. Follow these phases in order and build momentum with small wins.

Phase 1 - Foundations (1-2 weeks)

Get comfortable with the core vocabulary before touching any tools.

  • Data, datasets, and data types
  • What a database is and how tables work
  • Rows, columns, primary keys, and foreign keys
  • The difference between structured and unstructured data

Phase 2 - SQL Basics (2-4 weeks)

SQL is the single most important skill. Learn to query data before anything else.

  • SELECT, WHERE, ORDER BY, GROUP BY
  • Aggregate functions: SUM, COUNT, AVG, MAX, MIN
  • JOINs: INNER, LEFT, RIGHT
  • Filtering with WHERE and HAVING

Phase 3 - Statistics & Python (4-8 weeks)

Build analytical thinking and learn to work with data programmatically.

  • Mean, median, standard deviation, correlation
  • Python basics and Pandas DataFrames
  • Data cleaning - handling nulls, duplicates, outliers
  • Exploratory Data Analysis (EDA)

Phase 4 - Visualization & Reporting (2-4 weeks)

Turn analysis into insights that decision-makers can understand.

  • Choosing the right chart type for your data
  • Build a dashboard in Tableau or Power BI
  • Data storytelling - narrative structure for presentations
  • KPIs and business metrics

Phase 5 - Advanced Topics (Ongoing)

As you grow, these skills will set you apart in the job market.

  • Advanced SQL: CTEs, window functions, subqueries
  • Data warehouses, star schemas, and ETL pipelines
  • dbt for data transformation
  • Big data fundamentals: Spark, Airflow, Kafka

Final Advice

Don't try to master all the theory before you start. The fastest way to truly understand data analysis is through doing.

  • Learn the basics - Use this glossary as your starting reference.
  • Get SQL practice - Use free platforms like Mode Analytics, LeetCode SQL, or SQLZoo.
  • Work on real data - Download a dataset from Kaggle and explore it in Python or Excel.
  • Build a portfolio project - A simple dashboard or analysis writeup is more valuable than any certificate.
  • Experiment continuously - Break things. Try queries. See what happens.

Practical exposure beats passive study every time. Start today.

Top comments (0)