Every term you’ve been nodding along to in meetings, finally explained unambiguously so you stop Googling them under the table. 🤣
(Disclaimer: I did not Googled. Possibly LLM-ed. 😜)
1. Core Data Concepts
Start here. These are the foundational building blocks every data analyst must understand before anything else.
| Term | Acronym | Definition | Example |
|---|---|---|---|
| Data | - | Raw facts and figures that have not yet been processed or analyzed | Sales numbers, customer names, timestamps |
| Dataset | - | A structured collection of data organized for analysis | A spreadsheet of 10,000 customer orders |
| Database | DB | An organized system for storing and retrieving structured data | MySQL, PostgreSQL, Microsoft SQL Server |
| Spreadsheet | - | A grid-based tool for organizing, calculating, and visualizing data | Microsoft Excel, Google Sheets |
| Row / Record | - | A single entry in a table - represents one item or event | One customer's order details |
| Column / Field | - | A category or attribute shared across all rows in a table | Customer name, order date, price |
| Data Type | - | The kind of value a field holds - number, text, date, boolean, etc. | Age = integer; Name = text; Active = boolean |
| Structured Data | - | Data organized in rows and columns with a defined format | A sales table in a database |
| Unstructured Data | - | Data with no fixed format or schema | Customer emails, social media posts, images |
| Semi-Structured Data | - | Data with some organization but not a strict table format | JSON files, XML documents, log files |
| Metadata | - | Data that describes other data - its structure, origin, and meaning | A file's creation date, author, and size |
| Primary Key | PK | A unique identifier for each row in a database table | Customer ID = 10042 (no two customers share it) |
| Foreign Key | FK | A field in one table that links to the primary key of another table | Order table has a Customer ID column that links to the Customer table |
| Schema | - | The blueprint that defines the structure of a database - its tables, columns, and data types | A schema specifying that the Orders table has 5 columns with specific types |
2. SQL & Querying
SQL is the most important skill for a data analyst. These are the terms and commands you will use every single day.
| Term | Acronym | Definition | Example |
|---|---|---|---|
| Query | - | A question or request you send to a database to retrieve or manipulate data | SELECT * FROM orders WHERE date > '2024-01-01' |
| SQL | SQL | The standard language for querying and managing relational databases | Used in MySQL, PostgreSQL, SQL Server, BigQuery |
| SELECT | - | SQL command to retrieve data from a table | SELECT name, age FROM customers |
| WHERE | - | SQL clause to filter rows based on a condition | WHERE country = 'India' |
| JOIN | - | SQL operation to combine rows from two or more tables based on a related column | JOIN orders ON customers.id = orders.customer_id |
| GROUP BY | - | SQL clause that groups rows sharing a value so aggregate functions can be applied | GROUP BY city - then COUNT() per city |
| ORDER BY | - | SQL clause that sorts results by one or more columns | ORDER BY sales DESC - highest sales first |
| Aggregate Function | - | A function that performs a calculation on a set of values and returns a single result | SUM(), AVG(), COUNT(), MIN(), MAX() |
| Subquery | - | A query nested inside another query | SELECT * FROM sales WHERE amount > (SELECT AVG(amount) FROM sales) |
| Window Function | - | A function that calculates values across a set of rows related to the current row, without collapsing them | ROW_NUMBER(), RANK(), LAG(), LEAD() - used for rankings and running totals |
| CTE | CTE | A temporary named result set defined within a query, making complex queries easier to read | WITH top_customers AS (SELECT ...) SELECT * FROM top_customers |
| Index | - | A database structure that speeds up data retrieval by providing a fast lookup path | An index on Customer ID makes searches by ID much faster |
| View | - | A saved query that acts like a virtual table | A 'monthly_sales' view that always returns the latest month's data |
| Stored Procedure | - | A saved set of SQL statements that can be executed on demand | A procedure that calculates monthly bonuses for all employees |
| NULL | - | A missing or unknown value in a database - not zero, not blank, but absent | A customer's phone number field with no value entered |
3. Statistics for Data Analysts
You don't need a statistics degree, but understanding these core concepts will make your analysis trustworthy and rigorous.
| Term | Acronym | Definition | Example |
|---|---|---|---|
| Mean | - | The arithmetic average of a set of values | Average order value = total revenue / number of orders |
| Median | - | The middle value when data is sorted - less affected by outliers than mean | If salaries are 20K, 30K, 35K, 40K, 200K - median is 35K, not 65K |
| Mode | - | The most frequently occurring value in a dataset | If most customers are from Mumbai, Mumbai is the mode |
| Standard Deviation | SD | Measures how spread out values are from the mean | Low SD = values clustered near average; high SD = wide spread |
| Variance | - | The average of the squared differences from the mean - related to standard deviation | Variance = SD squared; used in many statistical models |
| Distribution | - | How values are spread across a dataset | Normal (bell curve), skewed, uniform distributions |
| Percentile | - | The value below which a given percentage of data falls | 90th percentile salary = 90% of employees earn below this amount |
| Correlation | - | A measure of how closely two variables move together, from -1 to +1 | Temperature and ice cream sales have positive correlation |
| Causation | - | One variable directly causes a change in another - stronger than correlation | Smoking causes lung cancer; correlation alone does not imply this |
| Outlier | - | A data point that is significantly different from the rest | A transaction of \$1,000,000 in a dataset of typical \$50 purchases |
| Hypothesis | - | A testable statement about a relationship or effect in data | Customers who receive emails spend 20% more on average |
| P-value | - | The probability that results as extreme as observed could occur by chance alone - below 0.05 is typically significant | P-value of 0.02 = only 2% chance the result is due to random noise |
| Confidence Interval | CI | A range within which the true value is expected to fall, with a stated level of certainty | Average delivery time is 3.5 days ± 0.4 days at 95% confidence |
| Sample | - | A subset of data drawn from a larger population for analysis | Surveying 1,000 customers to represent all 100,000 customers |
| Bias | - | A systematic error that skews results in a particular direction | Surveying only premium users skews satisfaction data upward |
4. Python & Data Tools
Python is the go-to language for data analysis. These are the tools and concepts you will encounter in real workflows.
| Term | Acronym | Definition | Example |
|---|---|---|---|
| Python | - | The most popular programming language for data analysis, known for simplicity and a rich library ecosystem | Used for cleaning data, building models, and automating workflows |
| Pandas | - | A Python library for data manipulation and analysis using DataFrames | df.groupby('city')['sales'].sum() - sales by city in one line |
| NumPy | - | A Python library for fast numerical computing with arrays and matrices | Used under the hood by Pandas and most ML libraries |
| Jupyter Notebook | - | An interactive coding environment that combines code, output, and narrative text | Run Python cells, see charts, and add explanations all in one file |
| DataFrame | - | A 2-dimensional table-like data structure with labeled rows and columns - the core object in Pandas | Like a spreadsheet inside Python |
| Data Cleaning | - | The process of fixing errors, removing duplicates, handling missing values, and standardizing formats | Replacing blank cells with averages, fixing typos in city names |
| Data Wrangling | - | The process of transforming raw data into a format suitable for analysis | Merging tables, reshaping wide to long format, parsing dates |
| ETL | ETL | Extract, Transform, Load - the process of pulling data from sources, cleaning it, and loading it into a destination | Pulling raw sales data from an API, cleaning it, and loading it into a data warehouse |
| Data Pipeline | - | An automated sequence of steps that moves and transforms data from source to destination | A daily pipeline that refreshes a dashboard with yesterday's orders |
| Regular Expressions | Regex | A syntax for pattern matching and text manipulation in strings | Extracting all email addresses from a column of free text |
| API | API | A way for software systems to communicate - data analysts use them to pull data from services | Pulling weather or financial data directly into Python via an API call |
| Web Scraping | - | Automatically extracting data from websites using code | Scraping product prices from an e-commerce site with BeautifulSoup |
| Version Control | - | Tracking changes to code over time so you can undo, review, and collaborate safely | Git - saves snapshots of your analysis scripts |
| Git | - | The standard version control system for tracking code changes | git commit -m 'cleaned null values in orders table' |
5. Data Visualization & Reporting
Turning numbers into clear visuals and stories is one of the highest-value skills a data analyst can develop.
| Term | Acronym | Definition | Example |
|---|---|---|---|
| Dashboard | - | An interactive visual display of key metrics and data, updated in real time or on schedule | A sales dashboard showing daily revenue, top products, and regional performance |
| KPI | KPI | A measurable value that indicates how well a goal is being achieved | Monthly Active Users, Conversion Rate, Customer Acquisition Cost |
| Chart | - | A visual representation of data - bars, lines, pies, scatter plots, etc. | A bar chart comparing revenue by product category |
| Bar Chart | - | A chart using rectangular bars to compare values across categories | Revenue by department |
| Line Chart | - | A chart showing how a value changes over time using connected data points | Monthly website traffic over 12 months |
| Scatter Plot | - | A chart plotting individual data points on two axes to show relationships between variables | Customer age vs. total spend |
| Heatmap | - | A grid where values are represented by color intensity | Hour-by-day view of website traffic - darker = more visitors |
| Histogram | - | A chart showing the frequency distribution of a single numeric variable | Distribution of customer ages in 10-year bins |
| Tableau | - | A leading business intelligence and data visualization platform | Drag-and-drop dashboards connected to live databases |
| Power BI | - | Microsoft's business intelligence tool for building interactive reports and dashboards | Common in organizations already using Microsoft 365 |
| Data Storytelling | - | The practice of communicating insights from data using a narrative with visuals | A slide deck explaining why sales dropped last quarter using charts and context |
| Exploratory Data Analysis | EDA | An initial analysis phase to summarize data, find patterns, and detect anomalies before modeling | Running df.describe() and df.hist() in Pandas to understand a new dataset |
6. Advanced & Enterprise Concepts
Once you have the fundamentals, these terms appear constantly in data engineering, warehousing, and larger organizations.
| Term | Acronym | Definition | Example |
|---|---|---|---|
| Data Warehouse | DW | A central repository storing large volumes of historical, structured data from multiple sources - optimized for analysis | Google BigQuery, Amazon Redshift, Snowflake |
| Data Lake | - | A storage system for raw data in any format - structured, unstructured, or semi-structured | An AWS S3 bucket holding raw logs, JSON files, and CSVs |
| Data Mart | - | A subset of a data warehouse focused on a specific business area | A marketing data mart containing only ad spend and campaign data |
| OLAP | OLAP | Online Analytical Processing - systems designed for complex queries and aggregations on large historical data | Running multi-dimensional sales analysis: by region, product, and quarter simultaneously |
| OLTP | OLTP | Online Transaction Processing - systems designed for fast, frequent, small read/write operations | A system processing thousands of online orders per second |
| Snowflake Schema | - | A database schema where dimension tables are normalized into multiple related tables, reducing redundancy | A date dimension split into year, quarter, month sub-tables |
| Star Schema | - | A simple warehouse schema with one central fact table linked to multiple dimension tables | A Sales fact table linked to Date, Product, and Customer dimension tables |
| Fact Table | - | A table in a data warehouse storing measurable, quantitative data about events | Each row = one sales transaction with amount, date, and customer ID |
| Dimension Table | - | A table storing descriptive attributes used to filter and group fact data | Customer table with name, city, age, and segment |
| Data Governance | - | The policies and processes that ensure data quality, security, privacy, and proper use across an organization | Rules defining who can access customer PII data and how long it is retained |
| Data Lineage | - | A record of where data originated and how it has moved and transformed over time | Knowing that a dashboard number traces back to a specific raw database table |
| Data Catalog | - | A searchable inventory of all data assets in an organization with metadata and documentation | A company-wide system where analysts can search for available tables and understand what each column means |
| Partitioning | - | Dividing a large database table into smaller segments based on a column value for faster querying | Partitioning a billion-row table by month so each query only scans one month |
| Apache Spark | - | An open-source big data processing framework for distributed computation on large datasets | Processing terabytes of clickstream data across a cluster of servers |
| dbt | dbt | Data Build Tool - a framework for transforming raw data in a warehouse using SQL, with version control and testing | Writing SQL models that are automatically run, tested, and documented |
7. Popular Tools & Platforms
| Category | Tools |
|---|---|
| Query & SQL | MySQL, PostgreSQL, BigQuery, Snowflake, SQL Server |
| Python Analysis | Pandas, NumPy, Jupyter Notebook |
| Data Visualization | Tableau, Power BI, Matplotlib, Seaborn, Plotly |
| Spreadsheets | Microsoft Excel, Google Sheets |
| Big Data & Pipelines | Apache Spark, Apache Airflow, dbt, Kafka |
| Cloud Platforms | AWS (Redshift, S3), Google Cloud (BigQuery), Azure Synapse |
| Version Control | Git, GitHub |
| Data Cleaning | OpenRefine, Pandas, Excel Power Query |
| BI & Reporting | Looker, Metabase, Superset, Grafana |
| Statistics & ML | Scikit-learn, R, SciPy, Statsmodels |
Recommended Learning Path
Don't try to learn everything at once. Follow these phases in order and build momentum with small wins.
Phase 1 - Foundations (1-2 weeks)
Get comfortable with the core vocabulary before touching any tools.
- Data, datasets, and data types
- What a database is and how tables work
- Rows, columns, primary keys, and foreign keys
- The difference between structured and unstructured data
Phase 2 - SQL Basics (2-4 weeks)
SQL is the single most important skill. Learn to query data before anything else.
- SELECT, WHERE, ORDER BY, GROUP BY
- Aggregate functions: SUM, COUNT, AVG, MAX, MIN
- JOINs: INNER, LEFT, RIGHT
- Filtering with WHERE and HAVING
Phase 3 - Statistics & Python (4-8 weeks)
Build analytical thinking and learn to work with data programmatically.
- Mean, median, standard deviation, correlation
- Python basics and Pandas DataFrames
- Data cleaning - handling nulls, duplicates, outliers
- Exploratory Data Analysis (EDA)
Phase 4 - Visualization & Reporting (2-4 weeks)
Turn analysis into insights that decision-makers can understand.
- Choosing the right chart type for your data
- Build a dashboard in Tableau or Power BI
- Data storytelling - narrative structure for presentations
- KPIs and business metrics
Phase 5 - Advanced Topics (Ongoing)
As you grow, these skills will set you apart in the job market.
- Advanced SQL: CTEs, window functions, subqueries
- Data warehouses, star schemas, and ETL pipelines
- dbt for data transformation
- Big data fundamentals: Spark, Airflow, Kafka
Final Advice
Don't try to master all the theory before you start. The fastest way to truly understand data analysis is through doing.
- Learn the basics - Use this glossary as your starting reference.
- Get SQL practice - Use free platforms like Mode Analytics, LeetCode SQL, or SQLZoo.
- Work on real data - Download a dataset from Kaggle and explore it in Python or Excel.
- Build a portfolio project - A simple dashboard or analysis writeup is more valuable than any certificate.
- Experiment continuously - Break things. Try queries. See what happens.
Practical exposure beats passive study every time. Start today.
Top comments (0)