preeti deshmukh

Posted on May 29

Beginner's Data Analyst Glossary

#datascience #beginners #data #analytics

Every term you’ve been nodding along to in meetings, finally explained unambiguously so you stop Googling them under the table. 🤣

(Disclaimer: I did not Googled. Possibly LLM-ed. 😜)

1. Core Data Concepts

Start here. These are the foundational building blocks every data analyst must understand before anything else.

Term	Acronym	Definition	Example
Data	-	Raw facts and figures that have not yet been processed or analyzed	Sales numbers, customer names, timestamps
Dataset	-	A structured collection of data organized for analysis	A spreadsheet of 10,000 customer orders
Database	DB	An organized system for storing and retrieving structured data	MySQL, PostgreSQL, Microsoft SQL Server
Spreadsheet	-	A grid-based tool for organizing, calculating, and visualizing data	Microsoft Excel, Google Sheets
Row / Record	-	A single entry in a table - represents one item or event	One customer's order details
Column / Field	-	A category or attribute shared across all rows in a table	Customer name, order date, price
Data Type	-	The kind of value a field holds - number, text, date, boolean, etc.	Age = integer; Name = text; Active = boolean
Structured Data	-	Data organized in rows and columns with a defined format	A sales table in a database
Unstructured Data	-	Data with no fixed format or schema	Customer emails, social media posts, images
Semi-Structured Data	-	Data with some organization but not a strict table format	JSON files, XML documents, log files
Metadata	-	Data that describes other data - its structure, origin, and meaning	A file's creation date, author, and size
Primary Key	PK	A unique identifier for each row in a database table	Customer ID = 10042 (no two customers share it)
Foreign Key	FK	A field in one table that links to the primary key of another table	Order table has a Customer ID column that links to the Customer table
Schema	-	The blueprint that defines the structure of a database - its tables, columns, and data types	A schema specifying that the Orders table has 5 columns with specific types

2. SQL & Querying

SQL is the most important skill for a data analyst. These are the terms and commands you will use every single day.

Term	Acronym	Definition	Example
Query	-	A question or request you send to a database to retrieve or manipulate data	SELECT * FROM orders WHERE date > '2024-01-01'
SQL	SQL	The standard language for querying and managing relational databases	Used in MySQL, PostgreSQL, SQL Server, BigQuery
SELECT	-	SQL command to retrieve data from a table	SELECT name, age FROM customers
WHERE	-	SQL clause to filter rows based on a condition	WHERE country = 'India'
JOIN	-	SQL operation to combine rows from two or more tables based on a related column	JOIN orders ON customers.id = orders.customer_id
GROUP BY	-	SQL clause that groups rows sharing a value so aggregate functions can be applied	GROUP BY city - then COUNT() per city
ORDER BY	-	SQL clause that sorts results by one or more columns	ORDER BY sales DESC - highest sales first
Aggregate Function	-	A function that performs a calculation on a set of values and returns a single result	SUM(), AVG(), COUNT(), MIN(), MAX()
Subquery	-	A query nested inside another query	SELECT * FROM sales WHERE amount > (SELECT AVG(amount) FROM sales)
Window Function	-	A function that calculates values across a set of rows related to the current row, without collapsing them	ROW_NUMBER(), RANK(), LAG(), LEAD() - used for rankings and running totals
CTE	CTE	A temporary named result set defined within a query, making complex queries easier to read	WITH top_customers AS (SELECT ...) SELECT * FROM top_customers
Index	-	A database structure that speeds up data retrieval by providing a fast lookup path	An index on Customer ID makes searches by ID much faster
View	-	A saved query that acts like a virtual table	A 'monthly_sales' view that always returns the latest month's data
Stored Procedure	-	A saved set of SQL statements that can be executed on demand	A procedure that calculates monthly bonuses for all employees
NULL	-	A missing or unknown value in a database - not zero, not blank, but absent	A customer's phone number field with no value entered

3. Statistics for Data Analysts

You don't need a statistics degree, but understanding these core concepts will make your analysis trustworthy and rigorous.

Term	Acronym	Definition	Example
Mean	-	The arithmetic average of a set of values	Average order value = total revenue / number of orders
Median	-	The middle value when data is sorted - less affected by outliers than mean	If salaries are 20K, 30K, 35K, 40K, 200K - median is 35K, not 65K
Mode	-	The most frequently occurring value in a dataset	If most customers are from Mumbai, Mumbai is the mode
Standard Deviation	SD	Measures how spread out values are from the mean	Low SD = values clustered near average; high SD = wide spread
Variance	-	The average of the squared differences from the mean - related to standard deviation	Variance = SD squared; used in many statistical models
Distribution	-	How values are spread across a dataset	Normal (bell curve), skewed, uniform distributions
Percentile	-	The value below which a given percentage of data falls	90th percentile salary = 90% of employees earn below this amount
Correlation	-	A measure of how closely two variables move together, from -1 to +1	Temperature and ice cream sales have positive correlation
Causation	-	One variable directly causes a change in another - stronger than correlation	Smoking causes lung cancer; correlation alone does not imply this
Outlier	-	A data point that is significantly different from the rest	A transaction of \$1,000,000 in a dataset of typical \$50 purchases
Hypothesis	-	A testable statement about a relationship or effect in data	Customers who receive emails spend 20% more on average
P-value	-	The probability that results as extreme as observed could occur by chance alone - below 0.05 is typically significant	P-value of 0.02 = only 2% chance the result is due to random noise
Confidence Interval	CI	A range within which the true value is expected to fall, with a stated level of certainty	Average delivery time is 3.5 days ± 0.4 days at 95% confidence
Sample	-	A subset of data drawn from a larger population for analysis	Surveying 1,000 customers to represent all 100,000 customers
Bias	-	A systematic error that skews results in a particular direction	Surveying only premium users skews satisfaction data upward

4. Python & Data Tools

Python is the go-to language for data analysis. These are the tools and concepts you will encounter in real workflows.

Term	Acronym	Definition	Example
Python	-	The most popular programming language for data analysis, known for simplicity and a rich library ecosystem	Used for cleaning data, building models, and automating workflows
Pandas	-	A Python library for data manipulation and analysis using DataFrames	df.groupby('city')['sales'].sum() - sales by city in one line
NumPy	-	A Python library for fast numerical computing with arrays and matrices	Used under the hood by Pandas and most ML libraries
Jupyter Notebook	-	An interactive coding environment that combines code, output, and narrative text	Run Python cells, see charts, and add explanations all in one file
DataFrame	-	A 2-dimensional table-like data structure with labeled rows and columns - the core object in Pandas	Like a spreadsheet inside Python
Data Cleaning	-	The process of fixing errors, removing duplicates, handling missing values, and standardizing formats	Replacing blank cells with averages, fixing typos in city names
Data Wrangling	-	The process of transforming raw data into a format suitable for analysis	Merging tables, reshaping wide to long format, parsing dates
ETL	ETL	Extract, Transform, Load - the process of pulling data from sources, cleaning it, and loading it into a destination	Pulling raw sales data from an API, cleaning it, and loading it into a data warehouse
Data Pipeline	-	An automated sequence of steps that moves and transforms data from source to destination	A daily pipeline that refreshes a dashboard with yesterday's orders
Regular Expressions	Regex	A syntax for pattern matching and text manipulation in strings	Extracting all email addresses from a column of free text
API	API	A way for software systems to communicate - data analysts use them to pull data from services	Pulling weather or financial data directly into Python via an API call
Web Scraping	-	Automatically extracting data from websites using code	Scraping product prices from an e-commerce site with BeautifulSoup
Version Control	-	Tracking changes to code over time so you can undo, review, and collaborate safely	Git - saves snapshots of your analysis scripts
Git	-	The standard version control system for tracking code changes	git commit -m 'cleaned null values in orders table'

5. Data Visualization & Reporting

Turning numbers into clear visuals and stories is one of the highest-value skills a data analyst can develop.

Term	Acronym	Definition	Example
Dashboard	-	An interactive visual display of key metrics and data, updated in real time or on schedule	A sales dashboard showing daily revenue, top products, and regional performance
KPI	KPI	A measurable value that indicates how well a goal is being achieved	Monthly Active Users, Conversion Rate, Customer Acquisition Cost
Chart	-	A visual representation of data - bars, lines, pies, scatter plots, etc.	A bar chart comparing revenue by product category
Bar Chart	-	A chart using rectangular bars to compare values across categories	Revenue by department
Line Chart	-	A chart showing how a value changes over time using connected data points	Monthly website traffic over 12 months
Scatter Plot	-	A chart plotting individual data points on two axes to show relationships between variables	Customer age vs. total spend
Heatmap	-	A grid where values are represented by color intensity	Hour-by-day view of website traffic - darker = more visitors
Histogram	-	A chart showing the frequency distribution of a single numeric variable	Distribution of customer ages in 10-year bins
Tableau	-	A leading business intelligence and data visualization platform	Drag-and-drop dashboards connected to live databases
Power BI	-	Microsoft's business intelligence tool for building interactive reports and dashboards	Common in organizations already using Microsoft 365
Data Storytelling	-	The practice of communicating insights from data using a narrative with visuals	A slide deck explaining why sales dropped last quarter using charts and context
Exploratory Data Analysis	EDA	An initial analysis phase to summarize data, find patterns, and detect anomalies before modeling	Running df.describe() and df.hist() in Pandas to understand a new dataset

6. Advanced & Enterprise Concepts

Once you have the fundamentals, these terms appear constantly in data engineering, warehousing, and larger organizations.

Term	Acronym	Definition	Example
Data Warehouse	DW	A central repository storing large volumes of historical, structured data from multiple sources - optimized for analysis	Google BigQuery, Amazon Redshift, Snowflake
Data Lake	-	A storage system for raw data in any format - structured, unstructured, or semi-structured	An AWS S3 bucket holding raw logs, JSON files, and CSVs
Data Mart	-	A subset of a data warehouse focused on a specific business area	A marketing data mart containing only ad spend and campaign data
OLAP	OLAP	Online Analytical Processing - systems designed for complex queries and aggregations on large historical data	Running multi-dimensional sales analysis: by region, product, and quarter simultaneously
OLTP	OLTP	Online Transaction Processing - systems designed for fast, frequent, small read/write operations	A system processing thousands of online orders per second
Snowflake Schema	-	A database schema where dimension tables are normalized into multiple related tables, reducing redundancy	A date dimension split into year, quarter, month sub-tables
Star Schema	-	A simple warehouse schema with one central fact table linked to multiple dimension tables	A Sales fact table linked to Date, Product, and Customer dimension tables
Fact Table	-	A table in a data warehouse storing measurable, quantitative data about events	Each row = one sales transaction with amount, date, and customer ID
Dimension Table	-	A table storing descriptive attributes used to filter and group fact data	Customer table with name, city, age, and segment
Data Governance	-	The policies and processes that ensure data quality, security, privacy, and proper use across an organization	Rules defining who can access customer PII data and how long it is retained
Data Lineage	-	A record of where data originated and how it has moved and transformed over time	Knowing that a dashboard number traces back to a specific raw database table
Data Catalog	-	A searchable inventory of all data assets in an organization with metadata and documentation	A company-wide system where analysts can search for available tables and understand what each column means
Partitioning	-	Dividing a large database table into smaller segments based on a column value for faster querying	Partitioning a billion-row table by month so each query only scans one month
Apache Spark	-	An open-source big data processing framework for distributed computation on large datasets	Processing terabytes of clickstream data across a cluster of servers
dbt	dbt	Data Build Tool - a framework for transforming raw data in a warehouse using SQL, with version control and testing	Writing SQL models that are automatically run, tested, and documented

7. Popular Tools & Platforms

Category	Tools
Query & SQL	MySQL, PostgreSQL, BigQuery, Snowflake, SQL Server
Python Analysis	Pandas, NumPy, Jupyter Notebook
Data Visualization	Tableau, Power BI, Matplotlib, Seaborn, Plotly
Spreadsheets	Microsoft Excel, Google Sheets
Big Data & Pipelines	Apache Spark, Apache Airflow, dbt, Kafka
Cloud Platforms	AWS (Redshift, S3), Google Cloud (BigQuery), Azure Synapse
Version Control	Git, GitHub
Data Cleaning	OpenRefine, Pandas, Excel Power Query
BI & Reporting	Looker, Metabase, Superset, Grafana
Statistics & ML	Scikit-learn, R, SciPy, Statsmodels

Recommended Learning Path

Don't try to learn everything at once. Follow these phases in order and build momentum with small wins.

Phase 1 - Foundations (1-2 weeks)

Get comfortable with the core vocabulary before touching any tools.

Data, datasets, and data types
What a database is and how tables work
Rows, columns, primary keys, and foreign keys
The difference between structured and unstructured data

Phase 2 - SQL Basics (2-4 weeks)

SQL is the single most important skill. Learn to query data before anything else.

SELECT, WHERE, ORDER BY, GROUP BY
Aggregate functions: SUM, COUNT, AVG, MAX, MIN
JOINs: INNER, LEFT, RIGHT
Filtering with WHERE and HAVING

Phase 3 - Statistics & Python (4-8 weeks)

Build analytical thinking and learn to work with data programmatically.

Mean, median, standard deviation, correlation
Python basics and Pandas DataFrames
Data cleaning - handling nulls, duplicates, outliers
Exploratory Data Analysis (EDA)

Phase 4 - Visualization & Reporting (2-4 weeks)

Turn analysis into insights that decision-makers can understand.

Choosing the right chart type for your data
Build a dashboard in Tableau or Power BI
Data storytelling - narrative structure for presentations
KPIs and business metrics

Phase 5 - Advanced Topics (Ongoing)

As you grow, these skills will set you apart in the job market.

Advanced SQL: CTEs, window functions, subqueries
Data warehouses, star schemas, and ETL pipelines
dbt for data transformation
Big data fundamentals: Spark, Airflow, Kafka

Final Advice

Don't try to master all the theory before you start. The fastest way to truly understand data analysis is through doing.

Learn the basics - Use this glossary as your starting reference.
Get SQL practice - Use free platforms like Mode Analytics, LeetCode SQL, or SQLZoo.
Work on real data - Download a dataset from Kaggle and explore it in Python or Excel.
Build a portfolio project - A simple dashboard or analysis writeup is more valuable than any certificate.
Experiment continuously - Break things. Try queries. See what happens.

Practical exposure beats passive study every time. Start today.

DEV Community