Understanding 6 Common Data Formats in Data Analytics

Prasanna Kumar T — Tue, 07 Oct 2025 15:44:58 +0000

In the world of data analytics, the way we store and share data matters just as much as the insights we gain from it. Different formats are optimized for different purposes — whether it’s human readability, query speed, or compression efficiency.

We’ll use the same simple dataset and represent it in all six formats.

Sample Dataset:

Let’s take a small example of student marks:

1. CSV (Comma-Separated Values)

CSV is one of the simplest and most widely used data formats. Each line represents a record, and each field is separated by a comma.
It’s human-readable, lightweight, and easy to open in Excel or Google Sheets.

Dataset Representation (CSV):

✅ Pros: Easy to read, supported by all tools
❌ Cons: No data types, no schema, not efficient for large data

2. SQL (Relational Table Format)

SQL stores data in tables with rows and columns. Data can be inserted using SQL statements.

Dataset Representation (JSON):

✅ Pros: Structured, supports queries and relationships
❌ Cons: Not ideal for semi-structured or nested data

3. JSON (JavaScript Object Notation)

JSON is a lightweight, text-based format used to store structured and semi-structured data.
It’s commonly used in APIs and NoSQL databases like MongoDB.

Dataset Representation (JSON):

✅ Pros: Human-readable, supports nested structures
❌ Cons: Larger file size, slower parsing for very big data

4. Parquet (Columnar Storage Format)

Apache Parquet is a binary columnar format optimized for analytical queries in big data systems like Hadoop, Spark, and AWS Athena.
It stores data by columns, allowing faster reads for specific fields.

Dataset Representation (Parquet):

Parquet is a binary format, so you can’t read it directly like text.
However, the same dataset conceptually looks like this when stored in Parquet:

In Python (using PyArrow or Pandas), you’d write:

✅ Pros: Highly efficient, supports compression, great for analytics
❌ Cons: Not human-readable

5. XML (Extensible Markup Language)

XML is a markup language similar to HTML that represents data with tags.
It’s self-descriptive and widely used in web services and configuration files.

Dataset Representation (XML):

✅ Pros: Hierarchical, supports metadata
❌ Cons: Verbose, larger file size, slower parsing

6. Avro (Row-Based Storage Format)

Apache Avro is a binary row-based format often used in data streaming and serialization (especially with Apache Kafka).
It stores data with a schema, making it compact and fast.

Dataset Representation (Avro Schema + Example):

Schema (in JSON):

Sample Data (Avro JSON representation):

✅ Pros: Compact, schema-based, great for streaming
❌ Cons: Requires tools to read/write (not human-readable)

MongoDB Hands-on: Working with a Zomato Restaurants Dataset

Prasanna Kumar T — Tue, 26 Aug 2025 04:43:34 +0000

This professional guide demonstrates a compact, repeatable workflow for common MongoDB tasks using a Zomato-style restaurants dataset. It covers:

inserting sample documents,
computing top restaurants by rating,
counting reviews that mention the word “good”,
fetching all reviews for a restaurant,
updating and deleting records, and
exporting results as JSON/CSV.

All commands are ready to paste into mongosh (except mongoexport, which runs in your system terminal). The examples assume a database named zomato and a collection named restaurants.

Dataset shape (example document)

Typical document structure used in the examples:

Note: the rate field in this dataset is a string of the form "4.1/5". Queries below show how to parse the numeric portion.

1. Insert at least 10 sample records

Use insertMany to add sample documents if you don’t already have data.

2. Top 5 restaurants by rating

Because rate is stored as "4.1/5", use $split + $toDouble to convert it to a numeric value, then aggregate and sort. The pipeline below computes an average per restaurant (useful if a restaurant has multiple documents).

What this returns: top 5 restaurant names with their average numeric rating and total votes.

3. Count reviews that contain the word “good” (case-insensitive)

If reviews_list is an array of subdocuments with a review text field:

Notes:

This counts documents (restaurants) where at least one review contains “good”.

If you need the exact count of matching reviews across all documents, use $unwind and $match then $count:

4. Retrieve all reviews for a specific restaurant (by name)

Replace "Jalsa" with the restaurant of interest.

5. Update a review and delete a record
Update one matching review (in-place)

Example: update a review in San Churro Cafe whose text mentions “Ambience”.

6. Delete a record (example: remove "Spice Elephant")

7. Export results as JSON/CSV

Important: mongoexport is a command-line tool. Run these commands in your system terminal (PowerShell, Bash, or Command Prompt), not inside mongosh.

Export entire collection as JSON

mongoexport --db=zomato --collection=restaurants --out=restaurants.json

Export selected fields as CSV

mongoexport --db=zomato --collection=restaurants \
--type=csv --fields=name,location,rate,votes,cuisines \
--out=restaurants.csv

Export the Top 5 aggregation result (recommended approach: write the aggregation to a temporary collection, then export):

Write the aggregation to a collection named top5_restaurants:

Export that collection:

mongoexport --db=zomato --collection=top5_restaurants --out=top5_restaurants.json
mongoexport --db=zomato --collection=top5_restaurants --type=csv --fields=_id,location,avgRating,totalVotes --out=top5_restaurants.csv

Authentication / remote server: if your MongoDB uses authentication or is remote, supply --uri with the full connection string:

mongoexport --uri="mongodb://user:pass@host:27017/zomato" --collection=restaurants --out=restaurants.json

Practical notes & best practices

Run mongoexport from the OS terminal — not from mongosh. Running it inside the shell causes a syntax error.
Normalize numeric fields: if you will frequently query by rating, consider storing rate_num as a numeric field (e.g., 4.1) during ingestion to simplify queries and improve performance.
Text search: for more advanced review searches (stemming, ranking), consider enabling text indexes and using $text.

Backups: keep JSON exports of critical collections for reproducibility and backup.

Closing

This guide supplies a compact, production-minded set of queries and commands you can use to demonstrate MongoDB skills on the Zomato dataset. Use the commands as-is in mongosh and run the mongoexport commands from your terminal to produce JSON/CSV deliverables.

DEV Community: Prasanna Kumar T

Understanding 6 Common Data Formats in Data Analytics

MongoDB Hands-on: Working with a Zomato Restaurants Dataset