Data Formats

Vignesh K — Mon, 06 Oct 2025 04:45:24 +0000

Understanding Popular Data Formats: CSV, SQL, JSON, XML, Avro, and Parquet

When working with data engineering, analytics, or backend systems, you’ll often come across multiple data formats. Each has its own strengths, weaknesses, and use cases. Let’s explore the most common ones:

1. CSV (Comma-Separated Values)

What it is: A plain text file where values are separated by commas.
Use case: Simple tabular data, easy import/export.
Pros: Human-readable, lightweight.
Cons: No support for nested data, no schema enforcement.

2. SQL (Structured Query Language)

What it is: A language to interact with relational databases. Data is stored in tables with defined schema.
Use case: Storing structured data in relational databases (MySQL, PostgreSQL, etc.).
Pros: Strong schema, powerful queries, ACID compliance.
Cons: Not flexible for unstructured data.

3. JSON (JavaScript Object Notation)

What it is: A lightweight data format for representing structured, nested data using key-value pairs.
Use case: Web APIs, configurations, NoSQL databases.
Pros: Human-readable, supports hierarchy.
Cons: Can become verbose for large datasets.

4. XML (eXtensible Markup Language)

What it is: A markup-based format using tags to represent data.
Use case: Legacy systems, document storage, SOAP APIs.
Pros: Supports metadata, validation with DTD/XSD.
Cons: Verbose, harder to read compared to JSON.

5. Avro

What it is: A row-based binary format developed by Apache, commonly used in data pipelines.
Use case: Kafka messaging, big data serialization.
Pros: Compact, schema evolution supported.
Cons: Not human-readable (binary format).

6. Parquet

What it is: A columnar storage format optimized for analytics.
Use case: Big data processing (Spark, Hadoop, AWS Athena).
Pros: Compressed, fast query performance, great for large-scale analytics.
Cons: Not human-readable.

Quick Comparison

Format	Type	Readable?	Best Use Case
CSV	Row-based	✅ Yes	Simple tabular data
SQL	Relational	✅ Yes	Databases with strong schema
JSON	Hierarchical	✅ Yes	APIs, configs, NoSQL
XML	Hierarchical	✅ Yes	Legacy systems, structured documents
Avro	Row-based binary	❌ No	Messaging, streaming pipelines
Parquet	Columnar binary	❌ No	Big data analytics, fast queries

Final Thoughts

Each format shines in different scenarios:

Use CSV for small tabular data.
Use SQL for structured databases.
Use JSON for modern APIs.
Use XML when working with older systems.
Use Avro for messaging pipelines.
Use Parquet for analytics at scale.

Working with Yelp Data in MongoDB: CRUD Operations and Queries

Vignesh K — Wed, 27 Aug 2025 10:27:50 +0000

In this blog post, we'll explore how to perform basic CRUD operations and queries on a MongoDB database named mydb with a collection called yelp, based on a Yelp dataset. We'll use MongoDB's JavaScript-based query syntax to insert records, find top-rated businesses, count reviews containing specific words, retrieve reviews for a business, update a review, and delete a record. Let's dive in!

1. Inserting Records into the Yelp Collection
To populate the yelp collection, we can use the insertMany method to add multiple documents at once. Below is an example of inserting 10 review documents, each with fields like business_id, date, review_id, stars, text, type, user_id, cool, useful, and funny.

2. Finding the Top 5 Businesses by Average Rating
To find the top 5 businesses with the highest average rating, we use MongoDB's aggregation pipeline. The pipeline groups reviews by 'business_id', computes the average 'stars', sorts in descending order, and limits to 5 results.

For the inserted data, this query returns:

'C3D4E5F6G7H8I9J0K1L2M3': 4.5
'E5F6G7H8I9J0K1L2M3N4O5': 4.5
'A1B2C3D4E5F6G7H8I9J0K1': 4.5
'D4E5F6G7H8I9J0K1L2M3N4': 4
'B2C3D4E5F6G7H8I9J0K1L2': 3.5

3. Counting Reviews Containing the Word “Good”
To count reviews with the word 'good' in the 'text' field, we use countDocuments with a case-insensitive regex.

This returns '3' for the inserted data, as three reviews contain 'good' (review IDs 'R4V5W6X7Y8Z9A0B1C2D3E4', 'R6X7Y8Z9A0B1C2D3E4F5G6', 'R9A0B1C2D3E4F5G6H7I8J9').

4. Retrieving All Reviews for a Specific Business
To fetch all reviews for a specific 'business_id', we use the find method. For example, to get reviews for '9yKzy9PApeiPPOUJEtnvkg' (from the sample Yelp data):

5. Updating a Review
To update a review, we use updateOne to modify specific fields. For example, to update the review with 'review_id' 'fWKvX83p0-ka4JS3dc6E5A' (from the sample data, if inserted):

Notes
These queries assume you're using MongoDB with the 'yelp' collection in the 'mydb' database.
The 'date' field uses MongoDB's Date object for proper storage. When importing a CSV, ensure dates are parsed correctly (e.g., new Date("YYYY-MM-DD")).
The regex in the 'good' query is case-insensitive ('$options: "i"').
To import a large 'yelp.csv' file, you can use MongoDB's mongoimport tool or a script to parse and insert the data.
If you encounter issues or need help with specific MongoDB versions, importing data, or running these queries, feel free to share more details!
This setup provides a solid foundation for working with Yelp data in MongoDB. You can extend these queries for more complex analyses, like filtering by date or combining conditions. Happy coding!

DEV Community: Vignesh K