<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shrutti Kannan</title>
    <description>The latest articles on DEV Community by Shrutti Kannan (@shrutti_kannan_4d6b7159e2).</description>
    <link>https://dev.to/shrutti_kannan_4d6b7159e2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3413708%2Fffb743b4-a973-4ada-9d82-fc15c682325f.png</url>
      <title>DEV Community: Shrutti Kannan</title>
      <link>https://dev.to/shrutti_kannan_4d6b7159e2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shrutti_kannan_4d6b7159e2"/>
    <language>en</language>
    <item>
      <title>🔍 Understanding 6 Common Data Formats in Data Analytics (With Examples)</title>
      <dc:creator>Shrutti Kannan</dc:creator>
      <pubDate>Wed, 08 Oct 2025 18:16:54 +0000</pubDate>
      <link>https://dev.to/shrutti_kannan_4d6b7159e2/understanding-6-common-data-formats-in-data-analytics-with-examples-4mh7</link>
      <guid>https://dev.to/shrutti_kannan_4d6b7159e2/understanding-6-common-data-formats-in-data-analytics-with-examples-4mh7</guid>
      <description>&lt;p&gt;When working in data analytics, we often need to store, share, and transform data in various formats. Each format has its own strengths, ideal use cases, and limitations. Whether you're wrangling data for machine learning, storing logs, or building a data pipeline, understanding these formats is key.&lt;/p&gt;

&lt;p&gt;In this article, we’ll walk through six widely-used data formats with a simple example dataset. We'll look at:&lt;/p&gt;

&lt;p&gt;CSV (Comma Separated Values)&lt;/p&gt;

&lt;p&gt;SQL (Relational Table Format)&lt;/p&gt;

&lt;p&gt;JSON (JavaScript Object Notation)&lt;/p&gt;

&lt;p&gt;Parquet (Columnar Storage Format)&lt;/p&gt;

&lt;p&gt;XML (Extensible Markup Language)&lt;/p&gt;

&lt;p&gt;Avro (Row-based Storage Format)&lt;/p&gt;

&lt;p&gt;🎓 Example Dataset&lt;/p&gt;

&lt;p&gt;To make things concrete, let’s use this small dataset of student exam scores:&lt;/p&gt;

&lt;p&gt;Name    Register No Subject Marks&lt;br&gt;
Alice   1001    Math    89&lt;br&gt;
Bob 1002    Science 92&lt;br&gt;
Charlie 1003    History 85&lt;/p&gt;

&lt;p&gt;We'll now represent this dataset in each format.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;📄 CSV (Comma Separated Values)
🔍 What is CSV?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;CSV is one of the simplest data formats. It stores tabular data as plain text, with rows separated by line breaks and columns separated by commas. It’s human-readable, easy to generate, and widely supported.&lt;/p&gt;

&lt;p&gt;🧾 Example:&lt;/p&gt;

&lt;p&gt;Name,Register No,Subject,Marks&lt;br&gt;
Alice,1001,Math,89&lt;br&gt;
Bob,1002,Science,92&lt;br&gt;
Charlie,1003,History,85&lt;/p&gt;

&lt;p&gt;✅ Pros:&lt;/p&gt;

&lt;p&gt;Simple and readable&lt;/p&gt;

&lt;p&gt;Easy to parse&lt;/p&gt;

&lt;p&gt;Ideal for small data&lt;/p&gt;

&lt;p&gt;❌ Cons:&lt;/p&gt;

&lt;p&gt;No support for nested data&lt;/p&gt;

&lt;p&gt;No data types (everything is text)&lt;/p&gt;

&lt;p&gt;Not efficient for large datasets&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;🗄️ SQL (Relational Table Format)
🔍 What is SQL?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;SQL is the language used for managing relational databases. Data is stored in tables with defined columns and types. You query it using SQL syntax like SELECT, INSERT, etc.&lt;/p&gt;

&lt;p&gt;🧾 Example:&lt;/p&gt;

&lt;p&gt;CREATE TABLE student_scores (&lt;br&gt;
  Name TEXT,&lt;br&gt;
  RegisterNo INT,&lt;br&gt;
  Subject TEXT,&lt;br&gt;
  Marks INT&lt;br&gt;
);&lt;/p&gt;

&lt;p&gt;INSERT INTO student_scores (Name, RegisterNo, Subject, Marks) VALUES&lt;br&gt;
('Alice', 1001, 'Math', 89),&lt;br&gt;
('Bob', 1002, 'Science', 92),&lt;br&gt;
('Charlie', 1003, 'History', 85);&lt;/p&gt;

&lt;p&gt;✅ Pros:&lt;/p&gt;

&lt;p&gt;Strong data typing&lt;/p&gt;

&lt;p&gt;Powerful querying capabilities&lt;/p&gt;

&lt;p&gt;Supports constraints and indexing&lt;/p&gt;

&lt;p&gt;❌ Cons:&lt;/p&gt;

&lt;p&gt;Requires database engine&lt;/p&gt;

&lt;p&gt;Less flexible for hierarchical data&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;🌐 JSON (JavaScript Object Notation)
🔍 What is JSON?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;JSON is a lightweight format for storing structured data, often used in web APIs. It supports nested and hierarchical data.&lt;/p&gt;

&lt;p&gt;🧾 Example:&lt;br&gt;
[&lt;br&gt;
  {&lt;br&gt;
    "Name": "Alice",&lt;br&gt;
    "RegisterNo": 1001,&lt;br&gt;
    "Subject": "Math",&lt;br&gt;
    "Marks": 89&lt;br&gt;
  },&lt;br&gt;
  {&lt;br&gt;
    "Name": "Bob",&lt;br&gt;
    "RegisterNo": 1002,&lt;br&gt;
    "Subject": "Science",&lt;br&gt;
    "Marks": 92&lt;br&gt;
  },&lt;br&gt;
  {&lt;br&gt;
    "Name": "Charlie",&lt;br&gt;
    "RegisterNo": 1003,&lt;br&gt;
    "Subject": "History",&lt;br&gt;
    "Marks": 85&lt;br&gt;
  }&lt;br&gt;
]&lt;/p&gt;

&lt;p&gt;✅ Pros:&lt;/p&gt;

&lt;p&gt;Human-readable and flexible&lt;/p&gt;

&lt;p&gt;Supports nested structures&lt;/p&gt;

&lt;p&gt;Common in APIs and web apps&lt;/p&gt;

&lt;p&gt;❌ Cons:&lt;/p&gt;

&lt;p&gt;Larger file size than binary formats&lt;/p&gt;

&lt;p&gt;No schema enforcement&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;🧱 Parquet (Columnar Storage Format)
🔍 What is Parquet?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Parquet is an open-source columnar storage format optimized for performance, especially with big data tools like Apache Spark and Hive. It stores data by columns instead of rows, which allows efficient compression and retrieval.&lt;/p&gt;

&lt;p&gt;🧾 Example (Pseudo view):&lt;/p&gt;

&lt;p&gt;We can’t represent Parquet directly as plain text, but here’s how the schema might look:&lt;/p&gt;

&lt;p&gt;parquet_schema:&lt;br&gt;
  Name: BYTE_ARRAY (UTF8)&lt;br&gt;
  RegisterNo: INT32&lt;br&gt;
  Subject: BYTE_ARRAY (UTF8)&lt;br&gt;
  Marks: INT32&lt;br&gt;
Columnar View:&lt;br&gt;
Name: ["Alice", "Bob", "Charlie"]&lt;br&gt;
RegisterNo: [1001, 1002, 1003]&lt;br&gt;
Subject: ["Math", "Science", "History"]&lt;br&gt;
Marks: [89, 92, 85]&lt;/p&gt;

&lt;p&gt;You’d use a tool like pandas.to_parquet() or Apache Spark to read/write this format.&lt;/p&gt;

&lt;p&gt;✅ Pros:&lt;/p&gt;

&lt;p&gt;Great for analytics (fast column reads)&lt;/p&gt;

&lt;p&gt;Efficient storage and compression&lt;/p&gt;

&lt;p&gt;Preferred in big data environments&lt;/p&gt;

&lt;p&gt;❌ Cons:&lt;/p&gt;

&lt;p&gt;Not human-readable&lt;/p&gt;

&lt;p&gt;Complex to debug without tools&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;🪵 XML (Extensible Markup Language)
🔍 What is XML?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;XML is a markup language that stores structured data with custom tags. It’s verbose but flexible and still used in enterprise and legacy systems.&lt;/p&gt;

&lt;p&gt;🧾 Example:&lt;br&gt;
&lt;br&gt;
  &lt;br&gt;
    Alice&lt;br&gt;
    1001&lt;br&gt;
    Math&lt;br&gt;
    89&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
    Bob&lt;br&gt;
    1002&lt;br&gt;
    Science&lt;br&gt;
    92&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
    Charlie&lt;br&gt;
    1003&lt;br&gt;
    History&lt;br&gt;
    85&lt;br&gt;
  &lt;br&gt;
&lt;/p&gt;

&lt;p&gt;✅ Pros:&lt;/p&gt;

&lt;p&gt;Hierarchical and self-descriptive&lt;/p&gt;

&lt;p&gt;Schema (XSD) support for validation&lt;/p&gt;

&lt;p&gt;❌ Cons:&lt;/p&gt;

&lt;p&gt;Verbose&lt;/p&gt;

&lt;p&gt;Slower to parse compared to JSON&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;📦 Avro (Row-Based Storage Format)
🔍 What is Avro?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Avro is a row-based binary format developed within the Apache Hadoop ecosystem. It stores both data and schema, making it ideal for serialization and transport in distributed systems.&lt;/p&gt;

&lt;p&gt;🧾 Example:&lt;/p&gt;

&lt;p&gt;Avro is a binary format, but here’s a conceptual view using JSON schema + data.&lt;/p&gt;

&lt;p&gt;Schema (JSON):&lt;/p&gt;

&lt;p&gt;{&lt;br&gt;
  "type": "record",&lt;br&gt;
  "name": "Student",&lt;br&gt;
  "fields": [&lt;br&gt;
    {"name": "Name", "type": "string"},&lt;br&gt;
    {"name": "RegisterNo", "type": "int"},&lt;br&gt;
    {"name": "Subject", "type": "string"},&lt;br&gt;
    {"name": "Marks", "type": "int"}&lt;br&gt;
  ]&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;Data (stored in binary, but conceptualized like):&lt;/p&gt;

&lt;p&gt;[&lt;br&gt;
  {"Name": "Alice", "RegisterNo": 1001, "Subject": "Math", "Marks": 89},&lt;br&gt;
  {"Name": "Bob", "RegisterNo": 1002, "Subject": "Science", "Marks": 92},&lt;br&gt;
  {"Name": "Charlie", "RegisterNo": 1003, "Subject": "History", "Marks": 85}&lt;br&gt;
]&lt;/p&gt;

&lt;p&gt;✅ Pros:&lt;/p&gt;

&lt;p&gt;Schema evolution support&lt;/p&gt;

&lt;p&gt;Compact binary format&lt;/p&gt;

&lt;p&gt;Ideal for streaming and Hadoop&lt;/p&gt;

&lt;p&gt;❌ Cons:&lt;/p&gt;

&lt;p&gt;Not human-readable&lt;/p&gt;

&lt;p&gt;Requires Avro libraries to read/write&lt;/p&gt;

&lt;p&gt;📊 Summary Comparison&lt;br&gt;
Format  Human-readable  Supports Nested Best For    Compression&lt;br&gt;
CSV ✅ ❌ Simple data exchange    ❌&lt;br&gt;
SQL ✅ ❌ Structured storage  ✅ (DB level)&lt;br&gt;
JSON    ✅ ✅ APIs, web apps  ❌&lt;br&gt;
Parquet ❌ ✅ Big data analytics  ✅✅✅&lt;br&gt;
XML ✅ ✅ Legacy systems  ❌&lt;br&gt;
Avro    ❌ ✅ Serialization, Kafka    ✅✅&lt;br&gt;
🧠 Conclusion&lt;/p&gt;

&lt;p&gt;Choosing the right data format depends on your use case. For small, simple data, CSV or JSON works well. For large-scale analytics, go with Parquet. If you're dealing with streaming systems or need schema evolution, Avro is your friend.&lt;/p&gt;

&lt;p&gt;Understanding these formats—and how to work with them—will help you be more effective whether you're doing ETL, data science, or building data pipelines.&lt;/p&gt;

&lt;p&gt;👉 Which format do you use the most in your projects? Let me know in the comments!&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>beginners</category>
      <category>dataengineering</category>
    </item>
  </channel>
</rss>
