DEV Community

HARISH B
HARISH B

Posted on

Data in the Cloud: 6 Common Formats for Data Analytics

In the world of data analytics, how we store, exchange, and process data depends heavily on the data format used.

From simple text files like CSV to efficient formats like Parquet, each serves a unique purpose for performance, scalability, and compatibility.

In this article, we’ll explore six common data formats used in analytics:

  1. CSV (Comma-Separated Values)
  2. SQL (Relational Table Format)
  3. JSON (JavaScript Object Notation)
  4. Parquet (Columnar Storage Format)
  5. XML (Extensible Markup Language)
  6. Avro (Row-based Storage Format)

We’ll use this simple dataset as an example:

Name Register_No Subject Marks
Hari 101 Math 89
Asha 102 Science 92
Kiran 103 English 85

Step 1: Install Dependencies

!pip install pandas pyarrow fastavro lxml
Enter fullscreen mode Exit fullscreen mode

Step 2: Create the Dataset
We’ll first create a small dataset using pandas.

import pandas as pd  
data = {     
    "Name": ["Hari", "Asha", "Kiran"],     
    "Register_No": [101, 102, 103],     
    "Subject": ["Math", "Science", "English"],     
    "Marks": [89, 92, 85] 
}  

df = pd.DataFrame(data)
df
Enter fullscreen mode Exit fullscreen mode

1.CSV (Comma-Separated Values)

CSV is one of the simplest and most common formats for storing tabular data.
Each line represents a row, and columns are separated by commas. It’s human-readable and works easily with Excel or Python.

* Save to CSV
df.to_csv("students.csv", index=False)

* Verify content
print(open("students.csv").read())

Output:
Name,Register_No,Subject,Marks
Hari,101,Math,89
Asha,102,Science,92
Kiran,103,English,85
Enter fullscreen mode Exit fullscreen mode

2.SQL (Relational Table Format)

SQL organizes data into relational tables using columns and rows.
You can store, query, and join tables efficiently in databases like MySQL, SQLite, or PostgreSQL.

import sqlite3  

* Create SQLite database and table
conn = sqlite3.connect("students.db")  
df.to_sql("students", conn, if_exists="replace", index=False)  

* Verify content
pd.read_sql("SELECT * FROM students", conn)
Enter fullscreen mode Exit fullscreen mode

3.JSON (JavaScript Object Notation)

JSON represents data as key-value pairs, making it ideal for APIs and web applications.
It’s lightweight and language-independent.

* Save to JSON (pretty format)
df.to_json("students.json", orient="records", indent=2)

* Verify content
print(open("students.json").read())

Output:

[
  {
    "Name": "Hari",
    "Register_No": 101,
    "Subject": "Math",
    "Marks": 89
  },
  {
    "Name": "Asha",
    "Register_No": 102,
    "Subject": "Science",
    "Marks": 92
  },
  {
    "Name": "Kiran",
    "Register_No": 103,
    "Subject": "English",
    "Marks": 85
  }
]
Enter fullscreen mode Exit fullscreen mode

4.Parquet (Columnar Storage Format)

Parquet is a columnar storage format optimized for big data analytics.
It stores data by columns instead of rows, which improves compression and query performance — especially for analytical workloads.

* Save to Parquet
df.to_parquet("students.parquet", index=False)

* Verify by reading it back
pd.read_parquet("students.parquet")
Efficient and compact — great for distributed systems.
Enter fullscreen mode Exit fullscreen mode

5.XML (Extensible Markup Language)

XML stores data using nested tags, similar to HTML.
It’s human-readable and useful for hierarchical or document-like data.

import xml.etree.ElementTree as ET  

root = ET.Element("Students")  

for _, row in df.iterrows():     
    student = ET.SubElement(root, "Student")     
    ET.SubElement(student, "Name").text = str(row["Name"])     
    ET.SubElement(student, "Register_No").text = str(row["Register_No"])     
    ET.SubElement(student, "Subject").text = str(row["Subject"])     
    ET.SubElement(student, "Marks").text = str(row["Marks"])  

tree = ET.ElementTree(root)  
tree.write("students.xml", encoding="utf-8", xml_declaration=True)

* Verify content
print(open("students.xml").read())

Output:
xml
<?xml version='1.0' encoding='utf-8'?>
<Students>
  <Student><Name>Hari</Name><Register_No>101</Register_No><Subject>Math</Subject><Marks>89</Marks></Student>
  <Student><Name>Asha</Name><Register_No>102</Register_No><Subject>Science</Subject><Marks>92</Marks></Student>
  <Student><Name>Kiran</Name><Register_No>103</Register_No><Subject>English</Subject><Marks>85</Marks></Student>
</Students>
Enter fullscreen mode Exit fullscreen mode

6.Avro (Row-based Storage Format)

Avro is a binary row-based format developed by Apache for data serialization.
It includes the schema with the data, making it ideal for streaming systems like Apache Kafka.

from fastavro import writer, reader, parse_schema  

schema = {     
    "doc": "Student data",     
    "name": "Student",     
    "namespace": "example.avro",     
    "type": "record",     
    "fields": [         
        {"name": "Name", "type": "string"},         
        {"name": "Register_No", "type": "int"},         
        {"name": "Subject", "type": "string"},         
        {"name": "Marks", "type": "int"}     
    ] 
}  

records = df.to_dict(orient="records")  

with open("students.avro", "wb") as out:     
    writer(out, parse_schema(schema), records)  

print("students.avro created")

Verify Avro Data

from fastavro import reader  
with open("students.avro", "rb") as fo:     
    for record in reader(fo):         
        print(record)

Output:

{'Name': 'Hari', 'Register_No': 101, 'Subject': 'Math', 'Marks': 89}
{'Name': 'Asha', 'Register_No': 102, 'Subject': 'Science', 'Marks': 92}
{'Name': 'Kiran', 'Register_No': 103, 'Subject': 'English', 'Marks': 85}

Summary Check

print("Files created in Colab working directory:")
!ls -lh students.*
Enter fullscreen mode Exit fullscreen mode

Conclusion
Each data format serves a specific purpose in analytics:

Format Type Best For
CSV Text Simple tabular data
SQL Relational Structured, queryable data
JSON Hierarchical APIs, NoSQL data
Parquet Columnar Big data and analytics
XML Tagged Configs and document data
Avro Binary (Row) Streaming, schema evolution

Understanding when to use each format helps you build scalable, efficient, and interoperable data systems in the cloud.

You can view and run this notebook yourself on Google Colab:
View Colab Notebook - https://colab.research.google.com/drive/1vEfM0Cv9ir_FXmjibvG1pNGUazFJKbNF?usp=sharing

Top comments (0)