<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: HARISH B</title>
    <description>The latest articles on DEV Community by HARISH B (@harish_b_0079384348ed66e9).</description>
    <link>https://dev.to/harish_b_0079384348ed66e9</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3461120%2F7d1cab37-2514-4393-b429-dd256290d82e.jpg</url>
      <title>DEV Community: HARISH B</title>
      <link>https://dev.to/harish_b_0079384348ed66e9</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/harish_b_0079384348ed66e9"/>
    <language>en</language>
    <item>
      <title>Data in the Cloud: 6 Common Formats for Data Analytics</title>
      <dc:creator>HARISH B</dc:creator>
      <pubDate>Tue, 07 Oct 2025 14:33:58 +0000</pubDate>
      <link>https://dev.to/harish_b_0079384348ed66e9/data-in-the-cloud-6-common-formats-for-data-analytics-3e6h</link>
      <guid>https://dev.to/harish_b_0079384348ed66e9/data-in-the-cloud-6-common-formats-for-data-analytics-3e6h</guid>
      <description>&lt;p&gt;In the world of data analytics, how we store, exchange, and process data depends heavily on the data format used.&lt;br&gt;&lt;br&gt;
From simple text files like CSV to efficient formats like Parquet, each serves a unique purpose for performance, scalability, and compatibility.&lt;/p&gt;

&lt;p&gt;In this article, we’ll explore six common data formats used in analytics:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;CSV (Comma-Separated Values)&lt;/li&gt;
&lt;li&gt;SQL (Relational Table Format)&lt;/li&gt;
&lt;li&gt;JSON (JavaScript Object Notation)&lt;/li&gt;
&lt;li&gt;Parquet (Columnar Storage Format)&lt;/li&gt;
&lt;li&gt;XML (Extensible Markup Language)&lt;/li&gt;
&lt;li&gt;Avro (Row-based Storage Format)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We’ll use this simple dataset as an example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Register_No&lt;/th&gt;
&lt;th&gt;Subject&lt;/th&gt;
&lt;th&gt;Marks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hari&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;Math&lt;/td&gt;
&lt;td&gt;89&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Asha&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;Science&lt;/td&gt;
&lt;td&gt;92&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kiran&lt;/td&gt;
&lt;td&gt;103&lt;/td&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Step 1: Install Dependencies&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!pip install pandas pyarrow fastavro lxml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 2: Create the Dataset&lt;br&gt;
We’ll first create a small dataset using pandas.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd  
data = {     
    "Name": ["Hari", "Asha", "Kiran"],     
    "Register_No": [101, 102, 103],     
    "Subject": ["Math", "Science", "English"],     
    "Marks": [89, 92, 85] 
}  

df = pd.DataFrame(data)
df
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  1.CSV (Comma-Separated Values)
&lt;/h1&gt;

&lt;p&gt;CSV is one of the simplest and most common formats for storing tabular data.&lt;br&gt;
Each line represents a row, and columns are separated by commas. It’s human-readable and works easily with Excel or Python.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;* Save to CSV
df.to_csv("students.csv", index=False)

* Verify content
print(open("students.csv").read())

Output:
Name,Register_No,Subject,Marks
Hari,101,Math,89
Asha,102,Science,92
Kiran,103,English,85
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  2.SQL (Relational Table Format)
&lt;/h1&gt;

&lt;p&gt;SQL organizes data into relational tables using columns and rows.&lt;br&gt;
You can store, query, and join tables efficiently in databases like MySQL, SQLite, or PostgreSQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import sqlite3  

* Create SQLite database and table
conn = sqlite3.connect("students.db")  
df.to_sql("students", conn, if_exists="replace", index=False)  

* Verify content
pd.read_sql("SELECT * FROM students", conn)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  3.JSON (JavaScript Object Notation)
&lt;/h1&gt;

&lt;p&gt;JSON represents data as key-value pairs, making it ideal for APIs and web applications.&lt;br&gt;
It’s lightweight and language-independent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;* Save to JSON (pretty format)
df.to_json("students.json", orient="records", indent=2)

* Verify content
print(open("students.json").read())

Output:

[
  {
    "Name": "Hari",
    "Register_No": 101,
    "Subject": "Math",
    "Marks": 89
  },
  {
    "Name": "Asha",
    "Register_No": 102,
    "Subject": "Science",
    "Marks": 92
  },
  {
    "Name": "Kiran",
    "Register_No": 103,
    "Subject": "English",
    "Marks": 85
  }
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  4.Parquet (Columnar Storage Format)
&lt;/h1&gt;

&lt;p&gt;Parquet is a columnar storage format optimized for big data analytics.&lt;br&gt;
It stores data by columns instead of rows, which improves compression and query performance — especially for analytical workloads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;* Save to Parquet
df.to_parquet("students.parquet", index=False)

* Verify by reading it back
pd.read_parquet("students.parquet")
Efficient and compact — great for distributed systems.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  5.XML (Extensible Markup Language)
&lt;/h1&gt;

&lt;p&gt;XML stores data using nested tags, similar to HTML.&lt;br&gt;
It’s human-readable and useful for hierarchical or document-like data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import xml.etree.ElementTree as ET  

root = ET.Element("Students")  

for _, row in df.iterrows():     
    student = ET.SubElement(root, "Student")     
    ET.SubElement(student, "Name").text = str(row["Name"])     
    ET.SubElement(student, "Register_No").text = str(row["Register_No"])     
    ET.SubElement(student, "Subject").text = str(row["Subject"])     
    ET.SubElement(student, "Marks").text = str(row["Marks"])  

tree = ET.ElementTree(root)  
tree.write("students.xml", encoding="utf-8", xml_declaration=True)

* Verify content
print(open("students.xml").read())

Output:
xml
&amp;lt;?xml version='1.0' encoding='utf-8'?&amp;gt;
&amp;lt;Students&amp;gt;
  &amp;lt;Student&amp;gt;&amp;lt;Name&amp;gt;Hari&amp;lt;/Name&amp;gt;&amp;lt;Register_No&amp;gt;101&amp;lt;/Register_No&amp;gt;&amp;lt;Subject&amp;gt;Math&amp;lt;/Subject&amp;gt;&amp;lt;Marks&amp;gt;89&amp;lt;/Marks&amp;gt;&amp;lt;/Student&amp;gt;
  &amp;lt;Student&amp;gt;&amp;lt;Name&amp;gt;Asha&amp;lt;/Name&amp;gt;&amp;lt;Register_No&amp;gt;102&amp;lt;/Register_No&amp;gt;&amp;lt;Subject&amp;gt;Science&amp;lt;/Subject&amp;gt;&amp;lt;Marks&amp;gt;92&amp;lt;/Marks&amp;gt;&amp;lt;/Student&amp;gt;
  &amp;lt;Student&amp;gt;&amp;lt;Name&amp;gt;Kiran&amp;lt;/Name&amp;gt;&amp;lt;Register_No&amp;gt;103&amp;lt;/Register_No&amp;gt;&amp;lt;Subject&amp;gt;English&amp;lt;/Subject&amp;gt;&amp;lt;Marks&amp;gt;85&amp;lt;/Marks&amp;gt;&amp;lt;/Student&amp;gt;
&amp;lt;/Students&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  6.Avro (Row-based Storage Format)
&lt;/h1&gt;

&lt;p&gt;Avro is a binary row-based format developed by Apache for data serialization.&lt;br&gt;
It includes the schema with the data, making it ideal for streaming systems like Apache Kafka.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from fastavro import writer, reader, parse_schema  

schema = {     
    "doc": "Student data",     
    "name": "Student",     
    "namespace": "example.avro",     
    "type": "record",     
    "fields": [         
        {"name": "Name", "type": "string"},         
        {"name": "Register_No", "type": "int"},         
        {"name": "Subject", "type": "string"},         
        {"name": "Marks", "type": "int"}     
    ] 
}  

records = df.to_dict(orient="records")  

with open("students.avro", "wb") as out:     
    writer(out, parse_schema(schema), records)  

print("students.avro created")

Verify Avro Data

from fastavro import reader  
with open("students.avro", "rb") as fo:     
    for record in reader(fo):         
        print(record)

Output:

{'Name': 'Hari', 'Register_No': 101, 'Subject': 'Math', 'Marks': 89}
{'Name': 'Asha', 'Register_No': 102, 'Subject': 'Science', 'Marks': 92}
{'Name': 'Kiran', 'Register_No': 103, 'Subject': 'English', 'Marks': 85}

Summary Check

print("Files created in Colab working directory:")
!ls -lh students.*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Conclusion&lt;br&gt;
Each data format serves a specific purpose in analytics:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CSV&lt;/td&gt;
&lt;td&gt;Text&lt;/td&gt;
&lt;td&gt;Simple tabular data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL&lt;/td&gt;
&lt;td&gt;Relational&lt;/td&gt;
&lt;td&gt;Structured, queryable data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;Hierarchical&lt;/td&gt;
&lt;td&gt;APIs, NoSQL data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parquet&lt;/td&gt;
&lt;td&gt;Columnar&lt;/td&gt;
&lt;td&gt;Big data and analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XML&lt;/td&gt;
&lt;td&gt;Tagged&lt;/td&gt;
&lt;td&gt;Configs and document data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avro&lt;/td&gt;
&lt;td&gt;Binary (Row)&lt;/td&gt;
&lt;td&gt;Streaming, schema evolution&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Understanding when to use each format helps you build scalable, efficient, and interoperable data systems in the cloud.&lt;/p&gt;

&lt;p&gt;You can view and run this notebook yourself on Google Colab:&lt;br&gt;
View Colab Notebook - &lt;a href="https://colab.research.google.com/drive/1vEfM0Cv9ir_FXmjibvG1pNGUazFJKbNF?usp=sharing" rel="noopener noreferrer"&gt;https://colab.research.google.com/drive/1vEfM0Cv9ir_FXmjibvG1pNGUazFJKbNF?usp=sharing&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dataanalytics</category>
      <category>python</category>
      <category>dataengineering</category>
      <category>bigdata</category>
    </item>
  </channel>
</rss>
