What is Data?

#database #data #sre #programming

There are three categories of data: structured, semi-structured, and unstructured. Each type of data has its own characteristics and use cases, understanding the differences between them is crucial for effective data management and analysis.

Structured data is organized and easily searchable. It is typically stored in relational databases, and its format is well-defined with pre-determined columns, data types, and relationships.

Examples include data from enterprise resource planning (ERP) systems, customer relationship management (CRM) databases, and financial records. Structured data can be easily queried and analyzed using SQL and other database tools. An example of Structured Data:

import sqlite3

# Connect to the database
conn = sqlite3.connect('example.db')

# Create a table with structured data
conn.execute('''CREATE TABLE employees
             (id INT PRIMARY KEY NOT NULL,
             name TEXT NOT NULL,
             age INT NOT NULL);''')

# Insert data into the table
conn.execute("INSERT INTO employees (id, name, age) VALUES (1, 'John Doe', 25)")
conn.execute("INSERT INTO employees (id, name, age) VALUES (2, 'Jane Smith', 30)")

# Query the data from the table
cursor = conn.execute("SELECT * FROM employees")
for row in cursor:
    print("ID = ", row[0])
    print("Name = ", row[1])
    print("Age = ", row[2])

# Close the database connection
conn.close()

A structured database table with pre-defined columns for id, name, and age. We insert data into the table and query it using SQL.

Semi-structured data falls somewhere between structured and unstructured data. It has a defined structure, but it's not as rigid as structured data.

Semi-structured data often includes metadata and tags that provide additional context. Examples include XML and JSON files, which are commonly used to exchange data between web applications.

import json

# Define a JSON object with semi-structured data
employee = {
  "id": 1,
  "name": "John Doe",
  "age": 25,
  "department": {
    "name": "Engineering",
    "manager": "Jane Smith"
  }
}

# Convert the JSON object to a string
employee_json = json.dumps(employee)

# Print the JSON string
print(employee_json)

# Convert the JSON string back to a Python object
employee_dict = json.loads(employee_json)

# Access the data in the Python object
print(employee_dict['id'])
print(employee_dict['name'])
print(employee_dict['age'])
print(employee_dict['department']['name'])
print(employee_dict['department']['manager'])

we define a JSON object with semi-structured data that includes a nested department object. We convert the object to a JSON string and back to a Python object, accessing the data using dictionary keys.

Unstructured data lacks any predefined structure. It's the most challenging type of data to work with because it includes text, images, and multimedia files that don't fit neatly into a database schema.

Examples include emails, social media posts, images, and videos. Unstructured data can be challenging to analyze using traditional data analysis tools, but advancements in natural language processing NLP and machine learning algorithms are making it easier to derive insights from unstructured data.

import pytesseract
from PIL import Image

# Open an image file with unstructured data
image = Image.open('example.png')

# Use Tesseract OCR to extract text from the image
text = pytesseract.image_to_string(image)

# Print the extracted text
print(text)

we open an image file with unstructured data and use Tesseract OCR to extract text from the image. The extracted text doesn't have a pre-defined structure and is challenging to analyze without advanced NLP techniques.

Each datatype have their unique characteristics and uses. Programmers and organizations can better manage and analyze their data by understanding the usecases, costs and benefits of each datatype, leading to more informed decision-making and better business outcomes.

A great resource to explore for learning about working with within the Azure ecosystem is the Microsoft Certificate: Azure Data fundamentals. It looks like this course is even freely available for students.

Something I happened across today to add context to how the industry is thinking about data and datatypes in the ongoing conversation around AI

DEV Community

What is Data?

Top comments (0)

Read next

1072. Flip Columns For Maximum Number of Equal Rows

Who should be your first data hire and when should you hire them?

The Gap That LeetCode's 30 Days of JavaScript Actually Fills

2257. Count Unguarded Cells in the Grid