DEV Community: Phatsawut Duangkaew

5 Essential Stages of Data Management Before You Start AI Projects

Phatsawut Duangkaew — Wed, 25 Jun 2025 16:00:02 +0000

My visit to the PSU Cybersecurity & Data Privacy Days 2 proved to be incredibly insightful. I found myself in the audience for "Accelerate Modernize Applications with Nutanix AI Platform," a session led by Khun Surak Thammarak of Nutanix (Thailand) Ltd.

As he presented a slide titled "One Platform to Simplify Data Management," it sparked a critical question in my mind. In the world of tech, we are always talking about revolutionary AI models and the magic of training them. But what about the journey of the data itself? We always hear about the final product, but what about the data pipeline behind it?

That slide laid out a clear roadmap, and I realized that understanding this data lifecycle is the true starting point for anyone serious about AI.

Why Data Management Matters in AI

An AI model is only as good as the data it learns from. You can have the most brilliant algorithm in the world, but if you feed it messy, disorganized, or inaccessible data, the results will be disappointing. Think of it like cooking: even a world-class chef can't make a great meal with poor-quality ingredients.

Poor data management can ruin AI projects before they even begin. A solid data lifecycle isn't just a "nice-to-have"; it's the foundational backbone of any modern AI application.

The 5 Key Stages of an AI Data Pipeline

The presentation slides elegantly broke down the complex data journey into five logical stages. Let's walk through each one.

Ingest Raw Data

What it means: This is the starting line. It’s the process of collecting all the raw, unprocessed data from its original sources. This could be anything from sensor reading in a factory, user clicking on an app, to millions of text files.
Key term: The slide mentions to Geo-Distributed High Capacity that means the storage system must be able to collect massive amounts of data from different physical locations, all at the same time.
Why it's important: Your AI project needs a wide and deep pool of raw material. This stage ensures you have a scalable and robust way to gather it all in one place.

Prepare Data

What it means: Raw data is messy. This stage is all about cleaning, structuring, and transforming that data into a neat, consistent format that a machine can understand. It involves tasks like removing errors, label information, and organizing it.
Key term: The slide points to Low Latency Block storage. "Low latency" means fast access. The system needs to be quick so that data scientists can experiment and prepare data without long delays. "Unified" means it can handle different data types (like files and database blocks) in one place.
Why it's important: This is arguably the most critical step. High-quality, well-prepared data leads to much more accurate and effective AI models.

Tune or Train Model

What it means: This is the part we hear about most often. It’s where data scientists feed the prepared data into their AI algorithms, allowing the model to learn and find patterns. This process is computationally intensive and requires reading the data over and over again.
Key term: The slide highlights Parallel Access (Cloud). This means multiple computers can access and process the data simultaneously ("in parallel"), extremely speeding up the training time. This needs to work whether the computers are in a local data center or in the cloud.
Why it's important: Faster training allows for more experimentation and quicker development cycles. Strong, parallel data access is the fuel that powers the heavy engine of model training.

Run AI Inferencing

What it means: Once a model is trained, "inferencing" is the act of putting it to work in the real world to make predictions. This could be a recommendation engine on a website or a facial recognition system on a security camera.
Key term: Run at Edge, Fast Reads. "The Edge" refers to a location closer to where the data is generated, like a retail store or a factory floor, rather than a central data center. For real-time results, the model needs to read data and make a decision instantly ("fast reads").
Why it's important: For AI to be useful, it often needs to provide answers immediately. Placing the model at the edge reduces lag and allows for real-time decision-making.

Archive Data

What it means: What happens to the data after it’s been used? You can't just delete it. Archiving is the process of moving older, less frequently accessed data to a cheaper storage tier for long-term retention.
Key term: Dense, Low-Cost, Low-Performance storage. Since you don't need to access this data quickly, you can store it on slower, less expensive hardware, which saves a lot of money. "Dense" means you can pack a lot of data into a small space.
Why it's important: This practice is crucial for both cost optimization and legal observance, as many industries require data to be kept for several years.

The Role of a Unified Platform

The presentation logically tied these stages together with the underlying Nutanix Unified Storage platform. The idea is to have one system that can manage data across this entire lifecycle, providing crucial capabilities like:

Data Mobility: Easily move data between stages or locations.
Data Access Anywhere: Allow teams to access the data they need, wherever they are.
Data Scaling and Agility: Grow your storage and performance as your AI needs evolve.
Security & Governance: Control who can access the data and track what happens to it.
Data Classification: Automatically identify what kind of data you have (e.g., sensitive personal info).

The slide also mentioned Data Lens, a tool that provides visibility and control over this entire data landscape, helping to protect against security risks and manage data effectively.

Summary: Why Developers Should Care

As a developer, it's easy to think of data as something that just "exists." But understanding this pipeline is essential for building successful, real-world AI applications.

AI isn’t just about writing Python code or using a machine learning library. It's a complete system that is heavily reliant on a well-oiled data machine. Understanding this helps you:

Collaborate better with data engineers and data scientists.
Build scalable projects that won't break when data volumes explode.
Prepare for real-world deployments where data security, speed, and cost truly matter.

The next time you start an AI project, remember the five stages. Building a solid data foundation isn't the most glamorous part of AI, but it is the most important.

From DevOps to DevSecOps

Phatsawut Duangkaew — Wed, 25 Jun 2025 15:07:43 +0000

Recently, I had the opportunity to attend the PSU Cybersecurity and Data Privacy Days 2. One of the sessions that caught my attention was "Accelerate Modernize Applications with Nutanix AI Platform," presented by Khun Surak Thammarak from Nutanix (Thailand) Ltd.

During the talk, Khun Surak presented a slide detailing a "Software Factory - DevSecOps with GitOps”. A thought immediately came up to my mind. We often hear about "DevOps," but what was this "DevSecOps"? What did the "Sec" part add?

That single question sent me on a learning journey. I decided to dive deeper into the topic, and I'm writing this blog to share what I discovered and explain, in simple terms, what DevSecOps is all about.

What is DevOps?

Before we talk about DevSecOps, let's quickly talk about DevOps. For years, the team that writes the code (Developers) and the team that manages the running software (Operations) worked in separate parts. This often led to slow processes

DevOps changed that. It’s a culture and a set of practices that bring these two teams together. The goal is simple: to shorten the lifecycle of development and deliver high-quality software faster and more reliably. Think of it as turning a clumsy, multi-stage process into a single, smooth, automated assembly line.

Building Better, Faster, and Safer: A Guide to DevSecOps

Now, let's get back to the topic. DevSecOps takes the great ideas of DevOps —speed and automation— and adds a crucial ingredient : Security.

Instead of having a security check at the very end of the process (which is slow and expensive), DevSecOps integrates security into every single step. Think of it as a smart, automated factory for creating software where security guards are present on the entire assembly line, not just at the final gate.

Let's walk through the factory map from the presentation.

The Two Main Characters: The Developer and the IT Operator

Our story has two key players:

The Developer: Their job is to write the code that creates the features you use in an app.
The IT Operator: Their job is to ensure the app runs smoothly for everyone to use (that is what we call "production").

DevSecOps uses an automated process to connect their work smoothly.

Part 1: The "Building" Phase (Continuous Integration)

As soon as a developer writes new code and saves it (an action called a "commit"), an automated process starts running step-by-step behind the scenes.

Code Check-up (By SonarQube): Imagine an expert reviewer who instantly scans the new code. This tool automatically checks bugs, security vulnerabilities, and quality issues. This is our first security checkpoint, ensuring problems are caught up early.
Building and Packaging (App Build, Image Build): The code is then "built" and packaged into a secure, ready-to-go container. Think of it, like putting all the ingredients for a meal into a sealed box, complete with instructions. This container has everything the application needs to run.
Storing the Box (By Harbor): This packaged container is stored in a secure warehouse, called a repository. It’s now an official software version that’s ready for the next step.
Final Security Scan: Before it can be sent out, the packaged container gets one more security scan to make sure nothing dangerous was packed inside.
Keeping an Eye on Things (By Grafana): Throughout this entire process, a monitoring tool acts like an inspector. If something fails, it immediately alerts the team so they can fix it.

This whole automated building and testing process is called Continuous Integration (CI).

Part 2: The "Release" Phase (Continuous Deployment with GitOps)

Once our software package is built and approved, it’s time to deliver it to the users. This part of the process uses a method called GitOps, where all automation is managed through a central code repository.

The Operator's Command (Commit): The IT Operator makes a simple change in a configuration file — like updating the version number from 1.0 to 1.1.
The Automation Watcher (By Flux): A special tool (Flux) is always watching this configuration file. As soon as it sees the change, it knows it's time to act.
Automatic Deployment (Deploy Kubernetes): The tool automatically takes the new, approved software package and deploys it to the live environment using a system called Kubernetes, which manages running applications.
Final Check (E2E Tests): Even after release, automated tests run to make sure everything is working as expected from a user's perspective.

This automated release process is called Continuous Deployment (CD).

Why Does This All Matter?

This DevSecOps approach, as shown in the presentation, is a game-changer:

Speed: New ideas get to users in hours or days, not months.
Security: With security checks built into every step (the "Sec" in
DevSecOps), applications are safer from the start, not as an afterthought.
Reliability: Automation reduces the chance of human error, meaning
fewer bugs and less downtime for users.

My curiosity about a single word on a slide led me to understand a whole new philosophy for building software. It’s not just about being fast; it’s about being fast and safe.