Your ML/AI Success Begins Here: Data Ingestion & Storage on AWS

#mlops #aws #genai #datascience

Data is the King, but how to treat the King?

Please start this blog, ONLY if you will COMPLETE it.
Else no worries, read it later.

AI, ML, GenAI all these, “Market” as well “Brain” consuming terms are bascially hungry for Data & not just Data but a GOOD DATA.

See it’s simple, you want to stitch a dress, you know you need something sharp, but Needle , Knife, Sword all are sharp.
If you use Sword instead of Needle, its Gone, the task, the work is all gone.

So before jumping into ML, AI, and all, its important to make ourself a bit aware about the DATA.
Data types, its properties and which Tool to be used in which use-case.

It starts with understanding your Data TYPES:

🌾Structured Data: Well-organized in relational databases.
🌾Semi-Structured Data: Loose schema like JSON or XML.
🌾Unstructured Data: Freeform formats like videos or images.

Depending upon the type of data you have, you will choose your Aws Tools accordingly.

🌾Structured: Amazon RDS, Redshift.
🌾Semi-structured: DynamoDB, S3.
🌾Unstructured: S3, Glacier.

See, this what an Architect’s work is, to select proper tools and services.
Else either you will mess up the project flow, or project cost or entire project.

Its like, when you want to travel somewhere, you dont just focus on going from A to B. You see many factors like confort, train , bus, distance, money etc.

So once Data TYPE is covered, quickly lets move to Data Properties.
As its one of the important part as well, that will make you decide which tools to select.

Key Data Properties can be divided into the “3 V’s”
— — — — — — — — — — — —

🌾Volume: The scale of your data its in Mb, Gb, Tb
Small datasets? Use RDS or DynamoDB.
Massive datasets? Go for S3 or Redshift.
🌾Velocity: The speed of data generation, means the speed in which you are getting the data.
Real-time ingestion? Use Amazon Kinesis.
Batch processing? Use AWS Glue or S3 pipelines.
🌾Variety: Diversity in the data formats.
Diverse formats? S3 is your versatile solution.
You see, how data , its type , its properties are creating a initial blueprint of your project.
These steps is what in later future, will help you in MLOps.

Now depending upon the above factors, you will decide which Storage you will use, but there is Types of STORAGE as well.

Data Storage Architectures:
— — — — — — — — — — — — — —

Data Warehouses
Its a Centralized storage, optimized for “structured” data & analytics.
For Scenerios like Reporting, business intelligence (BI) or structured queries.
AWS Service: Amazon Redshift.
Key Feature: Schema-on-write, ensuring organized and consistent data.
Data Lakes
Its a Scalable storage for “structured, semi-structured, and unstructured data”. Its like a dumping ground.
For scenerios like Big data analytics, ML model training.
AWS Service: Amazon S3.
Key Feature: Schema-on-read, enabling flexibility for data scientists.
Data Lakehouse
Its a hybrid model combining the best of data lakes and warehouses.
For scenerios like Unified analytics and ML pipelines.
AWS Service: Amazon Redshift Spectrum, Athena (for querying S3 data).
Key Feature: Seamless integration between structured and unstructured data.

I wont dig more, as i know , for an initial step & for an initial awarness this much info is fine. We will slowly, dig down step by step, in further steps how AWS handles ML,AI, GenAI projects, process.

So just remember, without bread, you cant make a sandwich.
And without data & knowing the data, you cant bring a ML project in-life.

If you’re worried about AI replacing jobs or questioning your ability to keep up, remember that every expert was once a beginner.

Stay curious & motivated — keep learning & moving forward.

Keep Calm, Keep Aware, Keep the Chin and Thinking UP !! You will do it !!

If you want any personal suggestion or a one-to-one call with me, will be more then happy to have one🌿
Let’s connect on Linkedin for a Hi !!

Now, Take a deep breathe and Go Learn🌏