Sayista Yazdani

Posted on Jun 27

# How AI Actually Understands Things 🤖

#machinelearning #ai #webdev #beginners

We use AI daily — ChatGPT, Gemini, Claude, Midjourney…
But have you ever wondered how AI actually processes information before giving an answer?

As I’m learning Generative AI, I explored some amazing concepts like:

Multimodal AI

AI that understands multiple types of data:

Text
Images
Audio
Video

together.

Example: If AI sees a dog image + hears barking + reads “This is a dog” → it combines everything and predicts: Dog.

What is Fusion in AI?

Fusion means:
Combining information from different sources before making a decision.
There are mainly 2 types:

Early Fusion vs Late Fusion

Early Fusion
AI combines all inputs first, then processes them together.
👉 Like mixing ingredients before cooking.

Late Fusion
AI processes each input separately and combines final predictions later.
👉 Like multiple judges giving opinions before the final decision.

CNN (Convolutional Neural Network)

Used mostly for images. CNN helps AI detect:

Edges
Shapes
Patterns
Objects

Example: AI sees ears + whiskers + fur → predicts Cat 🐱

LSTM (Long Short-Term Memory)

Used for sequence understanding like text, speech, and time-series data. It excels at remembering previous context.

Example: “I grew up in France, so I speak fluent ___”
AI remembers “France” → predicts French.

Transformers & Attention 🚀

Modern LLMs like ChatGPT mainly use Transformers. Instead of reading word-by-word, they use Attention to understand relationships between words.

“The animal didn’t cross the road because it was tired.”

AI understands “it” refers to the animal. That’s contextual intelligence.

Embeddings

AI cannot understand raw words or images directly. Everything is converted into dense mathematical vectors called Embeddings.

That’s how AI maps and understands structural similarity between concepts like:

King 👑
Queen 👑
Prince 🤴

SSMs (State Space Models)

A newer architecture designed for very long sequences.

Why it matters: Transformers become computationally expensive for massive contexts. SSMs process long information more efficiently with significantly lower memory usage.

Example: Reading a 500-page book without forgetting earlier chapters 📚

💡 Final Thought

AI responses may look simple on the surface, but internally:

CNNs detect visual patterns
LSTMs remember linear context
Transformers apply attention across text
Fusion combines multiple modalities
Embeddings numericalize core meaning
SSMs optimize long-term memory

The more I learn about AI, the more I realize: Using AI is easy. Understanding how AI thinks is the real game.

ai #machinelearning #deeplearning #multimodal

Top comments (1)

Sayista Yazdani • Jun 27

I'm curious to know—which of these AI concepts (CNN, LSTM, Transformers, or SSM) did you find the most fascinating or hardest to wrap your head around when you first started learning? Let's discuss in the comments!