Ruben Ghafadaryan

Posted on May 18

Why Clothing Matching Is More Complicated Than Asking ChatGPT

#ai #fft #fashion

Brief Summary

At first glance, AI-based clothing matching looks deceptively simple: upload two garment photos into a modern multimodal AI model and ask whether they suit each other. In reality, the problem quickly expands into a combination of color theory, computer vision, pattern analysis, visual psychology, trend modeling, and semantic style understanding. This article describes why a dedicated matching system may still be needed even in the era of large multimodal AI models, what technical ideas stand behind such systems, and why the final solution will most likely require a hybrid architecture and a significant amount of experimentation.

Disclaimer:

This article is not AI-generated. However, as English is not my native language, AI tools were used to help polish the style, improve readability, and double-check several technical and historical references. The ideas, conclusions, and technical direction remain entirely mine. In cases where suitable illustrations could not be found online, or where copyright limitations existed, AI-generated images were used as placeholders or conceptual illustrations.

Calling AI?

Discussing a potential fashion-related AI project, I requested a surprisingly large amount of resources for a seemingly simple feature: clothing matching. More specifically, a system that could look at two garments and answer the familiar question:

“Do these clothes actually suit each other?”

Very quickly the obvious question appeared:

“Why don’t we simply upload two images into ChatGPT or another multimodal AI and ask whether they match?”

At first glance, this sounds entirely reasonable. Modern AI models can already describe images, recognize objects, explain paintings, generate realistic photographs, and analyze visual content surprisingly well. Surely they can decide whether trousers and a shirt work together.

The deeper we discussed the task, however, the more we discovered an uncomfortable truth: fashion is one of those areas where humans make highly subjective visual decisions and still expect them to be consistent.

And unfortunately, consistency is exactly what production systems require.

A Small Personal Problem

I should probably clarify that I am not a fashion expert.

Like many engineers, my practical understanding of clothing historically operated somewhere between “looks acceptable” and “at least the colors are not actively dangerous.” Most successful color combinations in my wardrobe were selected not by me, but by my wife.

Unfortunately, I am a mathematician first.

And mathematicians have a predictable habit: whenever they cannot rely on intuition, they start building models.

So while many people may simply feel that two garments do not work together, my brain immediately starts asking which parameters conflict, whether saturation is involved, how visual attention is distributed, and whether the problem can be represented numerically.

At that point, the fashion problem quietly transforms into signal processing with visual side effects.

The Black-and-Yellow Blouse Problem

Imagine a black blouse with yellow circles.

Now combine it with plain white shorts.

Most people would probably say:

“Looks good.”

Now replace the white shorts with bright red trousers.

Suddenly the same blouse starts producing a completely different visual impression.

What changed?

Not the blouse. Not even the number of colors.

The issue is that fashion compatibility is not simply:

“Are the colors similar?”

In reality, the human brain evaluates contrast, saturation, visual balance, dominant versus accent colors, pattern interaction, complexity, texture, style, and overall visual harmony.

At this point, the project stopped looking like a small image comparison tool and started looking much closer to computational psychology with RGB values.

Image 1. Same blouse. Completely different visual balance

Why Generic Multimodal AI Is Not Enough

Modern multimodal systems like ChatGPT are genuinely impressive. If you upload two garments and ask whether they match, the answer will often sound convincing.

The problem is that “convincing” and “reliable production logic” are not the same thing.

A general-purpose AI model behaves more like a highly educated assistant with broad visual knowledge, but without stable fashion rules. Sometimes the answer will be excellent, sometimes inconsistent, and sometimes the same images will produce different answers after a model update.

This becomes problematic for a real product.

A customer-facing fashion system requires stable behavior, measurable quality, explainable reasoning, controllable business logic, predictable latency, and the ability to evolve into recommendation systems later.

“Ask a large neural network and hope for the best” becomes less convincing once product teams, customers, and legal departments enter the discussion.

The First Technical Surprise: RGB Is Almost Useless

Initially, one might imagine a straightforward solution: extract image colors, compare RGB values, calculate distances, produce a score.

This idea survives until the first real examples appear.

Humans do not perceive colors in RGB space.

Two shades of green may technically belong to the same color family while visually clashing completely. An acid neon green and a soft watercolor green may both be “green,” but placing them together can create a distinctly uncomfortable visual effect.

This is why serious color analysis usually avoids RGB and instead uses perceptual color spaces such as LAB or LCH. These models separate brightness, chroma, and hue in a way much closer to human perception.

In simplified form, the difference between two colors may be represented as a distance in perceptual color space:

$d = \sqrt{(L_1 - L_2)^2 + (a_1 - a_2)^2 + (b_1 - b_2)^2}$

Fortunately, the practical meaning is much simpler than the formula itself: the higher the value, the more differently humans perceive the colors.

Suddenly the system no longer sees:

“green”

but instead:

“high-chroma aggressive green with strong visual dominance.”

Which is considerably more useful for fashion analysis.

Fashion Is Not Always Logical

Another important realization appears quite quickly: even if we successfully model “good” visual combinations mathematically, fashion trends may completely ignore logic.

History repeatedly demonstrates that objectively questionable combinations can suddenly become fashionable because celebrities popularized them, luxury brands promoted them, social media amplified them, or entire industries decided they looked modern.

Acid neon colors periodically return every few years, oversized silhouettes repeatedly cycle back into fashion, and combinations once considered excessive suddenly become trendy again.

Meanwhile Coco Chanel helped popularize elegant black clothing as timeless classics, even though black had historically been associated more with practicality or mourning.

Fashion is full of such contradictions.

This creates an additional challenge for AI systems: they must distinguish between timeless visual harmony and temporary cultural trends.

A system based purely on “classical matching rules” might reject combinations that become fashionable for entirely cultural reasons. This means future systems may eventually require trend-awareness, contextual modes, or “experimental fashion” profiles.

Because sometimes people intentionally wear combinations designed to attract attention rather than visual harmony.

Image 2. Fashion periodically ignores logic and remains successful anyway.

The Unexpected Complexity of Patterns

The next surprise came from ornaments and patterns.

Even if colors match perfectly, patterns may still conflict visually. A floral blouse combined with plaid trousers may technically share compatible colors, yet the result still feels visually overloaded.

The problem is no longer color harmony. It becomes visual attention competition.

Fashion turns out to have surprisingly strict unwritten rules regarding which garment is allowed to dominate visually. Usually one expressive item works well, while multiple expressive items quickly become risky.

This means the system needs to analyze not only colors, but also ornament density, texture complexity, edge distribution, and pattern frequency.

At this point FFT unexpectedly enters the fashion industry.

Yes, Fast Fourier Transform.

The same mathematical tool used in signal processing can help estimate how visually “busy” a garment is. Repetitive patterns generate different frequency signatures than smooth minimalistic surfaces.

In simplified terms, compatibility itself may eventually become a weighted combination of many visual parameters:

$S = w_1 C + w_2 P + w_3 T + w_4 V$

Where:

$C$ — color harmony
$P$ — pattern compatibility
$T$ — texture and style semantics
$V$ — visual balance
$w_i$ — configurable weights defining importance of each factor

Fortunately, the real implementation would look slightly less frightening than the equation suggests.

Image 3. Signal processing meets fashion analysis

Fashion Is Largely About Visual Hierarchy

One of the most interesting discoveries was that outfits are not evaluated garment-by-garment.

Humans evaluate the balance of visual attention.

A neutral white pair of shorts works well with a loud patterned blouse because the shorts visually “step back” and allow the blouse to dominate. A second visually loud garment creates competition, and the eye no longer knows where to focus.

This is why neutral colors are so powerful in fashion: black, white, gray, beige, navy. These colors help stabilize combinations containing stronger visual elements.

A good matching engine therefore does not simply compare colors. It tries to estimate which garment dominates, which one supports, and whether the outfit becomes visually overloaded.

Where Modern AI Actually Becomes Useful

Ironically, after all this discussion about not relying entirely on multimodal AI, we still ended up wanting multimodal AI.

Just not alone.

Models like CLIP turned out to be extremely interesting because they understand higher-level visual semantics.

Traditional computer vision can detect dominant colors, saturation, contrast, texture complexity, edge density, FFT-based pattern frequencies, and visual attention balance. But CLIP can additionally recognize concepts like elegant, sporty, streetwear, minimalist, vintage, or formal.

This creates a hybrid architecture: classical deterministic visual analysis combined with semantic embedding systems.

The deterministic layer provides stability and explainability. The multimodal layer provides aesthetic understanding.

Together they form a much more reliable system than either approach alone.

Image 4. The system does considerably more than compare two JPEG files.

Can Fashion Trends Be Formalized?

One particularly interesting extension of such systems is the ability to formalize trends and allow customers to define their own style preferences instead of relying on a single universal “good taste” model.

Because the uncomfortable truth is that fashion rules are not fixed.

A luxury fashion retailer, a youth streetwear platform, and a conservative office clothing brand may evaluate the same outfit very differently.

This means the system eventually needs something resembling a configurable style-definition layer.

Conceptually, this could work as a lightweight “fashion policy language,” where customers define preferred visual patterns and forbidden combinations.

For example, one profile may prefer high neutral color ratios, low pattern complexity, limited accent colors, and conservative combinations. Another may intentionally allow strong contrasts, multiple accent colors, mixed patterns, and visually aggressive combinations.

In practice, the configuration may look like structured metadata or policy definitions:

{
  "style_profile": "minimalist_modern",
  "preferred": {
    "neutral_ratio_min": 0.5,
    "pattern_complexity_max": 0.4
  },
  "avoid": {
    "high_saturation_pairs": true
  }
}

Or, for a more experimental audience:

{
  "style_profile": "streetwear_experimental",
  "allow": {
    "multiple_accents": true,
    "mixed_patterns": true
  },
  "preferred": {
    "visual_attention_score_min": 0.7
  }
}

Over time, such systems could also become trend-aware: “prefer Scandinavian minimalism,” “allow neon revival trends,” “follow contemporary streetwear aesthetics,” or “avoid runway-style combinations.”

At that point the system stops behaving like a static rule engine and starts acting more like a configurable visual recommendation platform.

Ironically, this is very similar to many engineering systems: eventually the difficult part is not the algorithm itself, but allowing humans to customize the behavior without completely destroying the logic underneath.

How One Simple Task Becomes a Research Direction

In this case, we were not asked only for a small “do these two clothes match?” feature. The broader request was closer to an intelligent fashion platform, and clothing compatibility was just one selected problem from a much larger set.

However, this single problem is already enough to show the scale of the domain.

So the important conclusion is not that one matching feature is unexpectedly large. The more important conclusion is that this “simple” task is a good warning signal: the whole fashion intelligence domain may contain many similar subproblems, each of which can turn into a separate research and experimentation track.

In other words, clothing matching is not the whole platform.

It is just one doorway into the larger problem.

Conclusion

The original question was:

“Why don’t we just ask ChatGPT whether two garments suit each other?”

The answer turned out to be surprisingly complicated.

Because fashion compatibility is not merely image recognition, color comparison, or text generation. It is a combination of perceptual color theory, visual hierarchy, pattern interaction, texture analysis, semantic style understanding, human psychological perception, and evolving cultural trends.

Modern multimodal AI is absolutely useful in this domain. In fact, it may become one of the most important components of the final architecture.

But relying solely on a generic AI prompt is unlikely to produce a robust production-grade solution.

Most likely, the real implementation will require experimentation, hybrid architectures, handcrafted visual descriptors, semantic embedding systems, configurable trend profiles, and a considerable amount of testing.

Which leads to the final and slightly uncomfortable conclusion:

This task may still require a lot of experimenting before the right approach is found.

DEV Community