DEV Community: Ujjawal Tyagi

Decoding Amazon's Recommendation Engine

Ujjawal Tyagi — Fri, 01 Mar 2024 17:16:15 +0000

I am sure that while browsing Amazon, you must have experienced this while casually looking at something, and then you get bombarded with suggestions for "similar items you might like"? It's almost like the website can read your mind!

Well, while it may not be telepathy, there's a powerful recommendation engine behind the scenes, carefully crafting personalized suggestions just for you.

But the question is

How it works?

How does Amazon balance speed and accuracy in delivering personalized recommendations?

How do the deal with user privacy while utilizing their data for personalized recommendations?

So let's try to understand the inner workings of Amazon's Recommendation Engine and don't worry I won't make it complicated!

Beyond "Customers Who Bought This Also Bought"

While "Customers Who Bought This Also Bought" is a familiar sight, it's just one piece of the puzzle. We've all encountered those appealing product suggestions while browsing Amazon. But have you ever wondered how Amazon curates these recommendations amidst its vast inventory? How do they filter or what technique do they use for leveraging user behavior data to predict preferences?

Well the answer lies in the root of two primary techniques:

Collaborative Filtering: This method analyzes the behavior of similar users. Let's try to understand this, Imagine a giant network where users and items are connected based on their interactions. By analyzing buying habits and ratings of users with similar tastes, the engine predicts what you might like based on what others like you have chosen. Here's the technical breakdown:
- User-item matrix: This matrix represents interactions (purchases, ratings, etc.) between users and items. Each cell holds a value signifying the interaction strength. (e.g. purchase = 1, no interaction = 0)
- Similarity measures: Techniques like cosine similarity or Pearson correlation coefficients measure the similarity between user profiles based on their interaction patterns within the matrix.
- Nearest neighbor algorithms: These algorithms identify users with the highest similarity scores to the target user. Their past interactions are then used to recommend items they haven't encountered yet but might enjoy based on their similar preferences.
Content-Based Filtering: This technique focuses on the item itself. The engine analyzes features, descriptions, and categories of products you've interacted with, and then recommends similar items based on these characteristics. It can involve:
- Item-item matrix: This matrix represents the relationships between items based on shared features, categories, or descriptions. Each cell holds a similarity score between items.
- Feature engineering: Techniques like TF-IDF (Term Frequency-Inverse Document Frequency) are employed to extract relevant features and represent them numerically.
- Nearest neighbor algorithms: Similar to collaborative filtering, these algorithms identify items with the highest similarity scores to items the user has interacted with. These similar items are then presented as recommendations.

Hybridization:

With millions of products and customers, efficiently sorting through all that data is a huge challenge. To deal with this Amazon employs a technique called Matrix Factorization.
Also, Amazon doesn't rely solely on one technique. It often employs a hybrid approach, combining the strengths of collaborative and content-based filtering:

Weighted combination: The recommendations from both techniques are combined using weights based on their individual effectiveness for the specific user or item.
Matrix factorization: Advanced techniques like matrix factorization can be used to create a lower-dimensional representation of the user-item and item-item matrices, capturing latent factors influencing user preferences and item relationships. This allows for more efficient and accurate recommendations.

What about data?

These algorithms are only as good as the data they are fed. Amazon leverages a vast amount of user data to personalize recommendations, including:

Explicit feedback: It includes purchase history, ratings, reviews, and wish list additions.
Implicit feedback: It involves Browsing behavior, search queries, clicks on product images, and time spent on product pages.
Contextual data: Location, time of day, and device type can be used to tailor recommendations to specific situations (e.g., suggesting raincoats during a rainfall).

Advanced Personalization

Amazon employs additional techniques to personalize the recommendation experience:

Time-based recommendations: Products are suggested based on seasonal trends or upcoming events (e.g., recommending cookbooks around holidays).
Real-time recommendations: User behavior is analyzed in real-time to dynamically adjust recommendations on the fly.
A/B testing: Different recommendation strategies are tested on different user segments to identify the most effective approach for each individual.

But don't you think that scaling this recommendation engine to serve millions of users requires more than just clever algorithms? Yes, It demands a robust infrastructure. Amazon's recommendation engine operates atop a distributed computing framework, where data is partitioned across multiple servers.

But what happens if a server fails under the weight of user queries? For that Amazon has implemented fault-tolerant mechanisms, ensuring uninterrupted service by replicating data across redundant servers.

What's the role of Caching?

Amazon utilizes caching to store frequently accessed data closer to users, reducing the need to fetch information from the main database repeatedly. By keeping popular data in a cache, Amazon minimizes the computational overhead and latency associated with retrieving data, thus enhancing the overall user experience.

Reducing Load Times: Caching strategies enable Amazon to load web pages and display product information more quickly, leading to shorter wait times for users. With cached data readily available, users experience faster page load times, allowing for smoother browsing and quicker access to desired products.
Enhanced User Experience: By optimizing data retrieval with caching, Amazon ensures a seamless and efficient shopping experience for its users. Reduced latency and faster access to information contribute to a more responsive website, improving user satisfaction and encouraging increased engagement and sales.

What about user privacy & data?

Even if it is in the name of personalized experience the vast amount of user data collected by Amazon raises concerns about potential misuse or unauthorized access.
Specially Personalized recommendations can inadvertently create filter bubbles, limiting users' exposure to diverse information and viewpoints, and further which can perpetuate existing biases, leading to discriminatory recommendations.

So, what does Amazon have to say about this? Well, Amazon outlines its data collection and usage practices in its privacy policy, allowing users to make informed choices and allow users to manage their data and opt out of personalized recommendations altogether.
Also, Amazon anonymizes data before using it for recommendation purposes while trends and patterns are analyzed using aggregated data sets, minimizing the use of individual user information.

But still the balance between personalization and privacy remains a complex and evolving debate.

The Final Verdict

Amazon's recommendation engine is a complex combination of algorithms, data analysis, and machine learning, constantly evolving and improving. While the specifics remain proprietary, understanding the working between user behavior, data analysis, and recommendation algorithms gives a glimpse of how things work behind the scenes.

I wonder if other e-commerce giants like eBay or Walmart employ similar recommendation strategies, or if they have their own methods?

What do you think about it? Do let me know in the comments.

If you enjoyed this blog, you can follow me on:

If you'd like to support me, you can sponsor me on GitHub or buy me a coffee.

Why YouTube Never Runs Out of Storage? It's NOT just CLOUD!

Ujjawal Tyagi — Thu, 22 Feb 2024 20:19:15 +0000

Have you ever wondered, despite all these years and an absolutely insane amount of video data being generated. Why YouTube haven't run out of space? Especially with hits like these:

This is insane, right? Imagine a platform bursting with millions of videos, yet never facing a space crunch. And even if you try to counter it with cloud computing, at the end of the day, it's just physical hardware or hard disks sitting somewhere in a data center in the name of the cloud :)

From Petabytes to Exabytes:

YouTube operates at an unprecedented scale, storing petabytes and exabytes of video content to cater to its vast user base. To put this into perspective, a single petabyte is equivalent to one million gigabytes, while an exabyte is one billion gigabytes. Managing such immense volumes of data is insane🤯.

*So, the question arises: *

What's the limit ?🤔
How do they never lose anything?
How can any data be accessed instantly for anywhere in the world?

Let's delve into a deeper, more fascinating story behind YouTube's seemingly infinite storage capabilities.

And, don't worry, I'm not gonna fool you into cloud computing XD

Beyond the Cloud

Well, it does make sense when the maximum quality used to be 720p, but now most videos need to be stored in 4K. They must have developed some special compression algorithms or methods to minimize the size.

If they were to rely solely on cloud storage, it would require enormous space and be costly, regardless of the company's size, especially considering that anyone can upload vast amounts of data for free.

First take: Compression Magic

The only reasonable explanation includes data compression or some compression algorithm. Videos are compressed before storage using cutting-edge codecs, like VP9, H.264, H.265 (HEVC) and AV1. This reduces file size by up to 50%, significantly stretching storage capacity without compromising quality.

However, this must be done in a way that does not compromise quality at all. Nonetheless, with general compression, no matter how effective it is, there is still minimal loss during compression to maintain performance and speed.

This does sound like a Pied Piper's revolutionary compression algorithm from series "Silicon Valley" XD

In addition, YouTube utilizes advanced transcoding and optimization techniques to encode uploaded videos into multiple formats and resolutions, catering to various devices and network conditions. Adaptive bitrate streaming further enhances the user experience by dynamically adjusting video quality based on available bandwidth and device capabilities.

Second take: Storage Tiers

Tiered Storage is one of the main factors as videos aren't stored in a monolithic cloud. YouTube employs a tiered system, where frequently accessed content resides in high-performance, readily accessible storage (think lightning-fast SSDs), while less-viewed videos migrate to colder, more cost-effective tiers (like hard drives). This optimizes latency, performance and storage costs.

Third take: Content Lifecycle Management

Content Assessment: YouTube constantly analyzes videos to understand their popularity and engagement. Videos with low viewership or engagement are flagged for archival or removal, freeing up space for fresh content.
(But still there are tons of inactive accounts with all their old videos)
Partner Programs: YouTube offers monetization options for creators. Videos enrolled in such programs are typically retained longer due to their potential revenue generation.

Technology Advancements:

Emerging Technologies: YouTube actively explores cutting-edge technologies like DNA storage, which offers exponentially denser storage compared to traditional methods. While still in its early stages, it holds vast potential for the future.
Moore's Law: Storage capacity consistently increases, driven by advancements in hardware technology. This allows YouTube to accommodate growing video libraries while maintaining cost-effectiveness.

What about availability?

Well If you talk about just the availability of this huge data, then it is because of:

Global Network: YouTube's storage infrastructure isn't confined to a single location. It's distributed across data centers worldwide, ensuring redundancy and resilience. If one data center experiences an outage, others can seamlessly take over, preventing service interruptions.

Content Replication: Popular content is replicated across different data centers. This ensures it's readily available to viewers near them, minimizing latency and buffering issues.

What's the available information?

Google uses Google File System (GFS) and BigTable to manage the large amount of data. They have millions of disks that are in a RAID configuration across multiple data centers. I found an answer on twitter from 'TechWelthEngine' that sounds plausible.

"At 4.3 petabytes a day, it takes just over 232 days to get to an exabyte. If we assume that they have 15 EB of storage, then that means it'll take them 9.5 years to fill it all at this pace."

But if this is true then do they have to built a new 15Eb facility every 9.5 years?
I am not really sure. May be they will just dedupe any redundant data?
And don't forget the fact that the 4.3 petabytes a day will increase over the coming years specially with a huge number of videos are being created and narrated by AI!

And if they are really just constantly upgrading their servers(which obviously they are not) then it explains why we have to watch 2 ads, then 1.5 minutes of the actual video, then 2 ads, then 3 minutes, then the process repeats :)

So I believe there must be a way because they can't keep building server farms forever and ever....

I tried to contact YouTube and some senior developers at YouTube to get a more clear view on this, but so far there has been no response.

Hence, the question remains unanswered just how long can YouTube hold onto our data in the cloud, what are the YouTube's archival processes?

What do you think about it? Do let me know in the comments.

Inspired by twitter/X talks with Ben Weddle.

If you enjoyed this blog, you can follow me on:

If you'd like to support me, you can sponsor me on GitHub or buy me a coffee.

Unmasking LinkedIn's Connection Logic 🤯

Ujjawal Tyagi — Wed, 14 Feb 2024 14:45:19 +0000

Have you ever wondered how LinkedIn knows you and your potential connections are just a few clicks away? It's neither magic, nor some clever engineering hidden behind the scenes. What if I tell you that it's just Graph? Yes It is!
So let's dive into the fascinating world of graph algorithms and see how LinkedIn connects you to your professional network!
And don't worry It's not going to be too complex.

Beyond 1st, 2nd, and 3rd Degrees:

We all know those little icons next to LinkedIn profiles indicating our connection level (1st, 2nd, or 3rd degree). But how does LinkedIn calculate these connections? It all starts with a powerful tool called a graph algorithm. Imagine a giant map where people are represented as dots and connections are lines. This complex map, known as a graph, stores information about who's connected to whom.

How is it possible with millions of users and connections?

The challenge of scaling is real as navigating this map efficiently becomes a challenge. To overcome this, LinkedIn uses a special type of graph algorithm called bi-directional BFS (Breadth-First Search). This algorithm simultaneously searches from you and your potential connection, meeting somewhere in the middle to determine your connection level.

But wait, there's more! As you interact and your network grows, constantly searching the entire graph becomes impractical. So, LinkedIn employs a clever caching strategy. It stores your second-degree connections (friends of your friends) locally, allowing for faster lookups without needing to traverse the entire network every time.

Scaling to Millions:

Imagine storing everyone's second-degree connections on a single server! Not feasible. That's why LinkedIn distributes this data across multiple servers, dividing it based on user IDs. But what happens if a server crashes? To ensure redundancy, each shard (portion of data) is replicated on different servers.

Real twist: Speed vs Efficiency

Now comes the real twist. While replicating data ensures availability, it also adds complexity. To avoid hitting every server for each query, LinkedIn uses a technique called set cover. This fancy term basically means finding the smallest number of servers that hold all the information needed for your query, minimizing the number of hops and maximizing speed.

The Secret Sauce: Greedy Set Cover:

LinkedIn uses a modified version of the greedy set cover algorithm, prioritizing servers that hold connections most relevant to your search. Think of it like finding the shortest route on a map by visiting only the essential points. This clever approach reduces the number of servers needed, making queries faster and more efficient.

The End Result: A Connected us!

Thanks to these complex algorithms and clever caching strategies, LinkedIn can efficiently navigate its massive network and show you relevant connections within milliseconds. So, the next time you see those degree icons, remember the invisible technology working tirelessly to connect you with your professional world!

And for the tech-savvy:

A question: Does Facebook's 'Friends of friends' feature work the same way?
Or do they use TAO(Memcached) or something else?
What do you think about it?

If you're curious about the nitty-gritty, the research paper linked in the original content delves deeper into the specific algorithms and optimizations used by LinkedIn.

But for everyone else, hopefully, this blog has shed some light on the magic behind those connection degrees!
There still might be things I am missing on so do let me know in comments.

Inspired by various online discussions and Gaurav Sen.

If you enjoyed this blog, you can follow me on:

It's Valentine Day and I'm not feeling lonely because my keyboard is definitely getting touched tonight XD



#SingleCodersDay

If you'd like to support me, you can sponsor me on GitHub or buy me a coffee.

Docker alternative Podman on rise 🚀: The Future of DevOps?

Ujjawal Tyagi — Mon, 05 Feb 2024 22:33:43 +0000

As a developer, I've long relied on Docker for its robust features and ease of use. However, with the emergence of Podman, a new player in the containerization arena, the landscape is shifting. In this article, I'll delve into my experiences with both Docker and Podman, highlighting their key differences, advantages, and potential impact on the future of DevOps.
Sit tight & explore!

What's wrong with Docker?

For years, Docker has been my go-to tool for containerizing applications. Its intuitive interface, extensive community support, and seamless integration with orchestration tools like Kubernetes have made it an indispensable part of my workflow. From developing microservices to deploying scalable applications, Docker has been my trusted companion.

*But is it enough? *

What about:

the root access you need everytime?

the dependence on Daemon engine?

Surface for attackers and vulnerabilities?

Podman: The New Kid on the Block

A container engine that promises a fresh perspective on containerization. Initially, I was skeptical about Podman's capabilities, especially given its lack of a dedicated desktop application and limited compatibility with orchestration tools. However, recent updates have transformed my perception.

Despite my allegiance to Docker, I couldn't ignore the buzz surrounding Podman. Intrigued by its security features, lightweight architecture, and rootless operation, I decided to give it a try. To my surprise, transitioning from Docker to Podman was smoother than expected.

Is this transition worth?

Well, you don't have to worry about that as 99% of docker commands are same in Podman, all the docker or other container management tools are compatible with Podman Engine.

Security Matters:

One of Podman's standout features is its enhanced security model. Unlike Docker, which requires root access for container management, Podman operates in a rootless fashion, significantly reducing the attack surface and minimizing security risks. For organizations prioritizing security, Podman offers a compelling alternative.

Exploring Pods: Kubernetes Integration

While Docker's orchestration capabilities have long been lauded, Podman introduces a new concept: Pods. Similar to Kubernetes, Podman Pods allow multiple containers to share the same network, volume, and even Port mapping simplifying complex deployments and enabling seamless scaling.

The Future of Containerization

Whether you are team Docker or team Podman, one thing becomes clear: adaptability is key. Whether you're a seasoned Docker enthusiast or a curious newcomer, exploring the possibilities offered by Podman can lead to new insights, improved workflows, and enhanced security. In the dynamic world of DevOps, embracing innovation is not just a choice—it's a necessity.

Now, you've got everything you need to start your Podman journey.
There are more things to cover but that's a story for another time!

I'll be back with more technicalities about Podman till then see yaa....

If you enjoyed this blog, you can follow me on:

If you'd like to support me, you can sponsor me on GitHub or buy me a coffee.