How to crawl your way into market dominance

#ai #python #aws #mlops

Why do so many AI projects feel like déjà vu?

You start with bold ambitions, tackle a proof of concept, and… it stalls. Again.

At ML Vanguards, we know this story all too well. It’s the cycle we’ve broken countless times in our work at Cube Digital.

The truth? Building production-grade AI isn’t about chasing buzzwords — it’s about combining engineering knowledge with practical AI to deliver systems that actually work, scale, and drive real-world impact.

1. Business strategy revolves around aata

There is no surprise that data is power. The stock market is data, the consumers’ behavior is data, even clicks on buttons are relevant from a business perspective.

There is an awful need for automated solutions to streamline social media data collection and analysis. Knowing what your users want and actually use from your business is the key to decision making and strategy.

This article outlines an end-to-end solution of a highly scalable data-ingestion pipeline tailored to a specific area: marketing intelligence. This architecture caters to various analytical processes: sales, competitor analysis, market analysis, and customer insights to name a few.

2. Gathering Relevant Data Points

Scheduler: Plays multiple roles, despite its name, but the most important one is to trigger the crawler lambdas for each page link it has.

Crawler: The name states its purpose. If you’re not familiar with the term crawling, pause this article and look it over before proceeding. This component takes the page link and starts crawling/extracting various posts and information about them. More details will come in the implementation part.

Database: Most posts are unstructured textual data, but we can extract other useful information from them, and MongoDB shines at handling semi-structured data.

To mark the complete flow of the solution, the scheduler triggers a crawler lambda instance for each page, sending the page name and the link. The crawler starts extracting the posts from last week and stores the raw content, the post’s creation date, the link itself, and the name, but this doesn’t stop here. You can extract more information depending on what the platform offers you.

Then, the scheduler waits for all lambda instances to finish their execution, aggregates the extracted posts from the database, and, using some prompt templates, sends the posts along with these to ChatGPT to generate some reports.

2.1 Scheduler

The reporting part is not the focus, although you can find it here along with all the code in this article. The leading actor here is the scheduling part itself, and this is the main entry point of the system where the whole flow is started and orchestrated:

Then, it stores the correlation ID of each lambda in a list and waits for all lambdas to finish their execution here. The awaited time defined here is 15 seconds; you can play with it according to the average time it takes for your crawler to complete its task, so Cloudwatch is not called that often.

Last, it finds all crawled posts from these pages and sends them to the report generation phase.

2.2 Crawler

We’ve defined a main abstraction point for all types of crawlers. It defines a common interface that all derived crawlers must implement, and all subclasses must provide their implementation for the extract() method so wherever you need to build a new crawler. Besides the fact that this brings a lot of reusability and uniformity, another valuable advantage is represented down below:

Each crawler is easily promoted and called automatically. In this case, we have a dispatcher whose job is to select and instantiate the correct crawler class based on the link you’ve provided to be processed. This essentially acts as a registry and a factory for the crawler and manages these under the unified interface and structure we’ve created for them. The advantages?

Flexibility & scalability: This component unlocks the possibility of easy addition without modifying the existing codebase. This makes the system easily expandable; you can include more domains and specialized crawlers—just plug and play them.
Encapsulation & modularity: The dispatcher encapsulates the logic for determining which crawler to use based on the link. This makes the system more modular and allows each crawler to focus on its core business logic without worrying about pattern matching.

3. Challenges & pitfalls

Running headless browser instance with Selenium in Lambda runtime environment

The Lambda execution environment is read-only, so anything you want to write on disk should be done into a temporary file. This will mostly ruin your dream of automatically installing the binary driver. So you would need to install this directly in the docker image and reference it manually in Selenium’s driver options. The only driver that worked for this setup was the Google binary driver.

Aggregate empty pages

The initial monitoring algorithm was quite basic. It involved looping over the correlation IDs of each Lambda invocation and checking the database for any generated posts. However, a corner case we found where no new posts had been created for some pages within the searched time range, causing the algorithm to enter an infinite loop.

Avoid being blocked by social media platforms

A common issue — one that could have consumed days of effort — required approaching it from a different perspective. Popular social media platforms employ numerous anti-bot protection mechanisms to prevent crawling, such as request header analysis, rate limiting, and IP blocking.

Conclusion

In this article, we’ve explored a complete end-to-end robust solution for building a Highly Scalable Data Ingestion pipeline that can leverage existing data from multiple crawlable sources for various processes like ML training, data analysis, etc.

We’ve gone through specific challenges you might face and how to overcome them in this process.

🔗 Check out the code on GitHub and support us with a ⭐️
Thanks for reading ML Vanguards! Subscribe for free to receive new posts and support our work.

Within our newsletter, we keep things short and sweet.

If you enjoyed reading this article, consider checking out the full version on Medium. It’s still free ↓

Full article on Medium