Facilitating Real-Time Competitive Analysis

👨🏻‍💻 — Tue, 30 Apr 2024 15:28:53 +0000

Upon joining the team, we had recently launched a digital financial portal to assist consumers in finding suitable financial and insurance products.

Issue Description

One of the goals was to gain a better understanding of our offerings in comparison to those of our competitors. This required focusing on two main aspects:

Comparing our offerings with those of the competitor sourced from the same vendors.
Monitoring changes in vendor offerings available through our competitor.

Existing Approach and Obstacles

The existing method involved creating a few user profiles representing different potential user types. Twice a day, someone manually inputted these user details into both our platform and our competitor's platform, then compiled the results into a standardized Google Sheet for analysis by our analysts. However, this approach faced several challenges:

Limited Insights due to Fixed User Profiles: The use of a limited number of user profiles restricted our insights to specific "fictitious" user personas, hindering our ability to gather comprehensive insights across a broader range of potential user segments.
Lack of Scalability: The manual process was not scalable, making it hard to increase the frequency of data collection or expand the number of user profiles beyond a few. Consequently, our ability to capture real-time market dynamics and adapt to evolving user profiles was limited.

Resolution

To overcome these challenges, I introduced the following system:

Utilizing Real User Profiles and Auto-Scaling Crawler

Instead of relying on limited, artificial user personas, I proposed utilizing pseudonymized real user data extracted from our portal. Additionally, implementing an auto-scaling crawler would fetch competitor offers in real-time, eliminating manual data entry and restrictions on data collection frequency. This dynamic approach ensured that our analysis reflected current market conditions and the competitive landscape.

Benefits

Enhanced Data Accuracy: Leveraging real user data improved the relevance and accuracy of our analysis, enabling more informed decision-making.
Improved Scalability: Automation allowed for efficient scaling of data collection efforts to accommodate a limitless number of user profiles.
Real-Time Insights: Real-time offer fetching provided immediate visibility into competitor strategies and market conditions.

Obstacles

The system faced two significant challenges:

IP Address Blocking: The competitor's website blocked our Lambda workers' IP address range, disrupting the crawler's operation. To overcome this, we implemented rotating IP proxy servers to bypass the ban permanently.
Dynamic Website Updates: Crawler failures occurred when the competitor updated its website's elements or network request contracts. To address this, we established a monitoring system with a dead letter queue for swift identification and adaptation to changes.

Despite challenges, we successfully crawled the competitor's portal with 95% of our user profiles, achieving an end-to-end latency of 90 seconds (99th percentile).

Technology Stack

The solution incorporated the following components:

Kafka, for real-time communication.
Scala, for powering our stream processor.
Cypress, for web crawling.
Proxymesh, for a rotating IP proxy service to bypass IP ban.
Docker, for containerization.
AWS Lambda, for enabling serverless execution of the crawler.
Snowflake, for data warehousing.
dbt, for automating data transformation pipelines.
Argo Workflows, for orchestrating DBT jobs.
Looker, for business intelligence and data visualization.

DEV Community: 👨🏻‍💻