DEV Community: Tony Wang

27.6% of the Top 10 Million Sites are Dead

Tony Wang — Wed, 30 Oct 2024 08:48:52 +0000

The internet, in many ways, has a memory. From archived versions of old websites to search engine caches, there's often a way to dig into the past and uncover information—even for websites that are no longer active. You may have heard of the Internet Archive, a popular tool for exploring the history of the web, which has experienced outages lately due to hacks and other challenges. But what if there was no Internet Archive? Does the internet still "remember" these sites?

In this article, we'll dive into a study of the top 10 million domains and reveal a surprising finding: over a quarter of them—27.6%—are effectively dead. Below, I'll walk you through the steps and infrastructure involved in analyzing these domains, along with the system requirements, code snippets, and statistical results of this research.

The Challenge: Analyzing 10 Million Domains

Thanks to resources like DomCop, we can access a list of the top 10 million domains, which serves as our starting point. Processing such a large volume of URLs requires significant computing resources, parallel processing, and optimized handling of HTTP requests.

To get accurate results quickly, we needed a well-designed scraper capable of handling millions of requests in minutes. Here’s a breakdown of our approach and the system design.

System Design for High-Volume Domain Scraping

To analyze 10 million domains in a reasonable timeframe, we set a target of completing the task in 10 minutes. This required a system that could process approximately 16,667 requests per second. By splitting the load across 100 workers, each would need to handle around 167 requests per second.

1. Efficient Queue Management with Redis

Redis, with its capability of handling over 10,000 requests per second easily, played a key role in managing the job queue. However, even with Redis, tracking status codes from millions of domains can overload the system. To prevent this, we utilized Redis pipelines, allowing multiple jobs to be processed simultaneously and reducing the load on our Redis cluster.

// SPopN retrieves multiple items from a Redis set efficiently.
func SPopN(key string, n int) []string {
    pipe := Redis.Pipeline()
    for i := 0; i < n; i++ {
        pipe.SPop(ctx, key)
    }
    cmders, err := pipe.Exec(ctx)
    if err != nil { return nil }

    results := make([]string, 0, n)
    for _, cmder := range cmders {
        if spopCmd, ok := cmder.(*redis.StringCmd); ok {
            val, err := spopCmd.Result()
            if err == nil && val != "" { results = append(results, val) }
        }
    }
    return results
}

Using this method, we could pull large batches from Redis with minimal impact on performance, fetching up to 100 jobs at a time.

func (w *Worker) fetchJobs() {
    for {
        if len(w.Jobs) > 100 {
            time.Sleep(time.Second)
            continue
        }
        jobs := SPopN(w.Name+jobQueue, 100)
        for _, job := range jobs {
            w.AddJob(job)
        }
    }
}

2. Optimizing DNS Requests

To resolve domains efficiently, we used multiple public DNS servers (e.g., Google DNS, Cloudflare) and handled up to 16,667 requests per second. Public DNS servers typically throttle large volumes of requests, so we implemented error handling and retries for DNS timeouts and throttling errors.

var dnsServers = []string{
    "8.8.8.8", "8.8.4.4", "1.1.1.1", "1.0.0.1", "208.67.222.222", "208.67.220.220",
}

By balancing the load across multiple servers, we could avoid rate limits imposed by individual DNS providers.

3. HTTP Request Handling

To check domain statuses, we attempted direct HTTP/HTTPS requests to each IP address. The following code retries with HTTPS if the HTTP request encounters a protocol error.

func (w *Worker) worker(job string) {
    var ips []net.IPAddr
    var err error
    var customDNSServer string
    for retry := 0; retry < 5; retry++ {
        customDNSServer = dnsServers[rand.Intn(len(dnsServers))]
        resolver := &net.Resolver{
            PreferGo: true,
            Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
                d := net.Dialer{}
                return d.DialContext(ctx, "udp", customDNSServer+":53")
            },
        }

        ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
        defer cancel()

        ips, err = resolver.LookupIPAddr(ctx, job)
        if err == nil && len(ips) > 0 {
            break
        }

        log.Printf("Retry %d: Failed to resolve %s on DNS server: %s, error: %v", retry+1, job, customDNSServer, err)
    }

    if err != nil || len(ips) == 0 {
        log.Printf("Failed to resolve %s on DNS server: %s after retries, error: %v", job, customDNSServer, err)
        w.updateStats(1000)
        return
    }

    customDialer := &net.Dialer{
        Timeout: 10 * time.Second,
    }
    customTransport := &http.Transport{
        DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
            port := "80"
            if strings.HasPrefix(addr, "https://") {
                port = "443"
            }
            return customDialer.DialContext(ctx, network, ips[0].String()+":"+port)
        },
    }
    client := &http.Client{
        Timeout:   10 * time.Second,
        Transport: customTransport,
        CheckRedirect: func(req *http.Request, via []*http.Request) error {
            return http.ErrUseLastResponse
        },
    }

    req, err := http.NewRequestWithContext(context.Background(), "GET", "http://"+job, nil)
    if err != nil {
        log.Printf("Failed to create request: %v", err)
        w.updateStats(0)
        return
    }
    req.Header.Set("User-Agent", userAgent)

    resp, err := client.Do(req)
    if err != nil {
        if urlErr, ok := err.(*url.Error); ok && strings.Contains(urlErr.Err.Error(), "http: server gave HTTP response to HTTPS client") {
            log.Printf("Request failed due to HTTP response to HTTPS client: %v", err)
            // Retry with HTTPS
            req.URL.Scheme = "https"
            customTransport.DialContext = func(ctx context.Context, network, addr string) (net.Conn, error) {
                return customDialer.DialContext(ctx, network, ips[0].String()+":443")
            }
            resp, err = client.Do(req)
            if err != nil {
                log.Printf("HTTPS request failed: %v", err)
                w.updateStats(0)
                return
            }
        } else {
            log.Printf("Request failed: %v", err)
            w.updateStats(0)
            return
        }
    }
    defer resp.Body.Close()

    log.Printf("Received response from %s: %s", job, resp.Status)
    w.updateStats(resp.StatusCode)
}

Deployment Strategy

Our scraping deployment consisted of 400 worker replicas, each handling 200 concurrent requests. This configuration required 20 instances, 160 vCPUs, and 450GB of memory. With CPU usage at only around 30%, the setup was efficient and cost-effective, as shown below.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: worker
spec:
  replicas: 400
  ...
  containers:
    - name: worker
      image: ghcr.io/tonywangcn/ten-million-domains:20241028150232
      resources:
        limits:
          memory: "2Gi"
          cpu: "1000m"
        requests:
          memory: "300Mi"
          cpu: "300m"

The approximate cost for this setup was around $0.0116 per 10 million requests, totaling less than $1 for the entire analysis.

Data Analysis: How Many Sites Are Actually Accessible?

The status code data from the scraper allowed us to classify domains as "accessible" or "inaccessible." Here’s the criteria used:

Accessible: Status codes other than 1000 (DNS not found), 0 (timeout), 404 (not found), or 5xx (server error).
Inaccessible: Domains with the status codes above, indicating they are either unreachable or no longer in service.

accessible_condition = (
    (df["status_code"] != 1000) &
    (df["status_code"] != 0) &
    (df["status_code"] != 404) &
    ~df["status_code"].between(500, 599)
)
inaccessible_condition = ~accessible_condition

After aggregating the results, we found that 27.6% of the domains were either inactive or inaccessible. This meant that over 2.75 million domains from the top 10 million were dead.

| Status Code | Count     | Rate |
| ----------- | --------- | ---- |
| 301         | 4,989,491 | 50%  |
| 1000        | 1,883,063 | 19%  |
| 200         | 1,087,516 | 11%  |
| 302         | 659,791   | 7%   |
| 0           | 522,221   | 5%   |

Conclusion

With a dataset as large as 10 million domains, there are bound to be formatting inconsistencies that affect accuracy. For example, domains with a www prefix should ideally be treated the same as those without, yet variations in how URLs are constructed can lead to mismatches. Additionally, some domains serve specific functions, like content delivery networks (CDNs) or API endpoints, which may not have a traditional homepage or may return a 404 status by design. This adds a layer of complexity when interpreting accessibility.

Achieving complete data cleanliness and uniform formatting would require substantial additional processing time. However, with the large volume of data, minor inconsistencies likely constitute around 1% or less of the overall dataset, meaning they don’t significantly affect the final result: more than a quarter of the top 10 million domains are no longer accessible. This suggests that as time passes, your history and contributions on the internet could gradually disappear.

While the scraper itself completes the task in around 10 minutes, the research, development, and testing required to reach this point took days or even weeks of effort.

If this research resonates with you, please consider supporting more work like this by sponsoring me on Patreon. Your support fuels the creation of articles and research projects, helping to keep these insights accessible to everyone. Additionally, if you have questions or projects where you could use consultation, feel free to reach out via email.

The source code for this project is available on GitHub. Please use it responsibly—this is meant for ethical and constructive use, not for overwhelming or abusing servers.

Thank you for reading, and I hope this research inspires a deeper appreciation for the impermanence of the internet.

The Architecture of a Web Crawler: Building a Google-Inspired Distributed Web Crawler. Part 1

Tony Wang — Fri, 13 Oct 2023 12:37:00 +0000

Support me on Patreon to write more tutorials like this!

Introduction

In the rapidly evolving digital landscape, accessing and analyzing vast troves of web data has become imperative for businesses and researchers alike. In real-world scenarios, the need for scaling web crawling operations is paramount. Whether it’s dynamic pricing analysis for e-commerce, sentiment analysis of social media trends, or competitive intelligence, the ability to gather data at scale offers a competitive advantage. Our goal is to guide you through the development of a Google-inspired distributed web crawler, a powerful tool capable of efficiently navigating the intricate web of information.

The Imperative of Scaling: Why Distributed Crawlers Matter

The significance of distributed web crawlers becomes evident when we consider the challenges of traditional, single-node crawling. These limitations encompass issues such as speed bottlenecks, scalability constraints, and vulnerability to system failures. To effectively harness the wealth of data on the web, we must adopt scalable and resilient solutions.

Ignoring this necessity can result in missed opportunities, incomplete insights, and a loss of competitive edge. For instance, consider a scenario where a retail business fails to employ a distributed web crawler to monitor competitor prices in real-time. Without this technology, they may miss out on adjusting their own prices dynamically to remain competitive, potentially losing customers to rivals offering better deals.

In the field of academic research, a researcher investigating trends in scientific publications may find that manually collecting data from hundreds of journal websites is not only time-consuming but also prone to errors. A distributed web crawler, on the other hand, could automate this process, ensuring comprehensive and error-free data collection.

In the realm of social media marketing, timely analysis of trending topics is crucial. Without the ability to rapidly gather data from various platforms, a marketing team might miss the ideal moment to engage with a viral trend, resulting in lost opportunities for brand exposure.

These examples illustrate how distributed web crawlers are not just convenient tools but essential assets for staying ahead in the modern digital landscape. They empower businesses, researchers, and marketers to harness the full potential of the internet, enabling data-driven decisions and maintaining a competitive edge.

Introducing the Multifaceted Tech Stack: Kubernetes and More

Our journey into distributed web crawling will be guided by a multifaceted technology stack, carefully selected to address each facet of the challenge:

Kubernetes: This powerful orchestrator is the cornerstone of our solution, enabling the dynamic scaling and efficient management of containerized applications.
Golang, Python, NodeJS: We have chose these programming languages for their strengths in specific components of the crawler, offering a blend of performance, versatility, and developer-friendly features.
Grafana and Prometheus: These monitoring tools provide real-time visibility into the performance and health of our crawler, ensuring we stay on top of any issues.
Prometheus Exporters: Along with Prometheus, exporters capture customized metrics from various services, enhancing our monitoring capabilities of distributed crawlers.
ELK Stack (Elasticsearch, Logstash, Kibana): This trio constitutes our log analysis toolkit, enabling comprehensive log collection, processing, analysis, and visualization.

Preparing Your Development Environment

A robust development environment is the foundation of any successful project. Here, we’ll guide you through setting up the environment for building our distributed web crawler:

1). Install Dependencies: We highly recommend using a Unix-like operating system to install the packages listed below. For this demonstration, we will use Ubuntu 22.04.3 LTS.

sudo apt install -y awscli docker.io docker-compose make kubectl (check https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/ for detailed tutorial about how to install)

2). Configure AWS and Setup EKS cluster: To create a dedicated AWS Access key and run aws configure in the terminal of your development machine, please follow the tutorial available here

aws configure
AWS Access Key ID [****************3ZL7]: 
AWS Secret Access Key [****************S3Fu]: 
Default region name [us-east-1]: 
Default output format [None]:

After creating a Kubernetes cluster on AWS EKS by following the steps outlined in this guide, it’s time to generate the kubeconfig using the following command.

aws eks update-kubeconfig - name distributed-web-crawler
Added new context arn:aws:eks:us-east-1:************:cluster/distributed-web-crawler to /home/ubuntu/.kube/config

At this point, you can run kubectl get pods to verify if you can successfully connect to the remote cluster. Sometimes, you may encounter the following error. In such cases, we suggest following this tutorial to debug and resolve the version conflict issue.

kubectl get pods
error: exec plugin: invalid apiVersion "client.authentication.k8s.io/v1alpha1"

3).Setting up Redis and MongoDB Instances: In a distributed system, a message queue system is essential for distributing tasks among workers. Redis has been chosen for its rich data structures, such as lists, sets, and strings, which can serve not only as a message queue system but also as a cache and duplication filter. MongoDB is selected for its native scalability as a key-value database. This choice avoids the challenges of scaling a database to handle billions or more records in the future. Follow the tutorials below to create a Redis instance and a MongoDB instance, respectively:

Redis: https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Clusters.Create.html
MongoDB: https://www.mongodb.com/docs/atlas/getting-started/

3). Lens: the most powerful IDE for Kubernetes, allowing you to visually manage your Kubernetes clusters. Once you have it installed on your computer, you will eventually see charts as the screenshot shows. However, please note that you will need to install a few components to enable real-time CPU and memory usage monitoring for your cluster.

Constructing the Initial Project Structure

With your environment set up, it’s time to establish the foundation of the project. An organized and modular project structure is essential for scalability and maintainability. Since this is a demonstration project, I suggest consolidating everything into a monolithic repository for simplicity, instead of splitting it into multiple repositories based on languages, purposes, or other criteria:

./

├── docker

│   ├── go

│   │   └── Dockerfile

│   └── node

│       └── Dockerfile

├── docker-compose.yml

├── elk

│   └── docker-compose.yml

├── go

│   └── src

│       ├── main.go

│       ├── metric

│       │   └── metric.go

│       ├── model

│       │   └── model.go

│       └── pkg

│           ├── constant

│           │   └── constant.go

│           └── redis

│               └── redis.go

├── k8s

│   ├── config.yaml

│   ├── deployment.yaml

│   └── service.yaml

├── makefile

└── node

    └── index.js

13 directories, 14 files

Designing the Distributed Crawler Architecture

In understanding the architecture of a distributed web crawler, it’s essential to grasp the core components that come together to make this intricate system function seamlessly:

1) . Worker Nodes: These are the cornerstone of our distributed crawler. We’ll dedicate significant attention to them in the following sections. The Golang Crawler will handle straightforward webpages rendered from the server-side, while the NodeJS crawler will tackle complex webpages, using a headless browser, such as Chrome. It’s important to note that a single HTTP request issued by programming languages like Golang or Python is significantly more resource-efficient (often 10 times or more) compared to requests made with a headless browser.

2) . Message Queue: For simplicity and remarkable built-in features, we rely on Redis. Here, the inclusion of Bloom Filters stands out; they are invaluable for filtering duplicates among billions of records, offering high performance and minimal resource consumption.

3) . Data Storage: The choice of key-value databases, such as MongoDB, is available for storage. However, if you aspire to make your textual data searchable, akin to Google, Elastic Search is the preferred option.

4) . Logging: Within our ecosystem, the ELK stack shines. We deploy a Filebeat worker into each instance as a DaemonSet to collect and ship logs to Elastic Search via Logstash. This is a critical aspect of any distributed system, as logs play a pivotal role in debugging issues, crashes, or unexpected behaviors.

5) . Monitoring: Prometheus takes the lead here, enabling the monitoring of common metrics like CPU and memory usage by pods or nodes. With a customized metric exporter, we can also monitor metrics related to crawling tasks, such as the real-time status of each crawler, the total processed URLs, crawling rates per hour, and more. Moreover, we can set up alerts based on these metrics. Blind management of a distributed system with numerous instances is not advisable; Prometheus ensures that we have clear insights into our system’s health.

The Road Ahead

With a strong foundation laid, the series is poised to delve into the technical intricacies of each component. In the upcoming articles, we’ll start to develop the core code of crawlers and extract data from webpages.

Stay engaged and follow the series closely to gain a comprehensive understanding of building a cutting-edge distributed web crawler. You can access the source code for this project on the GitHub repository here

How to efficiently scrape millions of Google Businesses on a large scale using a distributed crawler

Tony Wang — Mon, 31 Jul 2023 16:46:33 +0000

Support me on (Patreon)[https://www.patreon.com/tonywang_dev] to write more tutorials like this!

Introduction

In the previous post, we covered the process of analyzing the network panel of a webpage to identify the relevant RESTful API for scraping desired data. While this approach works for many websites, some implement techniques like JavaScript encryption, which makes it difficult to decrypt and extract valuable information solely through RESTful APIs. This is where the concept of a “headless browser” can enable us to simulate the actions of a real user browsing the website with a browser.

A headless browser is essentially a web browser without a graphical user interface (GUI). It allows automated web browsing and page interaction, providing a means to access and extract information from websites that employ dynamic content and JavaScript encryption. By using a headless browser, we can overcome some of the challenges posed by traditional scraping methods, as it allows us to execute JavaScript, render web pages, and access dynamically generated content.

Here I will demonstrate the process of creating a distributed crawler using a headless browser, using Google Maps as our target website.

Throughout my experience, I have explored various headless browser frameworks, such as Selenium, Puppeteer, Playwright, and Chromedp. Among them, I believe that Crawlee stands out as the most powerful tool I have ever used for web scraping purposes. Crawlee is a JavaScript-based library, which means you can easily adapt it to work with other frameworks of your choice, making it highly versatile and flexible for different project requirements.

How to list all the businesses in a country

In general, when using Google Maps to find businesses we want to visit, we typically conduct searches based on the business category type and location. For instance, we may use a keyword like “shop near Holtsville” to locate any shops in a small town in New York. However, a challenge arises when multiple towns share the same name within the same country. To overcome this, Google Maps offers a helpful feature: querying by postal code. Consequently, the initial query can be refined to “shop near 00501,” with 00501 being the postal code of a specific location in Holtsville. This approach provides greater clarity and reduces confusion compared to using town names.

With this clear path for efficient searches, our next objective is to compile a comprehensive list of all postal codes in the USA. To accomplish this, I used a free postal code database accessible here. If you happen to know of a better database, leave a comment below.

Once we have downloaded the postal code list file, we can begin testing its functionality on Google Maps.

Using the keyword shop near 00501 USA in the Google Map search bar, we can observe a list of shops located in Holtsville. As our aim is to scrape all the businesses from this search, it is essential to ensure we retrieve a comprehensive list. To achieve this, we must scroll down through the search results until we reach the bottom of the list. Upon reaching the end, Google Maps will display a clear message stating You’ve reached the end of the list. This indicator serves as our cue to conclude the scrolling process and move on to the next phase of data extraction. By doing so, we can be certain that we have gathered all the relevant businesses from the specified location, enabling us to proceed with the scraping procedure accurately and comprehensively.

Once we have compiled the list of businesses from Google Maps, we can proceed to extract the detailed information we need from each business entry. This process involves going through the list one by one and scraping relevant data, such as the business’s address, operating hours, phone number, star ratings, number of reviews, and all available reviews.

Implementing the code of Google Map scraper

Google Map Businesses scraper

The provided source code mainly focuses on extracting information from Google Maps using CSS selectors, which is relatively straightforward. As spot instances can be terminated at any time, it is essential to handle this situation carefully.

To solve this issue, we need to implement code that listens for the SIGTERM and SIGINT events. These events indicate that the instance is about to be terminated. When these events are triggered, we should take appropriate actions to backup any pending tasks in the job queue and also preserve the state of any running tasks that haven’t been completed yet.

By listening to these signals, we can intercept the termination process and ensure that critical data and tasks are not lost. The backup mechanism enables us to store any unfinished work safely, allowing for a seamless continuation of tasks when new instances are launched in the future.

['SIGINT', 'SIGTERM', "uncaughtException"].forEach(signal => process.on(signal, async () => {
 await backupRequestQueue(queue, store, signal)
 await crawler.teardown()
 await sleep(200)
 process.exit(1)
}))

2. Google Map Business Detail Scraper

3. Deployment file for Kubernetes

Monitoring and Optimizing the performance

As of now, everything with Crawlee appears to be functioning well, except for one critical issue. After running in the Kubernetes (k8s) cluster for approximately one hour, the performance of Crawlee experiences a significant drop, resulting in the extraction of only a few hundred items per hour, whereas initially, it was extracting at a much higher rate. Interestingly, this issue is not encountered when using a standalone container with Docker Compose on a dedicated machine.

Moreover, while monitoring the cluster, you may observe a drastic decrease in CPU utilization from around 90% to merely 10%, especially if you have the metric-server installed. This unexpected behavior is concerning and requires investigation to identify the underlying cause.

To address this performance degradation and ensure efficient resource utilization, you have taken the initiative to leverage the Kubernetes API and client-go, the Golang SDK for Kubernetes. By utilizing these tools, you can effectively monitor the CPU utilization of all instances in the cluster. To further mitigate this issue, you have implemented a solution to automatically terminate instances that exhibit very low CPU utilization and have been active for at least 30 minutes.

By automatically terminating such instances, you can avoid inefficiencies in resource allocation and ensure that underperforming instances do not hamper the overall data extraction process. This proactive approach helps maintain the cluster’s performance and ensures that Crawlee operates optimally, delivering consistent and reliable results even in the dynamic and challenging Kubernetes environment.

the provided code aims to address the issue of low CPU utilization in Kubernetes nodes by utilizing the Kubernetes metrics API to filter out underperforming nodes. Subsequently, the instance termination process is executed through the AWS Go SDK.

To ensure the successful implementation of this solution in a Kubernetes (k8s) cluster, additional steps are required. Specifically, we need to create a ServiceAccount, ClusterRole, and ClusterRoleBinding to properly assign the necessary permissions to the nodes-cleanup-cron-task. These permissions are essential for the task to effectively query the relevant Kubernetes resources and perform the required actions.

The ServiceAccount is responsible for providing an identity to the nodes-cleanup-cron-task, allowing it to authenticate with the Kubernetes API server. The ClusterRole defines a set of permissions that the task requires to interact with the necessary resources, in this case, the metrics API and other Kubernetes objects. Finally, the ClusterRoleBinding connects the ServiceAccount and ClusterRole, granting the task the permissions specified in the ClusterRole.

By establishing this set of permissions and associations, we ensure that the nodes-cleanup-cron-task can access and query the metrics API and other Kubernetes resources, effectively identifying nodes with low CPU utilization and terminating instances using the AWS Go SDK.

Conclusion

At this stage, the majority of the code is complete, and you have the capability to deploy it on any cloud server with Kubernetes (k8s). This flexibility allows you to scale the application effortlessly, expanding the number of instances as needed to meet your specific requirements.

One of the key advantages of the design lies in its termination tolerance. With the implemented safeguards to handle SIGTERM and SIGINT events, you can deploy spot instances without concerns about potential data loss. Even when spot instances are terminated unexpectedly, the application gracefully manages the data in the job queue and running tasks.

By leveraging this termination tolerance feature, the application can handle spot instance terminations smoothly. This ensures that any pending tasks in the job queue are backed up safely and that the state of running tasks, which haven’t completed yet, is preserved. Consequently, you can rest assured that the integrity of your data and tasks will be maintained throughout the operation.

Deploying the application with Kubernetes and taking advantage of termination tolerance empowers you to scale the Google Maps scraper efficiently, managing numerous instances to meet your data extraction needs effectively. The combination of Kubernetes and the termination tolerance design enhances the overall robustness and reliability of the application, allowing for seamless operation even in the dynamic and unpredictable cloud environment. If you have any questions regarding this article or any suggestions for future articles, please leave a comment below. Additionally, I am available for remote work or contracts, so please feel free to reach out to me via email.

A Step-by-Step Guide to Building a Scalable Distributed Crawler for Scraping Millions of Top TikTok Profiles

Tony Wang — Mon, 12 Jun 2023 04:54:05 +0000

Support me on (Patreon)[https://www.patreon.com/tonywang_dev] to write more tutorials like this!

Introduction
In this tutorial, we will walk you through the process of building a distributed crawler that can efficiently scrape millions of top TikTok profiles. Before we embark on this tutorial, it is crucial to have a solid grasp of fundamental concepts like web scraping, the Golang programming language, Docker, and Kubernetes (k8s). Additionally, being familiar with essential libraries such as Golang Colly for efficient web scraping and Golang Gin for building powerful APIs will greatly enhance your learning experience. By following this tutorial, you will gain insight into building a scalable and distributed system to extract profile information from TikTok.

Developing a Deeper Understanding of the Website You Want to Scrape.

Before delving into writing the code, it is imperative to thoroughly analyze and understand the structure of TikTok’s website. To facilitate this process, we recommend using the convenient “Quick Javascript Switcher” Chrome plugin, available here. This invaluable tool allows you to disable and re-enable JavaScript with a single mouse-click. By doing so, we aim to optimize our scraping workflow, to increase efficiency, and to minimize costs by minimizing the reliance on JavaScript rendering.

Upon disabling JavaScript using the plugin, we will focus our attention on TikTok’s profile page — the specific page we aim to scrape. Analyzing this page thoroughly will enable us to gain a comprehensive understanding of its underlying structure, crucial elements, and relevant data points. By examining the HTML structure, identifying key tags and attributes, and inspecting the network requests triggered during page loading, we can unravel the essential information we seek to extract.

Furthermore, by scrutinizing the structure and behavior of TikTok’s profile page without the interference of JavaScript, we can ensure our scraper’s efficiency and effectiveness. Bypassing the rendering of JavaScript code allows us to directly target the necessary HTML elements and retrieve the desired data swiftly and accurately.

Imagine visiting a TikTok profile, such as “https://www.tiktok.com/@linisflorez09" with JavaScript enabled. You would witness approximately 300 requests being made, resulting in a whopping transfer of 10MB of data. Loading the entire page, complete with CSS style files, JavaScript files, images, and videos, takes roughly 5 seconds. Now, let’s put this into perspective: if we aim to scrape millions of data records, the total number of requests would skyrocket into the billions, while the data package would amass to over ten Terabytes. And that’s not even factoring in the computing resources consumed by headless Chrome instances. This proactive approach not only streamlines the scraping process, but also helps mitigate unnecessary expenses, ultimately saving you, your boss, or your customers substantial amounts of money.

It is crucial to acknowledge the monumental task at hand when dealing with such large-scale data scraping operations. By investing time and effort into analyzing the webpage upfront, we can discover innovative ways to extract the desired data while minimizing the number of requests, reducing data transfer size, and optimizing resource utilization. This strategic approach ensures that our scraping process is not only efficient but also cost-effective.

Implementing the Code for Scraping TikTok Profiles

When it comes to scraping TikTok’s profile page, the Golang built-in net/http package provides a reliable solution for making HTTP requests. If you prefer a more straightforward approach without the need for callback features like OnError and OnResponse offered by Golang Colly, net/http is a suitable choice.

Below, you’ll find a code snippet to guide you in building your TikTok profile scraper. However, certain parts of the code are intentionally omitted to prevent potential misuse, such as sending an excessive number of requests to the TikTok platform. It’s crucial to adhere to ethical scraping practices and respect the platform’s terms of service.

To extract information from HTML pages using CSS selectors in Golang, various tutorials and resources are available that demonstrate the use of libraries like goquery. Exploring these resources will provide you with comprehensive guidance on extracting specific data points from HTML pages.

Please note that the provided code snippet is meant for reference. Ensure that you modify and augment it as per your requirements and adhere to responsible data scraping practices.

Discovering the Entry Points for Popular Videos and Profiles

By now, we have completed the TikTok profile scraper. However, there’s more to explore. How can we find millions of top profiles to scrape? That’s precisely what I’ll discuss next.

If you visit the TikTok homepage at https://www.tiktok.com/, you’ll notice four sections on the top left: For You, Following, Explore, and Live. Clicking on the For You and Explore sections will yield random popular videos each time. Hence, these two sections serve as entry points for us to discover a vast number of viral videos. Let’s analyze them individually:

Explore Page

Once we navigate to the explore page, it’s advisable to clean up the network section of DevTools for better clarity before proceeding with any further operations.

To ensure accurate filtering of requests, remember to select the Fetch/XHR option. This selection will exclude any requests that are not made by JavaScript from the frontend. Once you have everything set up, proceed by scrolling down the explore page. As you do so, TikTok will continue recommending viral videos based on factors such as your country and behavior. Simultaneously, keep a close eye on the network panel. Your goal is to locate the specific request containing the keyword “explore” among the numerous requests being made.

Initially, it may not be immediately clear which exact request to focus on. Take your time and carefully inspect each request. We are looking for the request that returns essential information, such as author details, video content, view count, and other relevant data. Although the inspection process may require some patience, it is definitely worth the effort.

Continuing with the process, scroll down the explore page to explore more viral videos tailored to your country, behavior, and other factors. As you delve deeper, among the numerous requests being made, you will eventually come across a specific request containing the keyword explore. This particular request is the one we are searching for to extract the desired data. To proceed, right-click on this request and select the option Copy as cURL, as illustrated in the accompanying screenshot. By choosing this option, you can capture the request details in the form of a cURL command, which will serve as a valuable resource for further analysis and integration into your scraping workflow.

Using the previously identified request, we can import it into Postman to simulate the same request. Upon clicking the “Send” button, we should receive a similar response. This indicates that the request does not require the bothersome CSRF token for encryption and can be sent multiple times to obtain different results.

To further explore the request, we will examine it in Postman. Within the Params and Headers panel, you have the option to uncheck various boxes and then click the Send button. By doing so, you can verify if the response is successfully returned without including specific parameters. If the response is indeed returned, it implies that the corresponding parameter can be omitted in further development and requests. This step allows us to determine which parameters are required and which ones can be excluded for more efficient scraping.

Before diving into the code implementation, there is an essential piece of information we need to acquire — the category IDs. On the explore page, you will find a variety of categories displayed at the top, including popular ones like Dance and Music, Sports, and Entertainment. These categories play a crucial role in targeting specific types of content for scraping.

To proceed, we will follow a similar approach as mentioned earlier. Begin by cleaning up the network session to enhance clarity and ensure a focused analysis. Then, systematically click on each category button, one by one, and observe the value of the categoryType parameter associated with each request. By examining the categoryType values, we can identify the corresponding IDs for each category.

This step is vital as it enables us to tailor our scraping process to specific categories of interest. By retrieving the relevant category IDs, we can precisely target the desired content and extract the necessary data. So, take your time to explore and document the category IDs, as it will significantly enhance the effectiveness of your scraping implementation.

In the end, after performing the necessary analysis, we will compile a comprehensive map that associates each category type with its unique ID:

var categoryTypeMap = map[string]string{

"1": "comoedy & drama",

"2": "dance & music",

"3": "relationship",

"4": "pet & nature",

"5": "lifestyle",

"6": "society",

"7": "fashion",

"8": "enterainment",

"10": "informative",

"11": "sport",

"12": "auto",

}

At this point, we have almost completed the analysis of the explore page, and we are ready to begin the code implementation phase. To simplify the process and save time, there are several online services available that can assist us in converting JSON data into Go struct format. One such service that I highly recommend is https://mholt.github.io/json-to-go/.

This convenient tool allows us to paste the JSON response obtained from the explore page and automatically generates the corresponding Go struct representation. By utilizing this service, we can effortlessly convert the retrieved JSON data into structured Go objects, which will greatly facilitate data manipulation and extraction in our code.

The criteria I have set for determining popular profiles on TikTok is based on two factors: the number of likes on their content and the number of followers they have. Specifically, I consider a profile to be popular if they have any content with at least 250K likes or if they have accumulated at least 10K followers. These thresholds help identify profiles that have gained significant attention and engagement on the platform.

The key information I aim to extract from these popular profiles includes their unique identifier (ID), which serves as an input variable scraping profile details, and their follower count, which provides insights into their audience reach and influence. Additionally, I am interested in capturing the “digg” count of their videos, which represents the number of times users have interacted with and appreciated their content. These pieces of information offer valuable metrics to assess the popularity and impact of TikTok profiles.

It is worth noting that while the above-mentioned information is essential for my specific project, you have the flexibility to customize and retain any additional data that aligns with the requirements and objectives of your own undertaking. This allows you to tailor the scraping process to suit your unique needs and extract the most relevant information for your analysis or application.

For the parameters inside the getUrl function, you have the flexibility to remove or customize any specific parameters based on the analysis we conducted earlier. This allows you to fine-tune the request and retrieve more accurate results from the explore response. In this demonstration, I have chosen to keep all the parameters as they are, except for categoryType, which I have left as a variable. This approach will enable us to scrape data from all categories, providing a comprehensive view of the TikTok profiles we intend to extract.

Building an API service to monitor scraper stats

By now, we have completed the majority of the TikTok scraper. As we are utilizing Redis as the message queue to store tasks, it is crucial to monitor key statistics to ensure the smooth functioning of the scraper. We need to track metrics such as the number of times each category has been scraped, the count of successes and failures, and the remaining tasks in the job queue. To achieve this, it is necessary to build a service that offers an API endpoint for querying the statistics information at any time. Additionally, to safeguard sensitive stats, it is advisable to secure the endpoints, implementing appropriate authentication and authorization measures. This will ensure that only authorized individuals can access the scraper’s monitoring API and maintain the confidentiality of the collected data.

Here, we are going to complete the final part of the code, which is the main function. To simplify the deployment process, we will compile all the Golang code into a single binary file and package it into a Docker image. However, a question arises: How can we deploy different services, such as the profile scraper, explore scraper, and API service, with different numbers of replicas?

To address this challenge, we will use the main function with different arguments when running the tiktok-crawler binary. By modifying the workerMap, we can add as many different types of workers as we need to expand the functionality. For example, for the profile scraper, we may require 20 workers and 3 replicas, while for the explore scraper, we may need 40 workers and 4 replicas. The flexibility of the main function allows us to configure the desired number of workers for each scraper. By default, we set the number of workers for each scraper to 20.

Building a Docker Image and Deploying it into a Kubernetes Cluster

Here is the Dockerfile that enables us to build the binary file and package it into a Docker image, which can then be deployed into a Kubernetes (k8s) cluster.

Before deploying the code into a Kubernetes (k8s) cluster, it’s advisable to test the functionality of both the code and the Docker image locally using Docker Compose. Docker Compose allows us to define and manage multi-container applications. In this case, we can use the provided docker-compose.yml file.

By running the command docker-compose up — scale tiktok-profile=3 — scale tiktok-server=1 — scale tiktok-explore=5 -d, you can launch multiple instances of the desired services. This command allows you to scale up or down the number of replicas for each service as needed. It ensures that the services, such as tiktok-profile, tiktok-server, and tiktok-explore, are properly orchestrated and running concurrently.

Testing the code and Docker image locally with Docker Compose allows for a comprehensive evaluation of the application’s behavior and performance before deploying it into the production Kubernetes cluster. It helps ensure that the application functions as expected and can handle the desired scaling requirements.

After executing the provided command, you will observe that the specified number of profile scrapers, explore scrapers, and API servers are successfully launched and operational.

Deploying the Scraper to Kubernetes Cluster

Everything is prepared for the next stage, which involves deploying the application to a Kubernetes (k8s) cluster. Below is a sample k8s deployment file for your reference. You have the flexibility to customize the number of replicas for the scrapers and adjust the parameters for the scraper command as needed. It is important to note that the value for alb.ingress.kubernetes.io/subnets in the Ingress controller should be set according to the subnets associated with your k8s cluster during its creation. This ensures proper networking configuration for the Ingress controller.

To optimize cost while running the scraper, it is recommended to utilize Spot Instances when adding a new node group. Spot Instances offer a significant cost advantage, as they are typically priced 20%-90% lower than On-Demand instances. Since the scraper is designed to be stateless and can be terminated at any time, Spot Instances are suitable for this use case. By leveraging Spot Instances, you can achieve substantial cost savings while maintaining the required functionality of the scraper.

Once the node group has been successfully created and the state of the nodes has changed to ready, you are ready to deploy the scraper using the command kubectl apply -f deployment.yaml. This command will apply the configurations specified in the deployment file to the Kubernetes cluster. It will ensure that the desired number of replicas for the scraper services are up and running.

One of the advantages of using Kubernetes is its flexibility in scaling the number of replicas. You can easily adjust the number of workers that should be running at any given time by updating the deployment configuration. This allows you to scale up or down the number of scraper workers based on the workload or performance requirements.

By executing the appropriate kubectl commands, you have the flexibility to manage and control the deployment of the scraper services within the Kubernetes cluster, ensuring optimal performance and resource utilization.

Based on my extensive experience with the scraper, I have observed that the initial speed can reach an impressive rate of up to 1 million records per day when using the criteria I have set. However, it’s important to note that as time progresses, the speed may gradually decrease to a few thousand records per day. This decline occurs due to the nature of the explore page, where many of the popular contents have been created months ago. As we continue to scrape more profiles, we naturally cover a significant portion of the popular ones. Consequently, it becomes increasingly challenging to discover new viral content.

Considering this, it is advisable to consider temporarily halting the scraper for a few weeks or even longer. By pausing the scraping process, you allow time for new viral content to emerge and accumulate. Once a sufficient period has passed, restarting the scraper will help maintain efficiency and optimize costs, as you will be able to focus on capturing the latest popular profiles and videos.

With the successful completion of the TikTok scraper and its deployment in a distributed system using Kubernetes, we have achieved a robust and scalable solution. The combination of scraping techniques, data processing, and deployment infrastructure has allowed us to harness the full potential of TikTok’s platform. If you have any questions regarding this article or any suggestions for future articles, I encourage you to leave a comment below. Additionally, I am available for remote work or contracts, so please feel free to reach out to me via email.