DEV Community: Theo Vasilis

Groupon reaches new merchants thanks to web data collection

Theo Vasilis — Mon, 04 Dec 2023 23:00:00 +0000

Groupon (NASDAQ: GRPN) is the worlds most popular marketplace to find deals for activities, travel, goods, and services offered by local merchants in hundreds of cities around the globe. Groupon, originally meant as "group" + "coupon, was founded on the idea that the collective bargaining power of a large number of people can get them better deals than they could get individually.

In March 2023, Duan enkypl from Pale Fire Capital became Groupons new interim CEO and set an ambitious goal to rapidly expand business by reaching new merchants and thus offer more deals to consumers. Recognizing the potential of web data to find new and enrich existing leads, enkypl turned to Apify to leverage its expertise in web data collection.

Groupon is using web data collection for smart lead generation at scale

The challenge

Groupon was looking for a way to update information about existing merchants, as well as find new ones to ask them to join the network. Such information can be found on search engines, travel sites, online maps, and various other websites.

The web data-based lead generation and enrichment pipeline had to provide accurate and up-to-date data about tens of thousands of businesses and seamlessly integrate into Groupons existing Salesforce CRM platform.

The solution

Apify operates a cloud platform that provides serverless computation, data storage, proxies, open-source SDKs, and hundreds of ready-made web scraping Actors built by community developers. Apifys Enterprise solutions team helped Groupon set up various Actors to extract the required data and run it at scale in the cloud.

To ensure the data fits into Groupons specific Salesforce implementation, Apify built a new Actor to filter, organize, and match the business data. Thanks to the modularity of the Apify platform, this custom solution was prepared in a very short time, helping Groupon reach new merchants faster than with other solutions.

The outcome

Groupon's sales team now has a rich database filled with potential leads right at their fingertips. The automation of the entire data journeyfrom extraction to integrationtransformed into significant time savings, heightened efficiency, and, ultimately, a stronger position within the e-commerce space.

*"We selected Apify because of their vast experience with web data collection. The project has been delivered on a short schedule, and our sales teams are now empowered with fresh, unique leads that drive targeted campaigns and strategic outreach."

Filip Popovic, SVP Transformation & Product & HR at Groupon_*_

Technical details

The solution was composed of the following parts:

1. Configuring existing Actors

The data extraction process commenced with a custom-designed Actor, New Leads Runner , delivered by Apify, to fine-tune Groupon's search criteria and ensure that the data sourced from other Actors is as relevant and targeted as possible

2. Mining business information

After precise input preparation, Apify could pinpoint and collate business information aligning with Groupon's focus areas. This phase was not just about gathering data legally and ethically but doing so in a way that adhered to Groupon's stringent quality standards.

3. Ensuring data quality

Data duplication can be a significant issue when handling vast amounts of information. Thanks to Apify's Merge, Dedup, and Transform Datasets Actor, we could ensure each business entry was unique by eliminating duplicates and containing the most relevant information by merging attributes from various sources.

4. Integrating data with Salesforce

Once the lead generation pipeline was producing clean data, the next step was to integrate it into Groupons existing CRM. With another custom-built Actor - Salesforce Uploader - Groupon could transfer their newfound leads into their Salesforce. The uploader also cross-references the new data with existing entries to ensure that only new businesses are added.

Who are Groupon and Apify?

Groupon (NASDAQ: GRPN) is a global e-commerce marketplace based in Chicago that connects subscribers with local merchants.

Apify is a full-stack web scraping and browser automation platform. In addition to its vast range of pre-built data extraction tools, Apify offers enterprise solutions with its team of experts who know how to handle the challenges of collecting data from arbitrary websites at scale.

Web scraping for machine learning

Theo Vasilis — Sun, 26 Nov 2023 23:00:00 +0000

Hi, we're Apify , a cloud platform that helps you build reliable web scrapers fast and automate anything you can do manually in a web browser. This article on web scraping for machine learning was inspired by our work on collecting data for AI and ML applications .

What is web scraping?

At its simplest, web scraping is the automated extraction of data from websites. This process is akin to web crawling, which is about finding or discovering web links. The difference is that web scraping is focused on extracting that data.

Initially, web scraping was a manual, cumbersome process, but with technological advances being what they are, it has become an automated, sophisticated practice. Web scrapers can navigate websites, understand their structure, and extract specific information based on predefined criteria.

Web scraping 101: learn the basics

Why is web scraping used in machine learning?

In most cases, you cant build high-quality predictive models with just internal data.

Asif Syed, Vice President of Data Strategy, Hartford Steam Boiler

The ability to harvest and process data from a myriad of web sources is what makes web scraping indispensable for machine learning. Web scraping isn't just about accessing the data but transforming it from the unstructured format of web pages into structured datasets that can be efficiently used in machine learning algorithms.

You can't teach a machine to make predictions or carry out tasks based on data unless you have an awful lot of data to train it. From social media analytics to competitive market research, web scraping enables the gathering of diverse datasets to teach machines, such as today's so-called 'AI models', and provide them with a rich and nuanced understanding of the world.

Comparing data collection methods for machine learning

There are multiple ways to collect data for machine learning these range from traditional surveys and manually curated databases to cutting-edge techniques that utilize IoT devices. So, why choose web scraping over other methods of data acquisition?

Surveys: They can provide highly specific data but often suffer from biases and limited scope.
Databases: These offer structured information, yet they may lack the real-time aspect essential for certain machine learning applications.
IoT devices: These bring in a wave of real-time, sensor-based data, but they're constrained by the type and quantity of data they can collect. It's worth noting that implementing MQTT authentication enhances the security and efficiency of data transmission and allows these devices to communicate more reliably.
Web scraping: In contrast, web scraping provides access to an almost infinite amount of data available online, from text and images to metadata and more. Unlike surveys or databases, web scraping taps into real-time data, which is crucial for models requiring up-to-date information. Moreover, the diversity of data that can be scraped from the web is unparalleled, which allows for a more comprehensive training of machine learning models.

Learn about building functional AI models for web scraping

Quality and quantity of data in ML

**You can have all of the fancy tools, but if your data quality is not good, you're nowhere.

Veda Bawo, Director of Data Governance, Raymond James**

The adage "quality over quantity" holds a significant place in many fields, but in the world of machine learning, it's not a matter of choosing one over the other. The success of ML models is deeply rooted in the quality and quantity of data they're trained on.

Quality of data refers to its accuracy, completeness, and relevance. High-quality data is free from errors, inconsistencies, and redundancies, making it indispensable for dependable analysis and sound decision-making. On the other hand, the quantity of data pertains to its volume. A larger dataset provides more information, leading to more reliable models and improved outcomes. However, an abundance of low-quality data can be detrimental, potentially leading to inaccurate predictions and suboptimal decision-making.

When it comes to quantity, web scraping allows for the collection of vast amounts of data from various online sources. However, the web is full of low-quality data, so simply extracting raw data isn't enough. It needs to be cleaned and processed before it can be used for machine learning. More about that later.

Another crucial aspect of data for machine learning is variety. Web scraping provides access to diverse data to enhance a model's ability to understand and interpret varied inputs.

Cloud-based real-time data acquisition

In the context of machine learning, the ability to collect and process data in real time is increasingly becoming a necessity rather than a luxury. This is where cloud-based data acquisition plays a vital role, as - in opposition to Edge-based data acquisition - it offers scalability and flexibility, which are critical for handling the voluminous and dynamic nature of web data.

Cloud computing, with its vast storage and computational capabilities, allows for the handling of massive datasets that web scraping generates. It provides the infrastructure needed to collect, store, and process data from varied sources in real-time. This real-time aspect is especially important in applications like market analysis, social media monitoring, and predictive modeling, where the timeliness of data can be the difference between relevance and obsolescence.

Learn about the differences between Edge AI and Cloud AI

Web scraping challenges and techniques for machine learning

The efficacy of web scraping in machine learning hinges on several key techniques. These not only ensure the extraction of relevant data but also its transformation into a format that machine learning algorithms can effectively utilize.

Handling dynamic websites

A major challenge in web scraping is dealing with dynamic websites that continually update their content. These sites often use technologies like JavaScript, AJAX, and infinite scrolling, making data extraction more complex. To effectively scrape such sites, one must employ advanced techniques and tools, seeking the expertise of pioneers in the field, such as companies specializing in software development services. Advanced techniques and tools are required. These include methods for executing JavaScript, handling AJAX requests, and navigating through dynamically loaded content. Mastering these techniques enables the scraping of real-time data from these complex websites, a critical requirement for many machine-learning applications.

Blocking and blacklisting

Many websites have measures in place to detect and block scraping bots to prevent unauthorized data extraction. These measures include blacklisting IP addresses, deploying CAPTCHAs, and analyzing browser fingerprints. To counteract blocking, web scrapers employ techniques like rotating proxies, mimicking real browser behaviors, and making use of CAPTCHA-solving services.

Heavy server load

Web scrapers can inadvertently overload servers with too many requests, leading to performance issues or even server crashes. To prevent this, its essential to implement intelligent crawl delays, randomize scraping times, and distribute the load across multiple proxies. This approach ensures a polite and responsible scraping process that minimizes the impact on website servers.

What do you do with the scraped data?

Data preprocessing

We said earlier that scraping raw data isn't enough. The next critical step involves cleaning and transforming the raw data into a structured format suitable for machine learning models. This stage includes removing duplicates and inconsistencies, handling missing values, and normalizing data to ensure that it's free from noise and ready for analysis. Preprocessing ensures that the data fed into machine learning models is of high quality, which is essential for accurate results.

Feature selection

Once the data is preprocessed, the next step is to identify and extract the most relevant features from the dataset. This involves analyzing the data to determine which attributes are most significant for the problem at hand. By focusing on the most relevant features, the efficiency and performance of machine learning models are significantly enhanced. This step - known also as feature engineering - can also help in reducing the complexity of the model to make it faster and more efficient.

Integrating web data with ML applications

Once you have your data, you need a way to integrate it with other tools for machine learning. Here are some of the most renowned libraries and databases for ML:

LangChain

This open-source framework is revolutionizing the way developers integrate large language models (LLMs) with external components in ML applications. It simplifies the interaction with LLMs, facilitating data communication and the generation of vector embeddings. LangChain's ability to connect with diverse model providers and data stores makes it the ML developer's library of choice for building on top of large language models.

Learn more about LangChain

Hugging Face

Renowned for its datasets library, Hugging Face is one of the most popular frameworks in the machine learning community. It provides a platform for easily accessing, sharing, and processing datasets for a variety of tasks, including audio, computer vision, and NLP, making it a crucial tool for ML data readiness.

Learn more about Hugging Face

Haystack

This tool's ecosystem is vast, integrating with technologies like vector databases and various model providers. It serves as a flexible and dynamic solution for developers looking to incorporate complex functionalities in their ML projects.

Learn more about Haystack

LlamaIndex

LlamaIndex represents a significant advancement in the field of machine learning, particularly in its ability to augment large language models with custom data. This tool addresses a key challenge in ML: the integration of LLMs with private or proprietary data. It offers an approachable platform for even those with limited ML expertise, allowing for the effective use of private data in generating personalized insights.

With functionalities like retrieval-augmented generation (RAG), LlamaIndex enhances the capabilities of LLMs, making them more precise and informed in their responses. Its indexing and querying stages, coupled with various types of indexes, such as List, Vector Store, Tree, and Keyword indexes, provide a stable infrastructure for precise data retrieval and use in ML applications.

Learn how to integrate Apify with LlamaIndex to feed vector databases and LLMs with data crawled from the web

Pinecone and other vector databases

ML models need numerical data, known as embeddings in machine learning, so any data you've collected has to be stored in and retrieved from a vector database.

Pinecone

This vector database stands out for its high performance and scalability, which are crucial for ML applications. It's developer-friendly and allows for the creation and management of indexes with simple API calls. Pinecone excels in efficiently retrieving insights and offers capabilities like metadata filtering and namespace partitioning, making it a reliable tool for ML projects.

Learn more about Pinecone

Chroma

As an AI-native open-source embedding database, Chroma provides a comprehensive suite of tools for working with embeddings. It features rich search functionalities and integrates with other ML tools, including LangChain and LlamaIndex.

For more vector databases, check out 6 open-source Pinecone alternatives

Your first web scraping challenge

If you haven't done web scraping before, we've made it easy (and free) for you to get started. Apify has created a tool ideal for data acquisition for machine learning: Website Content Crawler.

Learn how to use Website Content Crawler

This tool was specifically designed to extract data for feeding, fine-tuning, or training machine learning models such as LLMs. You can retrieve the results using the API to formats such as JSON or CSV, which can be fed directly to your LLM or vector database. You can also integrate the data with LangChain using the Apify LangChain integration.

🌐 Website Content Crawler

How to do question answering from a PDF

Theo Vasilis — Wed, 15 Nov 2023 23:00:00 +0000

Hi, we're Apify, a full-stack web scraping and browser automation platform. This article about QA from PDFs was inspired by our work on getting better data for AI .

Apify + LangChain = QA from PDFs

Question answering is often used to create conversational client applications, such as custom AI chatbots. In many cases, web content is all you need. But sometimes, the information required for the AI model in question is contained in PDF documents (legal information, HR documents, and other papers that are more likely to be in PDF files than on a web page).

That's why we're going to show you how to collect content from PDF files, split the text into parts to feed an AI model, and create a QA system that can extract answers from the collected PDF docs.

For this, you'll need two things: Apify (for the data extraction and chunking) and LangChain (to create the QA system).

💬 Learn How Intercom uses Apify to feed web data to its AI chatbot for customer support

Scraping and chunking texts in PDFs

The tool we'll use to scrape and chunk PDF files is Apify's PDF Text Extractor. Like all the other web scrapers and automation tools in Apify Store, you can run PDF Text Extractor via:

Web UI
Apify API
Apify CLI

If you're new to Apify, using the UI is the easiest way to test it out, so that's the method we're going to use in this tutorial.

To use this tool and follow along, go to PDF Text Extractor in Apify Store and click the Try for free button.

You'll need an Apify account. If you don't have one, you'll be prompted to sign up when you click that button.

https://console.apify.com/sign-up

Otherwise, you'll be taken straight to Apify Console (which is basically your dashboard), and you'll see the UI that we're about to walk you through.

1. PDF URLs

To demonstrate, we'll extract the URL provided in the default setting:

You can use the + Add button to add other URLs to the list, as well. These will be added to the crawler queue, and the extractor will process them one by one.

2. Perform chunking

Large language models have limited memory and a fixed content window, which means that they can only process a certain amount of text at a time. That's why PDF Text Extractor offers a chunking option. Chunking means splitting long text into smaller parts so they can be fed to an LLM one at a time.

The default setting has the perform chunking option disabled. To demonstrate the difference, we'll later show you the final dataset with chunking disabled and enabled.

Below Perform chunking is the Chunk size. This is the number of characters, not tokens. It's set to 1,000 by default. Generally, the context window of an LLM is 4,000 tokens. So 1,000 characters should be sufficient in most cases.

The Chunk overlap option refers to the character overlap between text chunks adjacent to each other. Imagine splitting text into two parts. A part of the text will be at the end of the first chunk but also at the beginning. This overlap option is there because it's otherwise difficult to determine the precise point to do the split.

These are the only settings that matter for our demo, so we'll click Start to execute the code and begin the run.

When we ran the extractor with chunking disabled, we got 27 results (that is 27 pages). Each dataset item is a page.

If you click on the Output tab, you can see a preview of the dataset, which includes url (the files from which the text was extracted), index (the order in which text from the file was extracted), and the text from the PDF.

💡Dataset from PDF Text Extractor with chunking disabled

When we ran the extractor with chunking enabled, we got 98 results, as the text had been split.

Here's the chunked dataset:

💡 Dataset from PDF Text Extractor with chunking enabled

We've provided the datasets in JSON format, but you can choose from other formats by going to the Storage tab.

Question answering using PDF Text Extractor with LangChain

In the last part of this blog post, were going to use PDF Text Extractor combined with LangChain to ask questions about the content of a PDF file. Well use Python as the programming language of choice.

First, lets install the required packages:

pip install apify-client chromadb langchain openai python-dotenv tiktoken

Next, lets create a .env file for API keys:

OPENAI_API_KEY=your_api_keyAPIFY_API_TOKEN=your_api_key

Lastly, we'll create a main.py file that will use the PDF Text Extractor to retrieve text from a PDF file and then perform question answering over it.

The process is quite simple:

We use the Apify integration in LangChain to run the PDF Text Extractor with our specified parameters (the PDF URL refers to the WebAgent paper from the AI lab DeepMind)
We store all the text from the PDF in a vector store (Chroma in our case)
We can now perform QA using our defined queries

Here's the Python implementation:

from langchain.document_loaders.base import Documentfrom langchain.indexes import VectorstoreIndexCreatorfrom langchain.utilities import ApifyWrapperfrom dotenv import load_dotenv# Load environment variables from a .env fileload_dotenv()if __name__ == " __main__": # Use the Apify integration to run the extractor apify = ApifyWrapper() loader = apify.call_actor( actor_id="jirimoravcik/pdf-text-extractor", run_input={"urls": ["https://arxiv.org/pdf/2307.12856.pdf"], "chunk_size": 1000}, dataset_mapping_function=lambda item: Document( page_content=item["text"], metadata={"source": item["url"]} ), ) # Create a vector index and store all the text from the PDF index = VectorstoreIndexCreator().from_loaders([loader]) # Ask some questions about the PDF query = "What is the WebAgent? Who is behind its creation?" result = index.query_with_sources(query) print(result["answer"]) print('Sources:', result["sources"])

We simply run the code:

python main.py

And here's the output for our query:

WebAgent is an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. It was created by Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust from Google DeepMind and The University of Tokyo.Sources: https://arxiv.org/pdf/2307.12856.pdf

Before you go

We've shown you how to utilize Apify and LangChain to do QA from a PDF, but if you want to learn more about scraping web data for AI, we have more helpful content for you below.

Synthetic data generation vs. real data for AI

Theo Vasilis — Thu, 09 Nov 2023 23:00:00 +0000

Hi, we're Apify, a full-stack web scraping and browser automation platform. This article about synthetic vs. real data was inspired by our work on getting better data for AI .

What is synthetic data?

Synthetic data is artificially created data used to replace real data in machine learning applications. It's generated by computer algorithms and used to train neural networks and generative AI models.

Synthetic data generation and the 3 Big Bangs

Although synthetic data goes back to the 1970s, it didn't get much attention until the Big Bang of AI in 2012, when a team of researchers used it to train an artificial neural network to win an image classification competition by a huge margin.

The second Big Bang of AI occurred in 2017 with the arrival of Transformers - the deep learning architectures on which today's generative AI models are based.

Despite these two significant events in AI history, in 2021, only 1% of all data for AI was synthetic. But, according to Gartner, that number is expected to rise to 60% by the end of 2024. And by 2030, synthetic data is predicted to overshadow real-world data completely.

// Detect dark theme var iframe = document.getElementById('tweet-1674370195383545858-637'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1674370195383545858&theme=dark" } .

It's easy to understand why. The third Big Bang occurred in 2022 when a Generative Pretrained Transformer was put into the hands of consumers for the first time in the form of ChatGPT.

Since then, the data race for AI models has been escalating at a breathtaking rate. This, in turn, has caused the popularity of synthetic data generation to rise sharply. But why is synthetic data generation the answer to data demand?

Why use synthetic data for AI?

The shift towards synthetic data usage in AI development is fueled by several practical factors.

Privacy issues

Firstly, synthetic data helps overcome privacy issues associated with using real-world data, especially when the data contains sensitive personal information. By using synthetic data that mimics real user data, developers can train models without risking privacy breaches.

Ease and customizability

Moreover, synthetic data is essential for scenarios where real data is scarce or difficult to collect. For instance, in autonomous vehicle development, simulating millions of driving hours with diverse conditions is safer and more feasible than recording actual driving data. Synthetic data generation also makes it easier to customize data to meet the specific needs of a business.

Speed

Synthetic data isn't captured from real-world events. That means it's possible to construct a dataset much faster than by collecting real data. In other words, huge volumes of artificial data can be made available in a shorter period of time.

Cost

Lastly, synthetic data can be generated at a lower cost than gathering and labeling real-world data. This makes the development of AI models more efficient and allows for rapid iteration and improvement.

**A single image that would cost $6 from a labeling service can be artificially generated for 6 cents.

Paul Walborsky, co-founder of AI.Reverie**

What's the problem with synthetic data?

Notwithstanding the advantages of generating synthetic data, there's a major problem with it.

A study called The Curse of Recursion: Training on Generated Data Makes Models Forget has demonstrated that training an AI model on synthetic data or samples from another generative model can induce a distribution shift, which over time, causes model collapse. This, in turn, causes an AI model to misperceive the underlying learning task.

The study concludes that access to the original data source needs to be preserved, and additional data not generated artificially must remain available over time.

The problem of synthetic data can be mitigated by a hybrid approach that involves augmenting real-world data, thus combining real and synthetic data. This technique is known as data augmentation, which uses real-world datasets to create new synthetic examples and maintains the quality and diversity of the training dataset. But the prediction that purely synthetic data will overshadow real data is worrying.

Synthetic data can introduce bias or lose realism, which is why it's the main cause of model collapse. Furthermore, the problem of quality degradation in AI models means there's a need to retrain them with new sources of ground truth in the form of fresh real-world datasets. Otherwise, our AI models not only won't improve; they'll get worse at what we trained them to do.

Advantages of real data for AI

Despite the rise of synthetic data, real data holds invaluable benefits for AI. It captures the complexity and unpredictability of the real world, which is often difficult to emulate through synthetic means. Training AI models on real-world data ensures they're exposed to actual scenarios they'll encounter post-deployment. Naturally, this improves their reliability and performance.

Real data can also help in validating and testing AI models trained on synthetic data. This dual approach ensures that the models are not only fed with diverse and comprehensive data during training but also checked against real-world benchmarks to guarantee their efficacy.

How to collect real-world data for AI models

Granted that real-world data collection is necessary for training AI models, what are the options? Vast volumes of data are needed for AI, so how do you collect it?

There are several methods depending on the volume and type of data required.

Existing databases and crowdsourcing

For structured data, organizations can tap into existing databases and records. For unstructured data, like images or sounds, crowdsourcing platforms can be valuable, where individuals contribute data in exchange for compensation or on a voluntary basis.

Partnerships and collaborations

Another method is through partnerships and collaborations with institutions and companies that already possess vast amounts of relevant data. These collaborations can provide access to a wide array of real-world data while benefiting all parties involved.

IoT devices

Additionally, IoT devices and sensors are prolific data collectors and can provide a continuous stream of real-world information that can be used for AI models, especially in domains like environmental monitoring, healthcare, and smart cities.

The problem with all of the above? Scale and customizability. To collect enormous quantities of relevant, up-to-date data, there's really only one solution when it comes to the needs of AI:

Web scraping

Web scraping involves using software to extract information from websites. This technique can harvest large amounts of data by navigating the web automatically, mimicking human browsing to collect specific data from various web pages.

The advantages of web scraping are numerous. For one, it enables the collection of data at scale, which is beneficial for training AI models that require extensive datasets to improve their accuracy. It's also a time-efficient method, as once set up, web scrapers can gather data much faster than a human manually could.

Web scraping tools are usually customizable, allowing for the targeted collection of data. This means that if you're looking to train a model on a specific type of data, scrapers can be programmed to look for and collect just that. It's particularly useful for gathering structured data such as product information, prices, descriptions, and reviews from e-commerce sites or for collecting unstructured data like posts and comments from social media platforms.

Moreover, web scraping democratizes data collection. Even individuals or small companies without access to large datasets can collect the data they need from the public domain. This levels the playing field and fosters innovation in AI.

The challenges? The two primary obstacles are scrapers getting blocked by anti-bot protections and scraping dynamic web pages. But these challenges are not insurmountable

Collecting real-world data for AI is not unfeasible

Given the amount of data required for AI applications, web scraping isn't feasible with just a web scraping tool. You need infrastructure, tools, and expertise that allow you to tackle and overcome the various challenges of opening websites and extracting data for LLMs and other AI applications legally, ethically, and at scale.

Apify provides all of these things. Its platform gives developers easy access to serverless computation, data storage, distributed queues, and hundreds of web scrapers built by other developers. It's also deeply integrated with Crawlee, an open-source web scraping library that allows you to crawl and scrape websites at scale.

In addition to its vast range of pre-built data extraction tools, Apify offers Enterprise solutions with its team of experts who know how to handle the challenges of collecting data from arbitrary websites.

So, should you come to realize that synthetic data just won't cut it and you need a way to collect real data for AI, we'll be here for you!

How to integrate Google Search into your AI apps

Theo Vasilis — Sun, 05 Nov 2023 23:00:00 +0000

Hi, we're Apify, a full-stack web scraping and browser automation platform. This article about integrating Google search into AI apps was inspired by our work on getting better data for AI .

Does ChatGPT really have access to the internet?

You've probably heard that ChatGPT can browse the internet, not least because OpenAI announced it in September 2023.

Incidentally, that announcement is already outdated, as ChatGPT's knowledge cut-off date was very recently extended to April 2023:

**BREAKING 🚨

OpenAI has updated ChatGPTs knowledge cutoff to April 2023!

Even though the premature AGI has access to real time information through Bing, its non-Bing agent now has updated knowledge up to 6 months ago! [pic.twitter.com/4LeTiejaL2**](https://t.co/4LeTiejaL2)

Shahaf (@shahaf_dan) October 27, 2023

The cut-off date aside, does GPT-4 really have access to the internet? It would be more accurate to say that it supports a 'browse with Bing' feature (for those willing to part with $20 per month).

This is significant for two reasons:

1. Bing vs. Google

A research paper long ago demonstrated that there are differences between Google and Bing in terms of performance.

Overall, Google did a better job for single-word searches, and Bing was more precise for searches with multiple words, whether simple or complex. But Bing didn't show as many relevant results as Google.

2. GPT-4 doesn't really have access to the internet

The mention of a browser, regardless of whether it's Bing or another web browser, suggests that GPT-4 has access to current information, which is not entirely accurate. Such functionality isn't actually part of the AI model's architecture. In other words, the Bing search integration is a feature of the ChatGPT application, not the GPT model itself.

Why does this matter?

Because when building our own app (using the OpenAI API or accessing GPT models), we dont get the search integration automatically (the models still can't access the internet).

So the question is, is it possible to create an internet-access-enabled AI app? Yes, it is! And we're going to show you how to do it.

Function calling to the rescue!

LLMs are only text-in-text-out tools - they cannot control any other tools directly.

Lets imagine we want to create an AI app that calculates the prices of products on a website in different currencies. Without access to external tools, the answer will look something like this:

While the above example is from GPT-3.5, GPT-4 would behave the same way, as the model wouldn't have current exchange rates unless it accesses the internet.

Some GPT models, though (specifically gpt-4-0613 and gpt-3.5-turbo-0613) are trained to understand function declarations. We can specify different functions in the input of those models to perform certain tasks.

For our currency example, we can tell the model it can use a currency_converter function by passing the following parameters:

{ "model": "gpt-3.5-turbo-0613", "messages": [{ "role": "user", "content": "A t-shirt costs 1499 Czech Crowns. How much is that is US Dollars?" }], "functions": [{ "name": "currency_converter", "description": "Calculate live currency and foreign exchange rates using this tool.", "parameters": { "type": "object", "properties": { "fromCurrency": { "type": "string", "description": "The source currency code (eg. USD)", "enum": ["USD", "EUR", "CZK"] }, "toCurrency": { "type": "string", "description": "The target currency code (eg. USD)", "enum": ["USD", "EUR", "CZK"] }, "value": { "type": "number", "description": "The amount of money in the source currency." } }, "required": ["fromCurrency", "toCurrency", "value"] } } ]}

Aside from the regular messages field with the user prompt, we're passing an array of functions which have a name, a short description (to tell the LLM when to use this tool), and a JSON schema of parameters (so the LLM knows what format to generate the parameters in).

In this example, we tell the LLM that we have a function (currency_converter) it can use if needed. This function can convert currencies (and takes three parameters - fromCurrency, toCurrency and value).

Called with these parameters, the model returns the following response:

{ "id": "id-of-the-response", "object": "chat.completion", "created": 1698758802, "model": "gpt-3.5-turbo-0613", "choices": [{ "index": 0, "message": { "role": "assistant", "content": null, "function_call": { "name": "currency_converter", "arguments": "{\n \"fromCurrency\": \"CZK\",\n \"toCurrency\": \"USD\",\n \"value\": 1499\n}" } }, "finish_reason": "function_call" }], "usage": { "prompt_tokens": 131, "completion_tokens": 34, "total_tokens": 165 }}

The function_call part of the response shows that the model indeed wants to call the specified currency_converter function with the arguments specified in the arguments field.

Keep in mind that LLMs are still only text-in-text-out , so in the end, you still have to run the function yourself (even though the LLM gives you the parameters).

If you want to run the function and pass the result to an LLM in another API call, you might like to check out an example in OpenAIs documentation).

This was just a simple example of custom functions, but it doesn't really do them justice. They can do anything from web scraping (aka data extraction) to RPA .

How to enhance LLMs with Apify and Langchain

So far, weve only talked about how to interface the LLMs with the functions, but what do we do now?

Using Apify's Google Search Results Scraper and the Langchain library, we can enhance our LLM apps with the power of web scraping.

While Langchain provides support for full-fledged function calling, well be fine with a simplified version of it called Tools.

As with function calling, so too with Tools, you need to define a Tool by specifying a name and providing a short description. The difference is that Langchain tools only accept one string as a parameter. You also have to specify the implementation of the function. Consider the following example:

import { initializeAgentExecutorWithOptions } from "langchain/agents";import { ChatOpenAI } from "langchain/chat_models/openai";import { DynamicTool } from "langchain/tools";import { ApifyClient } from 'apify-client';(async () => { const tools = [new DynamicTool({ name: "google-scraper", description: "Call this to retrieve information about a subject. Parameter is a search query (string).", func: async (query) => { const client = new ApifyClient(); const input = { "queries": query, "maxPagesPerQuery": 1, "resultsPerPage": 10, }; const run = await client.actor("apify/google-search-scraper").call(input); const { items } = await client.dataset(run.defaultDatasetId).listItems(); return JSON.stringify(items); }, }),]; const executor = await initializeAgentExecutorWithOptions( tools, new ChatOpenAI({ modelName: "gpt-3.5-turbo-16k", temperature: 0, }), { agentType: "openai-functions", } ); const result = await executor.run("How much did the Barbie movie make?"); console.log(result);})();

There's a chance youll hit the context length limit with this approach. To fix it, you can lower the number of resultsPerPage in the Apify Actor call, filter out unused fields from the result before returning it from the Tool function or do some additional indexing on the data and retrieve only the relevant parts in the Tool function.

Try Google SERPs Scraper for free

So, there you have it: a complete example of an internet-access-enabled AI app that uses Google Search Scraper and LangChain. But why stop here?

Give your GPTs web scraping capability

Learn more

If you want to find out more about improving LLMs, using LangChain with scraped data, and storing and indexing data with vector databases, we leave you with the famous words of Augustine of Hippo: Take up and read!

Stuff about LLMs

Stuff about LangChain

Stuff about vector databases

What is Playwright?

Theo Vasilis — Wed, 11 Oct 2023 08:46:12 +0000

We explore the features of Playwright that make it an awesome tool for web automation, testing, and scraping 🎭

Why use Playwright?

We all know that technology moves fast, but even by modern standards, the rapid rise of Playwright is impressive.

Microsoft released Playwright in 2020 as an open-source Node library to automate Chromium, Firefox, and WebKit with a single API. Today, Playwright is one of the most popular frameworks for web automation, testing, and scraping. It provides automated control of a web browser with a few lines of code, making it particularly useful for data extraction, end-to-end testing, automating web page interaction, taking screenshots of web pages, and running automated tests for JavaScript libraries.

While similar to Puppeteer, Cypress, and Selenium, there are some differences. Let’s find out what they are.

Is Playwright a headless browser?

Not exactly. Playwright can be run in headful or headless mode (without a graphical user interface). By default, Playwright runs in headless mode, which means you won’t see what is happening in the browser when you run your script, but it will run faster. When you write and debug your scripts, it’s advisable to disable headless mode so you can see what your script is doing:

const browser = await chromium.launch({ headless: false })

On the other hand, if performance is the most important thing for you, headless mode is the way to go since headless browsers are quicker than real browsers.

What about Puppeteer and Selenium?

Speak of headless browsers, and the names Puppeteer and Selenium immediately spring to mind. So, how do these compare to their younger sibling? Puppeteer supports only JavaScript and TypeScript and works with Chromium, with experimental support for Firefox. Playwright supports Chromium, Firefox, and Safari with WebKit. You can use many programming languages with Playwright and one extra language with Selenium (Ruby). But Playwright’s greatest advantage over Selenium is its auto-waiting function.

🔖 Related: Compare Puppeteer and Selenium for automation¨

What languages does Playwright support?

Playwright works with some of the most popular programming languages, including JavaScript, Python, Java, and C#. Its support of Chromium, Firefox, and WebKit provides a wide range of cross-browser automation and web testing capabilities.

What platform does Playwright support?

Playwright is a cross-platform framework. The browser binaries for Chromium, Firefox, and WebKit work across three platforms: Windows (and WSL), macOS (10.14 or above), and Linux (though you may need to install additional dependencies, depending on your Linux distribution).

🔖 Related: Compare Cypress and Selenium for automation and web testing

How do I get started with Playwright?

One thing that isn’t said enough about Playwright: its documentation is superb. There you will find out how to install Playwright to get started.

You can install the VS Code extension. After installation, open the command panel and type Install Playwright. Alternatively, you can use the command line interface (CLI) and install Playwright using the appropriate package manager for your language. For example, NPM with Node.js:

npm init playwright@latest

That will give you the browsers and files you need to begin:

playwright.config.
tspackage.json
package-lock.json
tests/ 
    example.spec.ts
tests-examples/ 
    demo-todo-app.spec.ts

The tests folder contains a basic example test to get you started and the tests-examples folder contains a more detailed example, with tests written to test a todo app.

Alternatively, you can simply add Playwright to your existing project by calling:

npm install playwright

Why use Playwright for web automation and testing?

1. Faster communication with the Chrome DevTools Protocol

Most automation solutions use the WebDriver protocol to communicate with Chromium browsers, but Playwright provides much faster and more straightforward communication with the Chrome DevTools protocol. But Playwright isn’t just for Chrome and Edge and Playwright can be configured to test sites in Firefox and Safari, as well.

2. The auto-waiting function

Cross-browser and cross-language support aside, the auto-waiting function is Playwright’s greatest advantage over Puppeteer and Selenium. You don’t have to figure out when something is clickable because Playwright performs that action for you. You can emulate mouse clicks by using await page.click(), and wait for actions in the browser to finish by using convenient APIs like await page.waitForSelector() or await page.waitForFunction().

This unique automatic waiting feature eliminates the need to write custom waits or sleep statements in your test scripts. That means you can focus on writing high-quality tests instead of worrying about writing the perfect waiting logic.

3. Record scripts with Codegen

The Playwright documentation includes a test generator that shows you how to record your scripts with Codegen. You just need a single CLI command to kick off:

npx playwright codegen playwright.dev

This will open up an interactive browser and the Playwright inspector. Every action in the browser will be recorded in the inspector. You can then replay and adjust the generated script. In other words, Playwright generates test script code based on your interaction with the page. That means you can author tests out of the box without having to write the script manually.

4. Great debugging capabilities

Playwright has some excellent debugging features. You can debug scripts while you run them, which is handy during local development, and you can also analyze and debug failed tests. You can open Playwright Inspector to enable debug mode with npx playwright test --debug to debug all tests or npx playwright test example --debug to debug one test. Alternatively, you can set the PWDEBUG environment variable to run your scripts in debug mode.

5. Native mobile emulation

Playwright supports native mobile emulation, which means you can test your web applications on mobile devices without having to set up an actual device. Playwright can emulate Safari on iOS as well as Android devices. Playwright's test runner provides numerous predefined configurations, making it easy to test your web application on multiple devices and screen sizes to ensure that it works as expected for all users without having to manually set up each configuration.

6. Comprehensive reports

Playwright provides comprehensive reporting options for test results. You can:

a) Export results as a machine-readable JSON file.

This is useful if you want to integrate Playwright tests into a larger test suite or if you want to programmatically analyze the results.

b) Export the results as a stylish HTML page.

This is a great option if you want to share the test results with other members of your team or with stakeholders. The HTML report includes detailed information about the test runs, including the number of passed and failed tests, the duration of each test, and any errors that occurred during the test run.

🔖 Related: Playwright end-to-end testing tutorial

Why use Playwright for web scraping?

We’ve touched upon the brilliance of Playwright when it comes to web testing and automation, but its capabilities can also come in very handy when it comes to web scraping and data mining. Here’s why:

It can be very difficult to scrape some websites with regular HTML tools. Dynamic pages and browser fingerprinting are two of the biggest challenges. Playwright’s headless mode helps overcome these problems.

1. Loading dynamic pages

When it comes to pages loaded dynamically with AJAX or data rendered using JavaScript, you’ll need to render the page like a real user. HTML scrapers can’t do that. Headless browsers can. So, in such cases, you’ll need web scraping tools like Playwright Scraper or Puppeteer Scraper to load the page, execute its JavaScript, and scrape the required data.

2. Combatting browser fingerprinting

Some websites now use fingerprinting to track users and block scraping bots. A scraper that uses a headless browser can emulate the fingerprint of a real device. Without a headless browser, it’s nearly impossible to pass the various anti-bot challenges that block your access to a website. This makes using Puppeteer or Playwright Scraper your best bet when getting blocked.

Also, you can go even further and develop your own web scraper with Crawlee, a Node.js library that helps you pass those challenges automatically using Puppeteer or Playwright.

Crawlee helps you build reliable scrapers fast. Quickly scrape data, store it, and avoid getting blocked with headless browsers, smart proxy rotation, and auto-generated human-like headers and fingerprints.

Web scraping with Playwright

If you want to find out more about Playwright and web scraping, this tutorial shows you how to build a scraper with Playwright in Node.js to extract data about GitHub topics.

Multimodal AI: what can it do, and why is it a game-changer?

Theo Vasilis — Tue, 03 Oct 2023 22:00:00 +0000

Hi, we're Apify, a full-stack web scraping and browser automation platform. This article about multimodal AI was inspired by our work on getting better data for LLMs and other machine learning models. Check us out .

What is multimodal AI?

Im not one for hype, but honestly, I dont think people have yet realized the implications of multimodal AI.

For those not familiar with it, multimodal AI is an AI system that can understand multiple data types (image, audio, text, and numerical data) and can use them together to establish content and interpret context.

If you havent heard, OpenAIs ChatGPT now includes GPT-4V(ision), which can analyze graphics and photos. That means you can now get a language model to combine images with text (and audio) prompts. (More about audio later).

FYI, to use GPT-4V, you need a paid membership to ChatGPT-Plus (currently priced at $20 per month), but its still being rolled out in phases.

Bing has also introduced multimodality to its chat mode, and Google is planning to launch its own multimodal competitor, Gemini, any time now.

I wont say, The possibilities are endless! because theyre not yet. But already, GPT-4V has been used to do front-end web development, describe visual content, review product images, troubleshoot problems based on a photo, get advice on design, and decipher illegible text in historical manuscripts.

It took about six months for OpenAI to make good on its promise that GPT-4 would be multimodal, but its finally happening, and already weve seen examples of what its capable of.

Uses of vision in multimodal AI

👨💻 Front-end development

Multimodality can recreate a website dashboard from screenshots or sketches. While its execution is not yet flawless, it shows promising signs of being able to reduce the time needed to go from design to prototype.

Another front-end development possibility that multimodality has opened up is the ability to improve code. By using the result of one run as the prompt for the next run, the model can keep refining the code independently.

📸 Explaining visual and multimedia content

ChatGPT can now describe images in detail. It can provide captions, explain the humor in a meme or editorial cartoon, break down complex infographics into simple text explanations, and describe the difference between one product photo and another.

This has opened up a range of possibilities: creating text for a comic or graphic novel, getting advice on interior design, identifying poisonous food, and comparing and matching products.

📜 Optical character recognition for ancient manuscripts

I have a background in theology and late antiquity, so this one excites me more than it does most people. But its a big deal for anyone deeply involved in the humanities.

While OCR is only a small part of GPT-4 Vision, it has immense significance for historians and scholars of ancient languages and literature. GPT-4V is capable of deciphering, translating, and analyzing historical manuscripts. Turns out that all those years I spent learning Ancient Greek and Latin were a waste of time.

🤖 Curious about AI's ability to do web scraping ? Check out AI web scraping tools: do they really work?

Uses of voice recognition in multimodal AI

🔎 Search engines

Bing Chat has already made searching for online content easier. You neednt type your search query anymore. You only need to utter your query, and Bing will act just as it does when you type.

It will be interesting to see how this impacts keyword research in the near future. People are unlikely to be as brief in speaking as they are when they type search queries. This might make it easier to understand the intent behind keyword searches.

🤖 Intelligent assistants

The same ease can be applied to other things that weve used ChatGPT for but with the spoken word instead of the written one. Instead of typing prompts into an LLM, you can just state your intent, and the AI model will respond.

How is this different from Siri or Alexa? With multimodal AI or a voice-to-text system like OpenAIs Whisper, you dont need to dictate every word; you only need to say what action you want it to perform, and the model will handle the rest for you.

The implications of multimodal AI

Now imagine combining these multimodal AI capabilities with a tool like LangChain, which enables you to link AI models to outside sources, such as your Google Docs or Notion.

Soon, AI models will be able to function as intelligent assistants like the computer from Star Treks Enterprise. You'll be able to ask an LLM to retrieve a document or image from your files just by stating your request, and it will fetch it for you.

In other words, multimodal AI is on its way to taking automation to another level. We may never have to type in a search box or look through files ever again. And thats just the tip of the iceberg.

As machine learning continues to develop and AI models are trained to become more performant and capable of a broader range of things, its not implausible that well all have our own personal AI assistants in our devices in the near future, and not just the user interface of ChatGPT.

apify.com

Of centaurs and cyborgs: how AI is reshaping work

Theo Vasilis — Sun, 24 Sep 2023 22:00:00 +0000

The well-intentioned know-it-all

Christopher Lash was a tall, imposing, spectacled figure with a long white beard. A very erudite man, he seemed to have a profound knowledge of just about everything.

I worked with him for many years, and for a long time, I took his word on most things. But, despite all his knowledge and confidence, I had learned enough about some of his favorite topics to start seeing holes in his arguments. When I questioned or pressed him on his spurious claims, he'd blush and apologize for misleading me.

For him, whether he was saying anything true was secondary to whether he appeared to be teaching me something. It was better to provide a wrong answer than no answer at all.

AI reminds me of him.

I is for

AI is like that old friend of mine inasmuch as it's easy to be convinced that the information a generative AI model imparts is accurate because of the certainty with which it's communicated. But if you're even moderately versed in a subject, you might be surprised by the amount of error and bias there is in the information.

I say AI, but of course - like most people nowadays - what I'm actually referring to is the deep learning models known as LLMs. (Sorry, image models: I do like you, but you're not relevant to this particular conversation).

These LLMs (large language models) have become the go-to solution for just about anything related to knowledge and communication. A veritable jack of all trades, ChatGPT is now the first thing people think of when it comes to creating emails, articles, tables, code, ideas, and a whole host of other things.

The problem is no one actually knows what LLMs are really capable of, much less the best way to use them, and - most worryingly of all - when and how they fail.

Despite the tsunami of prompt tutorials which followed the whirlwind that ChatGPT unleashed on us, there's no instruction manual for using generative AI models. They're incredibly effective in some tasks and total (or partial) failures in others. Sometimes, Gen AI has flashes of brilliance, and at other times, it will fail enough to make you think the I in AI stands for idiot.

This problem and confusion about the use of AI models in the workplace was the impetus for Navigating the Jagged Technological Frontier, released on September 15, 2023. This is the first working paper by a team of social scientists in collaboration with Boston Consulting Group. It contains the findings of an experiment conducted to measure the effects of AI on knowledge worker productivity and quality.

The evidence presented in the paper is pretty interesting, but it doesn't exactly make for easy reading. So, I'll unpack it for you without going into too much detail. By the end, you'll be able to determine whether you're a centaur or a cyborg.

The jagged technological frontier

758 consultants and 18 realistic tasks representing work done at a consulting company. That's what the experiment involved. Tasks were assigned according to three different conditions:

1. No AI access

2. GPT-4 access

3. GPT-4 access with a prompt engineering overview

The headline? Consultants who used ChatGPT-4 outperformed those who didn't by a large margin and in every way.

Consultants using AI were:

More productive (they completed 12.2% more tasks)

Faster (they completed tasks 25.1% more quickly)

Able to produce higher quality work (results were 40% better)

But there's a lot more to it than that.

The conclusion of the study is that AI creates a jagged technological frontier where some tasks are easily done by AI while others, though seemingly similar in difficulty level, are outside the current capability of LLMs.

On tasks within the frontier, AI significantly improved human performance. Outside of it, humans relied too much on the AI and were more likely to make mistakes. Not all users navigated the jagged frontier with equal adeptness. While some completed their task incorrectly, others showcased a remarkable ability to harness the power of AI effectively.

Inside the frontier are the things AI is capable of, and outside the frontier are tasks it can't perform. When it came to the second category, consultants using AI were 19 percentage points less likely to produce correct solutions compared to those without AI.

The problem is that no one can see the frontier, and tasks you'd assume are jobs for a machine (like basic math) are things LLMs struggle with, while activities you'd associate with human creativity (like generating ideas or writing a sonnet) are things they can do pretty well.

What makes this even more surprising when you think carefully about it is that both activities should be the same for AI. Math obviously involves numbers, but for language models, so does writing a poem. AI perceives in tokens, not words.

Is AI a skills leveler?

It's becoming apparent in other studies of AI that using LLMs acts as a skills leveler in the workplace. The Jagged Frontier paper confirms this. Consultants who scored the lowest at the start of the experiment had a 43% jump in performance when they got to use AI. The consultants who started with the highest score also got a boost but a less significant one.

This suggests that companies should consider using AI technology more to raise workers to the top tiers of performance.

But there are two sides to every coin. While AI can act as a skills leveler, relying on it too much can backfire. A previous study had already demonstrated that recruiters who used high-quality AI became lazy, careless, and less skilled in their own judgment. They made worse decisions than those who used low-quality AI or none at all. Instead of using the LLM as a tool, they let it take over.

The Jagged Frontier study found the same thing to be true. The workers who used GPT-4 had less accurate answers than those who were not allowed to use it.

This aptly demonstrates the problem of the invisible frontier. It's easy to be deceived by AI if you can't see the line between what it can and can't do.

Centaur or cyborg?

According to the frontier study, there are two types of workers who use AI most effectively: centaurs and cyborgs.

🐴 The centaurs

The mythical horse-human hybrid analogy refers to workers who divide activities between themselves and AI, delegating select tasks to the models. Work is thus part human, part AI.

Users with this strategy switch between AI and human tasks, allocating responsibilities based on the strengths and capabilities of each entity. They discern which tasks are best suited for human intervention and which can be efficiently managed by AI.

🤖 The cyborgs

The cyborgs are workers who integrate their task flow with the AI and continually interact with the technology. Every task involves AI-assisted human labor.

Cyborg users dont just delegate tasks; they intertwine their efforts with AI at the very frontier of capabilities. This strategy might manifest as alternating responsibilities at the subtask level, such as initiating a sentence for the AI to complete or working in tandem with the AI.

A centaur's thoughts on using LLMs for writing

Why should AI get to do all the fun stuff?

Here's something the frontier study doesn't consider:

I barely use AI for writing, especially for first drafts. I'm a writer, and as such, I love the writing process. For me, the blank sheet is where the magic begins. It's where I start to form ideas: attention-grabbing opening lines, interesting angles, clever structure, and thought-provoking ways to end before I've even begun.

Delegating the first draft to an AI model would take away one of the things I most enjoy about my work.

That poses the question: will using AI too much make people's work boring? Could it mean employees will become dissatisfied doing a job they chose because they enjoy its processes?

AI doesn't make me faster

Not only does AI make writing less interesting, but it also fails to make it quicker and more efficient.

I don't mean to brag, but I've always been an insanely fast writer. Despite the findings of that frontier study, I don't think using AI would help me churn out content at greater speed without sacrificing quality. The amount of time I'd have to spend prompting an LLM to produce something I'm happy with and then editing it is the same amount of time I'd take to write it myself.

Why I'm a centaur (how I use ChatGPT)

Notwithstanding the above, I have a confession to make: I have, at times, elicited the help of an LLM with varying degrees of success. So, permit me to conclude with three examples of how I've used AI (ChatGPT in this case) for content.

1) Finalizing content

Not long ago, I wrote this career story. I crafted the entire body of text without any AI assistance. I interviewed the subject of the story a couple of times and tried to find ways to put the information together in a fun and coherent way.

What I ended up with was one long block of text: no subheadings and no title. For some reason, I just couldn't figure out how to break it all up into sections or what to call the article. So, I copied and pasted the whole article into ChatGPT and asked it to create a title, identify suitable breaks in the text, and create catchy subheadings for each section.

The title and all the subheadings you see in that article (except for the emojis) are produced by ChatGPT, untouched and unadulterated. Not bad for a generative pre-trained transformer!

2) Writing and explaining code

One of my first serious attempts at using ChatGPT for work was an article on web scraping with Selenium and Python. I didn't begin by using AI. Instead, I researched the topic to identify what aspects of web scraping with Selenium to focus on.

Based on that research, I created sub-headings and used them for my prompt. I asked ChatGPT to show me how to scrape a website with Selenium in Python with Chrome (and headless Chrome), how to locate and wait for elements, take a screenshot, and execute JavaScript code.

I checked the output against the Selenium documentation, video tutorials, and other articles and made some changes where the AI version seemed incomplete or too general. In the process, I discovered there was a new way to launch headless Chrome that ChatGPT couldnt be aware of (for LLMs lived in the past until very recently), so I updated the code.

Now, with a basic structure for the tutorial ready, I began drafting the rest of the text. I tried to come up with some interesting things beyond the usual SEO content that often sounds even more robotic than ChatGPT. In the process, I decided to bring Monty Python into it. Thats when it occurred to me to use the Monty Python website as the target for the tutorial. I then updated the tutorial to demonstrate how to scrape that particular website.

Quite proud of what I'd managed to accomplish with the help of AI, I shared it with a developer to ensure accuracy. The first question the dev asked was, Did you use ChatGPT for this?

Developers pick up on AI-generated code the way writers detect those trite and hackneyed ChatGPT conclusions. Also, the code was too generic to be functional, so he made some small fixes to make it usable.

The moral of the story? Don't use AI to write things you wouldn't be able to write without it.

3) Fixing (or generating) tables

Tables in articles can come in very handy, as you can provide a lot of information without long-winded writing. They're particularly popular amongst developers for comparing tools and methods. But sometimes, the various tools at hand being what they are, I run into problems.

A recent example is an HTML table that looked fine in light mode but invisible in dark mode. I had three choices:

a) spend ages trying to figure out how to fix it to work in both modes (I suck at CSS),
b) interrupt a colleague to help me out,
c) ask my underworked AI assistant to fix it for me.

I went for option c) since ChatGPT had nothing better to do. I copied and pasted the HMTL and asked it to fix the table so it would appear in both light and dark modes. It produced the altered HTML in a flash, and it worked like a charm.

How do you use AI in the workplace?

The Navigating the Jagged Technological Frontier paper demonstrates that when AI is used for the right things, it makes workers faster, more efficient, and more productive. It can act as a skills leveler and can take a lot of the drudgery out of everyday tasks. But it can also make us lazy, inefficient, and downright stupid.

The frontier study suggests that the choice isn't whether or not to use AI but when and how to use it. The evidence points to two effective approaches to the problem: dividing tasks between human work and AI (centaurs) and combining human oversight and AI in every aspect of work (cyborgs).

Based on my own use of AI in the workplace and the findings of the frontier study, I know I'm a centaur.

Which one are you?

apify.com

How to improve AI models with web scraping and data augmentation

Theo Vasilis — Mon, 11 Sep 2023 22:00:00 +0000

Why you need to improve AI models

Did you know that AI models degrade over time?

No? Hardly surprising. It's not something that often pops up in the deluge of content about the latest AI news, planned upgrades, and the next big step for deep learning.

So, sorry to burst that bubble. But it's a fact.

No matter how sophisticated the algorithm or how diverse the training dataset is, if you don't retrain and improve your AI models, not only will they not get better, they'll get worse.

Why is that? You're lucky I'm not busy. I'll explain it to you.

91% of machine learning models degrade over time

The problem of temporal quality degradation in AI

The quality degradation of AI models stems from the fact that they become dependent on the data as it was at the time of training. Data-producing environments often alter over time, and their statistical properties change alongside them. In other words, as a model is tested on current datasets in quickly changing contexts, the model's predictive ability inevitably declines. This is known as concept drift, model drift, or AI aging, which can significantly impact the quality of AI models.

The problem of AI model collapse

Another form of AI degeneration is a phenomenon known as model collapse. It occurs when AI is trained on synthetic data. By synthetic data, we mean artificially generated information created to augment or replace real data to improve machine learning models.

This problem of model collapse has been exacerbated by people filling the internet with AI-generated content and then feeding that content to AI models.

Researchers from the UK and Canada have demonstrated that the use of model-generated content in training causes irreversible defects in the resulting models. This is because they forget information about important but less common aspects of the data in the underlying data distribution. As a result, they begin to produce increasingly similar outputs.

Model collapse is a degenerative process affecting generations of learned generative models, where generated data ends up polluting the training set of the next generation of models

3 ways to improve AI models

1. Web scraping to feed AI models with the right data

With their ever-growing user base, AI solutions continue to become more complex as better and more diverse data is needed to develop them. Web scraping is the go-to solution for this problem. Web scraping is a method of harvesting data from websites and the most efficient way to collect web data to expand your training dataset and improve LLMs.

Web scraping is also used for customizing and fine-tuning generative AI. By adding information relevant to your use case (for example, feeding a language model data from your website to create an AI chatbot), you can make sure that the information the language model provides to customers is accurate.

If you're web scraping for large language models like ChatGPT or image models like MidJourney, you want to avoid extracting AI-generated content to get quality training datasets.

The problem is there's no mass labeling mechanism to differentiate between AI-generated and human-generated data. So unless you're extracting information that you're confident is human content (from before the time of ChatGPT, for example), there's no easy way to make the distinction and therefore be certain that you're introducing human-generated datasets back into the AI model's training.

A more reliable solution is to ensure that you retain a copy of an exclusively human-produced dataset to periodically refresh and retrain the model. But the problem of degradation means you cant have one dataset of human-produced content forever. Youll need reliable, up-to-date information as well, which means youll need new datasets of human-produced content to retrain the AI

2. Enhancing data quality

Web scraping to expand your dataset isn't enough to improve generative AI. The extracted data needs to be processed before feeding it to the model. For example, if you use a web scraping tool like Website Content Crawler, you can simultaneously use it to remove duplicate text lines and HTML elements you don't want in the training dataset. This is useful for data annotation, data labeling, and improving model accuracy.

3. Augmenting data

Data augmentation, like synthetic data, aims to increase the size and diversity of the training data for machine learning models. A key difference is that synthetic data is generated from scratch, while data augmentation uses an existing training dataset to create new examples.

Synthetic data can introduce bias or lose realism, which is why it's the main cause of model collapse. Augmented data, however, maintains the quality and diversity of the training dataset. Combining the two achieves the best results in machine learning applications.

While data augmentation is most popular in the area of computer vision applications (for example, flipping, rotating, or scaling an image to create a new data entry), it's also one of many handy techniques for NLP (natural language processing). In the context of language models, augmenting data involves altering the text (for example, replacing words with synonyms without changing the meaning) to create a new data entry.

Should you retrain your AI model?

Retraining AI regularly is the most obvious solution to AI degeneration. You should monitor how your model performs after deployment compared with how well it functioned during training. If you see a decline in performance, then it's time to retrain the model with additional sources of ground truth, manual data labeling, and large data volumes.

The final word: AI needs human-generated datasets

I said earlier that the problem with collecting web data for re-training AI models is the data pollution caused by AI-generated content. While web scraping is definitely the best way to feed specific web data to LLMs (to customize AI chatbots, for example), it won't solve the problem of degradation and model collapse on its own.

You need to monitor your AI model, retrain it with fresh data, and make sure you have human-generated content for the retraining. For that, content produced by LLMs won't help. You'll need human writers like me! 😀

We've all heard about the AI arms race, but there'll soon be a scramble for human content, too. Only those companies and platforms with access to human-generated data will be able to create the best quality generative AI models.

Frequently asked questions about AI

How do I train an AI model?

Whether your AI is an LLM or an image model, the basic steps for training AI are the same:

Prepare the training data.
Create a dataset.
Train the model.
Evaluate and iterate on your model.
Get predictions from your model.
Interpret prediction results.

How can I make my AI more reliable?

To improve an AI model, you need to train and optimize it with appropriate and diverse datasets and algorithms. This will improve accuracy and efficiency and help to reduce variance.

How do I build a generative AI model?

The process of building generative AI models has three main steps: prototyping, development, and deployment. The prototyping stage begins with data collection and ends with result analysis. The development stage begins with data preparation and ends with model optimization. The deployment stage begins with pipelining and ends with scaling.

How are large language models trained?

LLMs like GPT and BERT are trained with a large dataset from various data sources scraped from the web. This is what enables large language models to generate output for a wide range of tasks.

What is web scraping?

Web scraping is a technique used to automatically extract data from websites and online sources. It involves using software tools or scripts to access web pages, parse the HTML content, and retrieve specific information.

What is data augmentation?

Data augmentation is a collection of techniques that manage the process of automatically generating high-quality data on top of existing data. It is common In computer vision applications and sometimes used in natural language processing.

What is degradation in AI?

As a model is tested on current datasets in quickly changing contexts, the model's predictive ability inevitably declines. This change in accuracy leads to model degradation. This process of decreasing performance is also known as model drift.

What is AI model collapse?

Model collapse is a degenerative process affecting generations of learned generative models, where generated data ends up polluting the training set of the next generation of models.

AI web scraping tools: do they really work?

Theo Vasilis — Sun, 20 Aug 2023 22:00:00 +0000

What is AI-powered web scraping?

The internet is currently saturated with terms like AI-driven and AI-powered web scrapers. But is AI web scraping really a thing?

While there are certainly some excellent web scraping tools for AI out there, most of those advertised as AI-powered web scrapers are well, just web scrapers. AI has been stuffed in there because thats what everyones obsessed with at the moment.

So, I did my due diligence and tried out a few of these so-called AI scraping tools, identified two that are arguably worthy of the name, and explored the possibility of using GPT models to do web data extraction, as well.

Here are my findings.

AI-powered web scrapers

BrowseAI

First up is BrowseAI. Its basically an Apify-like SaaS platform. The Chrome extension/web app allows you to record user actions in the browser, and you can upload the recording on the platform and run it there.

You provide the URL of a page you want to scrape, like so:

Naturally, I chose the Apify blog because the content is awesome!

After a while, the web page opens in your browser:

You can then extract data with point-and-click tools that automatically recognize repeating components:

Now you can pick parts of those repeating components (things like title and author in this case), select them by clicking, and name the columns in the resulting table:

Pros of BrowseAl

The recorder has intuitive controls and a smart UI for selecting data to scrape.
Its a no-code solution, so its easy for those who are not developers to use it.

Cons of BrowseAI

The performance dips when recording.
Because its a no-code solution, theres little space for customization.

Is BrowseAI an AI web scraping tool?

BrowseAI is basically Apify plus a recorder. It provides Prebuilt Robots (which are essentially what Apify Actors are) and a platform to run the bots on (just like the Apify platform).

While BrowseAI is a pretty neat no-code web scraping tool, I wouldnt go so far as to call it an AI-powered web scraper. And if youre a dev who wants more customization, anti-blocking features, proxies, datasets, and other crucial things for serious data extraction projects, web scraping with Apify is an alternative solution you should consider.

https://apify.com/store/categories/ai

Next up is Kadoa.com - an online service that uses generative AI models for automated data extraction.

With Kadoa Playground, you input a URL, and the service will analyze the page using AI models to extract data automatically.

You can then select which data you want to scrape, making the process quick and efficient.

This can be especially useful for those who need to collect large amounts of data from websites for research or business purposes.

Again, I went with the Apify blog. Did I mention how awesome it is?

After analyzing the page, the service asks what data you want to extract. In this case, it found out that blog.apify.com contains links to blog posts and articles, so it offered to scrape these:

After picking Blog posts, Kadoa gave me the option to customize the scrape even more:

What's cool is that it didn't ask for CSS / XPath selectors but allowed me to provide the commands using regular natural language (English, in this case).

As I wanted to scrape titles of the blog posts, their respective authors, and the publication dates, I just added 3 fields named title, author, and pub_date.

Theres no required syntax, as Kadoa makes extensive use of generative AI models to deal with that.

After a while, the service gave me the result as a neatly-formatted JSON array:

Pros of Kadoa.com

Fast and easy to use.

Cons of Kadoa.com

The whole project is still in the early phase, so it has some limitations:

The playground doesn't work for generic homepages, sites behind a login, sites with scraping preventions, or sites that require click automation.

If youre a developer who needs to scrape those things (and frankly, for any large-scale scraping task, you really do need to), then Website Content Crawler is an alternative you should consider.

Is Kadoa an AI web scraping tool?

I think Kadoa is worthy of the AI in AI-powered web scraping. The AI is what makes Kadoa very easy for non-developers to use.

https://blog.apify.com/webscraping-ai-data-for-llms/

Using GPT models for data extraction

So, those are two ready-made AI web scraping products you could try, but another possibility is to use AI (LLMs in this case) directly.

For example, you can build a scraper with Crawlee that extracts text from a page, feeds it to an LLM, and says, Make a JSON out of this.

💻 Here follows body.innerText for a blog listing page. Extract article names, descriptions, author names, and creation dates, and format those as a JSON array...

You can see an example in the OpenAI playground:

And here's the response:

{
    "articleName": "How web scraping and AI are helping to find missing children",
    "description": "The Missing Children initiative began with a Facebook page. Web scraping Facebook for data labeling has taken it to a whole new level and the initiative is now reuniting families all over Egypt.",
    "authorName": "Theo Vasilis",
    "creationDate": "Aug 7, 2023"
  },
  {
    "articleName": "Google Maps scraping manual: how to extract reviews, images, restaurants, and more 📍 📚",
    "description": "Welcome to your comprehensive guide to extracting valuable data from Google Maps. In this manual, we will walk you through various techniques and tools to help you scrape images, extract restaurant data, gather contact details, scrape reviews, and much more.",
    "authorName": "Natasha Lekh",
    "creationDate": "Jul 26, 2023"
  },
  {
    "articleName": "What is generative AI?",
    "description": "What you need to know about generative AI and what it means for developers.",
    "authorName": "Theo Vasilis",
    "creationDate": "Jun 23, 2023"
  },
  {
    "articleName": "Apify tutorial contest 🏆",
    "description": "Create guides on how to use the Apify platform or Crawlee to build web scrapers, and win up to $1,000 of free platform credits for your own projects!",
    "authorName": "Theo Vasilis",
    "creationDate": "May 19, 2023"
  },
  {
    "articleName": "Web scraping for AI: how to collect data for LLMs",
    "description": "A tutorial that shows you how to crawl, extract, and process web data to feed, fine-tune, or train large language models.",
    "authorName": "Theo Vasilis",
    "creationDate": "Aug 10, 2023"
  },
  {
    "articleName": "Puppeteer tutorial: submitting forms, clicking buttons, and handling inputs",
    "description": "Find out how to use Puppeteer to handle forms, buttons, and inputs. Learn about type method, click method, and how to deal with text fields, dropdowns, and checkboxes.",
    "authorName": "Ayodele Aransiola",
    "creationDate": "Aug 9, 2023"
  },
  {
    "articleName": "How web scraping and AI are helping to find missing children",
    "description": "The Missing Children initiative began with a Facebook page. Web scraping Facebook for data labeling has taken it to a whole new level and the initiative is now reuniting families all over Egypt.",
    "authorName": "Theo Vasilis",
    "creationDate": "Aug 7, 2023"
  },
  {
    "articleName": "Selenium Grid: what it is and how to set it up",
    "description": "Learn about the Selenium Grid architecture and explore its use in scenarios like large test suites, cross-browser testing, and continuous integration. This comprehensive guide also includes steps for setting up Selenium Grid and practical tips for efficient parallel test execution.",
    "authorName": "Percival Villalva",
    "creationDate": "Aug 3, 2023"
  },
  {
    "articleName": "Is web scraping legal?",
    "description": "Web scraping is legal if you scrape data publicly available on the internet. But you should be careful when scraping personal data or intellectual property. We cover the confusion surrounding the legality of web scraping and give you tips for compliant and ethical scrapers.",
    "authorName": "Ondra Urban",
    "creationDate": "Aug 3, 2023"
  },
  {
    "articleName": "Traditional NLP techniques and the rise of LLMs",
    "description": "The field of NLP has changed with the rise of LLMs, but NLP still has a role to play. Apply NLP techniques to scraped data and learn about tokenization, stemming, lemmatization, removing stop words, and more NLP techniques.",
    "authorName": "Usama Jamil",
    "creationDate": "Aug 2, 2023"
  },
  {
    "articleName": "10 reasons tourists hate European landmarks (according to data from Google Maps)",
    "description": "A small data project to visualize and analyze bad Google Maps reviews of popular European landmarks.",
    "authorName": "Natasha Lekh",
    "creationDate": "Aug 1, 2023"
  },
  {
    "articleName": "Python and machine learning",
    "description": "Learn how Python and machine learning intersect to solve complex problems that defeat traditional programming methods. Find out about Pandas, TensorFlow, Scikit-learn, and how they can transform data.",
    "authorName": "Percival Villalva",
    "creationDate": "Jul 31, 2023"
  },
  {
    "articleName": "Top 5 books on AI",
    "description": "Explore the world of AI through a comprehensive selection of books recommended by business leaders. These reads provide an in-depth understanding of AI's history, machine learning, generative AI, diversity in AI, and AI for cybersecurity.",
    "authorName": "Guest Author",
    "creationDate": "Jul 29, 2023"

This approach is more resilient to page changes than regular scraping approaches as it doesnt use CSS selectors. These can stop working after a redesign or when developers change the page layout, for example.

The data consistency heavily depends on the prompt you provide. You should be as specific as possible and always describe the schema (attribute names, such as title, author_name, publication_date).

Also, you have to keep in mind that the GPT model only remembers the current context and will not keep references to articles, authors, or concepts it has seen on the previous pages.

As a result, while it will transform pages to JSON arrays just fine, it will struggle to transform the data to RDF (Resource Description Framework).

Example of an RDF graph (taken from Stardog). Utilizing GPT for creating these graphs from web content can be difficult because of the limited context size (e.g., Im scraping the Love Me Do page, but I dont know how to refer to the The Beatles node because its definition was on a different page).

RDF models are cool because they can answer questions like, Which friends of John Lennon live in Liverpool and have worked with him on at least two albums? which is something regular databases struggle with because you have to label all the entities and their relations manually.

Connecting a large language model with a web crawler may seem like the go-to solution for parsing webpages and creating RDF graphs out of them, but its tricky because of the limited context memory of todays LLMs.

Can AI do web scraping?

So, can you use AI to do web scraping, and more to the point should you?

As with other uses of GPT models, AI tools are most helpful to those who know their field well enough to moderate and correct them.

If you don't know how to code, you shouldnt trust an AI to do it for you.

If youre a developer, then you may find GPT models helpful for certain aspects of web scraping, especially if youre particularly good at prompt engineering, but I dont think theyre ready to steal your job just yet.

Web scraping for AI: how to collect data for LLMs

Theo Vasilis — Wed, 09 Aug 2023 22:00:00 +0000

Generative AI solutions begin with web scraping

For months now, I've been bleating on about the limitations of large language models. Don't get me wrong. I think they're great for certain things. But people tend to overestimate their capabilities or misinterpret what they're supposed to be used for.

One of the limitations of LLMs, which I think is pretty well-documented now, is their inability to produce current, reliable information. Web scraping is the go-to solution for this problem.

📢 As of September 27, 2023, GPT-4's knowledge is no longer limited to data before September 2021:

Web scraping is not only one of the methods used to train LLMs; it's also the technique developers use to improve and customize generative AI models.

Watch Website Content Crawler in action and learn how to integrate it with LangChain in this live demo

Using web scraping tools (such as the one I'm going to use in the tutorial below) can help feed, fine-tune, or train LLMs or provide context for prompts for ChatGPT and similar language models. This can come in handy for a number of things, including:

Creating custom chatbots for customer support
Generating personalized content
Summarizing, translating, and proofreading texts at scale

Introducing Website Content Crawler for data ingestion

To feed and fine-tune LLMs, it's not enough to just scrape data. You need to process and clean it before you can use it for generative AI and machine learning. So in this tutorial, I'm going to use Website Content Crawler , which was designed specifically for this purpose. This guide will demonstrate why WCC is useful for collecting data for LLMs.

Website Content Crawler is what Apify calls an Actor (a serverless cloud program). Actors can perform anything from a simple action, such as filling out a web form or sending an email, to complex operations, such as crawling an entire website and removing duplicates from a large dataset.

Like all Apify Actors, you can run WCC via:

Web UI
Apify API
Apify CLI

If you're new to Apify, using the UI is the easiest way to test it out, so that's the method I'm going to use in this tutorial.

To use this tool and follow along with me, go to Website Content Crawler in Apify Store and click the Try for free button.

You'll need an Apify account. If you don't have one, you'll be prompted to sign up when you click that button.

Otherwise, you'll be taken straight to Apify Console (which is basically your dashboard), and you'll see the UI that I'm about to walk you through.

Website Content Crawler · Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting usin...

apify.com

1. Start URLs

I'm going to use the default input and scrape the Apify docs using the following start URL: https://docs.apify.com/academy/web-scraping-for-beginners.

First step: Start URLs

In this case, the crawler will only crawl the links beginning with academy/.

You can add other URLs to the list, as well. These will be added to the crawler queue, and the Actor will process them one by one.

You can use the Text file option for batch processing if you have lots of URLs and want to crawl them all. You can either upload a file with a list that has each URL on a separate line, or you can provide a URL of the file.

2. Crawler settings

Crawler type

The default crawler type is Firefox. This can load most pages and is usually better for anti-bot blocking, but its the slowest option. Apify has set it as the default because it gets you the most consistent results. However, it requires more compute units takes longer, and therefore costs more.

Choose crawler type

If you need a browser or wish to render client-side JavaScript, you can use the Chrome browser instead. It's faster and requires less memory, but keep in mind that it's more detectable by anti-bot protections.

Use the Raw HTTP client (Cheerio) if you dont need JavaScript client-side rendering, as it will be 20 times faster.

If youre feeling adventurous, you could try the experimental JSDOM option. It's much faster than browsers and provides some JS execution support. However, at this point, JSDOMs coverage of standard web APIs is still incomplete (see this ancient issue tracking the still missing fetch implementation). So I can't wholeheartedly recommend it.

Exclude URLs (globs)

By default, the crawler will visit all the web pages in the Start URLs field (plus all the linked pages - but only if their path prefixes match). However, there might be some you dont want to visit. If that's the case, you can use the exclude URLs (globs) option.

Fill in webpages you don't want to visit by using the exclude URLs option

You can also check if the glob matches what you want with the Test Glob button.

Check whether glob matches your interest with the Test Glob tool

Initial cookies

Cookies are sometimes used to identify the user with the server its trying to access. You can use the initial cookies option if you want to access content behind a log-in or authenticate your crawler with the website youre scraping. Here are a couple of examples.

Use the initial cookie option to scrape the content behind a log-in

3. HTML processing

There are two steps to HTML processing: a) waiting for content to load and b) processing the HTML from the web page (data cleaning). Although the UI doesn't strictly follow this order, I've decided to break it up this way: 3. HTML processing and 4. Data cleaning.

Use the HTML processing tool

Wait for dynamic content

Some web pages have lazy loading, which is when the web page loads more content as you scroll down. In such cases, you can tell the crawler to wait for dynamic content to load. The crawler will wait for up to 10 seconds as long as the web page is changing.

Maximum scroll height

The maximum scroll height is the height at which you scroll down before starting to process the page. This is there just to prevent infinite scrolling. Imagine an online store loading more and more products as you scroll, for example.

Remove cookie warnings

Once the content is loaded, the crawler may try to click on the cookie modals. With the remove cookie warnings option, it will click and hide the modals. It's enabled by default.

Expand clickable elements

The expand clickable elements option lets you add a selector of things the crawler should click on. If you don't select this, the Actor won't crawl any links in collapsed content. So use this option to scrape content from collapsed sections of the webpage.

4. Data cleaning

Remove HTML elements

You can clean the data by removing HTML elements. These are the selectors of things you dont want to include in your results (banners, ads, menus, alerts, and so on). The default setting covers most things, but you can add more to the list if you need to. This way, you'll have only the content you need to feed your language model.

Remove banners, ads, menus, and other HTML elements

HTML transformer

With this option, you can try to remove more elements, but it may strip useful parts of the content you want to extract. So, if you discover that this is the case after running the Actor, you can choose None.

Pay attention to HTML transformer options

Remove duplicate text lines

You can remove duplicate text lines if the crawler keeps seeing the same line again and again. You can enable this in case you keep seeing some parts of footers or menus in your output but you dont want to look for the correct CSS selectors. The Actor strips the repeated content after 4 or 5 occurrences. This will prevent saving the same information repeatedly and so keep the data clean.

Use HTML transformer to remove duplicate text lines

5. Output settings

You can save the data as HTML or Markdown or save screenshots if you're using a headless browser. The Save files option deserves some special attention, though.

Choose the output setting that fits your needs

If you choose Save files , the crawler inspects the web page, and whenever it sees a link that goes to, say a PDF, Word doc, or Excel sheet, it will download it to the Apify key-value store.

6. Running the Actor

With the UI, you can execute code with the click of a button (the Start button at the bottom of the screen).

While running, you'll see what the crawler is up to in the log and can check if it's experiencing any issues. You can abort the run at any point.

When the crawler has completed a successful run, you can retrieve the data from the output tab.

Run the Website Content Crawler

7. Storing the data

The results of the Actor are stored in the default Dataset associated with the Actor run, from where you can access it via API and export it to formats like JSON, XML, or CSV.

With the UI, you need only click the Export results button to view or download the data in your preferred format.

By way of example, here's the data in JSON from the first of the 26 results I got from this demo run using the UI's default settings.

{
  "url": "https://docs.apify.com/academy/web-scraping-for-beginners",
  "crawl": {
    "loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
    "loadedTime": "2023-08-01T09:48:51.180Z",
    "referrerUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
    "depth": 0,
    "httpStatusCode": 200
  },
  "metadata": {
    "canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
    "title": "Web scraping for beginners | Academy | Apify Documentation",
    "description": "Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.",
    "author": null,
    "keywords": null,
    "languageCode": "en"
  },
  "screenshotUrl": null,
  "text": "Web scraping for beginners\nLearn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.\nWelcome to Web scraping for beginners, a comprehensive, practical and long form web scraping course that will take you from an absolute beginner to a successful web scraper developer. If you're looking for a quick start, we recommend trying this tutorial instead.\nThis course is made by Apify, the web scraping and automation platform, but we will use only open-source technologies throughout all academy lessons. This means that the skills you learn will be applicable to any scraping project, and you'll be able to run your scrapers on any computer. No Apify account needed.\nIf you would like to learn about the Apify platform and how it can help you build, run and scale your web scraping and automation projects, see the Apify platform course, where we'll teach you all about Apify serverless infrastructure, proxies, API, scheduling, webhooks and much more.\nWhy learn scraper development?\nWith so many point-and-click tools and no-code software that can help you extract data from websites, what is the point of learning web scraper development? Contrary to what their marketing departments say, a point-and-click or no-code tool will never be as flexible, as powerful, or as optimized as a custom-built scraper.\nAny software can do only what it was programmed to do. If you build your own scraper, it can do anything you want. And you can always quickly change it to do more, less, or the same, but faster or cheaper. The possibilities are endless once you know how scraping really works.\nScraper development is a fun and challenging way to learn web development, web technologies, and understand the internet. You will reverse-engineer websites and understand how they work internally, what technologies they use and how they communicate with their servers. You will also master your chosen programming language and core programming concepts. When you truly understand web scraping, learning other technology like React or Next.js will be a piece of cake.\nCourse Summary\nWhen we set out to create the Academy, we wanted to build a complete guide to modern web scraping - a course that a beginner could use to create their first scraper, as well as a resource that professionals will continuously use to learn about advanced and niche web scraping techniques and technologies. All lessons include code examples and code-along exercises that you can use to immediately put your scraping skills into action.\nThis is what you'll learn in the Web scraping for beginners course:\nWeb scraping for beginners\nBasics of data extraction\nBasics of crawling\nBest practices\nRequirements\nYou don't need to be a developer or a software engineer to complete this course, but basic programming knowledge is recommended. Don't be afraid, though. We explain everything in great detail in the course and provide external references that can help you level up your web scraping and web development skills. If you're new to programming, pay very close attention to the instructions and examples. A seemingly insignificant thing like using [] instead of () can make a lot of difference.\nIf you don't already have basic programming knowledge and would like to be well-prepared for this course, we recommend taking a JavaScript course and learning about CSS Selectors.\nAs you progress to the more advanced courses, the coding will get more challenging, but will still be manageable to a person with an intermediate level of programming skills.\nIdeally, you should have at least a moderate understanding of the following concepts:\nJavaScript + Node.js\nIt is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to starting this course. If you are not yet comfortable with asynchronous programming (with promises and async...await), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section:\nasync...await (YouTube)\nJavaScript loops (MDN)\nModularity in Node.js\nGeneral web development\nThroughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because the knowledge of them will be assumed (unless we're showing something out of the ordinary).\nHTML\nHTTP protocol\nDevTools\njQuery or Cheerio\nWe'll be using the Cheerio package a lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.\nNext up\nThe course begins with a small bit of theory and moves into some realistic and practical examples of extracting data from the most popular websites on the internet using your browser console. So let's get to it!\nIf you already have experience with HTML, CSS, and browser DevTools, feel free to skip to the Basics of crawling section.",
  "markdown": "## Web scraping for beginners\n\n **Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.** \n\n* * *\n\nWelcome to **Web scraping for beginners** , a comprehensive, practical and long form web scraping course that will take you from an absolute beginner to a successful web scraper developer. If you're looking for a quick start, we recommend trying [this tutorial](https://blog.apify.com/web-scraping-javascript-nodejs/) instead.\n\nThis course is made by [Apify](https://apify.com/), the web scraping and automation platform, but we will use only open-source technologies throughout all academy lessons. This means that the skills you learn will be applicable to any scraping project, and you'll be able to run your scrapers on any computer. No Apify account needed.\n\nIf you would like to learn about the Apify platform and how it can help you build, run and scale your web scraping and automation projects, see the [Apify platform course](https://docs.apify.com/academy/apify-platform), where we'll teach you all about Apify serverless infrastructure, proxies, API, scheduling, webhooks and much more.\n\n## Why learn scraper development?[](#why-learn \"Direct link to Why learn scraper development?\")\n\nWith so many point-and-click tools and no-code software that can help you extract data from websites, what is the point of learning web scraper development? Contrary to what their marketing departments say, a point-and-click or no-code tool will never be as flexible, as powerful, or as optimized as a custom-built scraper.\n\nAny software can do only what it was programmed to do. If you build your own scraper, it can do anything you want. And you can always quickly change it to do more, less, or the same, but faster or cheaper. The possibilities are endless once you know how scraping really works.\n\nScraper development is a fun and challenging way to learn web development, web technologies, and understand the internet. You will reverse-engineer websites and understand how they work internally, what technologies they use and how they communicate with their servers. You will also master your chosen programming language and core programming concepts. When you truly understand web scraping, learning other technology like React or Next.js will be a piece of cake.\n\n## Course Summary[](#summary \"Direct link to Course Summary\")\n\nWhen we set out to create the Academy, we wanted to build a complete guide to modern web scraping - a course that a beginner could use to create their first scraper, as well as a resource that professionals will continuously use to learn about advanced and niche web scraping techniques and technologies. All lessons include code examples and code-along exercises that you can use to immediately put your scraping skills into action.\n\nThis is what you'll learn in the **Web scraping for beginners** course:\n\n* [Web scraping for beginners](https://docs.apify.com/academy/web-scraping-for-beginners)\n * [Basics of data extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction)\n * [Basics of crawling](https://docs.apify.com/academy/web-scraping-for-beginners/crawling)\n * [Best practices](https://docs.apify.com/academy/web-scraping-for-beginners/best-practices)\n\n## Requirements[](#requirements \"Direct link to Requirements\")\n\nYou don't need to be a developer or a software engineer to complete this course, but basic programming knowledge is recommended. Don't be afraid, though. We explain everything in great detail in the course and provide external references that can help you level up your web scraping and web development skills. If you're new to programming, pay very close attention to the instructions and examples. A seemingly insignificant thing like using `[]` instead of `()` can make a lot of difference.\n\n> If you don't already have basic programming knowledge and would like to be well-prepared for this course, we recommend taking a [JavaScript course](https://www.codecademy.com/learn/introduction-to-javascript) and learning about [CSS Selectors](https://www.w3schools.com/css/css_selectors.asp).\n\nAs you progress to the more advanced courses, the coding will get more challenging, but will still be manageable to a person with an intermediate level of programming skills.\n\nIdeally, you should have at least a moderate understanding of the following concepts:\n\n### JavaScript + Node.js[](#javascript-and-node \"Direct link to JavaScript + Node.js\")\n\nIt is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to starting this course. If you are not yet comfortable with asynchronous programming (with promises and `async...await`), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section:\n\n* [`async...await` (YouTube)](https://www.youtube.com/watch?v=vn3tm0quoqE&ab_channel=Fireship)\n* [JavaScript loops (MDN)](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration)\n* [Modularity in Node.js](https://www.section.io/engineering-education/how-to-use-modular-patterns-in-nodejs/)\n\n### General web development[](#general-web-development \"Direct link to General web development\")\n\nThroughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because the knowledge of them will be **assumed** (unless we're showing something out of the ordinary).\n\n* [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML)\n* [HTTP protocol](https://developer.mozilla.org/en-US/docs/Web/HTTP)\n* [DevTools](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/browser-devtools)\n\n### jQuery or Cheerio[](#jquery-or-cheerio \"Direct link to jQuery or Cheerio\")\n\nWe'll be using the [**Cheerio**](https://www.npmjs.com/package/cheerio) package a lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.\n\n## Next up[](#next \"Direct link to Next up\")\n\nThe course begins with a small bit of theory and moves into some realistic and practical examples of extracting data from the most popular websites on the internet using your browser console. So [let's get to it!](https://docs.apify.com/academy/web-scraping-for-beginners/introduction)\n\n> If you already have experience with HTML, CSS, and browser DevTools, feel free to skip to the [Basics of crawling](https://docs.apify.com/academy/web-scraping-for-beginners/crawling) section."

Integrating your data with LangChain, Pinecone, and other tools

You can now use the data you've collected to feed and fine-tune LLMs by integrating your data with LangChain or with a vector database such as Pinecone or any Pinecone alternatives.

For a detailed example, check out this tutorial on how to use LangChain and Pinecone with Apify.

How web scraping and AI are helping to find missing children

Theo Vasilis — Sun, 06 Aug 2023 22:00:00 +0000

The Missing Children initiative began with a Facebook page. Web scraping Facebook for data labeling has taken it to a whole new level and the initiative is now reuniting families all over Egypt.

The Missing Children initiative

Ever since the 2011 Egyptian revolution, child abduction has been on the rise in Egypt. During the revolution alone, 1,200 persons were reported missing, and the problem of lost persons is still not going away. Children make up the largest proportion of victims. Thousands of children go missing in Egypt every year. There are five main reasons: adoption, begging, sex trade, organ trade, and ransom.

This tragic situation was the motivation behind the Missing Children (Atfal Mafkoda) initiative. Atfal Mafkoda is a community effort and Facebook page with over 2 million followers that works on identifying missing and trafficked children and those mistreated in orphanages. The initiative has so far tracked down over 3,000 people to reunite them with their families.

The Missing Children initiative was launched in 2016 by engineer Rami el-Gebali, beginning with a "No to using children as beggars" campaign. Mr. Gebali asked people to take pictures of children begging on the streets and send them to the page. As a result, he collected tens of thousands of such photos, and a match-making process between begging children and misplaced children began.

While that campaign led to only three children being found, the vast number of images led to the page having the largest database of missing children with pictures in Egypt.

"Our motto is that no family should suffer the pain of missing a living loved one. We want to spread our model across the world. We proved the concept, and we know it works."

Rami el-Gebali, founder of Atfal Mafkoda**

Rami el-Gebali, founder of the Missing Children initiative

AI face recognition to the rescue

In 2023, something made Rami el-Gebali realize that there had to be a way to speed up the complicated process of finding lost children. A community member pointed out that there was a photo that looked just like a person who had been missing for ten years. That person has now finally been reunited with his family. But Atfal Mafkoda had that photo two months after the person went missing. AI could have solved that case in one minute! Relying on human beings to do in ten years what AI would do in the blink of an eye didn't make sense!

The problem was that the process of finding lost children is not effective when the photos are too old, the picture quality is low, or the children have been missing for so long that their photos no longer match their current facial features. Atfal Mafkoda needed AI face recognition technology: face detection, face enhancement, face comparison, and face aging.

Enter Rami el-Gebali's neighbor, Youssef A. Abukwaik - a highly experienced software engineer and author of the SuSE Linux Installation & Configuration Handbook.

Rami asked Youssef for his help. So Mr. Abukwaik consulted a former manager who teaches at Boston University and inquired about the possibility of starting a graduation project for the semester that would benefit both the students and the Missing Children initiative.

What followed was the Spark project, for which the students won an Audience Choice Award. The students applied generative deep learning methods to improve forensic face aging to provide higher-quality aged face photos. Thanks to those brilliant students, Mr. Abukwaik was able to implement what the students did (see the GitHub repo here) and apply it to the Missing Children initiative.

The Atfal Mafkoda (Missing Children) Facebook page

Web scraping Facebook for data labeling

The next step was web scraping. Youssef needed to scrape the Atmal Mafkoda Facebook page in order to turn it into a dataset for data labeling. And this is where Youssef ran into problems.

"Apify allowed me to fully scrape our own Facebook page without the limitations I had experienced with alternative solutions. I was able to drill down up to 5,000 posts without blocks. No other open-source solution or alternative I tried compared to that."

Youssef A. Abukwaik, Software Engineer**

Youssef first tried downloading the Facebook page, but it wasn't easily parsable. So he turned to an open-source Facebook scraper. Facebook blocked him after just 30 requests. He then tried a proxy application, but the configuration was hard to do. After scraping 500 posts, he had to start all over again after encountering problems.

So, Youssef did what any of us would do: he turned to search engines and googled "How to scrape Facebook." Google autocompleted his search with 'How to scrape Facebook in 2023', and the first result was Apify.

"Apify has incredible potential for AI and machine learning. It was a turnkey solution that let me harvest the data in our Facebook community for data labeling and put it to use without any extra work."

Youssef A. Abukwaik, Software Engineer

How Apify is helping the Missing Children initiative

Apify allowed Youssef to fully scrape the Atfal Mafkoda Facebook page without the limitations he experienced in alternative solutions he came across. He was able to extract up to 5,000 posts without getting blocked.

Also, the ability to have a dataset is profoundly helpful. The scraped results don't disappear, so he doesn't have to use them immediately. He can do queries against the data, which sticks around long enough to perform business logic without having to suffer re-running. With one run, Youssef is able to re-download and work with the output.

Mr. Abukwaik now has an administration site where he can pull Facebook data. The data appears in a structured way, and he can see all the posts so he knows when he needs to do face matching with the John/Jane Does, and label them according to whether a person is missing or reunited with their family.

💡Get fast, reliable data for your AI and machine learning at Apify!

A global database of missing people

The Missing Children initiative is focused on Egypt for now, but as its founder has said, Our dream is to have one global database of missing people around the world.

Apify shares that dream, and with this and the Thorn project, which uses Apify to help find trafficked children in the US, let's hope that the dream becomes a reality very soon!

Want to read more of Apify's web scraping & AI-related success stories? Then check out the content below.

Read even more success stories here