DEV Community

Cover image for Leveraging GraphQL API Over Web Scraping: A Backend Approach
Retiago Drago
Retiago Drago

Posted on

2

Leveraging GraphQL API Over Web Scraping: A Backend Approach

Outlines

Introduction ๐ŸŒŸ

In this post, we will analyze two different approaches to extracting information from a website. The first approach is web scraping from the frontend, which didn't work effectively for our specific use case. The second approach is using GraphQL API to fetch data directly from the backend. We will dive into why the API-based approach is more advantageous than web scraping in our case.

Our task was to filter and retrieve Ethereum addresses for the Blockchain category and set the Lock Creation Date to the previous day on a particular website.

Where I Went Wrong ๐Ÿคฆโ€โ™‚๏ธ

Initially, we attempted to scrape data from the website using frontend scripting, but it wasn't efficient nor accurate. Here's the baseline when a user does it manually:

Let's focus on the lower and upper attributes of the request. The image below shows a manual request:

manual request time range

This video shows an automated request using a script:

from datetime import date, timedelta
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
URL = "https://www.team.finance/view-all-coins"
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
# try
driver.maximize_window()
driver.get(URL)
print("Chrome Browser Invoked")
filter = driver.find_element(By.XPATH, "/html/body/div[1]/main/div[2]/div[2]/div[1]/div/div/button[3]")
ActionChains(driver).click(filter).perform()
print("Filter button clicked")
eth = driver.find_element(By.XPATH, "/html/body/div[3]/div/div/div/div/div[2]/div[2]/div[2]/div/div[1]/input")
ActionChains(driver).click(eth).perform()
print("Ethereum checked")
end_date = driver.find_element(By.XPATH, "/html/body/div[3]/div/div/div/div/div[2]/div[2]/div[3]/div/div[2]/div/div/input")
today = date.today()
end_date.send_keys(Keys.CONTROL+'a'+Keys.BACKSPACE)
end_date.send_keys(f"{today}"+Keys.ENTER)
print(f"End Date entered: {today}")
start_date = driver.find_element(By.XPATH, "/html/body/div[3]/div/div/div/div/div[2]/div[2]/div[3]/div/div[1]/div/div/input")
yesterday = today-timedelta(days=1)
start_date.send_keys(Keys.CONTROL+'a'+Keys.BACKSPACE)
start_date.send_keys(f"{yesterday}"+Keys.ENTER)
print(f"Start Date entered: {yesterday}")
apply_filter = driver.find_element(By.XPATH, "/html/body/div[3]/div/div/div/div/div[2]/div[3]/div/button[1]")
ActionChains(driver).click(apply_filter).perform()
print("Apply filter")
# driver.quit()

My script simulated writing instead of clicking on the datepicker. I chose this approach because it was easier to code the writing simulation than to click on the datepicker. However, these two approaches produced different values that I didn't expect. Now, take a look at the lower and upper attributes of the automated request.

automated request time range

The boundaries above are incorrect because they do not always point at 0 o'clock in the morning for the previous day. After discussing with a friend, we decided to try a backend approach using the GraphQL API.

Why API (Backend) is Better than Web Scraping (Frontend) ๐Ÿ†š๐Ÿ’ป

By inspecting the website using the Developer Tools, we discovered that the client sends GraphQL requests to the server. GraphQL is a query language and runtime for APIs, enabling clients to request exactly the data they need. This allows for more efficient and flexible data retrieval than traditional REST APIs.

In our case, the GraphQL API offers several advantages over web scraping:

  1. Accuracy: The data we need is fetched directly from the backend, ensuring accuracy and consistency with the website's actual data.
  2. Efficiency: Using the API reduces the need to parse and extract information from the HTML code, making the process more efficient.
  3. Reliability: APIs are designed to be consumed programmatically, making them more reliable than scraping the ever-changing structure of a web page.
  4. Flexibility: GraphQL allows us to request only the data we need, reducing the amount of data transfer and processing required.

Code Walkthrough ๐Ÿ’ป๐Ÿšถโ€โ™‚๏ธ

In the successful Jupyter Notebook, we used the gql library to interact with the GraphQL API. Here's a high-level overview of the process:

  1. Install and import the required libraries.
    try:
    import gql
    except ImportError as e:
    print(e)
    %pip install gql[all]
    from json import dump
    from gql import gql, Client
    from gql.transport.aiohttp import AIOHTTPTransport
    from datetime import datetime, timezone, timedelta
  2. Define the GraphQL endpoint URL and set the time range for the previous 24 hours.
    # constants
    URL = 'https://team-finance-backend-origdfl2wq-uc.a.run.app/graphql'
    HOURS = 24
    UTC_FORMAT = '%Y-%m-%dT%H:%M:%S.%f%z'
    LOG_FILE = 'result.json'
  3. Create a GraphQL client using the AIOHTTPTransport and the endpoint URL.
    # Select your transport with a defined url endpoint
    transport = AIOHTTPTransport(url=URL)
    # Create a GraphQL client using the defined transport
    client = Client(transport=transport)
    # Get the current datetime and the previous 24 hours from now in UTC
    now_utc = datetime.now(timezone.utc)
    previous_1d_utc = now_utc - timedelta(hours=HOURS)
    # format the datetimes in ISO 8601 format with UTC offset
    upper = now_utc.strftime(UTC_FORMAT)
    lower = previous_1d_utc.strftime(UTC_FORMAT)
  4. Define the GraphQL query with placeholders for the time range and pagination limit.
    query_str = """
    query Previous1DEth($lower: DateTime!, $upper: DateTime!, $limit: Int!) {
    lockEvents(
    sorting: {field: unlockTime, direction: ASC}
    filter: {unlockTime: {gt: 0}
    and: [
    {chainId: {in: ["0x1", ]}},
    {createdAt: {between : {lower : $lower, upper : $upper} } },
    ]
    }
    paging: {limit: $limit, offset: 0}
    ){
    pageInfo {
    hasNextPage
    hasPreviousPage
    }
    nodes{
    unlockTime
    tokenAddress
    network
    tokenTotalSupply
    chainId
    createdAt
    tokenId
    isWithdrawn
    }
    }
    }
    """
    # create GraphQL request
    query = gql(query_str)
  5. Execute the query asynchronously, increasing the pagination limit until all data is retrieved.
    limit = 10
    # increase the limit if there is pagination until exhausted
    hasNextPage = True
    while hasNextPage:
    variables = {
    "lower": lower,
    "upper": upper,
    "limit": limit
    }
    result = await client.execute_async(query, variable_values=variables)
    hasNextPage = result['lockEvents']['pageInfo']['hasNextPage']
    limit += 10
  6. Extract the token addresses from the query result and store the complete response in a JSON file.
    # store the complete response for log
    with open(LOG_FILE, 'w') as f:
    dump(result, f)
    nodes = result['lockEvents']['nodes']
    addresses = [address['tokenAddress'] for address in nodes]
    print(addresses)
    print(f"{len(addresses)} addresses")

This approach allowed us to fetch the Ethereum addresses directly from the backend, providing a more efficient and reliable solution compared to web scraping. The complete code is available here for exploration.

Conclusion ๐Ÿคโœ…

In this post, we compared two approaches to extracting information from a website: web scraping from the frontend and using the GraphQL API from the backend. For our specific use case, leveraging the GraphQL API proved to be a more advantageous solution due to its accuracy, efficiency, reliability, and flexibility.

It's important to note that the best approach may vary depending on the specific website and data requirements. In some cases, web scraping might be the only option if no API is available. However, when possible, using an API is generally a more efficient and reliable way to fetch data programmatically.

We hope this post provided valuable insights into the benefits of using APIs, particularly GraphQL, over web scraping when extracting data from websites. I will be busy filling in my current project on this repo, but there is more to come! Check out my portfolio repository here:

GitHub logo ranggakd / DAIly

A bunch of Data analysis +AI notebooks I'd worked on almost a daiLY basis

DAIly

A bunch of Data Analysis and Artificial Intelligence notebooks ๐Ÿค– I'd worked on almost a daiLY basis ๐Ÿ‘จโ€๐Ÿ’ป

Ideas

This directory might contain notes or outlines of potential data analysis or AI projects that I'm considering working on in the future. These might be in the form of brainstorming notebooks, rough outlines powerpoint of project ideas, or notes on interesting data sources or tools that I want to explore further

back to โฌ†

Tips

This directory might contain more practical information, such as code snippets or tutorials that I've found helpful in my data analysis and AI work. These could be tips on how to use specific libraries or tools, how to preprocess data for analysis, or how to approach common data analysis or AI tasks

Fantastic Docs and Where to Find Them

Reading and understanding any documentation with minimum effort on Google Colab

backโ€ฆ

Follow me anywhere

ranggakd - Link in Bio & Creator Tools | Beacons

@ranggakd | center details summary summary Oh hello there I m a an Programmer AI Tech Writer Data Practitioner Statistics Math Addict Open Source Contributor Quantum Computing Enthusiast details center.

favicon beacons.ai

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

๐Ÿ‘‹ Kindness is contagious

Engage with a wealth of insights in this thoughtful article, valued within the supportive DEV Community. Coders of every background are welcome to join in and add to our collective wisdom.

A sincere "thank you" often brightens someoneโ€™s day. Share your gratitude in the comments below!

On DEV, the act of sharing knowledge eases our journey and fortifies our community ties. Found value in this? A quick thank you to the author can make a significant impact.

Okay