Introduction ๐
In this post, we will analyze two different approaches to extracting information from a website. The first approach is web scraping from the frontend, which didn't work effectively for our specific use case. The second approach is using GraphQL API to fetch data directly from the backend. We will dive into why the API-based approach is more advantageous than web scraping in our case.
Our task was to filter and retrieve Ethereum addresses for the Blockchain category and set the Lock Creation Date to the previous day on a particular website.
Where I Went Wrong ๐คฆโโ๏ธ
Initially, we attempted to scrape data from the website using frontend scripting, but it wasn't efficient nor accurate. Here's the baseline when a user does it manually:
Let's focus on the lower
and upper
attributes of the request. The image below shows a manual request:
This video shows an automated request using a script:
from datetime import date, timedelta | |
from selenium import webdriver | |
from selenium.webdriver import ActionChains | |
from selenium.webdriver.common.by import By | |
from selenium.webdriver.common.keys import Keys | |
from selenium.webdriver.chrome.service import Service | |
from webdriver_manager.chrome import ChromeDriverManager | |
URL = "https://www.team.finance/view-all-coins" | |
s = Service(ChromeDriverManager().install()) | |
driver = webdriver.Chrome(service=s) | |
# try | |
driver.maximize_window() | |
driver.get(URL) | |
print("Chrome Browser Invoked") | |
filter = driver.find_element(By.XPATH, "/html/body/div[1]/main/div[2]/div[2]/div[1]/div/div/button[3]") | |
ActionChains(driver).click(filter).perform() | |
print("Filter button clicked") | |
eth = driver.find_element(By.XPATH, "/html/body/div[3]/div/div/div/div/div[2]/div[2]/div[2]/div/div[1]/input") | |
ActionChains(driver).click(eth).perform() | |
print("Ethereum checked") | |
end_date = driver.find_element(By.XPATH, "/html/body/div[3]/div/div/div/div/div[2]/div[2]/div[3]/div/div[2]/div/div/input") | |
today = date.today() | |
end_date.send_keys(Keys.CONTROL+'a'+Keys.BACKSPACE) | |
end_date.send_keys(f"{today}"+Keys.ENTER) | |
print(f"End Date entered: {today}") | |
start_date = driver.find_element(By.XPATH, "/html/body/div[3]/div/div/div/div/div[2]/div[2]/div[3]/div/div[1]/div/div/input") | |
yesterday = today-timedelta(days=1) | |
start_date.send_keys(Keys.CONTROL+'a'+Keys.BACKSPACE) | |
start_date.send_keys(f"{yesterday}"+Keys.ENTER) | |
print(f"Start Date entered: {yesterday}") | |
apply_filter = driver.find_element(By.XPATH, "/html/body/div[3]/div/div/div/div/div[2]/div[3]/div/button[1]") | |
ActionChains(driver).click(apply_filter).perform() | |
print("Apply filter") | |
# driver.quit() |
My script simulated writing instead of clicking on the datepicker. I chose this approach because it was easier to code the writing simulation than to click on the datepicker. However, these two approaches produced different values that I didn't expect. Now, take a look at the lower
and upper
attributes of the automated request.
The boundaries above are incorrect because they do not always point at 0 o'clock in the morning for the previous day. After discussing with a friend, we decided to try a backend approach using the GraphQL API.
Why API (Backend) is Better than Web Scraping (Frontend) ๐๐ป
By inspecting the website using the Developer Tools, we discovered that the client sends GraphQL requests to the server. GraphQL is a query language and runtime for APIs, enabling clients to request exactly the data they need. This allows for more efficient and flexible data retrieval than traditional REST APIs.
In our case, the GraphQL API offers several advantages over web scraping:
- Accuracy: The data we need is fetched directly from the backend, ensuring accuracy and consistency with the website's actual data.
- Efficiency: Using the API reduces the need to parse and extract information from the HTML code, making the process more efficient.
- Reliability: APIs are designed to be consumed programmatically, making them more reliable than scraping the ever-changing structure of a web page.
- Flexibility: GraphQL allows us to request only the data we need, reducing the amount of data transfer and processing required.
Code Walkthrough ๐ป๐ถโโ๏ธ
In the successful Jupyter Notebook, we used the gql library to interact with the GraphQL API. Here's a high-level overview of the process:
- Install and import the required libraries.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
try: import gql except ImportError as e: print(e) %pip install gql[all] from json import dump from gql import gql, Client from gql.transport.aiohttp import AIOHTTPTransport from datetime import datetime, timezone, timedelta - Define the GraphQL endpoint URL and set the time range for the previous 24 hours.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
# constants URL = 'https://team-finance-backend-origdfl2wq-uc.a.run.app/graphql' HOURS = 24 UTC_FORMAT = '%Y-%m-%dT%H:%M:%S.%f%z' LOG_FILE = 'result.json' - Create a GraphQL client using the AIOHTTPTransport and the endpoint URL.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
# Select your transport with a defined url endpoint transport = AIOHTTPTransport(url=URL) # Create a GraphQL client using the defined transport client = Client(transport=transport) # Get the current datetime and the previous 24 hours from now in UTC now_utc = datetime.now(timezone.utc) previous_1d_utc = now_utc - timedelta(hours=HOURS) # format the datetimes in ISO 8601 format with UTC offset upper = now_utc.strftime(UTC_FORMAT) lower = previous_1d_utc.strftime(UTC_FORMAT) - Define the GraphQL query with placeholders for the time range and pagination limit.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
query_str = """ query Previous1DEth($lower: DateTime!, $upper: DateTime!, $limit: Int!) { lockEvents( sorting: {field: unlockTime, direction: ASC} filter: {unlockTime: {gt: 0} and: [ {chainId: {in: ["0x1", ]}}, {createdAt: {between : {lower : $lower, upper : $upper} } }, ] } paging: {limit: $limit, offset: 0} ){ pageInfo { hasNextPage hasPreviousPage } nodes{ unlockTime tokenAddress network tokenTotalSupply chainId createdAt tokenId isWithdrawn } } } """ # create GraphQL request query = gql(query_str) - Execute the query asynchronously, increasing the pagination limit until all data is retrieved.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
limit = 10 # increase the limit if there is pagination until exhausted hasNextPage = True while hasNextPage: variables = { "lower": lower, "upper": upper, "limit": limit } result = await client.execute_async(query, variable_values=variables) hasNextPage = result['lockEvents']['pageInfo']['hasNextPage'] limit += 10 - Extract the token addresses from the query result and store the complete response in a JSON file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
# store the complete response for log with open(LOG_FILE, 'w') as f: dump(result, f) nodes = result['lockEvents']['nodes'] addresses = [address['tokenAddress'] for address in nodes] print(addresses) print(f"{len(addresses)} addresses")
This approach allowed us to fetch the Ethereum addresses directly from the backend, providing a more efficient and reliable solution compared to web scraping. The complete code is available here for exploration.
Conclusion ๐คโ
In this post, we compared two approaches to extracting information from a website: web scraping from the frontend and using the GraphQL API from the backend. For our specific use case, leveraging the GraphQL API proved to be a more advantageous solution due to its accuracy, efficiency, reliability, and flexibility.
It's important to note that the best approach may vary depending on the specific website and data requirements. In some cases, web scraping might be the only option if no API is available. However, when possible, using an API is generally a more efficient and reliable way to fetch data programmatically.
We hope this post provided valuable insights into the benefits of using APIs, particularly GraphQL, over web scraping when extracting data from websites. I will be busy filling in my current project on this repo, but there is more to come! Check out my portfolio repository here:
DAIly
A bunch of Data Analysis and Artificial Intelligence notebooks
Ideas
This directory might contain notes or outlines of potential data analysis or AI projects that I'm considering working on in the future. These might be in the form of brainstorming notebooks, rough outlines powerpoint of project ideas, or notes on interesting data sources or tools that I want to explore further
Tips
This directory might contain more practical information, such as code snippets or tutorials that I've found helpful in my data analysis and AI work. These could be tips on how to use specific libraries or tools, how to preprocess data for analysis, or how to approach common data analysis or AI tasks
Fantastic Docs and Where to Find Them
Reading and understanding any documentation with minimum effort on Google Colab
backโฆ
Follow me anywhere
Top comments (0)