In tourism, many people still get cheated by scam companies. This happens a lot with umrah packages, tourist guides, and travel agencies. Why? Because it is not easy to check if a company is legal or not.
The government has official websites with lists of banned, blacklisted, or registered names. There is a search function, but the problem is the data is split into many different lists. For example, one list for tourist guides, one list for umrah, one list for travel agencies. You must choose the right list first, and then search. Also, each list uses pagination. That means you still need to click page by page, which is slow and not friendly.
I started to think , What if we make one simple website, where people just type a keyword, and it will show if the name exists in any of the lists? This way, travelers can check quickly if a company is real or a scam. Btw I’m doing this for fun, since I can’t go anywhere during the school holidays as the roads are all jammed , so i just spend my time on a little project for fun .
The Challenge
The most lazy part of this actually is to get all the related data . Copy and paste by hand is possible and easy way , but it too much work , it would be fun project to depress project later on haha . So why not use Mage AI as i already use this to for my previous project related to data .
At first, I created a normal block with a loop. It worked, but it was too slow because it went step by step through every page ( at least for page that not have many page ) . Then I realized , why not try a dynamic block? With dynamic blocks, I can run many requests at the same time with parallel processing. Much faster, much smarter.
Mage AI Dynamic Blocks
Here is where Mage AI helps. Mage AI has dynamic blocks. With this feature, we can scrape many pages in parallel. That means faster and easier. To learn more about Mage AI Dynamic Blocks go here .
This is how it works :
- Generate a list of url including pagination parameter using a loader block . Keep in mind a dynamic block must return a list of 2 lists of dictionaries
- Scrape the page based on url that are store in list of dictionary and reduce the data into one set
- Export the data to destination
Example
First step
Create loader block . Ensure you set this block as dynamic .
Once you set it as dynamic , you can write this as your loader , the purpose is for us to get all targeted url that we want to to scrape .
from typing import Dict, List
import requests
from bs4 import BeautifulSoup
@data_loader
def load_data(*args, **kwargs) -> List[List[Dict]]:
"""
This loader prepares tasks for scraping multiple MOTAC pages.
Each entry in 'urls' becomes a separate block run if used with dynamic blocks.
"""
url = "https://the-targeted-url"
response = requests.get(url, timeout=20)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
record_tag = soup.select_one("li.uk-disabled span")
jumlah_rekod = None
if record_tag:
text = record_tag.get_text(strip=True)
jumlah_rekod = int(text.split(":")[-1].strip())
urls = []
for offset in range(0, jumlah_rekod, 20):
if offset == 0:
urls.append(url)
else:
urls.append(f"{url}?s=&n=&v={offset}")
tasks = []
metadata = []
for idx, url in enumerate(urls, start=1):
tasks.append(dict(id=idx, url=url))
metadata.append(dict(block_uuid=f"scrape_page_{idx}"))
return [
tasks,
metadata
]
Second Step
Create a transformer . This transformer will do scraping and get all the data from the page . This will be automatic set as dynamic if you set the first block dynamic . The only thing we need to do is to reduce for the output . The reason we reduce because we want to export in one step , so we don't spawn another extra block for export .
import requests
from bs4 import BeautifulSoup
@transformer
def scrape_page(row, *args, **kwargs):
url = row["url"]
response = requests.get(url, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
results = []
table = soup.find("table")
if table:
headers = [th.get_text(strip=True) for th in table.find_all("th")]
for tr in table.find_all("tr")[1:]: # skip header row
cells = [td.get_text(strip=True) for td in tr.find_all("td")]
if cells:
results.append(dict(zip(headers, cells)))
return {
"page_id": row["id"],
"url": url,
"records": results,
}
Third Step
Add another block to do some cleaning of the data format , column name and etc before exporting .
if 'transformer' not in globals():
from mage_ai.data_preparation.decorators import transformer
if 'test' not in globals():
from mage_ai.data_preparation.decorators import test
import pandas as pd
@transformer
def transform(data, *args, **kwargs):
df = data
if '#' in df.columns:
df = df.drop(columns=['#'])
df = df.rename(columns={
'Nama': 'nama',
'No. TG': 'no_tg',
'Tempoh Sah': 'tempoh_sah',
'Tarikh Batal': 'tarikh_batal',
'Seksyen': 'seksyen'
})
for col in ['tempoh_sah', 'tarikh_batal']:
if col in df.columns:
df[col] = pd.to_datetime(df[col], format='%d/%m/%y', errors='coerce')
return df
@test
def test_output(output, *args) -> None:
# Ensure required columns exist
required_cols = ['nama', 'no_tg', 'tempoh_sah', 'tarikh_batal', 'seksyen']
for col in required_cols:
assert col in output.columns, f'Missing column: {col}'
Fourth Step
As for now , since this only daily project , i just gonna export first and using full load first . No worry , if i've a mood , i will write a better approach for this :)
from mage_ai.settings.repo import get_repo_path
from mage_ai.io.config import ConfigFileLoader
from mage_ai.io.postgres import Postgres
from pandas import DataFrame
from os import path
if 'data_exporter' not in globals():
from mage_ai.data_preparation.decorators import data_exporter
@data_exporter
def export_data_to_postgres(df: DataFrame, **kwargs) -> None:
"""
Template for exporting data to a PostgreSQL database.
Specify your configuration settings in 'io_config.yaml'.
Docs: https://docs.mage.ai/design/data-loading#postgresql
"""
schema_name = 'public'
table_name = 'pemandu_pelancong'
config_path = path.join(get_repo_path(), 'io_config.yaml')
config_profile = 'default'
with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader:
loader.export(
df,
schema_name,
table_name,
index=False,
if_exists='replace',
)
Why Use Dynamic Blocks for Scraping?
Dynamic blocks are powerful because they make scraping large datasets much faster. Instead of one request after another, you can run many requests at the same time. For websites with hundreds of pages, this saves a lot of time.
But there are also things to keep in mind
- Respect rate limits: Some websites may block you if you send too many requests at once
- Error handling: Always add retries in case some requests fail
- Data consistency: Make sure to clean and validate data before saving
- Ethics and legality: Always check if scraping the website is allowed
Closing Thoughts
This little holiday project showed me how useful Mage AI’s dynamic blocks can be. With just a few blocks, I turned a slow and boring manual process into a fast, automated pipeline. The scraped data can now be used to build a simple search directory, helping people quickly check if a company is real or a scam.
Dynamic blocks are not only fun , they’re practical, powerful, and a great tool for anyone working with pagination or large API calls.
So remember when you face hundreds of pages, don’t suffer like anakin let the blocks be with you.
Top comments (0)