BI teams that rely only on internal data — CRM, billing, product analytics — are working with half a map. Your dashboards tell you what you are doing, not what the market is doing around you. Competitor price changes, new 13F filings, fresh Y Combinator batches, Shopify app rankings, press releases — all of it is public, all of it is structured, and all of it can land in your warehouse on a nightly schedule if you wire it up correctly.
This tutorial walks through exactly that wiring. By the end you'll have a working pattern for: scheduling an Apify actor to run nightly, exporting the dataset to S3 (or GCS, or directly to Snowflake), loading it into a warehouse table, building a dashboard view on top, and firing a Slack alert when a metric moves. We'll use real BI tooling (Snowflake, BigQuery, dbt, Airflow, Google Sheets, Looker) and real actors from the NexGenData catalog. Code samples are copy-pasteable. If you maintain a competitive-intel dashboard, a lead-gen pipeline, or a market-monitoring report, you should be able to ship something useful from this post in an afternoon.
Anatomy of an Apify actor in a BI context
An actor is a serverless scraper. You pass it JSON input (URLs, search terms, filters), it runs in Apify's cloud, and it writes results to a dataset — a key-value store that exposes the rows as JSON, CSV, XLSX, JSONL, RSS, or HTML via a stable URL.
For BI purposes, three properties matter:
-
Output is structured. Each actor publishes a schema. Records come back as flat JSON objects, easy to
COPY INTOa warehouse table orpd.read_jsonin a notebook. -
Billing is pay-per-event (PPE). Most NexGenData actors charge $0.05–$0.50 per result, not per minute. Your cost scales with rows ingested, which makes monthly forecasting trivial:
rows_per_run × runs_per_month × $/row. -
Live-running beats batch downloads. You can hit the
run-sync-get-dataset-itemsendpoint, block until the actor finishes, and stream the dataset back in one HTTP call — ideal for ad-hoc analyst queries. Or you schedule a batch run and have Apify webhook the dataset URL to your ingest layer.
Three patterns for ingesting actor data
Pattern A: Direct API call from your ETL
Simplest pattern. Your Airbyte custom source, Fivetran function, or hand-rolled Python job calls the actor and writes results to the warehouse. Good for low-frequency runs (daily/weekly) and small-to-medium volumes.
curl -X POST \
"https://api.apify.com/v2/acts/nexgendata~saas-pricing-tracker/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"vendors": ["notion.so", "airtable.com", "monday.com"],
"plans": ["all"]
}' \
> pricing_$(date +%Y-%m-%d).json
From Python, the equivalent using the official SDK:
from apify_client import ApifyClient
import pandas as pd
client = ApifyClient("apify_api_xxx")
run = client.actor("nexgendata/saas-pricing-tracker").call(
run_input={"vendors": ["notion.so", "airtable.com"]}
)
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
df = pd.DataFrame(items)
df.to_sql("stg_competitor_pricing", warehouse_engine, if_exists="append")
Pattern B: Scheduled run + webhook to object storage
The production pattern. Apify's scheduler triggers the actor; on ACTOR.RUN.SUCCEEDED a webhook posts to your endpoint with the dataset URL. Your endpoint (Lambda, Cloud Function, or a small Flask container) streams the dataset to S3/GCS, and Snowpipe/BigQuery auto-ingest picks it up.
A webhook payload looks like this:
{
"userId": "abc123",
"createdAt": "2026-05-24T02:00:00.000Z",
"eventType": "ACTOR.RUN.SUCCEEDED",
"eventData": {
"actorId": "nexgendata~serp-rank-tracker-lite",
"actorRunId": "xYzRunId789",
"actorTaskId": "nightlyRankCheck"
},
"resource": {
"id": "xYzRunId789",
"status": "SUCCEEDED",
"defaultDatasetId": "dsId456",
"stats": { "computeUnits": 0.02, "datasetItemCount": 240 }
}
}
Your receiver fetches https://api.apify.com/v2/datasets/dsId456/items?format=json&clean;=true and writes the body to s3://bi-raw/apify/serp/2026-05-24.json. A Snowpipe definition pointed at that prefix loads it automatically.
Pattern C: Google Sheets via IMPORTDATA
For analysts who live in Sheets, you don't need a warehouse at all. Apify exposes dataset items as a CSV URL that IMPORTDATA can consume. Drop this in cell A1:
=IMPORTDATA("https://api.apify.com/v2/acts/nexgendata~yc-companies-directory-scraper/run-sync-get-dataset-items?token=APIFY_TOKEN&format;=csv&clean;=true")
Sheets refreshes the formula on edit and on a one-hour interval. Wrap it in QUERY() to filter, or feed it into a Looker Studio data source for a free dashboard layer.
Worked example: a daily competitor pricing BI dashboard
Goal: a Snowflake-backed Looker dashboard that, every morning at 7am, shows yesterday's pricing changes across your top 10 SaaS competitors, their SERP rank movement on five priority keywords, and any new music-gear pricing trends if you happen to sell music gear. We'll use three actors:
Step 1: schedule actor runs nightly
In the Apify Console, go to Schedules -> Create new. Set cron to 0 2 * * * (2am UTC). Add three "actions", one per actor, each pointing at a saved task with its input pre-filled. Save. You're done with scheduling.
Step 2: configure dataset export via webhook
On each actor task, open Integrations -> Webhooks. Add a webhook for ACTOR.RUN.SUCCEEDED pointing at your ingest endpoint:
https://ingest.yourcompany.com/apify?source={{eventData.actorId}}&run;={{eventData.actorRunId}}
A minimal Python receiver (FastAPI on Cloud Run):
@app.post("/apify")
async def ingest(source: str, run: str, payload: dict):
dataset_id = payload["resource"]["defaultDatasetId"]
url = f"https://api.apify.com/v2/datasets/{dataset_id}/items?format=jsonl&clean;=true"
data = httpx.get(url).content
key = f"apify/{source.replace('~','/')}/{date.today()}.jsonl"
s3.put_object(Bucket="bi-raw", Key=key, Body=data)
return {"status": "ok", "rows": payload["resource"]["stats"]["datasetItemCount"]}
Step 3: load to data warehouse
Snowflake DDL for the raw and curated layers:
-- Raw landing table, schemaless
CREATE TABLE raw.apify_pricing (
load_ts TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP,
source_file STRING,
payload VARIANT
);
-- Snowpipe pointed at s3://bi-raw/apify/nexgendata/saas-pricing-tracker/
CREATE PIPE raw.pipe_apify_pricing AS
COPY INTO raw.apify_pricing (source_file, payload)
FROM (SELECT METADATA$FILENAME, $1 FROM @bi_raw_stage/apify/nexgendata/saas-pricing-tracker/)
FILE_FORMAT = (TYPE = JSON);
-- Curated view, one row per vendor-plan-day
CREATE OR REPLACE VIEW analytics.competitor_pricing AS
SELECT
payload:vendor::STRING AS vendor,
payload:plan_name::STRING AS plan_name,
payload:price_usd::FLOAT AS price_usd,
payload:billing_period::STRING AS billing_period,
payload:scraped_at::DATE AS observation_date
FROM raw.apify_pricing;
BigQuery equivalent (for teams on GCP): create an external table on the GCS prefix, then a scheduled query that materializes the curated view nightly at 3am.
Step 4: build the dashboard view
The query the dashboard sits on top of — day-over-day price deltas plus a flag for movers:
WITH ranked AS (
SELECT
vendor, plan_name, price_usd, observation_date,
LAG(price_usd) OVER (PARTITION BY vendor, plan_name ORDER BY observation_date) AS prev_price
FROM analytics.competitor_pricing
WHERE observation_date >= CURRENT_DATE - 30
)
SELECT
vendor, plan_name, observation_date,
prev_price, price_usd,
price_usd - prev_price AS delta_usd,
ROUND((price_usd - prev_price) / NULLIF(prev_price, 0) * 100, 2) AS delta_pct,
CASE WHEN ABS(price_usd - prev_price) > 0 THEN TRUE ELSE FALSE END AS price_changed
FROM ranked
WHERE prev_price IS NOT NULL
ORDER BY observation_date DESC, ABS(delta_pct) DESC;
Point Looker, Mode, Hex, Tableau, or Power BI at that view. A "Price Movers" tile filtering on price_changed = TRUE AND observation_date = CURRENT_DATE - 1 gives you the morning briefing.
Step 5: alert on changes via Slack
A 50-line cron job converts the same query into a Slack ping:
import snowflake.connector, requests, os
conn = snowflake.connector.connect(...)
rows = conn.cursor().execute("""
SELECT vendor, plan_name, prev_price, price_usd, delta_pct
FROM analytics.competitor_pricing_changes
WHERE observation_date = CURRENT_DATE - 1 AND ABS(delta_pct) >= 5
""").fetchall()
if rows:
text = "*Competitor price moves (>=5%) overnight:*\n" + "\n".join(
f"• {v} {p}: ${pp} -> ${cp} ({d:+.1f}%)" for v, p, pp, cp, d in rows
)
requests.post(os.environ["SLACK_WEBHOOK"], json={"text": text})
Schedule it with Airflow, Prefect, GitHub Actions, or a plain crontab. You now have a closed loop: scrape -> load -> transform -> alert.
Cost calibration
NexGenData actors price between $0.05 and $0.50 per result. A realistic monthly bill for the pipeline above:
- saas-pricing-tracker: 10 vendors × 4 plans each = 40 rows/night × 30 nights = 1,200 rows/month × $0.10 = $120/mo
- serp-rank-tracker-lite: 5 keywords × 10 SERP positions = 50 rows/night × 30 = 1,500 × $0.05 = $75/mo
- reverb-musical-instrument-scraper: 500 listings/night × 30 × $0.05 = $750/mo (only worth it if music gear is core to your business; otherwise drop it)
So the lean pricing-and-SERP pipeline lands around $200/month — cheaper than a single seat of most competitive-intel SaaS tools, and you own the data. If you sign up for Apify through the NexGenData referral link 30% of your usage flows back as affiliate credit, which effectively rebates the cost.
Rule of thumb for estimating before you commit: prototype with a 1-day run, multiply datasetItemCount by 30, multiply by the actor's per-result price (visible on the actor page), and that's your monthly floor.
Best actors for BI use cases
| Use case | Actor | What it returns | Typical run cost |
|---|---|---|---|
| Market intel | pr-newswire-press-releases-scraper | Press release feed by company/topic | $0.05–$0.10/release |
| eastmoney-china-stock-screener | A-share screener data, fundamentals | $0.05/ticker | |
| sec-form-13f-tracker-pro | Institutional holdings from 13F filings | $0.10/position | |
| Competitor intel | saas-pricing-tracker | Vendor pricing tiers, plan features | $0.10/plan |
| serp-rank-tracker-lite | Keyword SERP positions | $0.05/result | |
| shopify-app-store-scraper | App listings, ratings, install counts | $0.05/app | |
| Lead-gen feeds | contact-info-scraper | Emails, phones, social handles from any URL | $0.05/contact |
| yc-companies-directory-scraper | YC batch directory: companies, founders, status | $0.05/company |
For deeper dives on specific verticals, see our sister posts on YC data for VC workflows, Eastmoney for China equity research, and sanctions/compliance scraping. Browse the full category collections at /market-intelligence-tools/, /lead-generation-data-tools/, and /financial-data-tools/, or hit the full resources hub.
Limitations and when NOT to use actors
Actors are powerful, but they're not always the right tool:
- If a vendor has a real API, use the API. Salesforce, HubSpot, Stripe, Segment — these have first-party connectors in Fivetran/Airbyte. Don't scrape what you can query.
- If a SaaS already aggregates the dataset, weigh the math. SimilarWeb, Ahrefs, Sensor Tower, Pitchbook all charge $1k–$10k/month but give you historical depth that's expensive to backfill via scraping. Use actors when you need a specific narrow slice (10 competitors, 5 keywords) at 1/10th the price, or when the SaaS doesn't cover your niche.
- Respect ToS and robots.txt. Public data is fair game in most jurisdictions, but a target site's terms may prohibit automated access. LinkedIn, Glassdoor, and a few others are litigious. Check the actor's documentation for compliance notes, and never scrape gated/authenticated content you don't have rights to.
- If you need sub-minute freshness, scraping isn't the architecture. Actors run on a schedule. For real-time price changes or fraud signals, you need streaming sources, not nightly batches.
FAQ
Can I run actors from a Jupyter notebook? Yes — install apify-client, call client.actor("name").call(run_input={...}), iterate the dataset with client.dataset(run["defaultDatasetId"]).iterate_items(). Results come back as Python dicts, ready for pandas.
Does Apify support webhooks? Yes. Every actor run can fire webhooks on STARTED, SUCCEEDED, FAILED, TIMED_OUT, or ABORTED. Payload includes the dataset ID so your receiver can pull results immediately.
How do I refresh nightly? Apify Console -> Schedules -> Create new, cron syntax. Or trigger from Airflow's SimpleHttpOperator against the run-sync endpoint if you want orchestration in one place.
What about long-running actors ( >1h)? Don't use run-sync. Use the async POST /acts/{id}/runs endpoint, then either poll GET /actor-runs/{id} or wait for the SUCCEEDED webhook. Apify runs can go up to 168 hours with the right memory configuration.
Can I pipe data to dbt/Airflow? Yes. Land actor results in a raw warehouse table (Pattern A or B), then dbt models the curated layer. For Airflow, a DAG with a PythonOperator that calls the Apify SDK, followed by a SnowflakeOperator that runs the dbt build, is the canonical shape.
Is the JSON format stable? Each actor publishes a JSON Schema on its detail page and versions it. NexGenData actors include a schema_version field on every row so your downstream models can branch on changes. Subscribe to actor update notifications in the console to catch breaking changes before they hit production.
Operational tips from production deployments
A few hard-won lessons from teams that have run this pattern in production for 6+ months:
-
Idempotency keys matter. Set
?clean=true&fields;=id,vendor,plan_name,price_usd,scraped_aton dataset pulls and dedupe in your raw layer on the actor's natural key. Webhooks can fire twice during Apify platform incidents; without dedup you'll double-count. - Land raw, model later. Resist the urge to flatten in the ingest layer. Drop the full payload as VARIANT/JSON and let dbt or your transformation layer pick fields. When the actor schema evolves (it will), you won't have to backfill — the raw rows are intact.
-
Tag every run with a build_id. Pass
{"build": "1.2"}inrun_input.metadataso you can join scraped rows back to the actor version that produced them. Useful when investigating why a metric jumped one Tuesday. -
Monitor actor health, not just data freshness. Apify exposes
GET /v2/users/me/usage/monthly— alert when daily compute units drift more than 30% from baseline. A silently broken actor returns zero rows, which looks like "no price changes" on the dashboard. Set a freshness SLA in dbt withdbt source freshnessand page on stale data. - Backfill on first deploy. Run the actor 7–14 times with date inputs (where supported) before going live, so day-over-day deltas have a baseline.
Airflow DAG sketch
The canonical orchestration shape, for teams already running Airflow:
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from apify_client import ApifyClient
from datetime import datetime
def run_actor(actor_id, **ctx):
client = ApifyClient(Variable.get("APIFY_TOKEN"))
run = client.actor(actor_id).call(run_input={"vendors": COMPETITORS})
ctx["ti"].xcom_push(key="dataset_id", value=run["defaultDatasetId"])
with DAG("competitor_pricing", schedule="0 2 * * *", start_date=datetime(2026,1,1)) as dag:
scrape = PythonOperator(task_id="scrape", python_callable=run_actor,
op_kwargs={"actor_id":"nexgendata/saas-pricing-tracker"})
load = SnowflakeOperator(task_id="load", sql="CALL sp_load_apify_pricing('{{ ti.xcom_pull(key=\"dataset_id\") }}')")
transform = SnowflakeOperator(task_id="dbt", sql="CALL run_dbt_models('competitor_pricing')")
scrape >> load >> transform
Three tasks, one DAG, runs in under a minute end-to-end for the volumes in our worked example. Plug the same shape into Prefect, Dagster, or Mage if Airflow isn't your stack.
Top comments (0)