NexGenData

Posted on Jun 25 • Originally published at thenextgennexus.com

Using Apify Actors for Business Intelligence Workflows

#finance #marketing #ecommerce #automation

BI teams that rely only on internal data — CRM, billing, product analytics — are working with half a map. Your dashboards tell you what you are doing, not what the market is doing around you. Competitor price changes, new 13F filings, fresh Y Combinator batches, Shopify app rankings, press releases — all of it is public, all of it is structured, and all of it can land in your warehouse on a nightly schedule if you wire it up correctly.

This tutorial walks through exactly that wiring. By the end you'll have a working pattern for: scheduling an Apify actor to run nightly, exporting the dataset to S3 (or GCS, or directly to Snowflake), loading it into a warehouse table, building a dashboard view on top, and firing a Slack alert when a metric moves. We'll use real BI tooling (Snowflake, BigQuery, dbt, Airflow, Google Sheets, Looker) and real actors from the NexGenData catalog. Code samples are copy-pasteable. If you maintain a competitive-intel dashboard, a lead-gen pipeline, or a market-monitoring report, you should be able to ship something useful from this post in an afternoon.

Anatomy of an Apify actor in a BI context

An actor is a serverless scraper. You pass it JSON input (URLs, search terms, filters), it runs in Apify's cloud, and it writes results to a dataset — a key-value store that exposes the rows as JSON, CSV, XLSX, JSONL, RSS, or HTML via a stable URL.

For BI purposes, three properties matter:

Output is structured. Each actor publishes a schema. Records come back as flat JSON objects, easy to COPY INTO a warehouse table or pd.read_json in a notebook.
Billing is pay-per-event (PPE). Most NexGenData actors charge $0.05–$0.50 per result, not per minute. Your cost scales with rows ingested, which makes monthly forecasting trivial: rows_per_run × runs_per_month × $/row.
Live-running beats batch downloads. You can hit the run-sync-get-dataset-items endpoint, block until the actor finishes, and stream the dataset back in one HTTP call — ideal for ad-hoc analyst queries. Or you schedule a batch run and have Apify webhook the dataset URL to your ingest layer.

Three patterns for ingesting actor data

Pattern A: Direct API call from your ETL

Simplest pattern. Your Airbyte custom source, Fivetran function, or hand-rolled Python job calls the actor and writes results to the warehouse. Good for low-frequency runs (daily/weekly) and small-to-medium volumes.


    curl -X POST \
      "https://api.apify.com/v2/acts/nexgendata~saas-pricing-tracker/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
      -H "Content-Type: application/json" \
      -d '{
        "vendors": ["notion.so", "airtable.com", "monday.com"],
        "plans": ["all"]
      }' \
      > pricing_$(date +%Y-%m-%d).json

From Python, the equivalent using the official SDK:


    from apify_client import ApifyClient
    import pandas as pd

    client = ApifyClient("apify_api_xxx")
    run = client.actor("nexgendata/saas-pricing-tracker").call(
        run_input={"vendors": ["notion.so", "airtable.com"]}
    )
    items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
    df = pd.DataFrame(items)
    df.to_sql("stg_competitor_pricing", warehouse_engine, if_exists="append")

Pattern B: Scheduled run + webhook to object storage

The production pattern. Apify's scheduler triggers the actor; on ACTOR.RUN.SUCCEEDED a webhook posts to your endpoint with the dataset URL. Your endpoint (Lambda, Cloud Function, or a small Flask container) streams the dataset to S3/GCS, and Snowpipe/BigQuery auto-ingest picks it up.

A webhook payload looks like this:


    {
      "userId": "abc123",
      "createdAt": "2026-05-24T02:00:00.000Z",
      "eventType": "ACTOR.RUN.SUCCEEDED",
      "eventData": {
        "actorId": "nexgendata~serp-rank-tracker-lite",
        "actorRunId": "xYzRunId789",
        "actorTaskId": "nightlyRankCheck"
      },
      "resource": {
        "id": "xYzRunId789",
        "status": "SUCCEEDED",
        "defaultDatasetId": "dsId456",
        "stats": { "computeUnits": 0.02, "datasetItemCount": 240 }
      }
    }

Your receiver fetches https://api.apify.com/v2/datasets/dsId456/items?format=json&clean;=true and writes the body to s3://bi-raw/apify/serp/2026-05-24.json. A Snowpipe definition pointed at that prefix loads it automatically.

Pattern C: Google Sheets via IMPORTDATA

For analysts who live in Sheets, you don't need a warehouse at all. Apify exposes dataset items as a CSV URL that IMPORTDATA can consume. Drop this in cell A1:


    =IMPORTDATA("https://api.apify.com/v2/acts/nexgendata~yc-companies-directory-scraper/run-sync-get-dataset-items?token=APIFY_TOKEN&format;=csv&clean;=true")

Sheets refreshes the formula on edit and on a one-hour interval. Wrap it in QUERY() to filter, or feed it into a Looker Studio data source for a free dashboard layer.

Worked example: a daily competitor pricing BI dashboard

Goal: a Snowflake-backed Looker dashboard that, every morning at 7am, shows yesterday's pricing changes across your top 10 SaaS competitors, their SERP rank movement on five priority keywords, and any new music-gear pricing trends if you happen to sell music gear. We'll use three actors:

Step 1: schedule actor runs nightly

In the Apify Console, go to Schedules -> Create new. Set cron to 0 2 * * * (2am UTC). Add three "actions", one per actor, each pointing at a saved task with its input pre-filled. Save. You're done with scheduling.

Step 2: configure dataset export via webhook

On each actor task, open Integrations -> Webhooks. Add a webhook for ACTOR.RUN.SUCCEEDED pointing at your ingest endpoint:


    https://ingest.yourcompany.com/apify?source={{eventData.actorId}}&run;={{eventData.actorRunId}}

A minimal Python receiver (FastAPI on Cloud Run):


    @app.post("/apify")
    async def ingest(source: str, run: str, payload: dict):
        dataset_id = payload["resource"]["defaultDatasetId"]
        url = f"https://api.apify.com/v2/datasets/{dataset_id}/items?format=jsonl&clean;=true"
        data = httpx.get(url).content
        key = f"apify/{source.replace('~','/')}/{date.today()}.jsonl"
        s3.put_object(Bucket="bi-raw", Key=key, Body=data)
        return {"status": "ok", "rows": payload["resource"]["stats"]["datasetItemCount"]}

Step 3: load to data warehouse

Snowflake DDL for the raw and curated layers:


    -- Raw landing table, schemaless
    CREATE TABLE raw.apify_pricing (
      load_ts TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP,
      source_file STRING,
      payload VARIANT
    );

    -- Snowpipe pointed at s3://bi-raw/apify/nexgendata/saas-pricing-tracker/
    CREATE PIPE raw.pipe_apify_pricing AS
    COPY INTO raw.apify_pricing (source_file, payload)
    FROM (SELECT METADATA$FILENAME, $1 FROM @bi_raw_stage/apify/nexgendata/saas-pricing-tracker/)
    FILE_FORMAT = (TYPE = JSON);

    -- Curated view, one row per vendor-plan-day
    CREATE OR REPLACE VIEW analytics.competitor_pricing AS
    SELECT
      payload:vendor::STRING       AS vendor,
      payload:plan_name::STRING    AS plan_name,
      payload:price_usd::FLOAT     AS price_usd,
      payload:billing_period::STRING AS billing_period,
      payload:scraped_at::DATE     AS observation_date
    FROM raw.apify_pricing;

BigQuery equivalent (for teams on GCP): create an external table on the GCS prefix, then a scheduled query that materializes the curated view nightly at 3am.

Step 4: build the dashboard view

The query the dashboard sits on top of — day-over-day price deltas plus a flag for movers:


    WITH ranked AS (
      SELECT
        vendor, plan_name, price_usd, observation_date,
        LAG(price_usd) OVER (PARTITION BY vendor, plan_name ORDER BY observation_date) AS prev_price
      FROM analytics.competitor_pricing
      WHERE observation_date >= CURRENT_DATE - 30
    )
    SELECT
      vendor, plan_name, observation_date,
      prev_price, price_usd,
      price_usd - prev_price AS delta_usd,
      ROUND((price_usd - prev_price) / NULLIF(prev_price, 0) * 100, 2) AS delta_pct,
      CASE WHEN ABS(price_usd - prev_price) > 0 THEN TRUE ELSE FALSE END AS price_changed
    FROM ranked
    WHERE prev_price IS NOT NULL
    ORDER BY observation_date DESC, ABS(delta_pct) DESC;

Point Looker, Mode, Hex, Tableau, or Power BI at that view. A "Price Movers" tile filtering on price_changed = TRUE AND observation_date = CURRENT_DATE - 1 gives you the morning briefing.

Step 5: alert on changes via Slack

A 50-line cron job converts the same query into a Slack ping:


    import snowflake.connector, requests, os
    conn = snowflake.connector.connect(...)
    rows = conn.cursor().execute("""
      SELECT vendor, plan_name, prev_price, price_usd, delta_pct
      FROM analytics.competitor_pricing_changes
      WHERE observation_date = CURRENT_DATE - 1 AND ABS(delta_pct) >= 5
    """).fetchall()

    if rows:
        text = "*Competitor price moves (>=5%) overnight:*\n" + "\n".join(
            f"• {v} {p}: ${pp} -> ${cp} ({d:+.1f}%)" for v, p, pp, cp, d in rows
        )
        requests.post(os.environ["SLACK_WEBHOOK"], json={"text": text})

Schedule it with Airflow, Prefect, GitHub Actions, or a plain crontab. You now have a closed loop: scrape -> load -> transform -> alert.

Cost calibration

NexGenData actors price between $0.05 and $0.50 per result. A realistic monthly bill for the pipeline above:

saas-pricing-tracker: 10 vendors × 4 plans each = 40 rows/night × 30 nights = 1,200 rows/month × $0.10 = $120/mo
serp-rank-tracker-lite: 5 keywords × 10 SERP positions = 50 rows/night × 30 = 1,500 × $0.05 = $75/mo
reverb-musical-instrument-scraper: 500 listings/night × 30 × $0.05 = $750/mo (only worth it if music gear is core to your business; otherwise drop it)

So the lean pricing-and-SERP pipeline lands around $200/month — cheaper than a single seat of most competitive-intel SaaS tools, and you own the data. If you sign up for Apify through the NexGenData referral link 30% of your usage flows back as affiliate credit, which effectively rebates the cost.

Rule of thumb for estimating before you commit: prototype with a 1-day run, multiply datasetItemCount by 30, multiply by the actor's per-result price (visible on the actor page), and that's your monthly floor.

Best actors for BI use cases

Use case	Actor	What it returns	Typical run cost
Market intel	pr-newswire-press-releases-scraper	Press release feed by company/topic	$0.05–$0.10/release
eastmoney-china-stock-screener	A-share screener data, fundamentals	$0.05/ticker
sec-form-13f-tracker-pro	Institutional holdings from 13F filings	$0.10/position
Competitor intel	saas-pricing-tracker	Vendor pricing tiers, plan features	$0.10/plan
serp-rank-tracker-lite	Keyword SERP positions	$0.05/result
shopify-app-store-scraper	App listings, ratings, install counts	$0.05/app
Lead-gen feeds	contact-info-scraper	Emails, phones, social handles from any URL	$0.05/contact
yc-companies-directory-scraper	YC batch directory: companies, founders, status	$0.05/company

For deeper dives on specific verticals, see our sister posts on YC data for VC workflows, Eastmoney for China equity research, and sanctions/compliance scraping. Browse the full category collections at /market-intelligence-tools/, /lead-generation-data-tools/, and /financial-data-tools/, or hit the full resources hub.

Limitations and when NOT to use actors

Actors are powerful, but they're not always the right tool:

If a vendor has a real API, use the API. Salesforce, HubSpot, Stripe, Segment — these have first-party connectors in Fivetran/Airbyte. Don't scrape what you can query.
If a SaaS already aggregates the dataset, weigh the math. SimilarWeb, Ahrefs, Sensor Tower, Pitchbook all charge $1k–$10k/month but give you historical depth that's expensive to backfill via scraping. Use actors when you need a specific narrow slice (10 competitors, 5 keywords) at 1/10th the price, or when the SaaS doesn't cover your niche.
Respect ToS and robots.txt. Public data is fair game in most jurisdictions, but a target site's terms may prohibit automated access. LinkedIn, Glassdoor, and a few others are litigious. Check the actor's documentation for compliance notes, and never scrape gated/authenticated content you don't have rights to.
If you need sub-minute freshness, scraping isn't the architecture. Actors run on a schedule. For real-time price changes or fraud signals, you need streaming sources, not nightly batches.

FAQ

Can I run actors from a Jupyter notebook? Yes — install apify-client, call client.actor("name").call(run_input={...}), iterate the dataset with client.dataset(run["defaultDatasetId"]).iterate_items(). Results come back as Python dicts, ready for pandas.

Does Apify support webhooks? Yes. Every actor run can fire webhooks on STARTED, SUCCEEDED, FAILED, TIMED_OUT, or ABORTED. Payload includes the dataset ID so your receiver can pull results immediately.

How do I refresh nightly? Apify Console -> Schedules -> Create new, cron syntax. Or trigger from Airflow's SimpleHttpOperator against the run-sync endpoint if you want orchestration in one place.

What about long-running actors ( >1h)? Don't use run-sync. Use the async POST /acts/{id}/runs endpoint, then either poll GET /actor-runs/{id} or wait for the SUCCEEDED webhook. Apify runs can go up to 168 hours with the right memory configuration.

Can I pipe data to dbt/Airflow? Yes. Land actor results in a raw warehouse table (Pattern A or B), then dbt models the curated layer. For Airflow, a DAG with a PythonOperator that calls the Apify SDK, followed by a SnowflakeOperator that runs the dbt build, is the canonical shape.

Is the JSON format stable? Each actor publishes a JSON Schema on its detail page and versions it. NexGenData actors include a schema_version field on every row so your downstream models can branch on changes. Subscribe to actor update notifications in the console to catch breaking changes before they hit production.

Operational tips from production deployments

A few hard-won lessons from teams that have run this pattern in production for 6+ months:

Idempotency keys matter. Set ?clean=true&fields;=id,vendor,plan_name,price_usd,scraped_at on dataset pulls and dedupe in your raw layer on the actor's natural key. Webhooks can fire twice during Apify platform incidents; without dedup you'll double-count.
Land raw, model later. Resist the urge to flatten in the ingest layer. Drop the full payload as VARIANT/JSON and let dbt or your transformation layer pick fields. When the actor schema evolves (it will), you won't have to backfill — the raw rows are intact.
Tag every run with a build_id. Pass {"build": "1.2"} in run_input.metadata so you can join scraped rows back to the actor version that produced them. Useful when investigating why a metric jumped one Tuesday.
Monitor actor health, not just data freshness. Apify exposes GET /v2/users/me/usage/monthly — alert when daily compute units drift more than 30% from baseline. A silently broken actor returns zero rows, which looks like "no price changes" on the dashboard. Set a freshness SLA in dbt with dbt source freshness and page on stale data.
Backfill on first deploy. Run the actor 7–14 times with date inputs (where supported) before going live, so day-over-day deltas have a baseline.

Airflow DAG sketch

The canonical orchestration shape, for teams already running Airflow:


    from airflow import DAG
    from airflow.operators.python import PythonOperator
    from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
    from apify_client import ApifyClient
    from datetime import datetime

    def run_actor(actor_id, **ctx):
        client = ApifyClient(Variable.get("APIFY_TOKEN"))
        run = client.actor(actor_id).call(run_input={"vendors": COMPETITORS})
        ctx["ti"].xcom_push(key="dataset_id", value=run["defaultDatasetId"])

    with DAG("competitor_pricing", schedule="0 2 * * *", start_date=datetime(2026,1,1)) as dag:
        scrape = PythonOperator(task_id="scrape", python_callable=run_actor,
                                op_kwargs={"actor_id":"nexgendata/saas-pricing-tracker"})
        load = SnowflakeOperator(task_id="load", sql="CALL sp_load_apify_pricing('{{ ti.xcom_pull(key=\"dataset_id\") }}')")
        transform = SnowflakeOperator(task_id="dbt", sql="CALL run_dbt_models('competitor_pricing')")
        scrape >> load >> transform

Three tasks, one DAG, runs in under a minute end-to-end for the volumes in our worked example. Plug the same shape into Prefect, Dagster, or Mage if Airflow isn't your stack.

DEV Community