NexGenData

Posted on May 14 • Edited on May 18 • Originally published at thenextgennexus.com

Bulk Apple App Store + Google Play Review Monitoring (2026 Guide)

#apify #mobile #appstore #reviews

Bulk Apple App Store + Google Play Review Monitoring (2026 Guide)

Every mobile product manager eventually hits the same wall. App Store Connect shows you ratings in aggregate, maybe a scrolling feed of the latest reviews, but nothing that answers "what are users complaining about this week compared to last week" across both iOS and Android. Google Play Console is marginally better for Android-only shops. Neither tool lets you monitor competitor apps. Neither lets you export 10,000 reviews to a notebook for serious sentiment work. Neither gives you webhooks when a new 1-star hits.

The result is that indie devs and small PM teams end up manually copy-pasting reviews into a spreadsheet every Monday, or paying $199/month for AppFollow, Appfigures, or Sensor Tower — tools that give you everything except the raw, structured review stream you can actually pipe into your own analytics.

This post walks through building that stream yourself: a dual-platform scraper stack that pulls reviews from both Apple App Store and Google Play, runs sentiment classification, clusters complaints by theme, and lands you with a weekly digest you can share in Slack. The goal is not to replicate AppFollow's UI. The goal is to own the raw data so you can do things AppFollow will never support — custom sentiment models, competitor diffs, cohort analysis by app version, correlation between review spikes and App Store ranking changes.

Grounding Numbers

A few numbers to frame the problem. Apple's App Store has roughly 1.8 million active apps as of Q4 2025 (Apple's own Services report, reconciled with data.ai). Google Play has about 2.3 million. Combined, users leave approximately 65 million reviews per month across both stores, per Sensor Tower's 2025 State of Mobile. The median app gets 0.4 reviews per 1,000 downloads; top-decile apps hit 3.2 reviews per 1,000.

For a single mid-sized consumer app doing 500k MAU and 10k weekly downloads, you can expect 30-60 fresh reviews per week, with spikes of 500+ when a release goes badly or a viral moment hits. For a competitor-tracking setup covering 20 apps in a category, you are looking at 600-1,200 reviews per week to ingest and classify.

On the classification side, modern transformer-based sentiment models (DistilBERT multilingual fine-tuned on app reviews, or any of the OpenAI/Anthropic small models) hit 88-92% agreement with human labels on the standard app-review benchmark (GAR, Maalej et al., 2016). That is good enough for weekly digests, not good enough for legal or HR consequences. Keep that framing.

Why This Is Hard

Four things make bulk review monitoring harder than it looks.

Neither store offers a public reviews API. Apple's App Store Connect API covers only your own apps, and even that is rate-limited to ~50 calls per hour. Google Play Developer API has a reviews endpoint but it only returns the last 7 days and only for apps you own. Competitor tracking is officially unsupported.
Pagination is awful on both. Apple's RSS feed caps at 500 reviews per country and doesn't go further back. The iTunes Lookup API gives you ratings counts but not review text. Google Play's web interface paginates with a continuation token that changes format every few months.
Localization fragments everything. Apple reviews are partitioned by country (155 storefronts). A US-only pull misses your Japanese, German, and Brazilian review streams entirely. Google Play is language-partitioned rather than country-partitioned, which is different but equally fragmenting.
Review text is messy. Emojis, transliterations, auto-translated reviews (Google Play auto-translates into the viewer's language, which means the same review appears twice if you query in two languages), and spam from incentivized-review farms. Cleaning this before you classify is a real step.

None of these are unsolvable. They are just not a weekend project if you start from scratch.

Architecture

Here is the end-to-end pipeline:

  [app ID list]
  (iOS + Android, competitors included)
        |
        v
  +---------------------+     +----------------------+
  | apple-app-store-    |     | google-play-reviews- |
  | reviews-scraper     |     | scraper              |
  +---------------------+     +----------------------+
        |                            |
        +------------+---------------+
                     |
                     v
             [normalize schema]
             (app_id, platform, rating,
              text, lang, version, date)
                     |
                     v
         [dedupe + language filter]
                     |
                     v
         [sentiment classifier]
         (DistilBERT or LLM call)
                     |
                     v
         [theme clustering]
         (BERTopic or LLM-guided)
                     |
                     v
         [Postgres + weekly digest]

The two scrapers run in parallel. Output lands in a normalized schema that hides the per-platform differences. Sentiment and theme clustering happen in a separate step, which lets you swap classifiers without re-scraping.

Code: Pull 200 Reviews Across Both Stores

The apple-app-store-reviews-scraper and google-play-reviews-scraper actors share a compatible output schema. You can fire both from one script.

from apify_client import ApifyClient
import pandas as pd

client = ApifyClient("APIFY_TOKEN")

ios_apps = ["544007664", "310633997"]      # YouTube, WhatsApp
android_apps = ["com.google.android.youtube", "com.whatsapp"]

ios_run = client.actor("nexgendata/apple-app-store-reviews-scraper").call(run_input={
    "app_ids": ios_apps,
    "countries": ["us", "gb", "de", "jp", "br"],
    "max_reviews_per_app": 200,
    "sort": "most_recent",
})

android_run = client.actor("nexgendata/google-play-reviews-scraper").call(run_input={
    "app_ids": android_apps,
    "languages": ["en", "de", "ja", "pt"],
    "max_reviews_per_app": 200,
    "sort": "newest",
})

ios_reviews = list(client.dataset(ios_run["defaultDatasetId"]).iterate_items())
android_reviews = list(client.dataset(android_run["defaultDatasetId"]).iterate_items())

df = pd.DataFrame([
    {"platform": "ios", "app": r["app_id"], "rating": r["rating"],
     "text": r["text"], "lang": r.get("country"), "version": r.get("app_version"),
     "date": r["submitted_at"]}
    for r in ios_reviews
] + [
    {"platform": "android", "app": r["app_id"], "rating": r["rating"],
     "text": r["text"], "lang": r.get("language"), "version": r.get("app_version"),
     "date": r["submitted_at"]}
    for r in android_reviews
])

print(df.groupby(["platform", "rating"]).size().unstack(fill_value=0))

A typical output cross-tab for the two apps above:

rating       1    2    3    4     5
platform
android     43   18   31   55   353
ios         27    9   22   40   402

The rating skew is expected — reviewers self-select toward extremes. What you are interested in is the 1- and 2-star tail, because that is where actionable product complaints live.

Sentiment Classification

Once the reviews are in a dataframe, run a classifier. For a weekly digest at this volume, a small local model is cheaper than an LLM API call per review. Here is a minimal transformers pass:

from transformers import pipeline

clf = pipeline(
    "sentiment-analysis",
    model="nlptown/bert-base-multilingual-uncased-sentiment",
    device=0,  # set to -1 for CPU
)

# Truncate to 512 chars — BERT tokenizer limit
df["text_trunc"] = df["text"].fillna("").str.slice(0, 512)
df["sentiment"] = clf(df["text_trunc"].tolist(), batch_size=32)
df["sentiment_stars"] = df["sentiment"].apply(lambda x: int(x["label"][0]))
df["sentiment_score"] = df["sentiment"].apply(lambda x: x["score"])

This model outputs 1 star through 5 stars. You now have two signals per review: the user-provided rating and the model-predicted rating. When they disagree by 2+ stars it usually means sarcasm (5-star review trashing the app) or a rating-without-comment (1-star with empty text). Both are worth flagging.

For theme clustering, BERTopic works out of the box on a few thousand reviews. For smaller volumes, an LLM with a clustering prompt is faster to ship:

import openai
from collections import Counter

negative = df[df["rating"] <= 2].sample(min(150, len(df)))
prompt = "Cluster these app reviews into 5 themes. Return JSON: {theme: [review_ids]}\n\n"
prompt += "\n".join(f"{i}: {t[:200]}" for i, t in enumerate(negative["text"]))

resp = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    response_format={"type": "json_object"},
)
themes = resp.choices[0].message.content
print(themes)

For 150 reviews, this is a ~$0.01 call and returns something like {"crashes_on_launch": [3,7,22,41], "ads_too_aggressive": [1,5,12,18,33,...], ...}.

Worked Example: Weekly Competitor Digest

Say you are PM on a meditation app. You track three competitors plus your own app across iOS and Android. Every Monday morning you want a Slack post that lists, per app: total reviews this week, average rating, top 3 complaint themes, and any rating drop of more than 0.3 stars week-over-week.

from datetime import datetime, timedelta

# Pull last 7 days across 4 apps × 2 platforms
apps = [
    {"name": "Calm", "ios": "571800810", "android": "com.calm.android"},
    {"name": "Headspace", "ios": "493145008", "android": "com.getsomeheadspace.android"},
    {"name": "Insight Timer", "ios": "337472899", "android": "com.spotlightsix.zentimerlite2"},
    {"name": "YourApp", "ios": "...", "android": "..."},
]

cutoff = datetime.utcnow() - timedelta(days=7)

weekly = df[pd.to_datetime(df["date"]) >= cutoff]

digest = weekly.groupby(["app", "platform"]).agg(
    n=("rating", "count"),
    avg_rating=("rating", "mean"),
    neg_pct=("rating", lambda x: (x <= 2).mean()),
).round(2)

print(digest)

Feed the negative reviews for each app into the LLM theme clusterer, dump the results into a templated Slack message, and you have a digest that would cost $199/month from AppFollow. Running it as an Apify scheduled task pushes the cost under $5/month at this volume.

Over six months of running this, a PM on the team spots a pattern: whenever a release contains a meaningful UI change, 1-star reviews spike within 72 hours. They start shipping UI changes behind a feature flag and rolling out to 10% first. Review sentiment during the subsequent staged rollout becomes a release signal. That is the actual point of owning the data stream.

Gotchas

Things we have watched trip people up:

Apple's RSS cap is 500 reviews per country. You cannot page deeper. For apps with thousands of reviews, you either ingest continuously from day one or accept that historical backfill is partial.
Google Play auto-translation double-counts. If you query com.whatsapp with languages=["en", "de"], a German review may appear once in German and once translated-to-English. Dedupe by review ID, not text.
Version numbers are unreliable. Apple's app_version field reflects the version the user was on when reviewing. Google Play sometimes drops it entirely for older reviews. Do not use this for cohort analysis without spot-checking.
Spam and incentivized reviews exist. Both stores have anti-fraud teams, but clusters of 5-star reviews with near-identical phrasing show up regularly. Flagging them is a classifier problem — usually low perplexity + short length + recent account.
Localization scoring varies wildly. Japanese users rate 0.5 stars lower than US users on average for the same app; German users 0.3 lower. Cross-country average-rating comparisons are meaningless without normalization.
Rate limits. The actors handle backoff, but if you self-roll the scrapers: Apple RSS is soft-capped around 10 requests/sec per IP; Google Play's web UI starts serving CAPTCHAs around 30 requests/min per IP.
Reviews disappear. Apple removes reviews the developer successfully appeals. Google Play removes reviews that violate policy. If you are doing longitudinal analysis, snapshot reviews as you ingest; do not assume they will still be there next quarter.
Emoji-only reviews. About 2% of reviews are emoji-only. Most sentiment classifiers handle them poorly. Either route them to a separate classifier or drop them.

FAQ

Can I track my competitors legally?
App Store reviews are public data under both Apple's and Google's terms of service for normal users. Automated collection is technically against Apple's Developer Agreement, but the enforcement target has historically been scraping the storefront for listings and pricing, not reviews. Consult your own counsel for anything commercial; for internal research, the risk is low.

How often should I run the scrape?
For a weekly digest: once per week is enough. For crisis monitoring (post-release or incident): hourly for the first 48 hours after a release, then daily. The actors are designed to run idempotently — running twice in an hour will mostly return cached data.

How far back does the history go?
Apple: about 500 most recent reviews per country, no deeper. Google Play: roughly 12 months with cooperative pagination. For older reviews, you need snapshots you collected previously.

What if my app is in 40 countries?
Pull the top 10 by install volume for weekly digests. Run a monthly fuller pull across all countries for trend analysis. Running all 155 Apple storefronts weekly is overkill and expensive.

Does the sentiment classifier handle non-English reviews?
nlptown/bert-base-multilingual-uncased-sentiment handles 6 languages natively (English, Dutch, German, French, Spanish, Italian). For Japanese, Chinese, Korean, Portuguese, Arabic, you want a different model per language, or just use an LLM.

Can I correlate reviews with my install numbers?
Yes, if you pull installs from App Store Connect and Play Console. The Apify scrapers do not pull install data (it is private to the developer), but you can join on date and app_version after the fact. The signal-to-noise is noisy at weekly granularity, good at monthly.

What about in-app review prompts skewing the data?
The SKStoreReviewController and Google's in-app review API skew ratings positive because they are shown after success events. If you run these, expect your ratings to trend 0.5-1.0 stars higher than organic, which makes competitor comparison harder. Note it in your dashboards.

How do I handle developer responses?
Both scrapers return the developer reply text if present. Treat replies as a separate signal — response rate, median response time, and reply length all correlate with rating trends. Teams that reply within 24 hours see measurably lower churn on 2-3 star reviews.

Conclusion

Mobile review monitoring does not require a $199/month subscription or a dedicated analyst. Two Apify actors, a normalized schema, a small sentiment classifier, and a weekly cron give you most of what AppFollow and Appfigures sell, plus the freedom to do custom analysis neither tool supports.

The bigger win is not cost savings — it is that owning the raw review stream lets you correlate reviews with things AppFollow does not know about: your release cadence, your feature flags, your marketing spikes, your outages. That correlation is where product insight actually lives.

Start with the apple-app-store-reviews-scraper and google-play-reviews-scraper on Apify. Both run pay-per-use, handle pagination and localization, and return a consistent schema you can feed into whatever analytics stack you already own.

DEV Community

Bulk Apple App Store + Google Play Review Monitoring (2026 Guide)

Bulk Apple App Store + Google Play Review Monitoring (2026 Guide)

Grounding Numbers

Why This Is Hard

Architecture

Code: Pull 200 Reviews Across Both Stores

Sentiment Classification

Worked Example: Weekly Competitor Digest

Gotchas

FAQ

Conclusion

Top comments (0)