Olga Braginskaya

Posted on Nov 24 • Originally published at datobra.com on Nov 24

Build in Public: Week 3. First Survive Discovery, Then Enjoy Analysis

#ai #buildinpublic #development #saas

Last week I noticed something annoying: the engagement on my Week 1 and Week 2 posts dropped, even though the content was objectively good. So I asked Perplexity when developers actually read dev.to and the answer was basically: please stop posting on Saturdays. No one is there.

From there, Wykra updates move to Monday morning. Let's see if the stats agree.

First, I Need Actual People

This week is about taking Wykra from we can find influencers to we can filter them and analyze them in depth. In the previous post I explored several ways of discovering influencers and for this week I want to combine a couple of those methods rather than rely on just one. The plan is to mix a targeted Google query through the Bright Data SERP dataset with a Perplexity prompt through OpenRouter (or Bright Data) and see whether using them together leads to a more consistent shortlist. Google will be my starting point, but I already noticed that the SERP dataset often responds with "error": "Recaptcha appears", "error_code": "blocked" which makes it clear that having more than one discovery path isn’t just a nice-to-have, it’s self-defense. Google AI Mode also didn’t behave much better: the crawler kept returning "error": "Crawler error: waiting for selector \"#aim-chrome-initial-inline-async-container\" failed: timeout 30000ms exceeded", "error_code": "wait_element_timeout".

I spent a while thinking about who I should search for as an example this week and since I’m currently deep in a sourdough phase, it felt natural to look for people who actually bake sourdough themselves. I wanted actual home bakers, people posting their starter progress, fermentation attempts and sometimes failed loaves.. New York seemed like the perfect testing ground, so that became the theme for this round of discovery. The Google query I used:

site:instagram.com ("sourdough" OR "sourdough bread" OR "starter") ("NYC" OR "New York" OR "Brooklyn" OR "Manhattan" OR "Queens" OR "Bronx") ("bio" OR "profile" OR "baker") -restaurant -shop -bakery -menu -delivery

I also set "language": "en", "country": "PT", "start_page": 1, "end_page": 2 to limit final results but Google still returned a huge JSON. So I only took the first ten Instagram links it surfaced:

https://www.instagram.com/reel/DO6K4Pwjf4H/ Sourdough starter success video — making sourdough bread from scratch.
https://www.instagram.com/reel/DRHqgN6Daec/ Day-12 sourdough starter update; NYC baker documenting the feeding process.
https://www.instagram.com/reel/DRAmjx3kbR-/ Starting a new sourdough chapter in NYC — another early-stage starter reel.
https://www.instagram.com/reel/DQKmxHyCY55/ Day-9 sourdough starter update; fermentation, early growth and “Novi” progress.
https://www.instagram.com/emscakesntreats/reel/DRNDnk0jbrx/ Growing a sourdough starter and feeding “Doby” on day 14; home baker content.
https://www.instagram.com/bigdoughenergy/ Profile of an NYC home baker and bread artist sharing sourdough loaves and recipes.
https://www.instagram.com/reel/DQVAa6pjQ64/ Another sourdough “Novi” update — starter progress over days.
https://www.instagram.com/reel/DQXfFbYCcka/ Day-14 sourdough starter update; patience and fermentation notes.
https://www.instagram.com/reel/DQuKbYuE0YY/ New York–style multigrain sourdough bagel being boiled then baked.
https://www.instagram.com/p/DQzLEQQDhBL/ Olive sourdough inclusion loaf; standard sourdough-baking reel with a finished bread photo.

As you can see Google mostly returned individual posts and reels rather than profile pages. I think that’s normal for Instagram SERP results, since Google indexes post URLs much more consistently than profiles. I extracted the profile handles from those post URLs. Google’s results shift every time, so whether you get anything useful is basically luck.

The result is fine, but definitely not great. Yes, there’s some actual baking in there, but the list is full of repeats (though I took only first 10 results), the same creator keeps resurfacing again and again. And this is still a relatively forgiving query, when I tried the same workflow for pizza bakers in Lisbon, Google basically returned nothing at all. Technically there was one result, but it turned out to be a pizza equipment shop, not a creator.

The Perplexity prompt follows the same idea:

This is what I got:

I ran the code several times and the model returned a different set of accounts each time, so there’s no stable or repeatable result here either, but at least it consistently returns profile URLs right away, which already puts it ahead of Google.

Then I decided to try a different approach, the one that occurred to me earlier but I only now got around to testing. I started by identifying hashtags and only then moved on to the posts.

The first call returned a set of NYC-specific sourdough hashtags: #nycsourdough, #sourdoughnyc, #artisanbreadnyc, #nycbakers, #brooklynsourdough, #manhattansourdough, #nycbread, #sourdoughcommunitynyc, #breadstagramnyc, #sourdoughnewyork.

Then I passed these into the second prompt, keeping strict rules: only real Instagram profiles, no brands, no bakeries, no invented handles.

The final list I got was:

Only one profile, brooklynsourdough , overlapped with the previous list, which shows that this method surfaces a completely different slice of the NYC sourdough community rather than reinforcing earlier results.

Although in this case I’m searching in a huge city with a very broad range - not restricting creator size, niche depth or even which part of New York they’re in. The experience was again very different when I tried the same workflow for pizza bakers in Lisbon. Google returned exactly one result that was even remotely relevant (and that turned out to be a pizza equipment store), while Perplexity, across three runs, confidently produced several profiles that simply do not exist. I tightened the system prompt to explicitly forbid inventing handles, however occasional hallucinations still sneak. Honestly, Instagram is not an easy platform to automate against and both methods struggle in places you wouldn’t expect.

If you want to try the same searches yourself, here’s the Jupyter notebook I used - you can open it and play with the prompts: https://github.com/wykra-io/wykra-api-python/blob/main/research/search.ipynb

Looking Inside the Profiles

After the discovery step I had around twenty Instagram handles, but I still did not know who was actually relevant. Some looked like real NYC sourdough people, some were just general baking and some might be not relevant at all. Before going deeper I wanted a quick sanity check that an LLM could at least separate “probably relevant” from “why is this here”.

I pulled the full profile JSONs from Bright Data’s Instagram dataset. Each snapshot included account-level metadata plus a slice of recent posts, which is great for analysis and terrible if you try to send it to a model as-is. Anyway I wrote a small minimizer in Python. It flattens the raw profiles, skips private accounts, filters out profiles with fewer than 1000 followers and also removes any profiles that haven’t posted in the last six months, then keeps only a short summary:

– basic profile info such as handle, profile name, followers, posts count, bio, category

– a few engagement and account type signals (business, professional, average engagement)

– a sample of recent posts, sorted by datetime, with caption, datetime, likes and comments

If you want to see the actual data rather than the description:

The full profiles JSON is here: https://github.com/wykra-io/wykra-api-python/blob/main/research/profiles.json

The full notebook with the data-collection code is here: https://github.com/wykra-io/wykra-api-python/blob/main/research/analysis.ipynb

The reduced version of the profiles is here: https://github.com/wykra-io/wykra-api-python/blob/main/research/short_profiles.json

For the sanity check I used Claude 3.5 Sonnet through pydantic-ai and OpenRouter. The system prompt tells the model what it is looking at and what to do with it, the user prompt is just the minimized profiles plus that query.

After the profiles are reduced to the fields that actually matter, the model has no trouble ranking them. It reads the bios, looks at the recent posts and places the bakers in a reasonable order, finally something in this pipeline that didn’t fight back:

An interesting detail: when I compared Claude’s ranking with what Google SERP and Perplexity returned, the final shortlist contained accounts surfaced by all three methods.

Second Layer Discovery: Exploring Related Accounts

Next I noticed that each profile snapshot comes with a related_accounts list – basically Instagram’s suggestion graph around that creator. So I took the profiles that Claude ranked the highest in the first pass, grabbed all their non-private related accounts, turned them into profile URLs and ran the same pipeline again: fetch snapshots with Bright Data, minimize them and send the compact JSON into Claude with the same ranking prompt.

On this second hop the model mostly surfaced established NYC bakeries and cafés rather than home bakers. The top result was lanicosia_bakery (a 100-year-old Bronx bakery) with a relevance score of 4, then zeppieribakery and a couple of NYC-based dessert accounts like atoricafe and bitesbybianca with low scores. Most of the remaining related accounts either weren’t in NYC, weren’t about baking at all or had nothing to do with bread or sourdough, so they didn’t make it into the ranked list.

Even though the graph hop felt “smart” on paper, “follow who the good bakers are connected to”, in practice it quickly drifted from “NYC home sourdough bakers” to “general NYC food and bakery accounts” with only a few partially relevant hits.

Two Speeds, Two Jobs - Fast Discovery vs Deep Analysis

Before this point everything I’ve built assumes a pretty simple goal:

"Give me a shortlist of creators who match my prompt."

For that the flow is fast and relatively efficient: Google + Perplexity → profile snapshots → lightweight relevance scoring → done. It’s the right tool when a user needs quick inspiration, a direction to explore or a starting point for outreach. But that flow collapses the moment the question changes from "Who should I look at?" to "Is this creator actually good?".

A real evaluation, pulling all posts, reels, captions, timestamps and comments for the past 3–6 months, checking formats, identifying sponsored content, measuring post-level engagement, analyzing content topics, is a completely different workload. Running this for ten creators at once would be both slow and unnecessarily expensive. And honestly most users don’t need that for a discovery task.

Which is why Wykra needs two separate modes:

1. Fast Discovery (the default)

You get a shortlist of accounts ranked by relevance. Enough to browse, compare and filter.

2. Deep Dive on Demand

When a user says: “This creator looks promising, analyze them properly.”

That’s when we pull the full dataset. It’s slower, more resource-heavy and it should be opt-in. But it gives an actual, trustworthy picture of a single influencer.

Most importantly, this matches real workflows: sometimes you want a list; sometimes you want the truth.

I took one of the creators Claude ranked highly aya_eats_ (11218 followers, avg engagment 0.7181) and pulled they recent posts and reels for the past six months. Instagram essentially has three content types: posts, reels and stories. Reels dominate attention these days, posts still matter for evergreen content and stories would be valuable for analysis except they can’t be scraped, which is unfortunate because that’s the only thing I personally ever watch. So I just threw all posts and reels into one DataFrame and sent the JSON to Claude to see what kind of basic analysis it would come up with.

To see what this looks like in practice, here’s the raw output it produced:

I checked the same data with a few simple Pandas summaries and the results were almost identical. Reels absolutely dominate: around 3,300 likes and 36K views on average, compared to posts that barely hit 28 likes. The posting rhythm is steady: roughly 1.5–2 posts per week, with activity increasing from September to November and everything lands in the evening hours (20:00–23:00). The hashtag usage matches the themes Claude picked up: baking, Asian recipes and seasonal content. Engagement by theme also tells the same story: Asian-food reels perform an order of magnitude better than anything else. Brand presence shows up through light, organic mentions (@bobsredmill, @vitalfarms, @staub_usa) and there are zero paid partnerships, which supports the “authentic home cooking” impression.

Testing with a small creator is cute, but the patterns barely move. For something closer to reality, spikes, trends, actual performance curves, I ran the same workflow on a bigger creator, my favorite fashion blogger _liullland (~187k). I pulled 186 posts and reels over the last six months, limited it at 100 items so Claude wouldn’t choke and asked it to summarise what’s going on. The full JSON + analysis is in this gist:

Then I dropped the same data into pandas and ran a few simple charts: likes over time for posts vs reels, top hashtags and a scatter of views vs likes for reels.

Reels dominate the account - every major spike comes from video, while static posts stay almost flat. There’s also a noticeable spike in the last month, engagement jumps sharply and I have no idea what caused it.

I’m not a fashion blogger, but even I can see the hashtags repeat a lot, mostly fashion/GRWM variations.

And the views-vs-likes scatterplot has strong correlation, no weird dead-view content, plus the occasional viral reel that pushes the whole account upward.

By this point I’m pretty sure nobody is still reading, so it’s a perfect moment to stop here and continue the deeper analysis next week.

All this scraping, ranking, filtering and re-checking also made something obvious: we shouldn’t throw away the results we already spent time and money collecting. If Wykra finds a solid creator, that data should be stored and reused instead of fetched again from scratch. And we should explicitly ask the user whether a suggestion was useful or not - that feedback needs to be saved too.

The data will still have to be refreshed periodically (otherwise we’d just turn into an outdated Instagram directory), but at least future lookups won’t require rebuilding everything from zero.

Next week I’ll continue the analysis and dig a bit deeper into the data we can reliably scrape and interpret.

If you want to support the project, ⭐️ the repo and follow me on Twitter/X - it really helps.

Repo: https://github.com/wykra-io/wykra-api

Twitter/X: https://x.com/ohthatdatagirl