Sample dataset analysis: a 100-row snapshot of DefiLlama Yields

#webscraping #apify #crypto #dataengineering

I pulled a 100-row sample of DefiLlama Yields to see whether the dataset is rich enough to support back-testing strategies, monitoring liquidity, building risk dashboards and feeding price-discovery models, or whether it is the kind of feed you have to enrich heavily before it becomes useful. Short answer: richer than I expected. Long answer below.

What is in the sample

DefiLlama Yields Scraper DeFi APY & TVL Pool Data Across All Chains Scrape DeFi yield and APY pools from DefiLlama, the most trusted DeFi data source. Each record has the following fields:

poolId -- pool id
project -- project
symbol -- symbol
chain -- chain
tvlUsd -- tvl usd
apy -- apy
apyBase -- apy base
apyReward -- apy reward
apyPct1D -- apy pct1D
apyPct7D -- apy pct7D
apyPct30D -- apy pct30D
apyMean30d -- apy mean30d
stablecoin -- stablecoin
ilRisk -- il risk
exposure -- exposure
rewardTokens -- reward tokens
volumeUsd1d -- volume usd1d
volumeUsd7d -- volume usd7d
poolMeta -- pool meta
url -- url
scrapedAt -- scraped at

The fields divide into three groups: identifiers (stable across re-scrapes), descriptive content (the actual signal you want), and metadata (timestamps, source URLs, scrape provenance). For most analytical workflows you only really touch the middle group, but the identifiers matter the moment you start joining across runs.

Two example records

Here are two rows from the sample, trimmed slightly so they fit:

{
  "poolId": "747c1d2a-c668-4682-b9f9-296708a3dd90",
  "project": "lido",
  "symbol": "STETH",
  "chain": "Ethereum",
  "tvlUsd": 20266411073,
  "apy": 2.47,
  "apyBase": 2.47,
  "apyReward": null,
  "apyPct1D": 0.018,
  "apyPct7D": -0.762
}

{
  "poolId": "80b8bf92-b953-4c20-98ea-c9653ef2bb98",
  "project": "binance-staked-eth",
  "symbol": "WBETH",
  "chain": "Ethereum",
  "tvlUsd": 8042689549,
  "apy": 2.48054,
  "apyBase": 2.48054,
  "apyReward": null,
  "apyPct1D": -0.02446,
  "apyPct7D": -0.0814
}

Even without aggregation you can see the cardinality is interesting. The descriptive fields vary widely across rows, which means a 100-row sample is enough to do meaningful exploratory analysis but probably not enough for any production-grade modelling -- you would want at least an order of magnitude more.

What I would do with the data

A non-exhaustive list of analyses this dataset directly supports:

Frequency analysis on the categorical columns to spot dominant clusters and long-tail outliers.
Time-series breakdowns using the timestamp fields to see daily, weekly and seasonal patterns.
Text analysis on the free-form fields -- topic modelling, keyword extraction, sentiment if the content warrants it.
Cross-joins with external reference data (back-testing strategies, monitoring liquidity, building risk dashboards and feeding price-discovery models typically needs a second-source enrichment step) to produce something more valuable than either input alone.

Quirks I noticed

A few practical observations from poking at the rows:

Some optional fields are missing rather than null. Normalise on load.
Long-form text occasionally contains newlines and the odd unicode quirk; clean before tokenising.
Identifier-like fields are strings; do not let your warehouse coerce them to int.

How I would shape it for downstream use

If I were dropping this dataset into a warehouse the rough plan would be: stage the raw JSON unchanged in a landing zone partitioned by scrape date, then create a curated view that casts the identifier fields to strings, parses the timestamps as native DATE/TIMESTAMP types, splits any compound columns, and trims long-form text. Keeping that two-layer structure means you can replay history without re-scraping, and you can iterate on the curated schema without losing fidelity.

For analytical queries the curated view is what you point dashboards and notebooks at. Common patterns I would pre-build as additional models: a daily-rollup view aggregating numeric columns by the most useful categorical breakdown, a recency view filtered to the last N days for "what is new" dashboards, and a delta view that diffs the latest snapshot against yesterday so you can surface additions and removals cheaply.

Bottom line

For a sample pull it is more than enough to validate the use-case fit. If the analytical questions you want to answer are reasonable on a 100-row sample, the full dataset will comfortably answer them. The next step is a longer-horizon pull -- a week or two of recurring snapshots -- which lets you stop treating each row as a one-off and start treating the dataset as a feed with its own dynamics.

For live, customizable extractions of this data, the actor that produced the dataset shown above is published on the Apify Store: logiover/defillama-yields-scraper. It supports JSON, CSV and Excel exports and runs on a schedule.