I spent a few weeks building three connected data-science projects on the greenhouse-gas footprint of what the world eats. They're deliberately a sequence - each one attacks an assumption the previous one had to make - and together they turned into a tidy case study in a few things I care about as a developer: benchmarking against baselines, respecting system boundaries, filtering data artefacts, and not letting a clean story beat an honest one.
This post is the technical tour: the data, the decisions, the code, and the gotchas. Full notebooks and data are in the repos linked at the end.
Stack: Python · pandas · NumPy · scikit-learn · Matplotlib/Seaborn · Jupyter.
The data (and a word on provenance)
Three public sources:
- OECD–FAO Agricultural Outlook - meat consumption by country/year/type.
- Our World in Data (CC BY 4.0) - national emissions; and the Poore & Nemecek (2018, Science) life-cycle emission factors.
- FAOSTAT (CC BY 4.0) - country-level emission intensities of livestock products.
The repos ship the data with a Data licence & attribution note distinguishing the MIT-licensed code from the data, which keeps its own terms. If you publish someone's dataset, do this.
Project 1 - A footprint metric, diet clustering, and the model that lost
Goal: turn demand into estimated emissions, find diet archetypes, and forecast where consumption is heading.
The metric itself is trivial - and that's the point; the value is in operationalising it cleanly:
# Mean life-cycle GHG, kg CO2e per kg of product (Poore & Nemecek via OWID)
EF = {"beef": 99.5, "sheep": 39.7, "pig": 12.3, "poultry": 9.9}
df["footprint_kt_co2e"] = df["volume_kt"] * df["type"].map(EF)
That alone surfaces the headline: in 2020 beef was ~21% of meat volume but ~68% of the footprint. A minority of volume drives the majority of emissions.
Clustering countries by their meat mix (not absolute volume) separates two levers people conflate - the carbon intensity of the mix vs the quantity eaten:
shares = volumes.div(volumes.sum(axis=1), axis=0) # per-country type shares
X = StandardScaler().fit_transform(shares)
labels = KMeans(n_clusters=k, n_init=10, random_state=42).fit_predict(X)
# archetypes range ~27 (poultry-led) to ~59 (beef/sheep-led) kg CO2e per kg of meat
The most useful thing I did: benchmark the forecast against a dumb baseline
It's easy to fit a trend, plot it, and declare victory. So I held out 2015–2019 and pitted a linear trend against a naive "next year = this year" baseline:
train = s[s.year <= 2014]
test = s[s.year.between(2015, 2019)]
naive_pred = train.value.iloc[-1] # last observed value
naive_mae = (test.value - naive_pred).abs().mean()
coef = np.polyfit(train.year, train.value, 1) # linear trend
trend_mae = np.abs(test.value.values - np.polyval(coef, test.year)).mean()
print(f"naive MAE={naive_mae:.2f} trend MAE={trend_mae:.2f}")
# total meat -> naive 0.17 vs trend 4.10 | poultry -> naive 0.52 vs trend 1.24
The naive baseline won by a mile. The series had flattened after ~2014, so a trend fit on the earlier rise overshot. Lesson #1: a model that can't beat the naive baseline shouldn't be used - and you report that, you don't bury it.
Project 2 - Opening up the single carbon number
Project 1 used one global factor per food. Project 2 asks what that hides, using the Poore & Nemecek per-product dataset (43 foods, GHG split into seven supply-chain stages plus four other impacts).
First, a free data-integrity check - the stages should sum to the reported total:
stages = ["land_use_change","feed","farm","processing","transport","packaging","retail"]
assert np.allclose(df[stages].sum(axis=1), df["total_ghg"], atol=1e-6) # passes to ~0 error
The "food-miles" myth, in three lines
animal = df[df.category == "animal"]
transport_pkg = animal[["transport","packaging"]].sum().sum()
print(transport_pkg / animal["total_ghg"].sum()) # ~0.047
Transport + packaging is ~4.7% of animal-product emissions; for beef, transport alone is ~0.5%. ~90% is farm + feed + land-use change. "Buy local" is a weak lever next to "change what you eat."
Is carbon a good proxy for everything? Not water.
df[["ghg","land","freshwater","eutrophication"]].corr()
# ghg<->land 0.83 | ghg<->eutrophication 0.76 | ghg<->freshwater 0.33
Carbon tracks land and nutrient pollution well, freshwater poorly (some plants - nuts, rice - are the thirstiest foods). A carbon-only metric can hide a water trade-off. PCA backs this up: PC1 explains ~66% of variance (an overall-impact axis), PC2 ~21% (a distinct water axis).
Z = StandardScaler().fit_transform(df[impact_cols])
print(PCA(n_components=2).fit(Z).explained_variance_ratio_) # [0.66, 0.21]
Re-basing per 100 g of protein (the fair comparison, restricted to genuine protein sources) keeps the animal/plant gap enormous - beef ~50–100× pulses or nuts, with eggs the most efficient animal protein. Lesson #2: measure more than one dimension, or you'll move harm instead of removing it.
Project 3 - The same product, ~70× variation (and the messy data work behind it)
Both earlier projects used global means. Project 3 quantifies what they flattened, using FAOSTAT emission intensities (kg CO₂e/kg) for livestock products across 250 areas, 1961–2023. This is where most of the real engineering lived.
Gotcha 1 - encoding. Read it as latin-1 and Türkiye becomes Türkiye. The file is UTF-8:
df = pd.read_csv(SRC, encoding="utf-8") # not latin-1
Gotcha 2 - aggregates masquerading as countries. FAOSTAT mixes regional aggregates (World, Africa, income groups) into the same Area column. They use Area Code ≥ 5000:
AGG = ["World","Africa","Americas","Asia","Europe","Oceania","income",
"European Union","Least Developed","Sub-Saharan", ...]
is_country = (df["Area Code"] < 5000) & \
~df["Area"].str.contains("|".join(AGG), case=False, na=False)
Gotcha 3 - data artifacts inflating the headline. My first pass gave a beef max/min ratio of 589× - driven by places that barely raise cattle (Hong Kong, Lebanon) reporting unreliable near-zero intensities. The fix is a per-product production floor, which also dropped a log(0) that was crashing the clustering:
MINPROD = {"Beef":50_000, "Cow milk":100_000, "Chicken":50_000,
"Eggs":30_000, "Pork":50_000, "Sheep meat":10_000} # tonnes/yr
def substantial(product):
s = intensity[product]
keep = (production[product].reindex(s.index).fillna(0) >= MINPROD[product]) & (s > 0)
return s[keep].dropna()
With genuine producers only, the spread is real, not noise: ~70×, from ~3.9 kg CO₂e/kg (Israel) to ~270 (Niger).
Gotcha 4 - log before you cluster. Intensities are strongly right-skewed, so Euclidean k-means on raw values is dominated by the tail. Log-transform first:
X = intensity[feats].replace(0, np.nan).dropna()
Z = StandardScaler().fit_transform(np.log(X)) # log: intensities are right-skewed
labels = KMeans(n_clusters=3, n_init=10, random_state=42).fit_predict(Z)
# three efficiency tiers: median beef ~16 / 50 / 79 kg CO2e/kg
For the time trend, FAOSTAT publishes a World aggregate, so you don't have to re-derive it:
world = df[(df.Area == "World") & (df.Element == "Emissions intensity")]
# world beef -32% since 1961, cow milk -52%, chicken -41% as systems got more productive
The trap that ties it together: system boundaries
Here's the one that'll bite you if you're not careful. Across these projects, beef shows up as ~99.5, ~60, and ~30 kg CO₂e/kg. Same animal, three numbers:
- ~99.5 - OWID's headline life-cycle figure (Project 1).
- ~60 - the same Poore & Nemecek study, summed across its stages (Project 2).
- ~30 - FAOSTAT's farm-gate intensity (Project 3).
None of them are "wrong." They're measured to different system boundaries (full LCA vs production-only) and processed differently. Lesson #3: never compare emission factors across boundaries as if they're the same measurement. Lean on relative rankings (beef ≫ poultry) - those are stable across all three. I called this out explicitly rather than quietly picking whichever number suited the slide.
Engineering takeaways
- Benchmark against a naive baseline. If your model can't beat "same as last value," it adds nothing - and saying so is the rigorous move.
- Global means hide variation. For anything about intervention, the distribution is the unit of analysis, not the average (~70× for beef).
- Mind system boundaries. 99 vs 60 vs 30 for the same animal. Compare like with like; trust relative rankings over absolute numbers.
- Filter artifacts before quoting extremes. A production floor turned a noisy 589× into a real 70×.
- Log-transform right-skewed features before distance-based methods (clustering, PCA).
- Separate aggregates from units in mixed-grain datasets (FAOSTAT Area Code ≥ 5000).
-
Encoding is not optional (
utf-8, or your country names rot). -
Reproducibility hygiene: pinned
requirements.txt, executed notebooks committed, data licensed and attributed.
Repos
- meat-carbon-footprint - demand, the footprint metric, diet clustering, the baseline-beats-model forecast.
- food-environmental-footprint - supply-chain stage decomposition, multi-impact correlations, PCA, per-protein.
- livestock-emission-intensity - FAOSTAT cross-country variation, efficiency tiers, the system-boundary write-up.
Each has the full Jupyter notebook (executed), a runnable analysis.py, pinned requirements, and a detailed write-up.
If you take one thing from this: the interesting question is almost never the average. It's the spread, the boundary, and whether your model actually beats doing nothing. Happy to talk methods in the comments.





Top comments (0)