Oluwagbade Odimayo

Posted on Jun 29

From Global Averages to Producer-Level Variation: 3 Python Data Projects on Food's Climate Impact

#python #datascience #machinelearning #sustainability

I spent a few weeks building three connected data-science projects on the greenhouse-gas footprint of what the world eats. They're deliberately a sequence - each one attacks an assumption the previous one had to make - and together they turned into a tidy case study in a few things I care about as a developer: benchmarking against baselines, respecting system boundaries, filtering data artefacts, and not letting a clean story beat an honest one.

This post is the technical tour: the data, the decisions, the code, and the gotchas. Full notebooks and data are in the repos linked at the end.

Stack: Python · pandas · NumPy · scikit-learn · Matplotlib/Seaborn · Jupyter.

The data (and a word on provenance)

Three public sources:

OECD–FAO Agricultural Outlook - meat consumption by country/year/type.
Our World in Data (CC BY 4.0) - national emissions; and the Poore & Nemecek (2018, Science) life-cycle emission factors.
FAOSTAT (CC BY 4.0) - country-level emission intensities of livestock products.

The repos ship the data with a Data licence & attribution note distinguishing the MIT-licensed code from the data, which keeps its own terms. If you publish someone's dataset, do this.

Project 1 - A footprint metric, diet clustering, and the model that lost

Goal: turn demand into estimated emissions, find diet archetypes, and forecast where consumption is heading.

The metric itself is trivial - and that's the point; the value is in operationalising it cleanly:

# Mean life-cycle GHG, kg CO2e per kg of product (Poore & Nemecek via OWID)
EF = {"beef": 99.5, "sheep": 39.7, "pig": 12.3, "poultry": 9.9}
df["footprint_kt_co2e"] = df["volume_kt"] * df["type"].map(EF)

That alone surfaces the headline: in 2020 beef was ~21% of meat volume but ~68% of the footprint. A minority of volume drives the majority of emissions.

Clustering countries by their meat mix (not absolute volume) separates two levers people conflate - the carbon intensity of the mix vs the quantity eaten:

shares = volumes.div(volumes.sum(axis=1), axis=0)        # per-country type shares
X = StandardScaler().fit_transform(shares)
labels = KMeans(n_clusters=k, n_init=10, random_state=42).fit_predict(X)
# archetypes range ~27 (poultry-led) to ~59 (beef/sheep-led) kg CO2e per kg of meat

The most useful thing I did: benchmark the forecast against a dumb baseline

It's easy to fit a trend, plot it, and declare victory. So I held out 2015–2019 and pitted a linear trend against a naive "next year = this year" baseline:

train = s[s.year <= 2014]
test  = s[s.year.between(2015, 2019)]

naive_pred = train.value.iloc[-1]                        # last observed value
naive_mae  = (test.value - naive_pred).abs().mean()

coef       = np.polyfit(train.year, train.value, 1)      # linear trend
trend_mae  = np.abs(test.value.values - np.polyval(coef, test.year)).mean()

print(f"naive MAE={naive_mae:.2f}  trend MAE={trend_mae:.2f}")
# total meat -> naive 0.17 vs trend 4.10   |   poultry -> naive 0.52 vs trend 1.24

The naive baseline won by a mile. The series had flattened after ~2014, so a trend fit on the earlier rise overshot. Lesson #1: a model that can't beat the naive baseline shouldn't be used - and you report that, you don't bury it.

Project 2 - Opening up the single carbon number

Project 1 used one global factor per food. Project 2 asks what that hides, using the Poore & Nemecek per-product dataset (43 foods, GHG split into seven supply-chain stages plus four other impacts).

First, a free data-integrity check - the stages should sum to the reported total:

stages = ["land_use_change","feed","farm","processing","transport","packaging","retail"]
assert np.allclose(df[stages].sum(axis=1), df["total_ghg"], atol=1e-6)   # passes to ~0 error

The "food-miles" myth, in three lines

animal = df[df.category == "animal"]
transport_pkg = animal[["transport","packaging"]].sum().sum()
print(transport_pkg / animal["total_ghg"].sum())          # ~0.047

Transport + packaging is ~4.7% of animal-product emissions; for beef, transport alone is ~0.5%. ~90% is farm + feed + land-use change. "Buy local" is a weak lever next to "change what you eat."

Is carbon a good proxy for everything? Not water.

df[["ghg","land","freshwater","eutrophication"]].corr()
# ghg<->land 0.83 | ghg<->eutrophication 0.76 | ghg<->freshwater 0.33

Carbon tracks land and nutrient pollution well, freshwater poorly (some plants - nuts, rice - are the thirstiest foods). A carbon-only metric can hide a water trade-off. PCA backs this up: PC1 explains ~66% of variance (an overall-impact axis), PC2 ~21% (a distinct water axis).

Z = StandardScaler().fit_transform(df[impact_cols])
print(PCA(n_components=2).fit(Z).explained_variance_ratio_)   # [0.66, 0.21]

Re-basing per 100 g of protein (the fair comparison, restricted to genuine protein sources) keeps the animal/plant gap enormous - beef ~50–100× pulses or nuts, with eggs the most efficient animal protein. Lesson #2: measure more than one dimension, or you'll move harm instead of removing it.

Project 3 - The same product, ~70× variation (and the messy data work behind it)

Both earlier projects used global means. Project 3 quantifies what they flattened, using FAOSTAT emission intensities (kg CO₂e/kg) for livestock products across 250 areas, 1961–2023. This is where most of the real engineering lived.

Gotcha 1 - encoding. Read it as latin-1 and Türkiye becomes TÃ¼rkiye. The file is UTF-8:

df = pd.read_csv(SRC, encoding="utf-8")   # not latin-1

Gotcha 2 - aggregates masquerading as countries. FAOSTAT mixes regional aggregates (World, Africa, income groups) into the same Area column. They use Area Code ≥ 5000:

AGG = ["World","Africa","Americas","Asia","Europe","Oceania","income",
       "European Union","Least Developed","Sub-Saharan", ...]
is_country = (df["Area Code"] < 5000) & \
             ~df["Area"].str.contains("|".join(AGG), case=False, na=False)

Gotcha 3 - data artifacts inflating the headline. My first pass gave a beef max/min ratio of 589× - driven by places that barely raise cattle (Hong Kong, Lebanon) reporting unreliable near-zero intensities. The fix is a per-product production floor, which also dropped a log(0) that was crashing the clustering:

MINPROD = {"Beef":50_000, "Cow milk":100_000, "Chicken":50_000,
           "Eggs":30_000, "Pork":50_000, "Sheep meat":10_000}   # tonnes/yr

def substantial(product):
    s = intensity[product]
    keep = (production[product].reindex(s.index).fillna(0) >= MINPROD[product]) & (s > 0)
    return s[keep].dropna()

With genuine producers only, the spread is real, not noise: ~70×, from ~3.9 kg CO₂e/kg (Israel) to ~270 (Niger).

Gotcha 4 - log before you cluster. Intensities are strongly right-skewed, so Euclidean k-means on raw values is dominated by the tail. Log-transform first:

X = intensity[feats].replace(0, np.nan).dropna()
Z = StandardScaler().fit_transform(np.log(X))             # log: intensities are right-skewed
labels = KMeans(n_clusters=3, n_init=10, random_state=42).fit_predict(Z)
# three efficiency tiers: median beef ~16 / 50 / 79 kg CO2e/kg

For the time trend, FAOSTAT publishes a World aggregate, so you don't have to re-derive it:

world = df[(df.Area == "World") & (df.Element == "Emissions intensity")]
# world beef -32% since 1961, cow milk -52%, chicken -41% as systems got more productive

The trap that ties it together: system boundaries

Here's the one that'll bite you if you're not careful. Across these projects, beef shows up as ~99.5, ~60, and ~30 kg CO₂e/kg. Same animal, three numbers:

~99.5 - OWID's headline life-cycle figure (Project 1).
~60 - the same Poore & Nemecek study, summed across its stages (Project 2).
~30 - FAOSTAT's farm-gate intensity (Project 3).

None of them are "wrong." They're measured to different system boundaries (full LCA vs production-only) and processed differently. Lesson #3: never compare emission factors across boundaries as if they're the same measurement. Lean on relative rankings (beef ≫ poultry) - those are stable across all three. I called this out explicitly rather than quietly picking whichever number suited the slide.

Engineering takeaways

Benchmark against a naive baseline. If your model can't beat "same as last value," it adds nothing - and saying so is the rigorous move.
Global means hide variation. For anything about intervention, the distribution is the unit of analysis, not the average (~70× for beef).
Mind system boundaries. 99 vs 60 vs 30 for the same animal. Compare like with like; trust relative rankings over absolute numbers.
Filter artifacts before quoting extremes. A production floor turned a noisy 589× into a real 70×.
Log-transform right-skewed features before distance-based methods (clustering, PCA).
Separate aggregates from units in mixed-grain datasets (FAOSTAT Area Code ≥ 5000).
Encoding is not optional (utf-8, or your country names rot).
Reproducibility hygiene: pinned requirements.txt, executed notebooks committed, data licensed and attributed.

Repos

meat-carbon-footprint - demand, the footprint metric, diet clustering, the baseline-beats-model forecast.
food-environmental-footprint - supply-chain stage decomposition, multi-impact correlations, PCA, per-protein.
livestock-emission-intensity - FAOSTAT cross-country variation, efficiency tiers, the system-boundary write-up.

Each has the full Jupyter notebook (executed), a runnable analysis.py, pinned requirements, and a detailed write-up.

If you take one thing from this: the interesting question is almost never the average. It's the spread, the boundary, and whether your model actually beats doing nothing. Happy to talk methods in the comments.

DEV Community