Roberto B.

Posted on Mar 7

GPX Runner's data decoded with PHP

#gpx #php #statistics #tutorial

Every runner with a sports watch carries a small data recorder on their wrist. After every run, it produces a GPX file: a detailed log of GPS positions, timestamps, elevation, and often heart rate. Most runners glance at the summary on their phone: distance, time, average pace, and move on.

But that GPX file holds much more. With the hi-folks/statistics PHP package, you can extract per-kilometer splits and apply real statistical analysis to your running performance, from pacing consistency to elevation impact, cardiac drift to long-term improvement trends.

In this article, we will parse a GPX file, build per-kilometer splits, and analyze a 10K run step by step using descriptive statistics, correlation, regression, outlier detection, and more.

Installing the package

composer require hi-folks/statistics

The package requires PHP 8.2+ and is available on GitHub: https://github.com/Hi-Folks/statistics.

What is inside a GPX file?

A GPX (GPS Exchange Format) file is XML. Every sport watch — Garmin, Polar, Suunto, Apple Watch, Coros — can export one. Each file contains a sequence of trackpoints recorded every 1–5 seconds during your run. A trackpoint looks like this:

<trkpt lat="45.4642" lon="9.1900">
  <ele>122.4</ele>
  <time>2025-03-15T07:30:45Z</time>
  <extensions>
    <gpxtpx:TrackPointExtension>
      <gpxtpx:hr>152</gpxtpx:hr>
    </gpxtpx:TrackPointExtension>
  </extensions>
</trkpt>

From these raw points you can derive:

Distance between consecutive points (using the Haversine formula)
Pace per kilometer (time elapsed over each 1 km segment)
Elevation gain and loss per kilometer
Heart rate averages per kilometer (from the Garmin extension)

Parsing the GPX file

PHP's built-in SimpleXML makes parsing straightforward. Here are the helper functions we use:

function parseGpx(string $filePath): array
{
    $xml = simplexml_load_file($filePath);
    if ($xml === false) {
        throw new RuntimeException("Cannot parse GPX file: {$filePath}");
    }

    $namespaces = $xml->getNamespaces(true);

    $points = [];
    foreach ($xml->trk->trkseg->trkpt as $trkpt) {
        $point = [
            'lat' => (float) $trkpt['lat'],
            'lon' => (float) $trkpt['lon'],
            'ele' => isset($trkpt->ele) ? (float) $trkpt->ele : 0.0,
            'time' => isset($trkpt->time) ? strtotime((string) $trkpt->time) : 0,
            'hr' => null,
        ];

        // Extract heart rate from Garmin TrackPointExtension
        if (isset($namespaces['gpxtpx'])) {
            $extensions = $trkpt->extensions;
            if ($extensions) {
                $gpxtpx = $extensions->children($namespaces['gpxtpx']);
                if (isset($gpxtpx->TrackPointExtension->hr)) {
                    $point['hr'] = (int) $gpxtpx->TrackPointExtension->hr;
                }
            }
        }

        $points[] = $point;
    }

    return $points;
}

To calculate the distance between two GPS coordinates, we use the Haversine formula, the standard method for computing great-circle distance on a sphere:

function haversineDistance(float $lat1, float $lon1, float $lat2, float $lon2): float
{
    $R = 6371000; // Earth radius in meters
    $dLat = deg2rad($lat2 - $lat1);
    $dLon = deg2rad($lon2 - $lon1);
    $a = sin($dLat / 2) ** 2
       + cos(deg2rad($lat1)) * cos(deg2rad($lat2)) * sin($dLon / 2) ** 2;

    return $R * 2 * atan2(sqrt($a), sqrt(1 - $a));
}

Then we walk through the trackpoints, accumulating distance until we hit each kilometer mark, and record the split:

function buildKmSplits(array $trackpoints): array
{
    $splits = [];
    $currentKm = 1;
    $kmDistance = 0;
    $kmStartTime = $trackpoints[0]['time'];
    $kmEleGain = 0;
    $kmEleLoss = 0;
    $kmHrValues = [];

    for ($i = 1; $i < count($trackpoints); $i++) {
        $prev = $trackpoints[$i - 1];
        $curr = $trackpoints[$i];

        $segDist = haversineDistance($prev['lat'], $prev['lon'], $curr['lat'], $curr['lon']);
        $kmDistance += $segDist;

        $eleDiff = $curr['ele'] - $prev['ele'];
        if ($eleDiff > 0) {
            $kmEleGain += $eleDiff;
        } else {
            $kmEleLoss += abs($eleDiff);
        }

        if ($curr['hr'] !== null) {
            $kmHrValues[] = $curr['hr'];
        }

        if ($kmDistance >= 1000) {
            $kmTime = $curr['time'] - $kmStartTime;
            $splits[] = [
                'km' => $currentKm,
                'time' => $kmTime,
                'pace' => $kmTime,
                'eleGain' => round($kmEleGain, 1),
                'eleLoss' => round($kmEleLoss, 1),
                'avgHr' => count($kmHrValues) > 0
                    ? (int) round(Stat::mean($kmHrValues))
                    : null,
            ];

            $currentKm++;
            $kmDistance -= 1000;
            $kmStartTime = $curr['time'];
            $kmEleGain = 0;
            $kmEleLoss = 0;
            $kmHrValues = [];
        }
    }

    return $splits;
}

The data

For this article, we use a simulated 10K run with realistic characteristics: a hilly middle section, slight positive split tendency, and heart rate drifting upward with fatigue. If you have a real GPX file, just swap in parseGpx() and buildKmSplits().

// === Option 1: Parse a real GPX file ===
// $trackpoints = parseGpx('your-run.gpx');
// $splits = buildKmSplits($trackpoints);

// === Option 2: Simulated 10K run ===
$splits = [
    ['km' => 1,  'time' => 322, 'pace' => 322, 'eleGain' => 5,  'eleLoss' => 2,  'avgHr' => 145],
    ['km' => 2,  'time' => 318, 'pace' => 318, 'eleGain' => 8,  'eleLoss' => 3,  'avgHr' => 150],
    ['km' => 3,  'time' => 335, 'pace' => 335, 'eleGain' => 22, 'eleLoss' => 4,  'avgHr' => 158],
    ['km' => 4,  'time' => 348, 'pace' => 348, 'eleGain' => 28, 'eleLoss' => 5,  'avgHr' => 164],
    ['km' => 5,  'time' => 340, 'pace' => 340, 'eleGain' => 15, 'eleLoss' => 18, 'avgHr' => 162],
    ['km' => 6,  'time' => 312, 'pace' => 312, 'eleGain' => 2,  'eleLoss' => 30, 'avgHr' => 155],
    ['km' => 7,  'time' => 325, 'pace' => 325, 'eleGain' => 3,  'eleLoss' => 8,  'avgHr' => 158],
    ['km' => 8,  'time' => 338, 'pace' => 338, 'eleGain' => 12, 'eleLoss' => 5,  'avgHr' => 165],
    ['km' => 9,  'time' => 352, 'pace' => 352, 'eleGain' => 18, 'eleLoss' => 3,  'avgHr' => 170],
    ['km' => 10, 'time' => 330, 'pace' => 330, 'eleGain' => 4,  'eleLoss' => 15, 'avgHr' => 172],
];

The package includes utility classes for column extraction and time formatting:

use HiFolks\Statistics\Stat;
use HiFolks\Statistics\Freq;
use HiFolks\Statistics\Utils\Arr;
use HiFolks\Statistics\Utils\Format;

[$paces, $eleGains, $hrValues, $kmNumbers] = Arr::extract(
    $splits,
    ['pace', 'eleGain', 'avgHr', 'km']
);

The full PHP example with the complete code is here: examples/article-gpx-running-analysis.php

Step 1: Run overview, your run at a glance

Before diving into analysis, let's get the big picture:

$totalTime = array_sum(array_column($splits, 'time'));
$totalEleGain = array_sum(array_column($splits, 'eleGain'));
$totalEleLoss = array_sum(array_column($splits, 'eleLoss'));

echo "Distance:        " . count($splits) . " km" . PHP_EOL;
echo "Total time:      " . Format::secondsToTime($totalTime) . PHP_EOL;
echo "Average pace:    " . Format::secondsToTime(Stat::mean($paces)) . "/km" . PHP_EOL;
echo "Elevation gain:  +" . $totalEleGain . " m" . PHP_EOL;
echo "Elevation loss:  -" . $totalEleLoss . " m" . PHP_EOL;
echo "Average HR:      " . round(Stat::mean($hrValues)) . " bpm" . PHP_EOL;

Output:

Distance:        10 km
Total time:      0:55:20
Average pace:    0:05:32/km
Elevation gain:  +117 m
Elevation loss:  -93 m
Average HR:      160 bpm

This is the summary your watch shows you. Now let's look at what the numbers actually reveal.

Step 2: Pace descriptive statistics, how consistent were you?

The average pace is useful, but it hides the variation. Did you hold a steady 5:32/km throughout, or did you yo-yo between 5:12 and 5:52?

$meanPace = Stat::mean($paces);
$medianPace = Stat::median($paces);
$stdevPace = Stat::stdev($paces);
$quartiles = Stat::quantiles($paces);

echo "Mean pace:       " . Format::secondsToTime($meanPace) . "/km" . PHP_EOL;
echo "Median pace:     " . Format::secondsToTime($medianPace) . "/km" . PHP_EOL;
echo "Std deviation:   " . round($stdevPace, 1) . " sec" . PHP_EOL;
echo "Fastest km:      " . Format::secondsToTime(min($paces)) . "/km" . PHP_EOL;
echo "Slowest km:      " . Format::secondsToTime(max($paces)) . "/km" . PHP_EOL;
echo "Quartiles:       Q1=" . Format::secondsToTime($quartiles[0]) . "/km"
    . "  Q2=" . Format::secondsToTime($quartiles[1]) . "/km"
    . "  Q3=" . Format::secondsToTime($quartiles[2]) . "/km" . PHP_EOL;

Output:

Mean pace:       0:05:32/km
Median pace:     0:05:33/km
Std deviation:   13 sec
Fastest km:      0:05:12/km (km 6)
Slowest km:      0:05:52/km (km 9)
Quartiles:       Q1=0:05:21/km  Q2=0:05:33/km  Q3=0:05:42/km

How to interpret the results:

A standard deviation of 13 seconds means most of your km were within ~13 seconds of the average. That's moderate consistency for a hilly course.
If mean and median are close (5:32 vs 5:33), your pacing was roughly symmetric — no extreme skew toward fast or slow km.
The range (5:12 to 5:52 = 40 seconds) tells you the spread from your best to worst km. Compare this with the IQR (Q1 to Q3 = 21 seconds) — the core of your pacing was much tighter than the extremes suggest.

Step 3: Pacing consistency, did you positive-split or negative-split?

Every coach talks about pacing strategy. A positive split means you slowed down in the second half; a negative split means you got faster. The Coefficient of Variation (CV) puts a single number on your consistency:

$cv = Stat::coefficientOfVariation($paces, 2);

$halfPoint = intdiv(count($splits), 2);
$firstHalfPaces = array_slice($paces, 0, $halfPoint);
$secondHalfPaces = array_slice($paces, $halfPoint);
$meanFirst = Stat::mean($firstHalfPaces);
$meanSecond = Stat::mean($secondHalfPaces);
$splitDiff = $meanSecond - $meanFirst;
$splitPct = round(($splitDiff / $meanFirst) * 100, 1);

echo "Coefficient of Variation: " . $cv . "%" . PHP_EOL;
echo "First half avg pace:  " . Format::secondsToTime($meanFirst) . "/km" . PHP_EOL;
echo "Second half avg pace: " . Format::secondsToTime($meanSecond) . "/km" . PHP_EOL;

Output:

Coefficient of Variation: 3.91%
First half avg pace:  0:05:33/km (km 1-5)
Second half avg pace: 0:05:31/km (km 6-10)
Negative split: 1.2 sec/km faster (0.4% improvement)

How to interpret the results:

A CV below 5% is considered good pacing for a hilly course. Elite runners often achieve CV under 2% on flat courses.
Our runner managed a slight negative split, the second half was marginally faster. This is often the sign of disciplined pacing: holding back on the uphills in km 3–5 and then capitalizing on the downhill at km 6.
Compare your CV across different runs to track whether your pacing discipline is improving over time.

Step 4: Elevation impact, how much do hills slow you down?

This is the question every trail runner wants answered. We have per-km elevation gain and per-km pace. Let's see if hills measurably affect your speed:

$corrEle = Stat::correlation($eleGains, $paces);
$regEle = Stat::linearRegression($eleGains, $paces);
$r2Ele = Stat::rSquared($eleGains, $paces, false, 4);

echo "Correlation (elevation gain vs pace): " . round($corrEle, 4) . PHP_EOL;
echo "Linear regression: pace = " . round($regEle[0], 2)
    . " x eleGain + " . round($regEle[1], 1) . PHP_EOL;
echo "R-squared: " . $r2Ele . PHP_EOL;

Output:

Correlation (elevation gain vs pace): 0.8053
Linear regression: pace = 1.18 x eleGain + 318.2
R-squared: 0.6485

How to interpret the results:

A Pearson correlation of 0.81 is strong, more uphill clearly means slower pace.
The slope (1.18) tells you: each additional meter of elevation gain within that kilometer is associated with roughly 1.2 seconds slower pace. On a km with 28m of climbing (km 4), the model predicts you'd run ~33 seconds slower than on a flat km.
R-squared of 0.65 means elevation gain explains about 65% of the variation in your pace. The remaining 35% comes from other factors — fatigue, wind, terrain surface, mental state.
Track this slope over multiple runs. As your hill fitness improves, this number should decrease, hills will slow you down less.

Step 5: Heart rate analysis, detecting cardiac drift

Heart rate tells a story that pace alone cannot. Even if your pace stays constant, a rising heart rate signals that your body is working harder, this is cardiac drift, caused by dehydration, heat, and accumulated fatigue.

$meanHr = Stat::mean($hrValues);
$stdevHr = Stat::stdev($hrValues);

// Cardiac drift: HR vs km number
$corrHrKm = Stat::correlation($kmNumbers, $hrValues);
$regHrKm = Stat::linearRegression($kmNumbers, $hrValues);
$r2HrKm = Stat::rSquared($kmNumbers, $hrValues, false, 4);

// HR vs pace
$corrHrPace = Stat::correlation($hrValues, $paces);

Output:

Mean HR:    160 bpm
Median HR:  160 bpm
Std dev:    8.5 bpm
Min HR:     145 bpm | Max HR: 172 bpm

Cardiac drift (HR vs km):
  Correlation:      0.8506
  Regression:       HR = 2.38 x km + 146.8
  R-squared:        0.7235
  HR drift per km:  +2.4 bpm/km

HR vs pace correlation: 0.6912

How to interpret the results:

A correlation of 0.85 between km number and heart rate confirms significant cardiac drift, your heart worked progressively harder as the run continued.
The regression slope (+2.4 bpm/km) means your heart rate rose by about 2.4 beats per minute each kilometer. Over 10 km, that's a ~24 bpm increase from start to finish.
The HR vs pace correlation (0.69) is moderate — heart rate is influenced by pace, but also by elevation, fatigue, and heat. A perfect correlation would mean pace alone determines HR, which is never true in real-world conditions.

Heart rate zone distribution

Using Freq::frequencyTableBySize(), we can see how many kilometers were spent in each heart rate zone:

$hrZones = Freq::frequencyTableBySize($hrValues, 10);
foreach ($hrZones as $range => $count) {
    echo "  " . $range . " bpm: "
        . str_repeat("#", $count) . " (" . $count . " km)" . PHP_EOL;
}

Output:

Heart Rate Zone Distribution:
  145 bpm: ## (2 km)
  155 bpm: ##### (5 km)
  165 bpm: ### (3 km)

This tells you that 5 of your 10 km were in the 150–159 bpm zone — your aerobic sweet spot. Only 3 km pushed into the 165+ range (threshold/anaerobic), primarily on the uphill and late-fatigue segments.

Step 6: Outlier detection, which km was your best and worst?

Not every kilometer is created equal. Some are unusually fast (downhill? adrenaline?) or slow (steep hill? red light? cramp?). The z-score tells you exactly how unusual each km was:

$zscores = Stat::zscores($paces, 2);

foreach ($splits as $i => $split) {
    echo "  km " . $split['km'] . ": " . Format::secondsToTime($split['pace']) . "/km"
        . "  z=" . sprintf("%+.2f", $zscores[$i]) . PHP_EOL;
}

Output:

  km  1: 0:05:22/km  z=-0.77
  km  2: 0:05:18/km  z=-1.08
  km  3: 0:05:35/km  z=+0.23
  km  4: 0:05:48/km  z=+1.23
  km  5: 0:05:40/km  z=+0.62
  km  6: 0:05:12/km  z=-1.54
  km  7: 0:05:25/km  z=-0.54
  km  8: 0:05:38/km  z=+0.46
  km  9: 0:05:52/km  z=+1.54
  km 10: 0:05:30/km  z=-0.15

How to interpret the results:

Negative z-scores mean faster than average; positive means slower. The further from zero, the more unusual.
Km 6 (z = -1.54) was the standout fast km — 20 seconds faster than average. Looking at the data, it had only 2m of elevation gain but 30m of loss. Gravity did the work.
Km 9 (z = +1.54) was the slowest — 18m of climbing plus accumulated fatigue in the late stages.
No km exceeded |z| > 2.0, so there are no statistical outliers. This is confirmed by both Stat::outliers() and Stat::iqrOutliers():

$zOutliers = Stat::outliers($paces, 2.0);   // []
$iqrOutliers = Stat::iqrOutliers($paces);    // []

If you had stopped at a traffic light or taken a water break, the affected km would show up as an outlier — and you'd know to exclude it from your pace analysis.

Step 7: Percentile benchmarks, your pace distribution

Percentiles tell you what your pace range actually looks like across this run:

$percentiles = [10, 25, 50, 75, 90];
foreach ($percentiles as $p) {
    echo "  P" . $p . ": "
        . Format::secondsToTime(Stat::percentile($paces, $p, 0)) . "/km" . PHP_EOL;
}

Output:

  P10: 0:05:13/km
  P25: 0:05:21/km
  P50: 0:05:33/km
  P75: 0:05:42/km
  P90: 0:05:52/km

How to interpret the results:

P10 (5:13/km) is the pace you only sustain on your fastest 10% of km — your peak speed on this run.
P50 (5:33/km) is your median pace — the truest single number for "how fast did I run?"
P90 (5:52/km) is your slowest 10% — your weakest km, usually hills or the final push.
The gap between P25 and P75 (21 seconds) is your interquartile range — the "core band" of your pacing. A narrower band means more consistent running.

Step 8: Distribution shape, are your km skewed?

skewness() and kurtosis() reveal the shape of your pace distribution:

$skewness = Stat::skewness($paces, 4);
$kurtosis = Stat::kurtosis($paces, 4);

echo "Skewness: " . $skewness . PHP_EOL;
echo "Kurtosis: " . $kurtosis . PHP_EOL;

Output:

Skewness: 0.0481
Kurtosis: -0.9316

How to interpret the results:

Skewness near zero (0.05) means your pace distribution is approximately symmetric. You did not have a long tail of slow km or fast km — the variation was balanced.
Negative kurtosis (-0.93) means your pace values are more uniformly spread out than a normal distribution, fewer km clustered tightly around the mean, and the extremes are not very extreme. This is typical for a hilly course where terrain forces variation.
If your skewness were strongly positive (> 0.5), it would mean a tail of slow km, possibly from steep climbs or late-run fatigue. A negative skewness would mean a tail of fast km, perhaps starting too fast.

Step 9: Confidence interval, what is your true pace?

Your 10 km give you an average pace, but with more data (more km), that average would stabilize. The confidence interval tells you the range where your "true" comfortable pace likely falls:

$ci = Stat::confidenceInterval($paces, 0.95, 0);
$sem = Stat::sem($paces, 1);

echo "95% CI: " . Format::secondsToTime($ci[0]) . "/km to "
    . Format::secondsToTime($ci[1]) . "/km" . PHP_EOL;
echo "Standard Error of the Mean: " . $sem . " sec" . PHP_EOL;

Output:

95% CI for your true pace: 0:05:24/km to 0:05:40/km
Standard Error of the Mean: 4.1 sec

How to interpret the results:

We are 95% confident that your true comfortable pace for this effort level and course profile is between 5:24/km and 5:40/km.
The SEM of 4.1 seconds is the engine behind this interval. With only 10 km, there's meaningful uncertainty. On a half-marathon (21 km), the SEM would shrink to about 2.8 seconds, and on a marathon (42 km) to about 2 seconds — your confidence interval would become very tight.
This is useful for race planning: instead of saying "I run 5:32/km pace", you can say "my pace is 5:24–5:40/km on this type of terrain" — a more honest and useful estimate.

Step 10: Multi-run trend analysis, are you getting faster?

The most powerful analysis comes from loading multiple GPX files across weeks or months. Each run gives you an average pace, and over time, you can see the trend.

Here we simulate 8 weeks of training data:

$weeks = [1, 2, 3, 4, 5, 6, 7, 8];
$weeklyPaces = [350, 342, 337, 333, 330, 328, 326, 325];

$trendReg = Stat::linearRegression($weeks, $weeklyPaces);
$trendR2 = Stat::rSquared($weeks, $weeklyPaces, false, 4);
$trendCorr = Stat::correlation($weeks, $weeklyPaces);

echo "Trend regression: pace = " . round($trendReg[0], 2)
    . " x week + " . round($trendReg[1], 1) . PHP_EOL;
echo "R-squared: " . $trendR2 . PHP_EOL;
echo "Correlation: " . round($trendCorr, 4) . PHP_EOL;

Output:

Trend regression: pace = -3.39 x week + 349.1
R-squared:        0.9176
Correlation:      -0.9579
Improvement rate:  3.4 seconds/km per week

Predicted pace at week 12: 0:05:08/km
(Extrapolation — use with caution!)

How to interpret the results:

The slope (-3.39) means you're improving by about 3.4 seconds per km per week on average. That's meaningful and measurable progress.
R-squared of 0.92 means the linear model explains most of the variance, but not all of it. The remaining 8% hints that the improvement pattern isn't perfectly linear — there's curvature in the data.
The negative correlation (-0.96) confirms weeks going up while pace goes down — exactly what improvement looks like.
The prediction for week 12 (5:08/km) is an extrapolation. Linear trends don't continue forever — you won't reach 0:00/km eventually. But for short-term planning (next 2–3 weeks), the projection can be a reasonable target.

Why linear regression can be misleading for athletic improvement

The linear model predicts a constant improvement of 3.4 seconds per week, forever. Taken to the extreme, it would eventually predict a pace of 0:00/km, which is obviously impossible. The real issue is more subtle: athletic improvement follows a diminishing returns curve. Early gains come fast (beginner effect, neuromuscular adaptation), but as you get fitter, each additional second of improvement requires more training volume and specificity.

Look at the data closely: the improvements become progressively smaller, from week 1 to week 2 is 8 seconds, but from week 7 to week 8 it's only 1 second. The rate is clearly slowing down, a pattern that linear regression cannot capture because it assumes a constant slope. This is why the linear R² (0.9176) leaves room for improvement.

Logarithmic regression, modeling the plateau

The logarithmicRegression() method fits the model y = a × ln(x) + b, which naturally produces fast initial improvement that gradually flattens:

// Logarithmic model: pace = a * ln(week) + b
$logReg = Stat::logarithmicRegression($weeks, $weeklyPaces);
$logWeeks = array_map(fn($v) => log($v), $weeks);
$logR2 = Stat::rSquared($logWeeks, $weeklyPaces, false, 4);

echo "Logarithmic regression: pace = " . round($logReg[0], 2)
    . " x ln(week) + " . round($logReg[1], 1) . PHP_EOL;
echo "R-squared: " . $logR2 . PHP_EOL;

Output:

Logarithmic regression: pace = -12.33 x ln(week) + 350.2
R-squared:              0.9987

The logarithmic model has a much higher R² (0.9987 vs 0.9176), indicating that it fits the observed data substantially better than the linear model for this dataset.

A note about this simple example: with only 8 points, extremely high R² is easy to obtain.

This suggests that the relationship is likely nonlinear: improvement appears rapid initially and then slows over time, a pattern consistent with diminishing returns. While the linear model already provides a strong fit (R² ≈ 0.92), the much higher R² for the logarithmic model indicates that accounting for curvature captures the structure of the data more accurately.

Comparing the predictions

The difference becomes clear when you project forward:

$linearPrediction = $trendReg[0] * 12 + $trendReg[1];
$logPrediction = $logReg[0] * log(12) + $logReg[1];

echo "Linear prediction week 12:      " . Format::secondsToTime($linearPrediction) . "/km" . PHP_EOL;
echo "Logarithmic prediction week 12:  " . Format::secondsToTime($logPrediction) . "/km" . PHP_EOL;

Output:

Linear prediction week 12:      0:05:08/km
Logarithmic prediction week 12: 0:05:20/km

The logarithmic model predicts 5:20/km, 12 seconds more conservative than the linear model's 5:08/km. At week 20 the gap widens further: the linear model would predict 4:41/km (unlikely for most recreational runners), while the logarithmic model predicts 5:13/km, a more realistic plateau.

Which model should you use?

Aspect	Linear	Logarithmic
Model	`pace = a × week + b`	`pace = a × ln(week) + b`
Assumes	Constant improvement forever	Fast early gains, gradual plateau
Short-term (4 weeks)	Good approximation	Good approximation
Long-term (12+ weeks)	Over-optimistic	More realistic
Best for	Short training blocks, beginners with stable gains	Multi-month planning, experienced runners
R² on this data	0.9176	0.9987

Recommendation: compare R² values for both models on your own data. If the logarithmic R² is higher, your improvement is already following a curve and the logarithmic model will give more trustworthy projections. If R² values are similar, you're still in the early "linear" phase of improvement — but use the logarithmic model for any prediction beyond 4–6 weeks.

All four models compared — letting the numbers decide

Rather than assuming which model fits best, let's run all four regression types on the same data and compare them objectively. The package provides logarithmicRegression(), powerRegression(), and exponentialRegression() alongside linearRegression() — each fits a different curve shape:

// Linear: pace = a * week + b
[$aLin, $bLin] = Stat::linearRegression($weeks, $weeklyPaces);
$r2Lin = Stat::rSquared($weeks, $weeklyPaces, false, 4);

// Logarithmic: pace = a * ln(week) + b
[$aLog, $bLog] = Stat::logarithmicRegression($weeks, $weeklyPaces);
$logWeeks = array_map(fn($v) => log($v), $weeks);
$r2Log = Stat::rSquared($logWeeks, $weeklyPaces, false, 4);

// Power: pace = a * week^b
[$aPow, $bPow] = Stat::powerRegression($weeks, $weeklyPaces);
$logPaces = array_map(fn($v) => log($v), $weeklyPaces);
$r2Pow = Stat::rSquared($logWeeks, $logPaces, false, 4);

// Exponential: pace = a * e^(b * week)
[$aExp, $bExp] = Stat::exponentialRegression($weeks, $weeklyPaces);
$r2Exp = Stat::rSquared($weeks, $logPaces, false, 4);

Output:

Model             R²       Week 12    Week 20    Week 52
─────────────────────────────────────────────────────────
Linear            0.9176   0:05:08    0:04:41    0:02:53
Logarithmic       0.9987   0:05:20    0:05:13    0:05:02
Power             0.9985   0:05:20    0:05:14    0:05:03
Exponential       0.9232   0:05:09    0:04:45    0:03:27

The R² column settles it without any assumptions:

Logarithmic (R² = 0.9987) and Power (R² = 0.9985) are virtually tied and both fit the data near-perfectly. They capture the curvature that the other two models miss.
Linear (R² = 0.9176) and Exponential (R² = 0.9232) leave about 8% of the variance unexplained — they force a shape that doesn't match the data's natural curve.

But R² only measures how well a model fits past data. The prediction columns reveal which models are trustworthy for the future:

At week 20, linear predicts 4:41/km and exponential predicts 4:45/km — both assume improvement keeps accelerating at nearly the same rate. For a recreational runner who started at 5:50/km, breaking 5:00/km in just 20 weeks is ambitious; breaking 4:45 is unrealistic.
At week 52 (one year), linear predicts 2:53/km — faster than the world record marathon pace (2:50/km for Kelvin Kiptum's 2:00:35). Exponential predicts 3:27/km. Both are absurd for the same runner.
Logarithmic and Power predict 5:02/km and 5:03/km at week 52 — a realistic plateau where the runner has improved by about 48 seconds over a year and further gains require significantly more effort.

The logarithmic and power models converge on nearly identical predictions because they both model the same fundamental pattern: fast early gains that asymptotically flatten. For running pace data, either is a sound choice. Logarithmic is slightly simpler to interpret (the coefficient a directly tells you "seconds of improvement per unit of ln(week)"), which is why we recommend it as the default for trend analysis.

Visualizing the models: why the curve matters more than the fit

These charts plot each model's forecast beyond the training data, so you can see at a glance where the predictions stay realistic and where they drift into fantasy.

The chart below shows all four models fitted to the same training data. The actual pace values (blue dots) end at week 8 — everything beyond is a prediction. Notice how the linear and exponential lines keep diving, while the logarithmic and power curves flatten into a realistic plateau.

Linear

The straight line fits the training period reasonably well, but projects an impossible pace of 2:53/km at week 52 — a reminder that constant improvement is a mathematical fiction.

Logarithmic

The curve mirrors how runners actually improve: rapid early gains that gradually flatten, predicting a realistic 5:02/km plateau after one year of training.

Power

Nearly indistinguishable from the logarithmic model, the power curve confirms the diminishing-returns pattern, two different equations arriving at the same truth.

Exponential

Slightly better than linear but still too optimistic, the exponential model bends just enough to look plausible in the short term while still predicting an unrealistic 3:27/km at week 52.

Linear VS Logarithmic

When we isolate the two best-fitting models, the difference becomes subtle over the training period but significant in the projection zone. Both track the actual data closely through week 8, but the linear model keeps promising improvement that will never come.

Loading real GPX files

To implement this with real GPX files, load each file, compute the average pace, and build your arrays:

$gpxFiles = glob('runs/2025-*.gpx');
$weeks = [];
$weeklyPaces = [];

foreach ($gpxFiles as $i => $file) {
    $trackpoints = parseGpx($file);
    $splits = buildKmSplits($trackpoints);
    [$paces] = Arr::extract($splits, ['pace']);
    $weeks[] = $i + 1;
    $weeklyPaces[] = Stat::mean($paces);
}

// Compare both models
$linear = Stat::linearRegression($weeks, $weeklyPaces);
$logarithmic = Stat::logarithmicRegression($weeks, $weeklyPaces);

What to look for when you analyze your own runs

Run the example and then try it with your own GPX files. Here's what to watch for:

CV below 5%: your pacing is disciplined. Above 8%: investigate what's causing the variation (hills? starting too fast?).
Elevation slope: track this number over months. As your hill strength improves, each meter of climb should cost fewer seconds.
Cardiac drift slope: a lower bpm/km slope means better aerobic fitness and hydration. Compare this across similar runs.
Z-scores: any km with |z| > 2 deserves investigation — was it a genuine outlier (stoppage, cramp) or a terrain feature?
Week-over-week trend: a negative slope means improvement. Plateaus are normal; a positive slope (getting slower) may signal overtraining or insufficient recovery.

The complete example

The full script with all helper functions and simulated data is available in the repository. You can run it with:

php examples/article-gpx-running-analysis.php

Source: examples/article-gpx-running-analysis.php

Summary of the package features used

Class	Method	What it does
`Stat`	`mean()`, `median()`, `stdev()`	Basic descriptive statistics on pace, HR, elevation
`Stat`	`quantiles()`, `percentile()`	Pace distribution — where do your km fall?
`Stat`	`coefficientOfVariation()`	Single number for pacing consistency
`Stat`	`correlation()`	Elevation vs pace, HR vs pace, HR vs time
`Stat`	`linearRegression()`	Quantify hill cost, cardiac drift rate, improvement trend
`Stat`	`logarithmicRegression()`	Model diminishing returns (pace improvement plateau)
`Stat`	`powerRegression()`, `exponentialRegression()`	Alternative non-linear trend models
`Stat`	`rSquared()`	How well does elevation/time explain your pace?
`Stat`	`zscores()`	Flag unusual km segments
`Stat`	`outliers()`, `iqrOutliers()`	Detect anomalous km (stops, sprints)
`Stat`	`skewness()`, `kurtosis()`	Distribution shape of your pace
`Stat`	`confidenceInterval()`, `sem()`	Estimate your true pace range
`Freq`	`frequencyTableBySize()`	Heart rate zone distribution
`Arr`	`extract()`	Extract columns from split data
`Format`	`secondsToTime()`	Human-readable pace and time formatting

Install it and start exploring your own runs:

composer require hi-folks/statistics

Top comments (1)

Jarosław Szutkowski • Mar 7

Nice one. Always wondered how these values are calculated by watches. With the examples you provide it might be a great opportunity to enhance the progress tracking over time