Note:I use AI assistance to draft and polish the English, but the analysis, interpretation, and core ideas are my own. Learning to write technical English is itself part of this project.
Motivation
When I stood alone in protest at stations along the Yamanote Line, I watched people walk past me. Some glanced at my sign and looked away with a frown. Others gave a small nod. But I had no way to measure how many people actually reacted to my message — or how.
That question is what started this project. I wanted to capture not just whether people reacted, but how their movement changed: did they slow down, step aside, or adjust their path? To do that, I needed to track pedestrian trajectories from video and place them on a real map. I filmed at several stations in Tokyo using a smartphone, and built the pipeline using open-source tools — YOLOX for detection, ByteTrack for tracking, and MapLibre for visualization.
Introduction
In the previous article, I extracted pedestrian trajectories from street video as structured JSON. The coordinates in that output are pixel positions in image space means something relative to the camera frame, but nothing about the real world.
In this article, I transform those image-space trajectories into geographic coordinates (WGS84) using homography, then render them on an interactive map with MapLibre GL JS. The result is a browser-based viewer where you can watch each pedestrian's path play back in real time over an aerial orthophoto. The data used inthis article was filmed at Shnbashi station.
The Core Problem: Two Different Coordinate Systems
Video footage and maps live in fundamentally different spaces. A pixel at (x, y) in a camera frame corresponds to some location (lon, lat) on the ground — but the mapping is not a simple linear scale. Because the camera is tilted at an angle rather than looking straight down, perspective distortion means that objects farther from the camera appear compressed. A projective transformation, called a homography, is the correct model for this mapping when the scene is approximately planar (which a ground surface is).
A homography H is a 3×3 matrix that maps points in image space to points in world space via:
[lon, lat, 1]^T ∝ H · [x, y, 1]^T
To compute H, I need at least four Ground Control Points (GCPs): pairs of corresponding locations, one in image coordinates and one in geographic coordinates.
Collecting Ground Control Points
The GCP workflow requires two tools running in sequence.
Step 1: Selecting Map Coordinates
The first tool (gcp_selector_map.py) opens a browser-based map where I click on identifiable features — a road marking, a corner of a crosswalk, a utility cover — and record their geographic coordinates.
The tool uses Folium with multiple tile layer options: OpenStreetMap, Esri World Imagery, and the Geospatial Information Authority of Japan (GSI) aerial photos. For urban scenes in Japan, the GSI orthophoto is particularly useful because it is regularly updated and offers clear ground-level features.
# Aerial orthophoto from Japan's Geospatial Information Authority
folium.TileLayer(
tiles='https://cyberjapandata.gsi.go.jp/xyz/seamlessphoto/{z}/{x}/{y}.jpg',
attr='国土地理院',
name='航空写真 (国土地理院)',
).add_to(m)
Clicking on the map adds a numbered marker and records the WGS84 coordinates. Once four or more points are selected, the tool exports a gcp_config.json:
{
"gcp_wgs84": [
{"lat": 35.68423, "lon": 139.76531},
{"lat": 35.68418, "lon": 139.76547}
],
"gcp_image": []
}
Step 2: Selecting Video Frame Coordinates
The second tool (gcp_selector_video.py) opens a frame from the same video and displays it in an OpenCV window. I click on the same physical features in the same order as I did on the map.
cap.set(cv2.CAP_PROP_POS_FRAMES, target_frame)
ret, frame = cap.read()
Using a stable, early frame from the video (default: 10 seconds in) avoids motion blur and ensures that the scene features are visible. Right-click removes the last point, Enter confirms. The tool writes the pixel coordinates back into gcp_config.json:
{
"gcp_wgs84": [
{"lat": 35.68423, "lon": 139.76531},
...
],
"gcp_image": [
{"x": 412, "y": 631},
...
]
}
The matching order matters: point 1 in gcp_wgs84 must correspond to point 1 in gcp_image.
Computing the Homography Matrix
With the GCP pairs collected, project_v2wgs84.py computes the homography using OpenCV's findHomography with RANSAC:
self.H, _ = cv2.findHomography(
self.src_points, # image coordinates (N, 2)
self.dst_points, # WGS84 coordinates (N, 2) as [lon, lat]
cv2.RANSAC,
ransacReprojThreshold=3.0
)
RANSAC is used here because a few GCPs may be slightly misclicked. The algorithm fits the best homography to the inlier set and discards outliers, making the result more robust than a direct least-squares fit.
Note that with exactly four GCPs — the minimum required to determine H — RANSAC has no effect: all four points are consumed by the initial sample, leaving none to evaluate as inliers or outliers. To benefit from outlier rejection, provide at least five GCPs, and preferably eight or more.
After computing H, the reprojection error is calculated by transforming each source GCP back through H and measuring the distance to the target:
projected = cv2.perspectiveTransform(src_reshaped, self.H)
errors = np.sqrt(np.sum((projected - self.dst_points) ** 2, axis=1))
self.reprojection_error = float(np.mean(errors))
Transforming Trajectories
The transform_trajectory method applies H to each point in a track's pixel-space trajectory:
def transform_points(self, points: np.ndarray) -> np.ndarray:
pts = points.reshape(-1, 1, 2).astype(np.float64)
transformed = cv2.perspectiveTransform(pts, self.H)
return transformed.reshape(-1, 2)
Before transformation, points outside the convex hull of the GCPs are discarded. The homography is only well-defined within the region spanned by the control points; extrapolation beyond the convex hull produces increasingly distorted results.
def is_in_valid_region(self, x: float, y: float) -> bool:
hull = cv2.convexHull(self.src_points.astype(np.float32))
return cv2.pointPolygonTest(hull, (float(x), float(y)), False) >= 0
Trajectory Simplification
The raw trajectory from the tracker contains a point for every frame — typically 30 points per second. For visualization and downstream analysis, this is far more detail than needed. When drawing in Maplibre, dense point sequences slow down rendering without adding visual clarity. I simplify by keeping only points where something meaningful changes: a direction shift or a speed change.
# Keep a point if:
# - direction changed by more than angle_threshold_deg (default: 10°)
# - speed changed by more than speed_ratio_threshold (default: 30%)
if angle > angle_threshold_deg or speed_change > speed_ratio_threshold:
result.append(curr)
This preserves turning points, stops, and accelerations while dropping the redundant intermediate frames during straight, constant-speed walking. A 60-second trajectory of ~1,800 frames typically reduces to 20–80 representative points.
The output for each track gains a trajectory_wgs84 field alongside the original pixel-space trajectory:
{
"id": 7,
"trajectory_wgs84": [
{"lon": 139.76531, "lat": 35.68423, "frame": 120, "time_sec": 4.0},
{"lon": 139.76545, "lat": 35.68419, "frame": 155, "time_sec": 5.17}
],
"geometry_wgs84": {
"type": "LineString",
"coordinates": [[139.76531, 35.68423], [139.76545, 35.68419]]
}
}
Rendering in MapLibre GL JS
The final step is generate_viewer.py, which reads the WGS84 JSON and generates a self-contained HTML file. The viewer uses MapLibre GL JS with an OpenStreetMap base layer.
Map Setup
const map = new maplibregl.Map({
container: 'map',
style: {
version: 8,
sources: {
osm: {
type: 'raster',
tiles: ['https://tile.openstreetmap.org/{z}/{x}/{y}.png'],
tileSize: 256,
}
},
layers: [{ id: 'osm', type: 'raster', source: 'osm' }]
},
center: [center_lon, center_lat],
zoom: 19
});
Two-Layer Rendering Strategy
The viewer maintains two GeoJSON sources:
-
Background (
bg): Full path of each active track, drawn at low opacity (0.15). Updated only when tracks enter or leave the scene. -
Trail (
trail): A sliding window of the recent path for each active track, updated every animation frame.
// Background: full path at low opacity
map.addLayer({
id: 'bg', type: 'line', source: 'bg',
paint: { 'line-color': ['get', 'color'], 'line-width': 2, 'line-opacity': 0.15 }
});
// Trail glow + line
map.addLayer({
id: 'trail-glow', type: 'line', source: 'trail',
paint: { 'line-color': ['get', 'color'], 'line-width': 8, 'line-blur': 6, 'line-opacity': 0.3 }
});
map.addLayer({
id: 'trail-line', type: 'line', source: 'trail',
paint: { 'line-color': ['get', 'color'], 'line-width': 3 }
});
Each track gets a unique color computed with the golden-angle hue distribution, which maximizes perceptual separation between adjacent track indices.
Animation Loop
The animation loop advances a currentTime variable and slices each track's coordinates to the active window:
function animate(ts) {
currentTime += (ts - lastTs) / 1000 * playbackSpeed;
const trailFeatures = [];
for (let i = 0; i < TRACKS.length; i++) {
const track = TRACKS[i];
const isActive = currentTime >= track.start && trailStart <= track.end;
if (!isActive) continue;
// Binary search for the relevant coordinate slice
const headIdx = upperIdx(coords, currentTime);
const tailIdx = lowerIdx(coords, currentTime - trailSec);
const slice = coords.slice(tailIdx, headIdx + 1).map(([lon, lat]) => [lon, lat]);
if (slice.length >= 2) {
trailFeatures.push({ type: 'Feature', ... });
}
}
map.getSource('trail').setData({ type: 'FeatureCollection', features: trailFeatures });
requestAnimationFrame(animate);
}
Binary search on the time axis avoids scanning all coordinates each frame, keeping the animation smooth even with hundreds of simultaneous tracks.
The viewer includes a timeline scrubber, adjustable playback speed (1×–120×), and a configurable trail window length (1–30 seconds). The generated HTML is fully self-contained — no server required, just open in a browser.
Results
Running the full pipeline on the Shinbashi station video produces a self-contained HTML viewer. GCP collection takes around 10 minutes per video, split between map selection and video frame annotation.
The geographic alignment looks accurate at zoom level 19 — pedestrians appear to walk along the correct sidewalk, following the actual street geometry. The animation runs smoothly in the browser even with multiple tracks active simultaneously.
Trajectory simplification reduces the point count significantly, making the viewer responsive without any noticeable loss of visual detail.
Interpretation
Plotting the first video revealed an immediate artifact: trajectories near the camera showed erratic up-and-down distortion, diverging left and right rather than tracing a clean path. This is a known limitation of ground-level homography — people walking close to the camera lens experience the most perspective distortion, and the foot-point approximation breaks down when the viewing angle is steep.
The rest of the scene told a clearer story. Almost all trajectories moved in one direction: left to right across the frame, from the street toward the station entrance. This was an evening shoot, and the pattern reflects the commuter flow — people heading home from work. There were no trajectories crossing perpendicular to this flow.
To reduce the near-camera distortion in future shoots, I plan to mount the camera higher on a tripod, which flattens the viewing angle and extends the reliable region closer to the lens.
Discussion
Limitations of Homography
A homography assumes the scene is a flat plane. For a ground-level camera looking across a street, this is a reasonable approximation for people walking on flat pavement. It breaks down for:
- Tall objects: A person's head and feet are at different heights, so their bounding box center (used as the tracking point) introduces a height-dependent offset. Using the foot point of the bounding box instead of the centroid reduces this effect.
- Sloped terrain: Hills or stairs invalidate the planar assumption.
-
Wide GCP coverage: Reprojection error increases at points far from the GCP convex hull. The
is_in_valid_regionfilter mitigates this by discarding trajectory points outside the reliable area.
For this project, I chose the foot point of the bounding box as the tracking coordinate, which approximates the actual ground contact point and keeps the homography error small for typical pedestrian heights.
GCP Selection Quality
The quality of the homography depends heavily on GCP selection. Good practices:
- Use features clearly visible in both the orthophoto and the video frame
- Distribute points across the full area of interest (not clustered in one corner)
- Avoid features that might differ between the photo and video (temporary markings, vehicles)
- Use more than four points and let RANSAC handle any mismatches
In practice, six to eight well-distributed GCPs produce significantly better results than the minimum four.
Simplification Trade-offs
The angle/speed-based simplification trades off detail for compactness. A threshold of 10° for direction changes and 30% for speed changes preserves the essential shape of each trajectory. Setting thresholds too high risks losing turns; too low retains excessive redundancy. For the information-geometry analysis in later articles, the full pixel-space trajectory (not the simplified WGS84 version) is used to preserve accurate timing information.
Conclusion
By combining GCP-based homography calibration with MapLibre GL JS rendering, I can place pedestrian trajectories extracted from street video onto an interactive geographic map. The pipeline — collect GCPs, compute H, transform coordinates, generate viewer — takes a few minutes per video and produces a self-contained HTML file that runs in any browser.
Bridge
Watching the trajectories animate on the map gives an intuitive picture of pedestrian flow — but comparing across videos or stations requires something more rigorous than visual inspection.
The next step is to construct a distribution from the trajectories in each video, representing the collective movement pattern as a probability distribution over features such as speed, direction, and acceleration. Each video then becomes a single point in a statistical manifold, where the geometry encodes how different two movement patterns are from each other. In the next article, I extract those features and begin constructing that manifold.


Top comments (0)