DEV Community: Toki Hirose

5. Comparing Pedestrian Manifolds with OSMnx Spatial Manifolds

Toki Hirose — Thu, 16 Apr 2026 14:42:09 +0000

Note:I use AI assistance to draft and polish the English, but the analysis, interpretation, and core ideas are my own. Learning to write technical English is itself part of this project.

Introduction

In previous articles, I detected pedestrian trajectories from video at multiple observation points and extracted features — speed, acceleration, speed skewness, stop ratio, path straightness, and dwell time — to construct statistical distributions. I then converted each station's distributions to natural parameters θ and their duals, the expectation parameters η, forming the statistical manifold M_U. This gave manifold coordinates for each observation point (Shinjuku, Shinbashi, Ginza1, Ginza2, Kamata), and the KL divergence between their e-geodesics and m-geodesics revealed the curvature of M_U.

In this article, I construct a second manifold — M_C — from the street network geometry surrounding each intersection. Where M_U represents the manifold of pedestrian behavior, M_C represents the spatial geometric structure through which pedestrians move. The main analytical question is: what kind of discrepancy arises between these two manifolds?

The Stack as Adjoint Functors

The hypothesis in this article is grounded in Bratton's The Stack: treating the city as a computational platform, does the urban structure ground the dynamic patterns of its users? Specifically, does the edge length, network centrality, and node count of an area determine pedestrian behavior — how fast people walk, their acceleration, their stop ratio?

In information-geometric terms:

M_C is the manifold constructed from the distributional geometry of the street network within 420m (5-minute walk) of each observation point — distributions of edge lengths, circuity, node connectivity, betweenness centrality, and bearing
M_U is the statistical manifold from Article 4, where each station is a point defined by the fitted trajectory feature distributions

The verification method is to check whether the positions of each observation point on M_C and M_U align in their pairwise distance ordering after normalization. If they do, the User Layer is directly shaped by the City Layer. If not, either M_C is under-expressive, or the user behavior reflects influences from layers beyond the City Layer — the Interface Layer or Address Layer in Bratton's terms. To investigate this variation, a Pythagorean decomposition through the Interface Layer intervention is also planned.

The adjoint functor hypothesis: F: M_C → M_U (free functor, predictive) and G: M_U → M_C (forgetful functor, projection). An adjoint relationship means M_C and M_U are not independent — changes in one systematically correspond to changes in the other. Where the correspondence breaks down is precisely where something beyond the City Layer is at work.

Constructing M_C: Street Network Features via OSMnx

Feature Extraction

For each station, I extract the street network within a 420m walking-distance radius (5-minute walk) using OSMnx. Each network edge (street segment) is treated as one observation, giving a distribution of network geometry at each station — parallel to how M_U was built from pedestrian trajectories.

Five features are extracted at the edge/node level:

Feature	Distribution	Parameters
`edge_length`	Log-normal	(s, loc=0, scale)
`circuity_mapped`	Beta	(α, β)
`node_degree`	Gamma	(a, loc=0, scale)
`betweenness_centrality`	Beta	(α, β)
`bearing_rad`	Von-Mises	(κ, μ, scale=1)

In addition, one scalar network-level statistic is retained as a fixed coordinate (not a fitted distribution):

Feature	Meaning
`edge_density_km`	Street km per km² — network density

The circuity of an edge is the ratio of actual edge length to straight-line distance between its endpoints. A value of 1.0 means perfectly straight; higher values indicate detours. This captures how convoluted the street geometry is, independent of segment length. The bearing_rad encodes the dominant orientation of each street segment, fitted to a Von-Mises distribution to capture the angular structure of the street grid.

One preprocessing step is required for bearing_rad. OSMnx represents streets as a directed graph, so an east–west road appears as two edges at approximately 90° and 270°. If all edge bearings are used directly in the range [−π, π], they cancel out and the distribution collapses. To avoid this, a double-angle trick is applied: bearings are first mapped to [0°, 180°) via bearing % 180 to treat the street as undirected, then doubled to [0°, 360°) to restore the circular structure before fitting the Von-Mises distribution.

The dual coordinate conversion follows the same procedure as M_U:

Distribution	θ	η
Log-normal(s, scale)	(m/s², −1/2s²) where m=log(scale)	(m, m²+s²)
Beta(α, β)	(α−1, β−1)	(ψ(α)−ψ(α+β), ψ(β)−ψ(α+β))
Gamma(a, scale)	(a−1, −1/scale)	(ψ(a)+log(scale), a·scale)
Von-Mises(κ, μ)	(κ cos μ, κ sin μ)	(A(κ) cos μ, A(κ) sin μ) where A(κ)=I₁(κ)/I₀(κ)

Total θ/η dimension: 2 (Log-normal) + 2 (Beta) + 2 (Gamma) + 2 (Beta) + 2 (Von-Mises) = 10 dimensions

Plus 1 scalar coordinate: edge_density_km → 11-dimensional M_C point

The manifold point for station s on M_C is:

c_s = (s_len, scale_len, α_circ, β_circ, a_deg, scale_deg, α_btwn, β_btwn, κ·cosμ, κ·sinμ, ρ_edge) ∈ M_C

Scalar features have no distributional interpretation and are excluded from the dually flat structure.

M_C Coordinates

θ-space coordinates (10 distributional dims + 1 scalar):

station	len:m/s²	len:−1/2s²	circ:α−1	circ:β−1	deg:a−1	deg:−1/sc	btwn:α−1	btwn:β−1	bear:κcosμ	bear:κsinμ	edge_density_km
Ginza1	2.1281	−0.4001	−0.7925	1.4501	8.6317	−1.5537	−0.1772	40.8849	−0.0623	−0.0035	177.89
Ginza2	2.1385	−0.4016	−0.7925	1.4401	8.3871	−1.5175	−0.1789	40.9422	−0.0615	−0.0032	178.12
Shinjuku	2.5291	−0.4738	−0.7959	1.3335	4.8897	−1.0181	−0.3197	43.3613	0.1130	0.0216	196.20
Kamata	2.9538	−0.4843	−0.7795	0.9660	5.3845	−1.0664	−0.3520	21.5502	−0.0228	−0.0567	109.69
Shinbashi	2.5988	−0.4740	−0.7842	1.3007	5.9866	−1.1638	−0.3110	30.6136	0.0119	0.0035	157.19

M_C Geometry: Geodesics and Curvature

KL divergence matrix (M_C, asymmetric):

	Ginza1	Ginza2	Shinjuku	Kamata	Shinbashi
Ginza1	0.000	0.000	0.110	0.198	0.054
Ginza2	0.000	0.000	0.103	0.192	0.050
Shinjuku	0.149	0.137	0.000	0.193	0.055
Kamata	0.232	0.224	0.262	0.000	0.084
Shinbashi	0.064	0.058	0.066	0.077	0.000

The dual flat structure is the same as M_U. M_C has 10-dimensional θ/η space (distributional features only):

Log-normal(edge_length): 2 dims
Beta(circuity_mapped): 2 dims
Gamma(node_degree): 2 dims
Beta(betweenness_centrality): 2 dims
Von-Mises(bearing_rad): 2 dims

edge_density_km (1 scalar) is not part of the exponential family structure and is excluded from the geodesic/curvature analysis.

e-geodesic: straight line in θ-space, θ(t) = (1−t)·θ_A + t·θ_B

m-geodesic: straight line in η-space, η(t) = (1−t)·η_A + t·η_B

The KL divergence between e-geodesic and m-geodesic at each t measures the curvature of M_C. The pedestrian manifold M_U had a geodesic gap of approximately 0.047; the spatial manifold M_C is considerably flatter — consistent with the expectation that street network geometry is more uniform across stations than human behavior.

PCA and KL Divergence

The figure shows two aspects of M_C: a PCA projection of the 11-dimensional manifold coordinates, and a graph where edge thickness encodes symmetric KL divergence — thinner edges indicate more similar network geometry.

Several features are immediately visible. Ginza1 and Ginza2 occupy nearly the same position in both views, consistent with their physical proximity: they are adjacent observation points on the same street network. Shinbashi sits near the center of both representations, close in network geometry to all other stations. Kamata is clearly separated, reflecting its lower density and less-connected street structure.

At first glance, the PCA and KL divergence orderings look consistent. However, a difference appears when examining the Ginza–Shinjuku pair. In the PCA projection, these two stations appear fairly distant. But their symmetric KL divergence is not large relative to other pairs — and KL divergence is the geometrically meaningful distance on the information manifold. This illustrates a general limitation: PCA collapses the high-dimensional manifold structure into two dimensions and can distort relative distances. The KL divergence, computed directly in θ-space, preserves the information-geometric distances that PCA may misrepresent.

M_C vs M_U: Pairwise Distance Comparison

To compare M_C and M_U, I compute the symmetric KL divergence between all station pairs in each manifold, then define the adjoint gap per pair:

gap(A, B) = Sym-KL_U(A,B)_norm − Sym-KL_C(A,B)_norm

where both distances are normalized to [0,1] within their respective manifolds. A positive gap means the pair is behaviorally more distinct than spatial structure predicts; a negative gap means the street geometry separates them more than their pedestrian behavior does.

Pair	Sym-KL (M_C)	Sym-KL (M_U)	M_C (norm)	M_U (norm)	adj_gap
Kamata vs Shinbashi	0.1632	2.5508	0.354	1.000	+0.646
Shinjuku vs Shinbashi	0.1241	1.2454	0.269	0.412	+0.143
Ginza1 vs Shinbashi	0.1197	1.1134	0.259	0.352	+0.093
Ginza2 vs Shinbashi	0.1088	1.0101	0.236	0.306	+0.070
Ginza1 vs Ginza2	0.0004	0.3523	0.000	0.009	+0.009
Ginza2 vs Shinjuku	0.2478	0.7865	0.538	0.205	−0.333
Ginza1 vs Shinjuku	0.2666	0.3834	0.579	0.023	−0.556
Ginza2 vs Kamata	0.4167	0.7804	0.905	0.202	−0.703
Shinjuku vs Kamata	0.4605	0.8774	1.000	0.246	−0.754
Ginza1 vs Kamata	0.4311	0.3320	0.936	0.000	−0.936

If the adjoint hypothesis held, the M_C and M_U orderings would correlate. The Spearman rank correlation between the two Sym-KL sequences is ρ = −0.297 (p = 0.405) — not only non-significant, but negative. Pairs that are more spatially distant in M_C tend to be less behaviorally distinct in M_U, not more. This is not a null result; it is a directional finding: spatial separation and behavioral separation are misaligned, and misaligned in a consistent direction across the five stations.

Where the Correspondence Holds — and Where It Breaks

What M_C tells us

On the spatial manifold M_C, Kamata is the most isolated station: its normalized M_C distance to other stations spans 0.354 (vs Shinbashi) to 1.000 (vs Shinjuku). This reflects a structural property of the observation point: unlike the other stations, where the street network crosses the railway easily on multiple sides, the Kamata observation point requires walking approximately 200–300 meters along the tracks before reaching a grade crossing. As a result, the local network is less centrally connected — lower betweenness centrality and lower edge density — than the other stations, all of which are in more uniformly accessible urban cores.

Ginza1 and Ginza2 are the closest pair on M_C (Sym-KL ≈ 0.0004, norm = 0.000). As adjacent observation points on the same street grid, this is expected and serves as a consistency check: the feature extraction and fitting procedure correctly identifies near-identical network environments as near-identical manifold points.

What M_U tells us — and where M_C fails to predict it

On the behavioral manifold M_U, the most isolated station is Shinbashi, not Kamata. The Kamata–Shinbashi pair has the largest symmetric KL divergence in M_U (2.5508, norm = 1.000), and every pair involving Shinbashi sits near the top of the M_U distance ranking. Spatially, Shinbashi is a moderate-density area, geometrically similar to Shinjuku and Ginza — yet behaviorally its pedestrian distributions are the most distinct of all five stations.

Kamata's position on M_U is equally surprising. The Ginza1–Kamata pair has the smallest symmetric KL divergence on M_U (0.3320, norm = 0.000) — meaning that despite being the most spatially isolated station in M_C, Kamata's pedestrian behavior is closest to Ginza1's. The City Layer predicts Kamata to be an outlier; the User Layer places it near the center of the behavioral space.

Ginza1 and Ginza2 are also close in M_U (Sym-KL = 0.3523, norm = 0.009), as expected from their spatial proximity. However, they are not the closest pair in M_U — that is Ginza1 vs Kamata — so spatial adjacency does not fully determine behavioral similarity even for near-identical network environments.

Interpretation

These divergences identify where the adjoint hypothesis breaks down. A large positive adjoint gap (M_U norm − M_C norm >> 0) means the pair is behaviorally more distinct than spatial structure predicts — the City Layer under-explains the difference. The top cases are Kamata–Shinbashi (+0.646), Shinjuku–Shinbashi (+0.143), and the Ginza–Shinbashi pairs. In each case, the street network geometry does not account for the behavioral separation.

Negative gaps (Ginza–Kamata: −0.936; Shinjuku–Kamata: −0.754) mean the City Layer over-predicts the behavioral difference: stations that look geometrically distinct turn out to have similar pedestrian distributions. This suggests that some behavioral features are insensitive to the specific street network differences between central Tokyo and lower-density areas, or that common attractor effects — major transit hubs, commercial flows — produce convergent behavior across dissimilar spatial environments.

Taken together, both directions of mismatch point to the same conclusion: the City Layer alone cannot account for how pedestrians actually move. Physical space sets constraints, but behavior emerges from the full stack of layers that operate within it. In Bratton's framework, the Interface Layer — the signs, messages, and social signals that users encounter — and the Address Layer — the identities and affordances assigned to locations — each contribute independently to behavioral outcomes. The adjoint gap is a geometric measure of how much explanatory work those layers are doing, above and beyond what the street network provides.

One limitation specific to this dataset must be acknowledged here. All M_U measurements in this study were collected during solo standing demonstrations. This means the behavioral distributions captured on M_U already reflect the presence of an Interface Layer perturbation. To isolate the pure City Layer baseline — the M_U point a station would occupy without any intervention — measurements under neutral conditions are needed. Obtaining that baseline is a prerequisite for Series 2: it is the reference point p from which the intervention-induced shift will be measured. The data collection for this is ongoing.

Discussion: Structural Untranslatability

In this article, M_C and M_U have been compared through normalized pairwise distances alone. The two manifolds live in different spaces — M_C is defined over street network feature distributions, M_U over pedestrian trajectory distributions — and no formal mapping between them has been constructed here. The adjoint gap captures the discrepancy between their distance structures, but it does not yet decompose that discrepancy into interpretable geometric terms.

That decomposition is the task of Series 2. There, for each station, a correspondence point q on M_U will be defined as the "spatially predicted" behavioral distribution — the M_U point that the City Layer, via the adjoint functor F: M_C → M_U, would predict from M_C alone. Once that correspondence is established, the Pythagorean identity on the dually flat manifold M_U gives:

D(p ‖ r) = D(p ‖ q) + D(q ‖ r) + ⟨θ_q − θ_r, η_p − η_q⟩

where p is the observed baseline distribution, r is the distribution under intervention, and q is the spatially predicted point. The inner product term vanishes when the path p → q → r is m-orthogonal — that is, when the intervention effect lies entirely within the submanifold predicted by M_C. When it does not vanish, the residual is the component of the behavioral shift that the City Layer cannot account for — the measurable footprint of the Interface Layer.

What this article establishes is the prerequisite: that the adjoint gap is nonzero and structured in a non-trivial way. If M_C perfectly predicted M_U, there would be nothing left for the Interface Layer to explain. The discrepancies documented here — Shinbashi's behavioral isolation, Kamata's unexpected proximity to Ginza in M_U — are precisely the locations where the Series 2 decomposition will be most informative.

The adjoint functor F: M_C → M_U, when eventually operationalized in Series 2, is more naturally interpreted as an m-projection — M_C constrains the expectation parameters η of M_U (observable averages such as mean speed and stop ratio) rather than the natural parameters θ. The choice of projection type will determine how the correspondence point q is computed and, consequently, how the Interface Layer residual is decomposed.

Conclusion

By comparing M_C and M_U, we can see where the correspondence between urban spatial structure and pedestrian behavior holds — and where it breaks down. The dual flat structure of both manifolds makes this comparison geometrically tractable: distances, projections, and residuals all have precise meanings in terms of statistical distinguishability.

The next steps are twofold. First, more observation points will be added to strengthen the comparison. Second, measurements will be conducted both with and without the standing demonstration message, making it possible to observe how the presence of an intervention shifts each station's position on M_U. Since all data in this study was collected while a demonstration was in progress, the baseline — pedestrian behavior without any intervention — has not yet been captured.

So far, this series has examined only the relationship between the City Layer and the User Layer. The next step is to investigate the influence of the Interface Layer directly. The Address Layer is also a candidate influence, but how to operationalize and measure it remains an open question.

Citation

Boeing, G. (2025). Modeling and Analyzing Urban Networks and Amenities with OSMnx. Geographical Analysis 57 (4), 567-577. doi:10.1111/gean.70009
Bratton, B. H. (2015). The Stack: On Software and Sovereignty. MIT Press. ISBN: 9780262029575

4. Constructing a Station-Level Statistical Manifold with Dual Flat Structure from Pedestrian Trajectories

Toki Hirose — Sat, 04 Apr 2026 23:31:15 +0000

Note:I use AI assistance to draft and polish the English, but the analysis, interpretation, and core ideas are my own. Learning to write technical English is itself part of this project.

Introduction

In this article, I extend the pedestrian trajectory feature distributions measured in Article 3 to analyze pedestrian trajectory distributions across multiple urban locations. Rather than applying PCA, I construct a manifold where each location's distribution becomes a single coordinate point, and compute KL divergences based on dual flat structure for comparison between stations. By embedding each observation point's information onto the manifold, we can connect how differences in pedestrian behavioral dynamics are influenced by differences in urban spatial structure — a connection developed further in the next article.

The key idea is to treat individual pedestrian trajectory observations not as isolated events, but as distributions where each observation point becomes a single point on the information-geometric manifold. Comparing stations on geodesics with dual flat structure — separating natural parameters from expectation parameters of the pedestrian trajectory distributions — allows us to observe non-linear behavioral differences in a geometrically precise way.

Motivation

At the end of last year, I read Spivak's "Can the Subaltern Speak?" I felt that the method of searching for where silence lies is similar to acoustic reflection surveying in geophysics. Acoustic reflection surveying sends sound into the ground; where strata boundaries exist, reflection intensity changes. Graphing the vertical intensity changes reveals stratum surfaces as regions of high reflection. Where liquid exists underground, reflections become chaotic rather than coherent.

When ideology is treated as its environment, well-discussed places show clear strata surfaces, while places that go undiscussed become regions that cannot be measured by that observation method. Like acoustic reflection surveying, the purpose of this series is to capture reflection surfaces by applying some action and identify where invisible places are.

Around the same time, I encountered information geometry. KL divergence measures the asymmetric difference between two distributions. I thought that invisible differences between places — the kind that symmetric metrics erase — might emerge precisely in that asymmetry.

Methodological Foundation

Why Information Geometry Over PCA?

PCA's essential operation is dimensionality reduction — discarding data. It treats observations as points in Euclidean space and projects onto directions of maximum variance, focusing on the observations themselves and summarizing inter-indicator relationships linearly. Euclidean distance between parameters does not correspond to "statistical distinguishability."

Information geometry addresses this through its dual flat structure:

Distributions as manifold points: Entire distributions are the unit of analysis, not individual observations
Fisher metric: The unique invariant metric on statistical manifolds, measuring distance as statistical distinguishability — how easily two distributions can be separated from data
Dual structure (e-connection / m-connection): Naturally separates observational indicators from distributional parameters:
- e-connection: captures changes in natural parameters (generative mechanisms)
- m-connection: captures changes in expectation parameters (observable statistics)
- The same observed change may reflect different magnitudes of change in the generative mechanism depending on the location on the manifold — a non-symmetry PCA cannot capture in principle

Station as a Point on the Manifold

Data was collected at five locations: Ginza 1-chome, Ginza 2-chome, Shinjuku, Kamata, and Shinbashi.

point_name	tracks	lat	lon	point attribution
Ginza1	618	35.67380	139.76772	Shopping / Tourism
Ginza2	517	35.67385	139.76775	Shopping / Tourism
Shinjuku	989	35.69183	139.70259	Large-scale commercial / Transit hub
Kamata	745	35.56262	139.71545	Commercial / Transit hub
Shinbashi	1106	35.66575	139.75797	Business

The point attribution labels indicate the urban function and regional character of each location. "Shopping/Tourism" suggests pedestrian influx is primarily for sightseeing and shopping; "Business" indicates a high proportion of commuting and work-related use.

Shinjuku, as a major terminal station, is a complex point where commercial and transit functions overlap. Even sharing the "Shopping/Tourism" label with Ginza, Shinjuku is expected to show greater diversity in travel purpose and speed distribution, with more pronounced mixing of stop and through-traffic behavior. Shinjuku's manifold point likely reflects broader distributions and more mixed traffic behavior compared to Ginza. Each station's point on the manifold is determined by its trajectory feature distribution parameters; point attribution functions as an "urban context" tag attached to that point.

Note: Ginza 1 and 2 are at adjacent intersections and overlap at this map scale.

Each station is represented by fitting the same 8-feature distribution schema established in Article 3:

Feature	Distribution	Parameters
`real_speed_mean`	Normal	(μ, σ)
`real_speed_cv`	Log-normal	(s, loc, scale)
`real_accel_abs_mean`	Half-normal	(loc, σ)
`stop_ratio`	Beta	(α, β)
`real_straightness`	Beta	(α, β)
`speed_skew`	Gamma	(a, loc, scale)
`decel_ratio`	Beta	(α, β)
`duration_sec`	Log-normal	(s, loc, scale)

The manifold point for station s is the concatenated vector of all fitted parameters:

p_s = (μ_speed, σ_speed, s_cv, …, s_dur, scale_dur) ∈ M_U

As a concrete example, the manifold coordinates for Shinjuku are:

Feature	Distribution	Fitted Parameters
`real_speed_mean`	Normal	μ=1.4625, σ=0.5319
`real_speed_cv`	Log-normal	s=0.3845, loc=0, scale=0.8025
`real_accel_abs_mean`	Half-normal	loc=0, σ=28.2603
`stop_ratio`	Beta	α=0.7074, β=7.8578
`real_straightness`	Beta	α=0.8339, β=0.7307
`speed_skew`	Gamma	a=1.861, loc=0, scale=1.1513
`decel_ratio`	Beta	α=31.5501, β=31.492
`duration_sec`	Log-normal	s=0.8659, loc=0, scale=2.1172

This parameter vector defines Shinjuku's single point on M_U. The same procedure is applied to all five stations to populate the manifold.

Dual Flat Structure

M_U is a dually flat manifold because each feature distribution belongs to an exponential family. The product of independent exponential families inherits this structure with:

Natural parameters θ: Canonical exponential family parametrization
Expectation parameters η: E[T(X)] where T(X) are sufficient statistics
Related by Legendre transform: η = ∇_θ ψ(θ) where ψ is the log-partition function

For each distribution family:

Normal(μ,σ): θ = (μ/σ², −1/2σ²), η = (μ, μ²+σ²)
Log-normal(s,scale): θ = (m/s², −1/2s²) where m=log(scale), η = (m, m²+s²)
Half-normal(σ): θ = −1/2σ², η = σ²
Gamma(a,scale): θ = (a−1, −1/scale), η = (ψ(a)−log(1/scale), a·scale)
Beta(α,β): θ = (α−1, β−1), η = (ψ(α)−ψ(α+β), ψ(β)−ψ(α+β))

For example, applying real_speed_mean (Normal distribution) to the dual flat structure:

Natural parameters θ
- θ₁ = μ/σ²: the mean weighted by precision. If pedestrian speeds are highly variable (large σ²), this value is small — a weak signal. If everyone walks at nearly the same speed, it is large — a strong signal.
- θ₂ = −1/2σ²: encodes the precision of the distribution. A large spread yields a small (more negative) value; a broad distribution yields a value close to zero.
Expectation parameters η
- η₁ = μ: simply the mean walking speed — directly readable from the data.
- η₂ = μ² + σ²: the second moment, encoding both the mean and the spread of speeds.

Applying these conversions to Shinjuku's fitted parameters yields a 15-dimensional coordinate vector in each system (θ-dim = η-dim = 15, one dimension per sufficient statistic across all 8 features):

θ (Shinjuku): [ 5.1697e+00 -1.7674e+00 -1.4882e+00 -3.3827e+00 -6.0000e-04
               -2.9260e-01  6.8578e+00 -1.6610e-01 -2.6930e-01  8.6100e-01
               -8.6860e-01  3.0550e+01  3.0492e+01  1.0005e+00 -6.6690e-01]

η (Shinjuku): [ 1.4625e+00  2.4218e+00 -2.2000e-01  1.9620e-01  7.9864e+02
               -3.2875e+00 -9.1700e-02 -9.8470e-01 -1.2312e+00  4.6990e-01
                2.1425e+00 -7.0020e-01 -7.0210e-01  7.5010e-01  1.3124e+00]

These two vectors are the dual coordinates of Shinjuku's point on M_U. The θ-coordinates encode the generative mechanism (natural parameters), while the η-coordinates encode the observable statistics (expectation parameters). Their relationship via the Legendre transform is what makes geodesic and divergence calculations tractable.

Fisher Information Metric

The Fisher information matrix G for M_U is block diagonal due to feature independence:
G = diag(G₁, G₂, …, G₈)

Each block G_i is computed analytically for the corresponding distribution family.

Results

Manifold Construction

Applied the schema to 5 JRE Line stations with available trajectory data. Each station's manifold coordinates were computed by fitting distributions to trajectory features extracted as in Article 3.

KL Divergence Between Stations

Using the block diagonal structure, KL divergence between stations p and q is:
D_KL(p ‖ q) = Σᵢ D_KL(pᵢ ‖ qᵢ)

For example, for a Normal-distributed feature:

D_KL(𝒩(μ₁,σ₁) ‖ 𝒩(μ₂,σ₂)) = log(σ₂/σ₁) + (σ₁² + (μ₁−μ₂)²) / (2σ₂²) − 1/2

Each distribution family (Log-normal, Half-normal, Gamma, Beta) has its own closed-form expression, computed analytically in the same way.

The divergence matrix shows clear behavioral differences between stations, with some pairs showing much higher divergence than others.

	Ginza1	Ginza2	Shinjuku	Kamata	Shinbashi
Ginza1	0.000	0.200	0.175	0.165	0.665
Ginza2	0.152	0.000	0.319	0.341	0.521
Shinjuku	0.208	0.467	0.000	0.453	0.740
Kamata	0.167	0.439	0.424	0.000	1.566
Shinbashi	0.448	0.490	0.505	0.984	0.000

Note that D(p‖q) ≠ D(q‖p) — KL divergence is asymmetric. For example, D(Kamata‖Shinbashi) = 1.566 while D(Shinbashi‖Kamata) = 0.984. Shinbashi shows the highest divergence from all other stations, suggesting it occupies a behaviorally distinct region of M_U.

This figure shows how much KL divergence arises per feature for each station pair. Row 1: D(p‖q) — forward direction. Row 2: D(q‖p) — reverse direction. Row 3: D(p‖q) − D(q‖p) — asymmetry (with zero line). The third row in particular reveals which features are driving the KL asymmetry.

e-Geodesics and m-Geodesics

The dual flat structure enables two types of geodesics between stations:

e-geodesic: Straight line in θ-space (natural parameter space)
m-geodesic: Straight line in η-space (expectation parameter space)

The asymmetry between these paths reveals the curvature of the behavioral manifold. For pairs with high KL divergence, the midpoint of the e-geodesic and m-geodesic can differ significantly, indicating nonlinear relationships between generative mechanisms and observed statistics.

The figure below shows the e-geodesic (left) and m-geodesic (right) between two observation points, along with the KL divergence computed at intermediate points along each path (center). t=0 represents the start point and t=1 the end point. At t=0 and t=1 the KL divergence is zero by definition; it reaches its maximum at t=0.5, the midpoint.

Pair	Sym-KL	Max Div	Mean Div	Nonlinearity
Kamata↔Shinbashi	1.27542	0.04743	0.02420	High
Shinjuku↔Shinbashi	0.62270	0.01401	0.00695	Medium
Ginza1↔Shinbashi	0.55669	0.01352	0.00672	Medium
Ginza2↔Shinjuku	0.39324	0.00896	0.00446	Medium
Ginza2↔Shinbashi	0.50506	0.00890	0.00442	Medium
Ginza2↔Kamata	0.39019	0.00546	0.00269	Low
Shinjuku↔Kamata	0.43869	0.00372	0.00182	Low
Ginza1↔Ginza2	0.17615	0.00202	0.00099	Low
Ginza1↔Shinjuku	0.19169	0.00178	0.00087	Low
Ginza1↔Kamata	0.16602	0.00081	0.00040	Low

Max Div: max geodesic gap
Mean Div: mean geodesic gap

The maximum geodesic gap is 0.047, which is small — indicating that this manifold is relatively flat. The e-geodesic and m-geodesic paths can be considered approximately identical.

PCA Distance vs Sym-KL Divergence

Image 5 illustrates the discrepancy between PCA and KL divergence. Each point represents an observation station projected onto two PCA components from the 15-dimensional θ feature space. The edges between points encode Sym-KL divergence: thicker edges indicate greater distributional difference. Notably, pairs such as Ginza1 and Shinjuku appear far apart in PCA space despite having one of the smaller KL divergences — a clear demonstration of how the Euclidean and Fisher metrics can produce conflicting orderings.

Pair	PCA dist	Sym-KL	PCA dist (norm)	Sym-KL (norm)	diff (KL - PCA)
Kamata vs Shinbashi	7.272	1.275	0.899	1.000	0.101
Shinjuku vs Shinbashi	7.907	0.623	1.000	0.412	-0.588
Ginza1 vs Shinbashi	6.019	0.557	0.699	0.352	-0.347
Ginza2 vs Shinbashi	3.914	0.505	0.364	0.306	-0.059
Shinjuku vs Kamata	6.168	0.439	0.723	0.246	-0.477
Ginza2 vs Shinjuku	6.910	0.393	0.841	0.205	-0.636
Ginza2 vs Kamata	3.574	0.390	0.310	0.202	-0.108
Ginza1 vs Shinjuku	4.954	0.192	0.530	0.023	-0.507
Ginza1 vs Ginza2	2.811	0.176	0.189	0.009	-0.180
Ginza1 vs Kamata	1.624	0.166	0.000	0.000	0.000

Discussion

PCA Distance vs KL Divergence

Although PCA was used for visualization, the actual analysis uses 15-dimensional θ/η vectors. Comparing normalized PCA distance and Sym-KL on a 0–1 scale reveals different orderings — pairs involving Shinjuku show particularly large discrepancies. This arises from a fundamental difference in what each metric measures: PCA reduces to 2 components and computes Euclidean distance; KL divergence applies the Fisher metric across all 15 components. A large KL divergence means the two distributions are statistically distinguishable with fewer samples — it captures how different the distributions are, not just how far apart their parameter vectors sit in Euclidean space.

The manifold itself is relatively flat (maximum geodesic gap = 0.047), meaning e-geodesic and m-geodesic paths are nearly identical. The discrepancy with PCA is therefore not a curvature effect but a metric effect: information geometry preserves the statistical structure of distributions, while PCA discards it. The dual structure further separates generative parameters from observable statistics — a distinction PCA cannot make. Unlike regression approaches that treat residuals as noise, the adjoint functor framework (developed in Article 5) will interpret residuals as structural untranslatability between urban form and pedestrian behavior.

Asymmetry

D(Kamata‖Shinbashi) > D(Shinbashi‖Kamata) means there are behaviors among Kamata pedestrians that rarely occur at Shinbashi. Features that produce asymmetry are those where one station has behavior patterns that simply don't appear at the other. For example, if stop_ratio asymmetry is large, Kamata pedestrians have a broader (or narrower) tail in their stopping-rate distribution than Shinbashi pedestrians. This asymmetry can represent functionally meaningful urban differences — a property that symmetric distance measures like PCA cannot capture.

Conclusion

By constructing M_U as a statistical manifold, we can quantify pedestrian behavioral differences between stations with geometric precision. The dual flat structure characterizes the geometry of space and enables geometrically meaningful comparison of station distributions. This approach moves beyond simple correlation analysis to detect structural differences in how urban environments shape human movement patterns. The dual structure of M_U will also serve as the foundation for decomposing intervention effects in Series 2.

In the next article, I'll compare this pedestrian manifold with a spatial manifold constructed from OSMnx street network features, establishing the adjoint relationship between urban structure and pedestrian behavior.

Toki Hirose

Apr 16

5. Comparing Pedestrian Manifolds with OSMnx Spatial Manifolds

#analytics #datascience #machinelearning #python

Comments

11 min read

Toki Hirose

Mar 29

3. Exploring Feature Distributions from Pedestrian Trajectories

#statistics #computervision #python #datascience

Comments

10 min read

3. Exploring Feature Distributions from Pedestrian Trajectories

Toki Hirose — Sun, 29 Mar 2026 05:07:50 +0000

Motivation

In this article, I explore the statistical distributions of features extracted from pedestrian trajectories. Although the tracking accuracy still has room for improvement, analyzing the distributions of speed and related features allows me to characterize average pedestrian behavior from real video data. I have recently been studying information geometry, and I want to investigate how pedestrian behavior changes across locations by modeling it as probability distributions. By constructing a statistical manifold from these distributions across multiple locations, I aim to capture differences between pedestrian populations that conventional clustering methods such as k-means — which rely on Euclidean distance — cannot detect.

Note

I use AI assistance to draft and polish the English, but the analysis, interpretation, and core ideas are my own. Learning to write technical English is itself part of this project.

Introduction

In Article 1, I detected pedestrians in video footage and visualized their trajectories. In Article 2, I projected those trajectories onto a map using homography. Known limitations remain — detection accuracy and calibration point selection for the homography transform — but these are out of scope for this article and will be addressed in a future iteration. In this article, I analyze the speed and behavioral distributions of the detected pedestrians.

The data used in this analysis was collected at the Ginza pedestrian zone (歩行者天国) in Tokyo.

Feature Extraction

Each pedestrian trajectory stores a timestamp, pixel coordinates within the video frame, and the height of the detected bounding box. From a fixed camera, this is sufficient to measure speed, acceleration, dwell time, and several behavioral indicators.

Pixel to Meter Conversion

All trajectory measurements are initially in pixel units. To convert to real-world distances, I used the bounding box height as a depth proxy. For each track at each frame, the average bounding box height between consecutive frames is computed and used to derive a scale factor based on the assumed average human height of 1.7 m:

scale [m/px] = H_REAL / avg_bbox_height_px
real_dist [m] = pixel_dist [px] × scale

This approach effectively performs implicit monocular depth estimation using the known height of a pedestrian as a reference object.

Speed

Per-step real-world speed was computed by dividing the depth-normalized displacement between frames by the elapsed time. For each track, I extracted:

real_speed_mean: mean walking speed
real_speed_cv: coefficient of variation (std / mean), which normalizes variability relative to the mean and indicates how much a single trajectory fluctuates in pace

Acceleration

Acceleration was computed as the finite difference of per-step speed divided by elapsed time. I extracted the mean of signed acceleration (real_accel_mean). Since acceleration can be positive or negative, the signed mean captures whether there is a net directional bias across the trajectory.

I also computed decel_ratio — the fraction of acceleration steps where the value is negative (i.e., the pedestrian is decelerating). A value near 0.5 indicates balanced acceleration and deceleration (typical steady walking); values above 0.5 indicate dominant braking behavior. While real_accel_abs_mean captures the intensity of speed change, decel_ratio captures its frequency, making them complementary indicators.

Speed Skewness

For each trajectory, I computed the skewness of the per-step speed distribution (speed_skew). A value near zero indicates a symmetric speed profile — steady walking. A negative value indicates the presence of a slow phase within the trajectory (hesitation or stopping). A positive value indicates a fast phase (e.g., hurrying after a pause). This captures the temporal asymmetry of the speed profile in a single scalar, which mean and variance alone cannot express.

Stop Ratio

I computed stop_ratio as the fraction of steps where real-world speed falls below 0.3 m/s. A higher stop ratio indicates that the pedestrian spent more time engaging with the environment — reading a sign, looking at a shop window, or reacting to a stimulus.

Path Straightness

real_straightness is the ratio of the depth-normalized displacement from start to end point to the total real path length (range: 0–1). Values near 1 indicate a nearly straight trajectory; values near 0 indicate winding, reversal, or wandering. Both displacement and total path length are depth-normalized using the bounding box height at the respective trajectory points, correcting for the variation in pixel scale with camera distance.

Dwell Time

duration_sec is the observation duration of each track, computed as the number of frames from first to last detection multiplied by the frame interval.

Distribution Analysis

I plotted the frequency distribution of each feature across all tracks and fitted candidate probability distributions using the Kolmogorov-Smirnov (KS) test.

The KS statistic D measures the maximum difference between the empirical CDF of the data and the theoretical CDF of a candidate distribution. A smaller D indicates a better fit. The corresponding p-value represents the probability of observing a deviation as large as D by chance under the null hypothesis; a higher p-value indicates a better fit. When comparing multiple candidate distributions, I selected the one with the smallest KS statistic (breaking ties by p-value).

Candidates were assigned based on the support of each feature:

Positive-valued: Normal, Log-normal, Half-normal, Gamma, Exponential
Bounded [0, 1]: Beta, Uniform, Normal, Log-normal
Full real line: Normal, Laplace, t(df=5), Cauchy

Feature	Unit	Mean	Std	Best Distribution	KS	p-value	Accept (p>0.05)
`real_speed_mean`	m/s	1.360	0.420	Gamma	0.037	0.350	✓
`real_speed_cv`	—	0.894	0.383	Log-normal	0.061	0.019	✗
`real_accel_mean`	m/s²	0.026	5.362	Cauchy	0.034	0.474	✓
`stop_ratio`	—	0.085	0.089	Beta	0.118	0.000	✗
`speed_skew`	—	2.460	1.866	Gamma	0.031	0.591	✓
`real_straightness`	—	0.484	0.268	Beta	0.031	0.466	✓
`decel_ratio`	—	0.499	0.054	Normal	0.067	0.008	✗
`duration_sec`	s	3.910	3.076	Gamma	0.045	0.161	✓

Results

The eight features selected for downstream analysis and their associated distribution families are as follows:

Feature	Best Distribution	Accept (p>0.05)	Rationale
`real_speed_mean`	Gamma	✓	Best fit for walking speed distribution
`real_speed_cv`	Log-normal	✗	Lowest KS among all candidates; captures individual variation in pace
`real_accel_mean`	Cauchy	✓	Lowest KS; heavy tails reflect rare but strong braking and surging events
`stop_ratio`	Beta	✗	Lowest KS; bounded in [0,1]; sensitive indicator for behavioral intervention
`speed_skew`	Gamma	✓	Lowest KS; captures temporal asymmetry in the speed profile
`real_straightness`	Beta	✓	Lowest KS; bounded in [0,1]; captures path linearity
`decel_ratio`	Normal	✗	Lowest KS; captures deceleration frequency
`duration_sec`	Gamma	✓	Best fit for observation duration

For features where the KS test was rejected (p < 0.05), I adopted the distribution with the lowest KS statistic as the working approximation. The reasoning is described in the Discussion.

Interpretation

real_speed_mean

Why this distribution?
Speed is non-negative and right-skewed: most pedestrians walk at a typical pace, with a tail of faster walkers. The Gamma distribution naturally accommodates a lower bound at zero and a right tail, making it a better fit than the Normal distribution, which would assign probability mass to negative values.

What the data reveals
The analysis reveals that the mean walking speed is 1.36 m/s, consistent with typical pedestrian speeds in a busy shopping district like Ginza. Speeds above 2.5 m/s are present but rare, representing pedestrians in a hurry.

real_speed_cv

Why this distribution?
The coefficient of variation is a positive quantity representing normalized speed variability within a trajectory. In a dense urban environment like Ginza, pedestrians frequently adjust pace due to crowds, direction changes, and window shopping. Tracking ID switches — where the system reassigns an ID between two different individuals — can also cause artificial speed jumps. Both effects contribute to a right-skewed distribution with a heavy tail, which the Log-normal distribution captures well.

What the data reveals
The mean CV of 0.89 indicates that most pedestrians show substantial within-trajectory speed variation. This is expected given the high foot traffic and the current limitations of the CentroidTracker-based tracking system.

real_accel_mean

Why this distribution?
Signed mean acceleration can be positive or negative, so a bell-shaped distribution centered near zero is expected. In a pedestrian zone without traffic signals, pedestrians are not forced to stop — sharp decelerations more likely reflect voluntary behavior such as entering a shop or looking at a sign. The Cauchy distribution captures the same bell shape as the Normal but with much heavier tails, consistent with the occasional extreme acceleration events observed.

What the data reveals
The distribution is centered very close to zero (mean = 0.026 m/s²), confirming that pedestrians neither systematically accelerate nor decelerate over their trajectories. The heavy Cauchy tails reflect rare but strong braking or surging events.

stop_ratio

Why this distribution?
stop_ratio is bounded in [0, 1] and represents the fraction of steps below a speed threshold within a trajectory. The Beta distribution is the natural choice for proportions on this interval.

What the data reveals
The distribution peaks near zero and 0.05, indicating that most pedestrians rarely drop below 0.3 m/s. However, the long right tail — reaching up to approximately 0.4 — confirms that a meaningful minority paused significantly, likely engaging with shop displays or other environmental stimuli.

real_straightness

Why this distribution?
real_straightness is bounded in [0, 1], making the Beta distribution the natural candidate. In a shopping district, pedestrians follow moderately direct paths but with meaningful deviation — neither hugging a straight line nor wandering randomly. This intermediate behavior, away from both boundary extremes, is well captured by a Beta distribution with parameters that place probability mass in the interior of [0, 1].

What the data reveals
The distribution is spread across the interior of [0, 1] with a mean of 0.48, indicating that pedestrians follow paths that are neither fully straight nor fully winding. This intermediate behavior is consistent with movement through a busy shopping district, where pedestrians maintain a general direction but frequently deviate in response to storefronts, crowds, and signage.

speed_skew

Why this distribution?
Per-step speed is non-negative, so the within-trajectory speed distribution has a natural floor at zero. This floor creates an inherent right skew in the speed profile, which means speed_skew is almost always positive (98.4% of tracks in this dataset). The Gamma distribution, which is defined for positive values and flexible in shape, fits this behavior well.

What the data reveals
The mean speed_skew is 2.47 (std = 2.03), indicating that most pedestrians have a few brief high-speed moments within an otherwise slower trajectory. This is consistent with occasional surges in pace — stepping around another person, crossing a gap in the crowd — embedded in otherwise steady walking.

decel_ratio

Why this distribution?
decel_ratio measures the fraction of steps where acceleration is negative. For steady walking, this should be close to 0.5 — alternating acceleration and deceleration in roughly equal measure. The Normal distribution centered near 0.5 captures this well.

What the data reveals
The mean is 0.499, confirming that pedestrians in this dataset decelerate and accelerate in nearly equal proportions. There is no systematic braking bias in baseline behavior — an important reference for detecting shifts caused by an intervention in Article 12 and 13.

duration_sec

Why this distribution?
Observation duration is positive and right-skewed: most pedestrians cross the field of view in a few seconds, but some linger, pause, or follow longer paths. The Gamma distribution accommodates the peak at short durations and the heavy right tail.

What the data reveals
The median duration is approximately 2 seconds, consistent with pedestrians walking through the frame. The right tail extends to around 14 seconds. One limitation of this approach is that the maximum observed duration is relatively short — it is likely that some pedestrians who stopped for longer had their track IDs reassigned by the tracker, truncating their trajectories.

Discussion

KS Test Sensitivity at Large Sample Sizes

The Kolmogorov-Smirnov test becomes increasingly sensitive as sample size grows. With n ≈ 600, even small deviations from a theoretical distribution are detectable, leading to rejection even when the visual fit is clearly reasonable. The KS statistic D measures the maximum gap between empirical and theoretical CDFs; a small D is evidence of good fit regardless of whether the test is formally rejected. For this reason, I adopted the distribution with the smallest D as the working approximation for all features, including those where the null hypothesis was rejected.

Feature Correlation and Implications for Manifold Construction

The correlation matrix reveals that speed_skew and real_speed_cv are strongly correlated (r = 0.77). Highly correlated features introduce redundant dimensions into the statistical manifold, which may affect the geometry of the parameter space in Article 4. Both features are retained for now, and their effect on the manifold structure will be examined in that article.

The Cauchy Distribution and Fisher Information

The Cauchy distribution for real_accel_mean is a physically meaningful result: it reflects the occasional extreme acceleration events — sudden stops at a shop entrance, abrupt lane changes — that occur against a background of smooth walking. However, the Cauchy distribution has no defined mean or variance. Its Fisher information metric is defined only through the scale parameter γ, so in Article 4, the treatment of the location parameter μ in the manifold construction will require careful consideration.

Limitations of the Data Collection Method

One limitation of this approach is that the CentroidTracker used for pedestrian tracking is prone to ID switching when pedestrians cross paths. This produces artificial velocity jumps in some trajectories and likely inflates the values of real_speed_cv and speed_skew. Improving tracking accuracy — for example by switching to a re-identification-based tracker — would reduce this artifact.

A second limitation is that the depth normalization via bbox_height assumes full-body visibility. For occluded or partially cropped pedestrians, the scale factor may be unreliable.

Toward Detecting Behavioral Change

The eight distribution families established here define the coordinate system of the statistical manifold in Article 4. In future articles (Articles 12 and 13), I will investigate how the parameters of these distributions shift when a standing demonstration is present. A systematic shift in Beta parameters for stop_ratio, or a change in the Cauchy scale parameter for real_accel_mean, would provide a quantitative signature of behavioral disruption.

Conclusion

In this article, I extracted eight features from pedestrian trajectories obtained from video footage and fitted probability distributions to each feature using the KS test. The analysis reveals that mean walking speed, speed skewness, and dwell time follow Gamma distributions; speed variability follows a Log-normal distribution; signed mean acceleration follows a Cauchy distribution; stop ratio and path straightness follow Beta distributions; and deceleration ratio follows a Normal distribution. These eight features and their associated distribution families form the coordinate parameterization of the statistical manifold to be constructed in Article 4.

In the next article

In the next article, I will apply this parameterization schema to pedestrian trajectory data collected at multiple stations along the Yamanote Line. Each location will be represented as a point on a statistical manifold, where coordinates are the estimated parameters of the eight distributions identified here. Using the Fisher information metric and e/m geodesics, I will compute distances between locations in a way that reflects genuine differences in pedestrian behavior — differences that Euclidean distance in raw feature space cannot capture.

Code

https://github.com/TOKIHISA/people_trajectory_analysis/blob/main/analysis/explore_distributions_pixel.ipynb

Toki Hirose

Apr 4

4. Constructing a Station-Level Statistical Manifold with Dual Flat Structure from Pedestrian Trajectories

#learning #datascience #analytics

Comments

10 min read

Toki Hirose

Mar 18

2. Visualizing Pedestrian Trajectories on a Map with MapLibre

#showdev #ai #javascript #machinelearning

Comments

9 min read

2. Visualizing Pedestrian Trajectories on a Map with MapLibre

Toki Hirose — Wed, 18 Mar 2026 15:11:01 +0000

Note:I use AI assistance to draft and polish the English, but the analysis, interpretation, and core ideas are my own. Learning to write technical English is itself part of this project.

Motivation

When I stood alone in protest at stations along the Yamanote Line, I watched people walk past me. Some glanced at my sign and looked away with a frown. Others gave a small nod. But I had no way to measure how many people actually reacted to my message — or how.

That question is what started this project. I wanted to capture not just whether people reacted, but how their movement changed: did they slow down, step aside, or adjust their path? To do that, I needed to track pedestrian trajectories from video and place them on a real map. I filmed at several stations in Tokyo using a smartphone, and built the pipeline using open-source tools — YOLOX for detection, ByteTrack for tracking, and MapLibre for visualization.　

Introduction

In the previous article, I extracted pedestrian trajectories from street video as structured JSON. The coordinates in that output are pixel positions in image space means something relative to the camera frame, but nothing about the real world.

In this article, I transform those image-space trajectories into geographic coordinates (WGS84) using homography, then render them on an interactive map with MapLibre GL JS. The result is a browser-based viewer where you can watch each pedestrian's path play back in real time over an aerial orthophoto. The data used inthis article was filmed at Shnbashi station.

The Core Problem: Two Different Coordinate Systems

Video footage and maps live in fundamentally different spaces. A pixel at (x, y) in a camera frame corresponds to some location (lon, lat) on the ground — but the mapping is not a simple linear scale. Because the camera is tilted at an angle rather than looking straight down, perspective distortion means that objects farther from the camera appear compressed. A projective transformation, called a homography, is the correct model for this mapping when the scene is approximately planar (which a ground surface is).

A homography H is a 3×3 matrix that maps points in image space to points in world space via:

[lon, lat, 1]^T ∝ H · [x, y, 1]^T

To compute H, I need at least four Ground Control Points (GCPs): pairs of corresponding locations, one in image coordinates and one in geographic coordinates.

Collecting Ground Control Points

The GCP workflow requires two tools running in sequence.

Step 1: Selecting Map Coordinates

The first tool (gcp_selector_map.py) opens a browser-based map where I click on identifiable features — a road marking, a corner of a crosswalk, a utility cover — and record their geographic coordinates.

The tool uses Folium with multiple tile layer options: OpenStreetMap, Esri World Imagery, and the Geospatial Information Authority of Japan (GSI) aerial photos. For urban scenes in Japan, the GSI orthophoto is particularly useful because it is regularly updated and offers clear ground-level features.

# Aerial orthophoto from Japan's Geospatial Information Authority
folium.TileLayer(
    tiles='https://cyberjapandata.gsi.go.jp/xyz/seamlessphoto/{z}/{x}/{y}.jpg',
    attr='国土地理院',
    name='航空写真 (国土地理院)',
).add_to(m)

Clicking on the map adds a numbered marker and records the WGS84 coordinates. Once four or more points are selected, the tool exports a gcp_config.json:

{
  "gcp_wgs84": [
    {"lat": 35.68423, "lon": 139.76531},
    {"lat": 35.68418, "lon": 139.76547}
  ],
  "gcp_image": []
}

Step 2: Selecting Video Frame Coordinates

The second tool (gcp_selector_video.py) opens a frame from the same video and displays it in an OpenCV window. I click on the same physical features in the same order as I did on the map.

cap.set(cv2.CAP_PROP_POS_FRAMES, target_frame)
ret, frame = cap.read()

Using a stable, early frame from the video (default: 10 seconds in) avoids motion blur and ensures that the scene features are visible. Right-click removes the last point, Enter confirms. The tool writes the pixel coordinates back into gcp_config.json:

{
  "gcp_wgs84": [
    {"lat": 35.68423, "lon": 139.76531},
    ...
  ],
  "gcp_image": [
    {"x": 412, "y": 631},
    ...
  ]
}

The matching order matters: point 1 in gcp_wgs84 must correspond to point 1 in gcp_image.

Computing the Homography Matrix

With the GCP pairs collected, project_v2wgs84.py computes the homography using OpenCV's findHomography with RANSAC:

self.H, _ = cv2.findHomography(
    self.src_points,  # image coordinates (N, 2)
    self.dst_points,  # WGS84 coordinates (N, 2) as [lon, lat]
    cv2.RANSAC,
    ransacReprojThreshold=3.0
)

RANSAC is used here because a few GCPs may be slightly misclicked. The algorithm fits the best homography to the inlier set and discards outliers, making the result more robust than a direct least-squares fit.

Note that with exactly four GCPs — the minimum required to determine H — RANSAC has no effect: all four points are consumed by the initial sample, leaving none to evaluate as inliers or outliers. To benefit from outlier rejection, provide at least five GCPs, and preferably eight or more.

After computing H, the reprojection error is calculated by transforming each source GCP back through H and measuring the distance to the target:

projected = cv2.perspectiveTransform(src_reshaped, self.H)
errors = np.sqrt(np.sum((projected - self.dst_points) ** 2, axis=1))
self.reprojection_error = float(np.mean(errors))

Transforming Trajectories

The transform_trajectory method applies H to each point in a track's pixel-space trajectory:

def transform_points(self, points: np.ndarray) -> np.ndarray:
    pts = points.reshape(-1, 1, 2).astype(np.float64)
    transformed = cv2.perspectiveTransform(pts, self.H)
    return transformed.reshape(-1, 2)

Before transformation, points outside the convex hull of the GCPs are discarded. The homography is only well-defined within the region spanned by the control points; extrapolation beyond the convex hull produces increasingly distorted results.

def is_in_valid_region(self, x: float, y: float) -> bool:
    hull = cv2.convexHull(self.src_points.astype(np.float32))
    return cv2.pointPolygonTest(hull, (float(x), float(y)), False) >= 0

Trajectory Simplification

The raw trajectory from the tracker contains a point for every frame — typically 30 points per second. For visualization and downstream analysis, this is far more detail than needed. When drawing in Maplibre, dense point sequences slow down rendering without adding visual clarity. I simplify by keeping only points where something meaningful changes: a direction shift or a speed change.

# Keep a point if:
# - direction changed by more than angle_threshold_deg (default: 10°)
# - speed changed by more than speed_ratio_threshold (default: 30%)
if angle > angle_threshold_deg or speed_change > speed_ratio_threshold:
    result.append(curr)

This preserves turning points, stops, and accelerations while dropping the redundant intermediate frames during straight, constant-speed walking. A 60-second trajectory of ~1,800 frames typically reduces to 20–80 representative points.

The output for each track gains a trajectory_wgs84 field alongside the original pixel-space trajectory:

{
  "id": 7,
  "trajectory_wgs84": [
    {"lon": 139.76531, "lat": 35.68423, "frame": 120, "time_sec": 4.0},
    {"lon": 139.76545, "lat": 35.68419, "frame": 155, "time_sec": 5.17}
  ],
  "geometry_wgs84": {
    "type": "LineString",
    "coordinates": [[139.76531, 35.68423], [139.76545, 35.68419]]
  }
}

Rendering in MapLibre GL JS

The final step is generate_viewer.py, which reads the WGS84 JSON and generates a self-contained HTML file. The viewer uses MapLibre GL JS with an OpenStreetMap base layer.

Map Setup

const map = new maplibregl.Map({
  container: 'map',
  style: {
    version: 8,
    sources: {
      osm: {
        type: 'raster',
        tiles: ['https://tile.openstreetmap.org/{z}/{x}/{y}.png'],
        tileSize: 256,
      }
    },
    layers: [{ id: 'osm', type: 'raster', source: 'osm' }]
  },
  center: [center_lon, center_lat],
  zoom: 19
});

Two-Layer Rendering Strategy

The viewer maintains two GeoJSON sources:

Background (bg): Full path of each active track, drawn at low opacity (0.15). Updated only when tracks enter or leave the scene.
Trail (trail): A sliding window of the recent path for each active track, updated every animation frame.

// Background: full path at low opacity
map.addLayer({
  id: 'bg', type: 'line', source: 'bg',
  paint: { 'line-color': ['get', 'color'], 'line-width': 2, 'line-opacity': 0.15 }
});

// Trail glow + line
map.addLayer({
  id: 'trail-glow', type: 'line', source: 'trail',
  paint: { 'line-color': ['get', 'color'], 'line-width': 8, 'line-blur': 6, 'line-opacity': 0.3 }
});
map.addLayer({
  id: 'trail-line', type: 'line', source: 'trail',
  paint: { 'line-color': ['get', 'color'], 'line-width': 3 }
});

Each track gets a unique color computed with the golden-angle hue distribution, which maximizes perceptual separation between adjacent track indices.

Animation Loop

The animation loop advances a currentTime variable and slices each track's coordinates to the active window:

function animate(ts) {
  currentTime += (ts - lastTs) / 1000 * playbackSpeed;

  const trailFeatures = [];
  for (let i = 0; i < TRACKS.length; i++) {
    const track = TRACKS[i];
    const isActive = currentTime >= track.start && trailStart <= track.end;
    if (!isActive) continue;

    // Binary search for the relevant coordinate slice
    const headIdx = upperIdx(coords, currentTime);
    const tailIdx = lowerIdx(coords, currentTime - trailSec);
    const slice = coords.slice(tailIdx, headIdx + 1).map(([lon, lat]) => [lon, lat]);

    if (slice.length >= 2) {
      trailFeatures.push({ type: 'Feature', ... });
    }
  }
  map.getSource('trail').setData({ type: 'FeatureCollection', features: trailFeatures });
  requestAnimationFrame(animate);
}

Binary search on the time axis avoids scanning all coordinates each frame, keeping the animation smooth even with hundreds of simultaneous tracks.

The viewer includes a timeline scrubber, adjustable playback speed (1×–120×), and a configurable trail window length (1–30 seconds). The generated HTML is fully self-contained — no server required, just open in a browser.

Results

Running the full pipeline on the Shinbashi station video produces a self-contained HTML viewer. GCP collection takes around 10 minutes per video, split between map selection and video frame annotation.

The geographic alignment looks accurate at zoom level 19 — pedestrians appear to walk along the correct sidewalk, following the actual street geometry. The animation runs smoothly in the browser even with multiple tracks active simultaneously.

Trajectory simplification reduces the point count significantly, making the viewer responsive without any noticeable loss of visual detail.

Interpretation

Plotting the first video revealed an immediate artifact: trajectories near the camera showed erratic up-and-down distortion, diverging left and right rather than tracing a clean path. This is a known limitation of ground-level homography — people walking close to the camera lens experience the most perspective distortion, and the foot-point approximation breaks down when the viewing angle is steep.

The rest of the scene told a clearer story. Almost all trajectories moved in one direction: left to right across the frame, from the street toward the station entrance. This was an evening shoot, and the pattern reflects the commuter flow — people heading home from work. There were no trajectories crossing perpendicular to this flow.

To reduce the near-camera distortion in future shoots, I plan to mount the camera higher on a tripod, which flattens the viewing angle and extends the reliable region closer to the lens.

Discussion

Limitations of Homography

A homography assumes the scene is a flat plane. For a ground-level camera looking across a street, this is a reasonable approximation for people walking on flat pavement. It breaks down for:

Tall objects: A person's head and feet are at different heights, so their bounding box center (used as the tracking point) introduces a height-dependent offset. Using the foot point of the bounding box instead of the centroid reduces this effect.
Sloped terrain: Hills or stairs invalidate the planar assumption.
Wide GCP coverage: Reprojection error increases at points far from the GCP convex hull. The is_in_valid_region filter mitigates this by discarding trajectory points outside the reliable area.

For this project, I chose the foot point of the bounding box as the tracking coordinate, which approximates the actual ground contact point and keeps the homography error small for typical pedestrian heights.

GCP Selection Quality

The quality of the homography depends heavily on GCP selection. Good practices:

Use features clearly visible in both the orthophoto and the video frame
Distribute points across the full area of interest (not clustered in one corner)
Avoid features that might differ between the photo and video (temporary markings, vehicles)
Use more than four points and let RANSAC handle any mismatches

In practice, six to eight well-distributed GCPs produce significantly better results than the minimum four.

Simplification Trade-offs

The angle/speed-based simplification trades off detail for compactness. A threshold of 10° for direction changes and 30% for speed changes preserves the essential shape of each trajectory. Setting thresholds too high risks losing turns; too low retains excessive redundancy. For the information-geometry analysis in later articles, the full pixel-space trajectory (not the simplified WGS84 version) is used to preserve accurate timing information.

Conclusion

By combining GCP-based homography calibration with MapLibre GL JS rendering, I can place pedestrian trajectories extracted from street video onto an interactive geographic map. The pipeline — collect GCPs, compute H, transform coordinates, generate viewer — takes a few minutes per video and produces a self-contained HTML file that runs in any browser.

Bridge

Watching the trajectories animate on the map gives an intuitive picture of pedestrian flow — but comparing across videos or stations requires something more rigorous than visual inspection.

The next step is to construct a distribution from the trajectories in each video, representing the collective movement pattern as a probability distribution over features such as speed, direction, and acceleration. Each video then becomes a single point in a statistical manifold, where the geometry encodes how different two movement patterns are from each other. In the next article, I extract those features and begin constructing that manifold.

Toki Hirose

Mar 29

3. Exploring Feature Distributions from Pedestrian Trajectories

#statistics #computervision #python #datascience

Comments

10 min read

Toki Hirose

Mar 15

1. Extracting Pedestrian Trajectories from Street Video as JSON

#data #datascience #machinelearning #science

Comments 5

7 min read

1. Extracting Pedestrian Trajectories from Street Video as JSON

Toki Hirose — Sun, 15 Mar 2026 01:43:41 +0000

Note: I use AI assistance to draft and polish the English, but the analysis, interpretation, and core ideas are my own. Learning to write technical English is itself part of this project.

Motivation

Why extract pedestrian trajectories from smartphone video footage? This approach serves multiple purposes in my research on urban social movements:

GIS-ready data: JSON output integrates seamlessly with geographic information systems and mapping tools
Cost-effective data collection: Eliminates the need for expensive GPS trackers or surveillance infrastructure
Understanding pedestrian behavior: Reveals how people move and interact in urban environments
Measuring protest reactions: Quantifies how standing demonstrations affect surrounding pedestrian flow

This project emphasizes rapid deployment for protest monitoring. The entire setup requires only a smartphone and tripod, enabling quick response to emerging events.

Introduction

In urban planning and transportation studies, understanding pedestrian movement patterns is crucial for designing safer and more efficient public spaces. Previous methods like manual observation or GPS tracking have limitations in coverage and cost. Computer vision offers a scalable alternative through video analysis.

This article demonstrates how to extract pedestrian trajectories from street video footage using open-source tools. I'll use YOLOX-Tiny for real-time person detection and implement a custom centroid-based tracker to generate structured JSON trajectory data. The sample videos used in this project were captured on a smartphone, which keeps the setup lightweight and easy to deploy.

Methodology

Person Detection with YOLOX-Tiny

YOLOX-Tiny is a lightweight object detection model optimized for real-time inference. I use the ONNX export for cross-platform compatibility with OpenCV and ONNX Runtime.

The detection pipeline:

Preprocessing: Letterbox resizing to maintain aspect ratio
Inference: YOLOX model processes the frame
Postprocessing: Convert detections to bounding boxes
Filtering: Confidence thresholding and non-maximum suppression

Centroid-Based Tracking

For tracking detected persons across frames, I implement a simple but effective centroid tracker:

Each detection's bounding box center becomes a centroid
Tracks are maintained by matching centroids between frames
New tracks are registered for unmatched detections
Lost tracks are deregistered after a maximum disappearance threshold
For visualize, footage tracks are also detected

Trajectory Analysis

For each complete trajectory, I extract:

Duration: Total time the person was tracked
Distance: Total pixels traveled
Direction: Movement angle in degrees
Start/End positions: Entry and exit points
Screen exit detection: Whether the person left the frame

Implementation

Setup and Dependencies

# Required packages
pip install opencv-python numpy onnxruntime

# Download YOLOX-Tiny ONNX model
# From: https://github.com/Megvii-BaseDetection/YOLOX

Core Detection Function

def detect_persons(frame, session):
    # Preprocess frame
    blob, ratio = preprocess_yolox(frame, 416, 416)

    # Run inference
    output = session.run(None, {session.get_inputs()[0].name: blob})[0]

    # Postprocess detections
    # ... (filter by confidence, apply NMS)

    return boxes, confidences

Tracking Implementation

Full CentroidTracker Implementation ```python from collections import defaultdict import numpy as np class CentroidTracker: """ centroid-based tracking algorithm for associating detected bounding boxes across frames. In addition to tracking centroids, it also maintains trajectories based on the foot point of the bounding box (the point where the person touches the ground), which is more stable for movement analysis. """ def __init__(self, max_disappeared=50): self.next_object_id = 0 self.objects = {} # ID: (centroid_x, centroid_y) self.disappeared = {} # ID: disappeared_frame_count self.trajectories = defaultdict(list) # ID: [(x, y, frame), ...] self.first_seen = {} # ID: first frame detected self.last_seen = {} # ID: last frame detected self.max_disappeared = max_disappeared def register(self, centroid, foot_point, frame_num): """register a new object with a unique ID""" self.objects[self.next_object_id] = centroid self.disappeared[self.next_object_id] = 0 self.trajectories[self.next_object_id].append( (foot_point[0], foot_point[1], frame_num) ) self.first_seen[self.next_object_id] = frame_num self.last_seen[self.next_object_id] = frame_num self.next_object_id += 1 def deregister(self, object_id): """deregister an object and remove it from tracking""" del self.objects[object_id] del self.disappeared[object_id] def update(self, rects, frame_num): """ update the tracker with new bounding box detections Args: rects: the list of detected bounding boxes [(x1, y1, x2, y2), ...] frame_num: the current frame number Returns: objects: a dictionary mapping object IDs to their current centroids {(cx, cy)} """ # when no detections are present, mark existing objects as disappeared if len(rects) == 0: for object_id in list(self.disappeared.keys()): self.disappeared[object_id] += 1 if self.disappeared[object_id] > self.max_disappeared: self.deregister(object_id) return self.objects # conpute centroids and foot points for the current detections input_centroids = np.zeros((len(rects), 2), dtype="int") input_feet = np.zeros((len(rects), 2), dtype="int") for i, (x1, y1, x2, y2) in enumerate(rects): cx = int((x1 + x2) / 2.0) input_centroids[i] = (cx, int((y1 + y2) / 2.0)) input_feet[i] = (cx, y2) # foot point is the bottom center of the bounding box # if no existing objects, register all input centroids if len(self.objects) == 0: for i in range(len(input_centroids)): self.register(input_centroids[i], input_feet[i], frame_num) # existing objects are present, match input centroids to existing object centroids else: object_ids = list(self.objects.keys()) object_centroids = list(self.objects.values()) # conpute distance matrix between existing object centroids and input centroids D = np.zeros((len(object_centroids), len(input_centroids))) for i, oc in enumerate(object_centroids): for j, ic in enumerate(input_centroids): D[i, j] = np.linalg.norm(oc - ic) # find the smallest distance pairs (existing object to input centroid) rows = D.min(axis=1).argsort() cols = D.argmin(axis=1)[rows] used_rows = set() used_cols = set() for (row, col) in zip(rows, cols): if row in used_rows or col in used_cols: continue # when distance is lower than a threshold, consider it a match if D[row, col] > 100: # if the distance is too large, ignore the match (this threshold can be tuned) continue object_id = object_ids[row] self.objects[object_id] = input_centroids[col] # using the centroid for tracking self.disappeared[object_id] = 0 self.trajectories[object_id].append( (input_feet[col][0], input_feet[col][1], frame_num) # using the foot point for trajectory analysis ) self.last_seen[object_id] = frame_num used_rows.add(row) used_cols.add(col) # not matched existing objects unused_rows = set(range(D.shape[0])) - used_rows for row in unused_rows: object_id = object_ids[row] self.disappeared[object_id] += 1 if self.disappeared[object_id] > self.max_disappeared: self.deregister(object_id) # not matched input centroids unused_cols = set(range(D.shape[1])) - used_cols for col in unused_cols: self.register(input_centroids[col], input_feet[col], frame_num) return self.objects ```

JSON Output Structure

The trajectory data is saved as structured JSON:

{
  "video_name": "street_footage.mp4",
  "fps": 30,
  "resolution": "1920x1080",
  "tracks": [
    {
      "id": 1,
      "duration": 12.5,
      "total_distance": 320.4,
      "trajectory": [
        {"x": 100, "y": 200, "frame": 10, "time_sec": 0.333},
        {"x": 105, "y": 202, "frame": 11, "time_sec": 0.367}
      ],
      "geometry": {
        "type": "LineString",
        "coordinates": [[100, 200], [105, 202]]
      }
    }
  ]
}

Results

Processing a 5min video at 30 FPS typically yields:

The JSON output provides rich data for further analysis:

Spatial patterns of movement
Temporal distribution of pedestrian activity
Flow direction analysis

Discussion

Advantages of This Approach

Cost-effective: Uses commodity hardware and free software
Scalable: Can process hours of footage automatically
Structured output: JSON format integrates with GIS and analysis tools
Real-time capable: YOLOX-Tiny enables live processing

The sample videos in this project were captured on a smartphone, but the same pipeline can be applied to fixed surveillance cameras for longer-term monitoring.

Interpretation

Processing demonstration videos from Shinbashi station revealed insights about centroid tracking performance and pedestrian behavior during protests:

Commuter indifference: In Japan, individual protests are uncommon, so commuters typically ignore demonstrators. Additionally, most people are busy office workers who tend to focus on their commute rather than noticing activities around them.
Camera height issues: Using a smartphone camera with a low tripod created unreliable detections. People near the camera appeared with unnatural up-and-down trajectories due to the low-angle perspective.
ID swapping during interactions: When pedestrians crossed paths or interacted closely, their tracking IDs would swap, creating fragmented trajectories for the same individuals.

Overall, the system successfully captured general movement patterns. Future improvements could include filtering trajectories with sudden angle changes after intersections or removing outliers based on historical movement differences.

Limitations and Future Improvements

Occlusion handling: Simple centroid tracking fails in crowds
Camera motion: Assumes static camera position
Identity persistence: No re-identification across camera cuts
Stopping behavior: People who stop moving in videos sometimes lose their tracking ID due to centroid distance thresholds, leading to fragmented trajectories (e.g., ID 5 → 110 → 430 as the same person gets re-detected with new IDs)

For crowded scenes, more sophisticated trackers like DeepSORT or ByteTrack would improve performance. Camera motion compensation using optical flow could extend applicability to moving platforms.

In this project, I prioritized spending time on analysis and visualization rather than implementing the most advanced tracking pipeline; that tradeoff made it easier to iterate quickly with real data.

Applications

This trajectory data serves as input for:

Urban planning: Identifying pedestrian flow bottlenecks
Safety analysis: Detecting high-risk crossing patterns
Traffic engineering: Optimizing signal timing
Accessibility studies: Understanding mobility patterns

The structured JSON format makes it easy to integrate with mapping libraries like MapLibre GL JS for visualization, as I'll explore in the next article.

Conclusion

By combining YOLOX-Tiny detection with centroid tracking, I can extract meaningful pedestrian trajectory data from video footage. The resulting JSON structure provides a foundation for spatial analysis of urban movement patterns. While the current implementation works well for moderate-density scenarios, future enhancements could address occlusion and camera motion challenges.

In the next article, I'll visualize these trajectories on an interactive map using MapLibre GL JS.

reference

Download YOLOX-Tiny ONNX model
- From: https://github.com/Megvii-BaseDetection/YOLOX
Centroid tracker
- https://pyimagesearch.com/2025/07/14/people-tracker-with-yolov12-and-centroid-tracker/
my github project
- https://github.com/TOKIHISA/people_trajectory_analysis/blob/main/src/detects_people.py

Toki Hirose

Mar 18

2. Visualizing Pedestrian Trajectories on a Map with MapLibre

#showdev #ai #javascript #machinelearning

Comments

9 min read

DEV Community: Toki Hirose

5. Comparing Pedestrian Manifolds with OSMnx Spatial Manifolds

Introduction

The Stack as Adjoint Functors

Constructing M_C: Street Network Features via OSMnx

Feature Extraction

M_C Coordinates

M_C Geometry: Geodesics and Curvature

PCA and KL Divergence

M_C vs M_U: Pairwise Distance Comparison

Where the Correspondence Holds — and Where It Breaks

What M_C tells us

What M_U tells us — and where M_C fails to predict it

Interpretation

Discussion: Structural Untranslatability

Conclusion

Citation

4. Constructing a Station-Level Statistical Manifold with Dual Flat Structure from Pedestrian Trajectories

Introduction

Motivation

Methodological Foundation

Why Information Geometry Over PCA?

Station as a Point on the Manifold

Dual Flat Structure

Fisher Information Metric

Results

Manifold Construction

KL Divergence Between Stations

e-Geodesics and m-Geodesics

PCA Distance vs Sym-KL Divergence

Discussion

PCA Distance vs KL Divergence

Asymmetry

Conclusion

Next Article

5. Comparing Pedestrian Manifolds with OSMnx Spatial Manifolds

Previous Article

3. Exploring Feature Distributions from Pedestrian Trajectories

3. Exploring Feature Distributions from Pedestrian Trajectories

Motivation

Note

Introduction

Feature Extraction

Pixel to Meter Conversion

Speed

Acceleration

Speed Skewness

Stop Ratio

Path Straightness

Dwell Time

Distribution Analysis

Results

Interpretation

real_speed_mean

real_speed_cv

real_accel_mean

stop_ratio

real_straightness

speed_skew

decel_ratio

duration_sec

Discussion

KS Test Sensitivity at Large Sample Sizes

Feature Correlation and Implications for Manifold Construction

The Cauchy Distribution and Fisher Information

Limitations of the Data Collection Method

Toward Detecting Behavioral Change

Conclusion

In the next article

Code

Next Article

4. Constructing a Station-Level Statistical Manifold with Dual Flat Structure from Pedestrian Trajectories

Previous Article

2. Visualizing Pedestrian Trajectories on a Map with MapLibre

2. Visualizing Pedestrian Trajectories on a Map with MapLibre

Motivation

Introduction

The Core Problem: Two Different Coordinate Systems

Collecting Ground Control Points

Step 1: Selecting Map Coordinates