Acknowledgments
All authors partially supported by OSD contract number N0002413D6400 via JHUAPL subcontract number 131753. (Distribution Statement A  Approved for public release; distribution is unlimited.) Bendich and Harer were also partially supported by NSF award BIGDATA 1444791. Harer was also partially supported by the DARPA MoDyL program, number HR00111620033. Pieloch was supported by the National Science Foundation Graduate Student Fellowship Program through grant DGE 1644869. We thank Christopher J. Tralie for many helpful insights about cover trees, and we thank David Porter and Michael Williams at JHUAPL for many motivational discussions.
Our code is available at https://github.com/geomdata/gdapublic/ with documentation at https://geomdata.github.io/gdapublic/
1. Overview and Assumptions
We propose a new supervised machinelearning method for classification, where each object to be classified is a weighted
pointcloud, and the classification task is to learn which of a finite set of labels should be applied to the pointcloud. The method is fast, datadriven, multiscale, and requires no tuning. Additionally, its details are transparently geometric; it does not suffer from the “black box” interpretation difficulties that arise in many machinelearning methods. We call it Covertree Differencing via Entropy Reduction [CDER, pronounced “cedar”].A highlevel sketch of CDER is as follows. We construct a partial cover tree [7] on the union of all labeled pointclouds that are given in some training set, and we search the cover tree for convex regions that are likely to be local minima of entropy [3]. For each such region, we build distributional coordinates from the dominant labels’ data. An ordered sequence of these distributional coordinates allows us to determine the likelihood of each label for an unlabeled test pointcloud.
Section 3 explains the notion of distributional coordinates defined on pointclouds as a frontend for supervised learning. Section 4 details the covertree algorithm and our enhancements to it. Section 5 explains our simple approach to entropy minimization in the labeled pointcloud context. Section 6 gives a formal description of CDER. Section 7 illustrates CDER with several synthetic examples. The remainder of this section establishes context and notation for this problem.
Fix a Euclidean^{1}^{1}1See the Discussion section for further thoughts on this requirement. space . A pointcloud is a finite^{2}^{2}2In this document, indicates the cardinality of a finite set , whereas indicates a norm. set . We allow weighted pointclouds, where each has a positive weight ; the utility of weights is explained in Section 2. A cloud collection is a finite set of pointclouds
with . Let denote the union of all pointclouds in a cloud collection, which is itself a pointcloud with weights and labels of each point inherited from the respective .
For our supervised learning task, the training data is a cloud collection , where each pointcloud is labeled from a finite set of labels, . For the sake of discussion and visualization, we usually interpret as a set of colors, but for mathematical purposes, where . Let denote the label function, and we also use to indicate a generic label in .
It is acceptable that the pointclouds have unequal sizes across . It is also acceptable that the labeled subcollections have unequal sizes across labels . Pointwise weights can be assigned to compensate for these sizes, as in Section 2. Our only structural hypothesis is that, for each labeled subcollection, the pointclouds are sampled from an underlying density function—or from several density functions as chosen by a random process—on . For each label , let denote this density function.
We aim to efficiently find regions in where one label has greater density than the other labels. That is, we seek convex regions that are “characteristic” of a particular label because is unusually prominent there, while ignoring regions where the various density functions are similar. See Figure 1.
We analyze the relative density of labels over regions using the informationtheoretic notion of entropy (see Section 5), locating subregions of low entropy. Because of the use of cover trees, our method emphasizes regions that are very large, very dense, or smallyetremote (see Section 4). For each of these regions , we construct a function that approximates near ; such a is an example of what we call a distributional coordinate (see Section 3).
2. Weights
Typically, pointclouds are experimental samples that do not come with preassigned weights, and we assign weights as follows: We have the prior assumption that each label is equally likely among the test data.^{3}^{3}3See the discussion section for more general thoughts about this assumption. We give the training set total weight 1. That is, each label is alloted total weight , where . Moreover, we make the assumption that each sample is equally representative of its underlying density function , regardless of . That is, each pointcloud is alloted total weight , where is the cardinality of , the number of training pointclouds with label . Finally, we assume all points have equal weight, so . Let denote the labeled and weighted pointcloud given by the union of all with these labels and weights.
For example, suppose that the training set consists of where , , and with labels red, blue, and red, respectively. Suppose we want to have unit weight. Then each has weight , each has weight , and each has weight .
These weights are chosen to model relative likelihoods in the following way: If each label has an underlying density function , and each with label is a sample from , then should be proportional to for any measurable set .
3. Distributional Coordinates
Given a cloud collection as above, we would like to map each pointcloud
to a vector
for some fixed dimension . While this is an interesting topic in its own right, we are mainly concerned with viewing this as a feature extraction method: that is, the vectors and corresponding labels will be used as input for any supervised learning algorithm that requires collections of vectors in a common Euclidean space, along with class labels.One could imagine a supervised learning approach to labeled pointclouds that does not perform feature extraction; for example, by first defining a (nonEuclidean) metric between pointclouds, and then applying a nearestneighbor or clustering method. However, most such metrics are unstable to noise (such as Hausdorff distance) or are slowerthanlinear to compute (such as various KullbackLeibler Divergence
[10] or the Wasserstein metric [9]).This section frames the problem of feature extraction in the common language of distributional coordinates. We start with a definition of this intuitive concept, and then describe the advantages and disadvantages of some standard examples. The approach we advocate comes in the next section.
Let be an integrable function on Euclidean space. For any pointcloud with pointwise weights , define
We refer to as a distributional coordinate. Any ordered set of distributional coordinates transforms into a vector . It is computationally important to keep the dimension —or at least the intrinsic dimension of
—as small as possible, as we hope to minimize the negative effect of the curse of dimensionality on the supervised learning problem.
We use the cloud collection shown on the bottom of Figure 2 as a notional working example in the Euclidean plane. There are two class labels, “magenta” and “green.” We imagine that each magenta pointcloud (a typical example on the topleft of the figure) contains a large number of points sampled from a large central blob and a few points sampled from much smaller blobs along the horizontal axis. Each green pointcloud (a typical example on the topright of the figure) contains a large number of points sampled from the same large central blob and a few points sampled from smaller, different blobs along the horizontal axis.
In a typical supervised learning context, would be split (say 80:20) into training and testing sets, and we would build a set of distributional coordinates from the labeled training set which could then infer the correct color labels on the testing set.
3.1. Binning
A simple approach would be to let each be the indicator function on some region ; that is, takes the value on some region and elsewhere. Then counts the total weight of the points . Assuming the regions cover the domain with overlap only on boundaries (for example, the could be the squares shown in Figure 2) this is what is commonly referred to as binning, and it has been used (for example, [4]) successfully for supervisedlearning on labeled pointclouds. On the other hand, binning is obviously unstable: consider a point that lies very close to the boundary between two adjacent bins. It also can suffer from the curse of dimensionality: a larger number of bins than are shown in Figure 2 would be needed to distinguish the class labels.
3.2. Gaussian Mixtures
To address instability, one could define each to be a Gaussian with mean and covariance matrix
. The obvious questions are: how many Gaussians (modes) should be used, what are their means, and what are their covariances? A datadriven approach should be used to fit a Gaussian Mixture Model (
[8], VI.8) to .It is apparent to the human eye that nine Gaussians (1 large green and magenta, 4 small magenta, and 4 small green) model the example in Figure 2. That is, we could “by hand” define a good set of distributional coordinates to establish a feature vector . Moreover, for the purposes of classification, the central “large” distributional coordinate is useless. And perhaps the two innermost “small” Gaussians are less useful, as the remaining six distributional coordinates are sufficient for accurate classification. Our goal is to replicate this observation: we construct Gaussian modes in a datadriven manner from that are appropriate for use as labeled distributional coordinates for the label classification problem. We now proceed to the computational details.
4. Cover Trees
Cover trees were originally conceived [7] as a way to accelerate the nearestneighbors problem. More recently, they have been used in dictionary learning [2] and for speeding up the analysis of the topology of pointclouds [13], to pick just a few applications. This section gives basic definitions for cover trees as well as a fast algorithm for their construction; our account is far more detailed than usually appears in the literature and is meant to accompany our publicly available code. Sections 5 and 6 use cover trees to define a set of labeldriven distributional coordinates on a cloud collection with class labels.
For any finite pointcloud^{4}^{4}4
For the moment we forget that
is a union of pointclouds and simply treat it as a single pointcloud to be studied. , consider a filtrationAn element is called an adult^{5}^{5}5To reduce confusion (hopefully) by maintaining a consistent metaphor, we diverge from prior works’ notation in two places. Our adult is called a center elsewhere. Our guardian is usually called a parent. We avoid the word parent because there are many distinct tree relations embedded in this algorithm. at level . The set is the cohort at .
A cover tree builds a filtration by covering with balls of smaller and smaller radius centered at the points in . See Figure 3.
Specifically, a cover tree is a filtration of with the following additional properties:

. Let ;

There is a radius such that ;

There is a real number , called the shrinkage ratio such that, for every , where .

For each , if , then .

For each , each point is assigned to a guardian such that lies in the ball ; we say is a child of . Each is its own guardian and its own child.

There is a tree structure on the (level, adult) pairs of the filtration , where there is a tree relation if was a child of at level . We say is a successor of , and is a predecessor of . Note that for all .
4.1. Vocabulary
Cover trees can be generated using a fast algorithm relying on the notion of friends, first introduced in [7]. The rough algorithm is familiar to many practitioners, but is not widely published. We enhance this algorithm by giving each point a weight and a label. We describe our method here because the details of the construction are crucial to the supervisedlearning task of this article.
Extending the maturation/reproduction metaphor of adults, children, and guardians above, a child with guardian at level is called a teen if , and it is called a youngin if . See Figure 4. When changing from level to level , the radius of each ball shrinks to . Children farther than from their guardians become orphans. We must decide whether these orphans should be adopted by other adults at level , or if the orphans should be emancipated as new adults at level . That is, the newly emancipated adults at level comprise the cohort at level .
Our algorithm uses several notions of friends, set by the following bounds at level :
(1) 
For each , we say that adults and at level are type friends if . It is easy to verify the following recursion relations:
(2) 
Here are the essential facts about friends, which are immediate consequences of the triangle inequality:
Theorem 1.

If is a child of at level that is orphaned at level , then can be adopted only by either the type1 friends of at level or the newlyemancipated children at level of those type1 friends. See Figure 5.

If we choose to reassign children to the nearest adult, then teen with guardian can be adopted by adult at level , where the level predecessors of and are type2 friends. See Figure 6.

If and are type3 friends at level , then their predecessors and are type3 friends at level . See Figure 7.
Theorem 2.
Suppose is an adult at level with predecessor . If has guardian at level and guardian of at level , then is a type1 friend of , and . See Figure 8.
Based on this scenario, we say is an elder of if . Moreover, if is its own predecessor, then is its only elder, so the notion of elder is nontrivial only if belongs to cohort .
Theorem 3.
For the purpose of computing elders, there are three particularly good choices of which require no additional computation. If , then . If , then . If , then . In these cases, the elders of are the friends (of the respective type) that have cohort or less. See Figure 4.
These facts allow the following algorithm.
4.2. Algorithm
At level , the cover tree has a single adult, . All points are its children. Its only friend is itself. Every has some weight and some label, which we assume as given input. For each label , compute the total weight and weighted mean of all children of with that label.
Suppose at level , the friends, children, perlabel weights, and perlabel means are known for all adults . We compute these for level as follows:

Advance. Shrink the radius by a factor of to . Every adult remains an adult . Mark as its own elder.

Orphan. For each , find its orphaned children: those such that . Sort the orphans so that the first orphan is the one nearest the perlabel mean for the label of greatest weight, the second orphan is the one nearest the perlabel mean of the label of secondgreatest weight, and so on, via a twodimensional argsort without repetition. The reason for this sorting will become apparent in Section 6.

Adopt or emancipate. For each , consider its orphans in order. For each orphaned by , check whether can be adopted by another adult using the type1 friends of :

If so, the adoptive guardian will be a adult such that . The predecessor of has .

If not, then is a new adult at level . Its only child is itself. Its friends are initialized to be itself and its predecessor (but these will be updated in the Befriending stage). Note the set is ordered by cohort, and within a cohort is ordered by predecessor, and within a predecessor block is ordered by proximity to the predecessor’s perlabel means.


Exchange Teens.^{6}^{6}6This step is optional for general cover tree, but it is particularly useful for the computation of elders, which are needed in Section 6. If has guardian , then check whether it is closer to some other guardian using type2 friends. That is, find . The predecessor of has . Note that implies , meaning that only teens (not youngins) can be exchanged.

Befriend: Now, all adults and their children are known. We need to find new friends using the recursion relation and the predecessors.

A pair of adults, and , are type3 friends if their predecessors and are type3 friends at level and .

A pair of adults, and , are type2 friends if they are type3 friends and .

A pair of adults, and , are type1 friends if they are type2 friends and .

For each newly emancipated adult of cohort , find its elders by searching its friends for elements of cohort or less.


Weigh: For each adult at level , compute the perlabel weights and perlabel weighted means for its children. Sort the labels by their weights from greatest to least.
At level , the friends, children, perlabel weights, and perlabel means are all known.
4.3. Algorithmic Efficiency
Suppose , and the covertree filtration stops at level . We are interested in the storage and computational complexity of this algorithm in terms of , for a fixed ambient dimension .
The desired output is the successor/predecessor tree and the children/guardian relationships at each level. Thus, storage of friend lists is the only issue that might lead us to another algorithm. The main question is: given an adult at level , how many other adults are within of ?
Since no two adults at level are within of one another, this is related to the idea of “doubling dimension.” Define to be the maximum number of points in that can be placed in the ball of radius , including the center, so that no two points have distance . This number grows with the ambient dimension . In Euclidean space, it is independent of scale; that is, the same number works for a ball of radius using points of distance at least . The number can be bounded by the number of simplices with edge length at least that can be put into a ball of radius . Since these simplices have a minimum volume , we get a bound like:
Setting , we get an upper bound on the expected size of the friends list, which is a constant independent of the level . Thus, the total storage size is ; in particular, it is at most , where the accounts for the lists of type1, type2, and type3 friends.
Setting , we get an upper bound on the expected number of successors of an adult from level to , which is a constant independent of the level . The size of is . The maximum expected size of is . The maximum expected size of is . Generally, the maximum expected size of is . Moreover, the larger the size of each cohort , the shallower the cover tree. That is, is expected to be of the order .
For each adult , we must:

Sum and sort the weights of its children, by label. The expected number of children is . Sorting the labels is for labels.

Compute the distance (an operation) to each of its roughly children. These distances yield the orphans and teens. The expected number of orphans is , and the expected number of teens is .

For each orphan, determine whether it should be adopted or liberated. Compute the distance (an operation) to the friends or successors of the friends of . This set has expected size , which is constant in .

For each teen, determine whether it should be exchanged. Compute the distance (an operation) to the friends or successors of the friends of . This set has expected size , which is constant in .

For each successor—a list of size —update its friends by computing the distance (an operation) to the successors of the friends of . This is expected to be distance computations.
For each adult at level , this is roughly operations, albeit with an additional penalty from . Level has at most adults, in which case the total number of computations is roughly
It is interesting that the rapid growth from the ambient dimension is counteracted by the logarithm term. In our experience with applications, this is effectively linear in
, as the sum terminates earlier than this estimate. This is because the estimate of
by is typically irrelevant; the maximum level of the cover tree is actually controlled by the ratio of the maximum to the minimum distances in the pointcloud. In applications, these are typically within several orders of magnitude, so with , we expect . Moreover, in our supervisedlearning application in Section 6, we abort the cover tree much earlier, once the lowentropy regions have been found.5. Entropy Reduction
For each label , let denote the (weighted) points in that have label . For any convex, compact , let . The total weight in of each label is , which by Section 2 corresponds to the relative likelihood of a point with label being sampled from . Let .
To quantify the relative prominence of the various labels, we use the informationtheoretic notion of entropy. Entropy is characterized by several nice properties, detailed in [3]. The formula is
(3) 
If all weights are roughly equal, then entropy is 1. If one label has high weight and the others have weight 0, then the entropy is 0. Generally, entropy decreases as a smaller subset of labels becomes more likely. Thus, low entropy is a good indicator that the region is particularly prominent for a small subset of labels. The question remains: how can we locate lowentropy regions?
Our search for lowentropy regions is motivated by the following elementary theorems; however, we do not insist that these strict regularity hypotheses hold for any given application.
Theorem 4.
Fix a function that is continuous on a compact, convex set . Fix a radius . Let denote the centroid of on the compact, convex set (the closed ball). There exists such that .
Proof: Note that varies continuously with , and apply Brouwer’s fixedpoint theorem ([11], II.22).
Theorem 5.
Suppose that is an analytic function with local max at . For any , there exists a radius such that the centroid of on has .
Proof: Expand as a power series around , and approximate by truncating at the secondorder term. The level sets are concentric ellipsoids about , as given by the positivedefinite symmetric matrix .
Theorem 6.
Suppose and are positive, continuous functions defined on a compact, convex region . Suppose that achieves a local max on and that on , with . Then, there is a radius such that the average value of on is greater than the average value of on for all .
Proof: The mean value theorem for integrals.
Putting these together, we try to find a region and a ball of radius in that region whose center is near the mean of a particular label. We ask whether the entropy of that ball is nondecreasing with radius; that is, does the entropy become lower (or remain the same) when the ball shrinks? If true, this is consistent with the hypothesis that a local maximum of a particular label occurs near the center of that ball, while other labels remain roughly constant. If false, then we subdivide the region and search around the boundary of the original ball for other smaller regions of low entropy.
Specifically, the compact, convex regions corresponds to the children of an adult in the cover tree, and the subballs correspond to the children of successors of at level in the cover tree. This is the heart of the CDER algorithm in Section 6.
6. Covertree Differencing via Entropy Reduction
The input data is a ratio and a training set of pointclouds with labels , respectively. Set weights as in Section 2.
6.1. Region Selection
Here we describe the heart of the CDER algorithm, wherein we search the adults of the cover tree as a datadriven method to locate regions on which we want to build distributional coordinates. The construction and weighting of those distributional coordinates occurs in Section 6.2.
For each level of the cover tree, define a subset of adults that are still potential candidates for distributional coordinates. Set . Once , we break.
The cover tree of is constructed levelbylevel as in Section 4.2. At each level of the cover tree (until break), we perform the following computation:
For each adult , let denote the set of children of at level . Let denote the set of children of at level . Let denote the union of the children (at level ) of the elders of . (If belongs to cohort or less, then .) Note that ; see Figure 9. Then, we decide what to do with :

if , then break. Nothing more is needed.

for each adult :

if has only one child (itself) at level , then pass. This region is useless.

elif has only one child (itself) at level , then pass. This region is useless.

elif , then build a distributional coordinate using the children of of each of the dominant labels.^{7}^{7}7A label is dominant if its points represent more than of the total weight of the children. For a twolabel task, this is simply the majority label.

elif , then append to . The smaller ball may contain a lowerentropy region.

elif , then append the successors of —other than itself!—to . The annulus may contain a lowerentropy region.

elif , then append the successors of —other than itself!—to . The annulus may contain a lowerentropy region.

elif , then pass. The region is useless.

elif if , then pass. The region is useless.

else (meaning ) append the successors of to . There is too little information to decide whether contains a lowentropy region, so reanalyze everything at the next level.

Recall that our method in Section 4.2 produces the list which is ordered by cohort, and within a cohort is ordered by predecessor, and within a predecessor block is ordered by proximity to the predecessor’s perlabel means. Therefore, if in this ordering, then tends to be nearer the mean of the most prominent label of a larger region than . Hence, the distributional coordinates are sorted by “granularity,” from coarse to fine.
In some scenarios, detailed analysis of clusters is more important than efficient classification. Therefore, one can implement a “nonparsimonious” version by replacing all pass operations with append. For machine learning applications, the “parsimonious” version is far faster, as the cover tree can be stopped much earlier than the limit suggested in Section 4.3.
6.2. Build Distributional Coordinates
For each selected to build, we construct a distributional coordinate in the following way.
For each of the dominant labels among the children of , use PCA/SVD to build a Gaussian model of mass 1. Let denote the entropy difference on caused by erasing the nondominant labels. We amplify (or attenuate) each Gaussian model by the coefficient
The distributional coordinate is .
One can experiment with different weighting systems; we have found these to be generally effective, and we justify this choice for the following geometric reasons:

The weight represents the relative likelihood that a point of the given label was selected from the region on which this coordinate was detected. Perhaps, by chance in our selection process, a single region was detected at a later level in the form of several smaller regions. The weight of the smaller regions will sum to the weight of the larger region.

The entropy term is the “certainty” with which this region was selected. The entropy is the information that is lost if the other labels are removed from the region, so is the information that remains. In other words, this term penalizes regions according to the impurity of their labels.

The term accounts for the relative remoteness of the sample region. All else being equal, remote regions are particularly distinctive. Because of the covertree construction, remoteness can be measured by the size of the covertree ball. For example, suppose that two different regions are detected by CDER, each with the same entropy, weight, and configuration of points; the Gaussian coordinates are translates of oneanother. Suppose that the first region is very distant from the rest of the pointcloud, so it is detected at an early level . Suppose that the second region is surrounded closely by other data, so it is not detected by the cover tree until a later level . The first region’s relative volume is .
7. Examples
This section illustrates the promise of CDER on several synthetic examples of varying levels of complexity that are designed to stress certain features of the algorithm.
7.1. Blobs
This example shows that CDER ignores highdensity regions unless they have low entropy.
Consider two labels^{8}^{8}8Labels are indicated by colors, and we have intentionally chosen colors that seem to show up noncatastrophically in blackandwhite printouts. Nonetheless, we recommend printing this paper in color or at least reading this section on the computer screen!: 0/magenta and 1/green. The cloud collection has 25 pointclouds of each label. Each magenta pointcloud consists of:

100 points sampled from a standard normal distribution

2 points sampled from a normal distribution with and mean

2 points sampled from a normal distribution with and mean

2 points sampled from a normal distribution with and mean

2 points sampled from a normal distribution with and mean
Each green pointcloud consists of

100 points sampled from a standard normal distribution

2 points sampled from a normal distribution with and mean

2 points sampled from a normal distribution with and mean

2 points sampled from a normal distribution with and mean

2 points sampled from a normal distribution with and mean
This produces the cloud collection in Figure 2. The output of CDER appears in Figure 10
In a typical run, the algorithm terminates at covertree level 8, even though the complete cover tree of ends at level 14. A total of 26 distributional coordinates are produced, and their weights vary by about 3 orders of magnitude. Table 1 shows how conservative the entropyreduction process is at selecting regions to generate distributional coordinates.
new Gaussians  

0  1  1  +0 
1  7  7  +0 
2  29  29  +4 
3  119  37  +7 
4  446  36  +7 
5  1379  40  +6 
6  2827  23  +1 
7  3903  3  +1 
To judge accuracy on testing data in a supervisedlearning scenario, we need a method to label a test pointcloud using the distributional coordinates. Each test pointcloud is mapped to a point in . Many sophisticated methods are possible, but for simplicity, we simply ask: For a given pointcloud , which is bigger: the Euclidean norm of the magenta Gaussian coordinates evaluated on ,
or the Euclidean norm of the green Gaussian coordinates evaluated on ,
With this simple comparison, the algorithm achieves 100% accuracy in a 5fold crossvalidation of this cloud collection with a 80/20 training/testing split. More precisely, for each run of the crossvalidation, we take 80 percent of the point clouds, turn these into a cloud collection, build distributional coordinates using this cloud collection, and then test the results on the remaining 20 percent of the point clouds. This entire procedure is then repeated five times.
Moreover, the relative masses of the 28 distributional coordinates vary over four orders of magnitude, so for this sort of comparison, one could dispose of many of them in the interest of speed while preserving accuracy. Note that these distributions are contrived to have the same mass (0moment) and mean (1moment) and variance. Elementary statistical tests would not distinguish them; 2moments or skewness tests would be necessary.
7.2. Blocks
This example shows that CDER is fairly robust against background noise that prevents strict separation. It also demonstrates that smoothness of the underlying distributions is not necessary for good results.
Consider two labels: 0/magenta and 1/green. The cloud collection consists of 100 magenta pointclouds and 100 green pointclouds. Each magenta pointcloud in generated by sampling 30 points uniformly in the unit square, as well as 2 extra points sampled uniformly in a square at , as well as 2 extra points sampled uniformly in a square at . Each green pointcloud in generated by sampling 30 points uniformly in the unit square, as well as 2 extra points sampled uniformly in a square at , as well as 2 extra points sampled uniformly in a square at . See Figures 1 and 11.
Using the same simple comparison as in Section 7.1, the algorithm achieves 88% accuracy despite the high background noise.
7.3. Deep Field
In this example, we demonstrate that Gaussian mixtures are (nearly) “fixed points” of CDER. It also demonstrates that the algorithm can handle unequal sample sizes via the weighting system in Section 2.
Consider two labels: 0/magenta and 1/green. The underlying distribution is build from 50 Gaussian distributions. For each of these, we choose a random label. We choose a mean point, uniformly on the
square. We choose a pair of orthonormal covariances uniformly from to . We choose a random rotation angle. We choose an amplification factor (that is, a sample size) uniformly between 50 and 5000. For a particular random seed, we arrive at some combined density function and , as in Figure 14Then, for each label, we produce between 20 and 40 pointclouds, each with between 50 and 500 points. For a particular random seed, we arrive at the cloud collection in Figure 12.
The CDER algorithm builds a Gaussian mixture model focused on regions of low entropy (Figure 13), so it should not be surprising that it builds Gaussians near the original Gaussians, as in Figure 14.
7.4. Three Labels
For simplicity, the previous examples have involved only two labels. In a twolabel system, a lowentropy region has exactly one dominant label. However, the algorithm is sensible for any number of labels, and it may be that the lowentropy regions are dominated by multiple labels. Hence, an ensemble of regions may be necessary to distinguish pointclouds.
Consider three labels: 0/magenta, 1/green, and 2/orange. Let be a triple of standard normal distributions: one each at , , and . Let be a triple of standard normal distributions: one each at ,, and . , and . Let be a triple of standard normal distributions: one each at , , and . See Figure 15.
The CDER algorithm detects these shared regions perfectly, achieving 100% in a 5fold crossvalidation of this cloud collection with a 80/20 training/testing split, using the same simple comparison method as in Section 7.1.
8. Stability
Ideally, one would like to prove that CDER is stable or backwardstable, in the sense of [14, Lecture 14]. To formulate stability or backwardstability, CDER must be expressed as a numerical approximation of a formal function between normed vector spaces.
Let denote the vector space of piecewise continuous functions on our compact domain. Define a norm on by , where each term is the norm on our (compact) domain. Given two “input ”functions we consider them as density functions for each of two labels. Consider the pointwise entropy,
Let denote the formal function by where
(4) 
and
(5) 
A numerical algorithm for can be obtained roughly as “sample equally from and , and apply CDER to generate weighted distributional coordinates. The weighted sum of the distributional coordinates for label 1 is , and the sum of the distributional coordinates for label 2 is .” A stability result such as this is a subject of future work. We have not yet succeeded in proving either stability statement for CDER, but the high crossvalidation of the examples above is promising.
9. Discussion
This paper introduced CDER, a datadriven, labeldriven feature extraction method for collections of labeled pointclouds. CDER is fast, both in terms of theoretical complexity and in initial tests on examples. It does not require the user to choose any tuning parameters, and the geometric meaning of its output features are transparently clear. This section outlines some future directions and makes a few generalizing remarks.
The attentive reader of Section 4 may notice that cover trees can be defined in an arbitrary metric space, and indeed this is how they were originally defined [7]. We use them to construct distributional coordinates, and our algorithm for doing so (Section 6) demands that we be able to quickly compute means of sets of points. While the Fréchet Mean ([5], IX) can be defined in any metric space, there are not always fast algorithms for its computation (it also need not exist in all cases). And so while more generality could be achieved in the definition of CDER, we only make our complexity statements for cloud collections in a common Euclidean space.
All examples in Section 7 were artificial, and simply intended to help visualize the novel and fairlytechnical CDER algorithm and emphasize its key properties. Future work will involve applications of CDER to real data from a variety of real fields, including large administrative datasets of interest to social scientists and to vehicle tracking [4].
We also hope that CDER will prove useful as a featureextraction method in topological data analysis [6], since persistence diagrams can be thought of as point clouds in the plane. Future work will compare the performance of CDER against other such featureextraction methods (for example, [12] and [1]).
Finally, we recall the weighting discussion in Section 2
, where we used the simplifying assumptions that each color/label was equally likely and that each point within a single point cloud should be given equal weight. We note that CDER can be easily adapted to accommodate other prior assumptions about relative likelihoods of labels or even prior assumptions about outlier status of certain points in a cloud, say as part of a Bayesian learning process.
References
 [1] A. Adcock, E. Carlsson, and G. Carlsson. The ring of algebraic functions on persistence bar codes. ArXiv eprints, 2013.
 [2] William K. Allard, Guangliang Chen, and Mauro Maggioni. Multiscale geometric methods for data sets ii: Geometric multiresolution analysis. Applied and Computational Harmonic Analysis, 32(3):435 – 462, 2012.
 [3] John C. Baez, Tobias Fritz, and Tom Leinsterm. A characterization of entropy in terms of information loss. Entropy, 13(11):1945–1957, 2011.
 [4] Paul Bendich, Sang Chin, Jesse Clarke, John deSena, John Harer, Elizabeth Munch, Andrew Newman, David Porter, David Rouse, Nate Strawn, and Adam Watkins. Topological and statistical behavior classifiers for tracking applications. IEEE Trans. on Aero. and Elec. Sys., 52(6):2644–2661, 2016.
 [5] D. Burago, I.U.D. Burago, and S. Ivanov. A Course in Metric Geometry. Crm Proceedings & Lecture Notes. American Mathematical Society, 2001.
 [6] Herbert Edelsbrunner and John Harer. Persistent Homology — a Survey. Contemporary Mathematics, 0000:1–26.
 [7] Sariel HarPeled and Manor Mendel. Fast construction of nets in lowdimensional metrics and their applications. SIAM Journal on Computing, 35(5):1148–1184.
 [8] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
 [9] Michael Kerber, Dmitriy Morozov, and Arnur Nigmetov. Geometry Helps to Compare Persistence Diagrams, pages 103–112. 2016.
 [10] S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Statist., 22(1):79–86, 03 1951.
 [11] James R. Munkres. Elements of Algebraic Topology. Addison Wesley, 1993.

[12]
Jan Reininghaus, Stefan Huber, Ulrich Bauer, and Roland Kwitt.
A stable multiscale kernel for topological machine learning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 4741–4748, 2015.  [13] Donald R. Sheehy. Linearsize approximations to the VietorisRips filtration. Discrete & Computational Geometry, 49(4):778–796, 2013.
 [14] Lloyd N. Trefethen and D Bau III. Numerical linear algebra. Numerical Linear Algebra with Applications, 12:361, 1997.
Comments
There are no comments yet.