Riemannian and sub-Riemannian methods for dimension reduction

Research output: Book/ReportPh.D. thesisResearch

In this thesis, we propose new methods for dimension reduction based on differential geometry, that is, finding a representation of a set of observations in a space of lower dimension than the original data space. Methods for dimension reduction form a cornerstone of statistics, and thus have a very wide range of applications. For instance, a lower dimensional representation of a data set allows visualization and is often necessary for subsequent statistical analyses. In ordinary Euclidean statistics, the data belong to a vector space and the lower dimensional space might be a linear subspace or a non-linear submanifold approximating the observations. The study of such smooth manifolds, differential geometry, naturally plays an important role in this last case, or when the data space is itself a known manifold. Methods for analysing this type of data form the field of geometric statistics. In this setting, the approximating space found by dimension reduction is naturally a submanifold of the given manifold. The starting point of this thesis is geometric statistics for observations belonging to a known Riemannian manifold, but parts of our work form a contribution even in the case of data belonging to Euclidean space, Rd.

An important example of manifold valued data is shapes, in our case discrete curves or surfaces. In evolutionary biology, researchers are interested in studying reasons for and implications of morphological differences between species. Shape is one way to formalize morphology. This application motivates the first main contribution of the thesis. We generalize a dimension reduction method used in evolutionary biology, phylogenetic principal component analysis (P-PCA), to work for data on a Riemannian manifold - so that it can be applied to shape data. P-PCA is a version of PCA for observations that are assumed to be leaf nodes of a phylogenetic tree. From a statistical point of view, the important property of such data is that the observations (leaf node values) are not necessarily independent. We define and estimate intrinsic weighted means and covariances on a manifold which takes the dependency of the observations into account. We then define phylogenetic PCA on a manifold to be the eigendecomposition of the weighted covariance in the tangent space of the weighted mean. We show that the mean estimator that is currently used in evolutionary biology for studying morphology corresponds to taking only a single step of our Riemannian gradient descent algorithm for the intrinsic mean, when the observations are represented in Kendall’s shape space.

Our second main contribution is a non-parametric method for dimension reduction that can be used for approximating a set of observations based on a very flexible class of submanifolds. This method is novel even in the case of Euclidean data. The method works by constructing a subbundle of the tangent bundle on the data manifold M via local PCA. We call this subbundle the principal subbundle. We then observe that this subbundle induces a sub-Riemannian structure on M and we show that the resulting sub-Riemannian geodesics with respect to this structure stay close to the set of observations. Moreover, we show that sub-Riemannian geodesics starting from a given point locally generate a submanifold which is radially aligned with the estimated subbundle, even for non-integrable subbundles. Non-integrability is likely to occur when the subbundle is estimated from noisy data, and our method demonstrates that sub-Riemannian geometry is a natural framework for dealing which such problems. Numerical experiments illustrate the power of our framework by showing that we can achieve impressively large range reconstructions even in the presence of quite high levels of noise.
Original languageEnglish
PublisherDepartment of Computer Science, Faculty of Science, University of Copenhagen
Number of pages113
Publication statusPublished - 2024

ID: 384258988