\documentclass[a4paper,11pt,openany,extrafontsizes]{memoir} \input{preamble} \usepackage[firstpage]{draftwatermark} \begin{document} \pagestyle{plain} \tightlists% \begin{titlingpage} \begin{center} \vspace{1cm} \textsf{\Huge{University of Oxford}}\\ \vspace{1cm} \includegraphics[scale=.8]{Stats_Logo.png}\\ \vspace{2cm} \Huge{\thetitle}\\ \vspace{2cm} \large{by\\[14pt]\theauthor\\[8pt]St Catherine's College}\\ % \vspace{2.2cm} \vfill \large{A dissertation submitted in partial fulfilment of the degree of Master of Science in Applied Statistics}\\ \vspace{.5cm} \large{\emph{Department of Statistics, 24--29 St Giles,\\Oxford, OX1 3LB}}\\ \vspace{1cm} \large{\thedate} \end{center} \end{titlingpage} %\chapterstyle{hangnum} %\chapterstyle{ell} %\chapterstyle{southall} \chapterstyle{wilsondob} \frontmatter \cleardoublepage% \chapter*{Declaration of authorship} \emph{This my own work (except where otherwise indicated).}\\[2cm] \begin{center} Date \hspace{.5\linewidth} Signature \end{center} \cleardoublepage% \begin{abstract} Abstract here \end{abstract} \cleardoublepage% \chapter*{Acknowledgements}% \label{cha:acknowledgements} Thank you! \cleardoublepage% \tableofcontents* \listoffigures* \listoftables* \clearpage \mainmatter% \chapter{Introduction}% \label{cha:introduction} \chapter{Topological Data Analysis and Persistent Homology}% \label{cha:tda-ph} \section{Homology}% \label{sec:homology} Our goal is to understand the topological structure of a metric space. For this, we can use \emph{homology}, which consists in associating for a metric space $X$ and a dimension $i$ a vector space $H_i(X)$. The dimension of $H_i(X)$ will give us the number of $i$-dimensional components in $X$: the dimension of $H_0(X)$ is the number of path-connected components in $X$, the dimension of $H_1(X)$ is the number of holes in $X$, and the dimension of $H_2(X)$ is the number of voids. Crucially, these vector spaces are robust to continuous deformation of the underlying metric space (they are \emph{homotopy invariant}). However, computing the homology of an arbitrary metric space can be extremely difficult. It is necessary to approximate it in a structure that would be both combinatorial and topological in nature. \section{Simplicial Complexes}% \label{sec:simplicial-complexes} In order to understand the topological structure of a metric space, we need a way to decompose it in smaller pieces which, when assembled, conserve the overall organisation of the space. For this, we use a structure called a \emph{simplicial complex}, which is a kind of higher-dimensional generalization of graphs. The building blocks of this representation will be \emph{simplices}, which are simply the convex hull of an arbitrary set of points. Examples of simplices include single points, segments, triangles, and tetrahedrons (in dimensions 0, 1,, 2, and 3 respectively). \begin{defn}[Simplex] The \emph{$k$-dimensional simplex} $\sigma = [x_0,\ldots,x_k]$ is the convex hull of the set $\{x_0,\ldots,x_k\} \in \mathbb{R}^d$, where $x_0,\ldots,x_k$ are affinely independent. $x_0,\ldots,x_k$ are called the \emph{vertices} of $\sigma$, and the simplices defined by the subsets of $\{x_0,\ldots,x_k\}$ are called the \emph{faces} of $\sigma$. \end{defn} We then need a way to combine these basic building blocks meaningfully so that the resulting object can adequately reflect the topological structure of the metric space. \begin{defn}[Simplicial complex] A \emph{simplicial complex} is a collection $K$ of simplices such that: \begin{itemize} \item any face of a simplex of $K$ is a simplex of $K$ \item the intersection of two simplices of $K$ is either the empty set or a common face or both. \end{itemize} \end{defn} %% TODO figure with examples of simplicial complexes Using these definitions, we can define homology on simplicial complexes. %% TODO add reference for more details/do it myself? \section{Filtrations}% \label{sec:filtrations} If we consider that a simplicial complex is a kind of ``discretization'' of a metric space, we realise that there must be an issue of \emph{scale}. For our analysis to be invariant under small perturbations in the data, we need a way to find the optimal scale parameter to capture the adequate topological structure, without taking into account some small perturbations, nor ignoring some important smaller features. %% TODO rewrite using the Cech filtration as an example? The ideal solution to these problems is to consider all scales at once: this is the objective of \emph{filtered simplical complexes}. \begin{defn}[Filtration] A \emph{filtered simplicial complex}, or simply a \emph{filtration}, $K$ is a sequence ${(K_i)}_{i\in I}$ of simplicial complexes such that: \begin{itemize} \item for any $i, j \in I$, if $i < j$ then $K_i \subseteq K_j$, \item $\bigcup_{i\in I} K_i = K$. \end{itemize} \end{defn} \section{Persistent Homology}% \label{sec:persistent-homology} We can now compute the homology for each step in a filtration. This leads to the notion of \emph{persistent homology}, which gives us all the information necessary to establish the topological structure of the metric space at multiple scales. \begin{defn}[Persistent homology] The \emph{$p$-th persistent homology} of a simplicial complex $K = {(K_i)}_{i\in I}$ is the pair $(\{H_p(K_i)\}_{i\in I}, \{f_{i,j}\}_{i,j\in I, i\leq j})$, where for all $i\leq j$, $f_{i,j} : H_p(K_i) \mapsto H_p(K_j)$ is induced by the inclusion map $K_i \mapsto K_j$. \end{defn} The functions $f_{i,j}$ allow us to link generators in each successive homology space in the filtration. Since each generator correspond to a topological feature (connected component, hole, void, etc, depending on the dimension $p$), we can determine whether it survives in the next step of the filtration. We can now determine when each feature is born and when it dies (if it dies at all). This representation will be dependent on the choice of basis for each homology space $H_p(K_i)$. However, by the Fundamental Theorem of Persistent Homology, we can choose base vectors in each homology space such that the collection of half-open intervals is well-defined and unique. This construction is called a \emph{barcode}. %% TODO references for the Fundamental Theorem \section{Topological summaries: barcodes and persistence diagrams}% \label{sec:topol-summ} In order to interpret the results of the persistent homology computation, we need to compare the output for a particular data set to a suitable null model. For this, we need some kind of a similarity measure between barcodes and a way to evaluate the statistical significance of the results. One possible approach for this is to define a space in which we can project barcodes and study their geometric properties. \emph{Persistence diagrams} are an example of such a space. \begin{defn}[Persistence diagrams] A \emph{persistence diagram} is the union of a finite multiset of points in $\bar{\mathbb{R}}^2$ zith the diagonal $\Delta = \{(x,x) \;|\; x\in\mathbb{R}^2\}$, where every point of $\Delta$ has infinite multiplicity. \end{defn} The diagonal $\Delta$ is added to facilitate comparisons between diagrams, as points near the diagonal correspond to short-lived topological feature, thus likely to be caused by small perturbations in the data. We can now define several distances on the space of persistence diagrams. \begin{defn}[Wasserstein distance] The \emph{$p$-th Wasserstein distance} between two diagrams $X$ and $Y$ is \[ W_p[d](X, Y) = \inf_{\phi:X\mapsto Y} \left[\sum_{x\in X} {d\left(x, \phi(x)\right)}^p\right] \] for $p\in [1,\infty)$, and \[ W_\infty[d](X, Y) = \inf_{\phi:X\mapsto Y} \sup_{x\in X} d\left(x, \phi(x)\right) \] for $p = \infty$, where $d$ is a distance on $\mathbb{R}^2$ and $\phi$ ranges over all bijections from $X$ to $Y$. \end{defn} \begin{defn}[Bottleneck distance] The \emph{bottleneck distance} is defined as the infinite Wasserstein distance with $d$ the uniform norm: $d_B = W_\infty[L_\infty]$. \end{defn} Since the bottleneck distance is by far the most commonly used, we will focus on it in the following. It is symmetric, non-negative, and satisfies the triangle inequality. However, it is not a true distance, as it is fairly straightforward to come up with two distinct diagrams at bottleneck distance zero, even on multisets not touching the diagonal $\Delta$. \section{Stability}% \label{sec:stability} \chapter{Temporal Networks}% \label{cha:temporal-networks} \section{Definition and basic properties}% \label{sec:defin-basic-prop} In this section, we will introduce the notion of temporal networks or graphs. This is a complex notion, with many concurrent definitions and interpretations. First, we restate the standard definition of a non-temporal, static graph. \begin{defn}[Graph] A \emph{graph} is a couple $G = (V, E)$, where $V$ is a finite set of \emph{nodes} (or \emph{vertices}), and $E \subseteq V\times V$ is a set of \emph{edges}. A \emph{weighted graph} is defined by $G = (V, E, w)$, where $w : E\mapsto \mathbb{R}_+$ is fcalled the \emph{weight function}. \end{defn} We also define some basic concepts that will be needed later on to build simplicial complexes on graphs. \begin{defn}[Clique] A \emph{clique} is a set of nodes where each pair is connected. That is, a clique $C$ of a graph $G = (V,E)$ is a subset of $V$ such that $\forall i,j\in C, i \neq j \implies (i,j)\in E$. A clique is said to be \emph{maximal} if it cannot be augmented by any node. \end{defn} Temporal networks are defined in the more general framework of \emph{multilayer networks}. However, this definition is much too general for our simple applications, and we restrict ourselves to edge-centric time-varying graphs. In this model, the set of nodes is fixed and doesn't change over time, whereas edges can appear or disappear at different timestamps. \begin{defn}[Temporal network] A \emph{temporal network} (or graph) is a tuple $G = (V, E, \mathcal{T}, \rho)$, where: \begin{itemize} \item $V$ is a finite set of nodes, \item $E\subseteq V\times V$ is a set of edges, \item $\mathbb{T}$ is the \emph{temporal domain} (often taken as $\mathbb{N}$ or $\mathbb{R}_+$), and $\mathcal{T}\subseteq\mathbb{T}$ is the \emph{lifetime} of the network, \item $\rho: E\times\mathcal{T}\mapsto\{0,1\}$ is the \emph{presence function}, which determines whether an edge is present in the network at each timestamp. \end{itemize} The \emph{available dates} of an edge are the set $\mathcal{I}(e) = \{t\in\mathcal{T}: \rho(e,t)=1\}$. \end{defn} Temporal networks can also have weighted edges. In this case, it is possible to have constant weights (edges can only appear or disappear over time, and always have the same weight), or time-varying weights. In the latter case, we can set the domain of the presence function to be $\mathbb{R}_+$ instead of $\{0,1\}$, where by convention a zero weight corresponds to an absent edge. \begin{defn}[Additive temporal network] A temporal network is said to be \emph{additive} if for all $e\in E$ and $t\in\mathcal{T}$, if $\rho(e,t)=1$, then $\forall t'>t, \rho(e, t') = 1$. Edges can only be added to the network, never removed. \end{defn} \section{Network partitioning}% \label{sec:network-partitioning} \section{Persistent homology for networks}% \label{sec:pers-homol-netw} We now consider the problem of applying persistent homology to network data. An undirected network is already a simplicial complex of dimension 1. However, this will not be sufficient to capture enough topological information: we need to introduce higher-dimensional simplices. The first possible method is to project the network on a metric space, thus transforming the network data into a point cloud data. For this, we need to compute the distance between each pair of nodes in the network (via shortest path distance for instance). This also requires the network to be connected. Another usual method for weighted networks is called the \emph{weight rank clique filtration} (WRCF), which filters the network based on weights. The procedure works as follows: \begin{enumerate} \item Set the set of all nodes, without any edge, as filtration step~0. \item Rank all edge weights in decreasing order $\{w_1,\ldots,w_n\}$. \item At filtration step $t$, keep only the edges whose weights are less than $w_t$, thus creating an unweighted graph. \item Define the maximal cliques of the resulting graph to be simplices. \end{enumerate} At each step of the filtration, we construct a simplicial complex based on cliques: this is called a \emph{clique complex}. It is necessarily valid since a subset of a clique is necessarily a clique itself, and the same is true for the intersection of two cliques. This leads to a first possibility for applying persistent homology to temporal networks. It is possible to segment the lifetime of the network into sliding windows, creating a static graph on each window by retaining only the edges available during the time interval. We can then apply WRCF on each static graph in the sequence, obtaining a filtered complex for each window, to which we can then apply persistent homology. This method is sensitive to the choice of sliding windows on the time scale. The width and the overlap of the windows can completely change the networks created and their topological features. Too small a window, and the network becomes too small to have any significant topological properties, too large, and we lose important information in the evolution of the network over time. \section{Zigzag persistence}% \label{sec:zigzag-persistence} \backmatter% \nocite{*} \bibliographystyle{plain} \bibliography{}% \label{cha:bibliography} \end{document} %%% Local Variables: %%% mode: latex %%% TeX-master: t %%% End: