Table of Contents
-
Pseudotime trajectory inference
- monocle The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells
- TSCAN TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis
- slingshot Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics
- monocle3 The single-cell transcriptional landscape of mammalian organogenesis
- Downstream analysis to statistically test "dynamic" gene programs
TL;DR: The goal is to give a "pseudotime" for each cell. Use minimum spanning tree (MST) to estimate "independent" lineages. Sort the cells within each lineage to assign "time" information.
Pseudotime trajectory inference
monocle
The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells
monocle
deals with pairs of cells directly:
We developed Monocle to informatically order the cells by their progress through differentiation rather than by the time they were collected, maximizing the transcriptional similarity between successive pairs of cells.
- Independent Component Analysis to reduce dimensions:
It reduces the dimensionality of this space using independent component analysis17. Dimensionality reduction transforms the cell data from a high-dimensional space into a low-dimensional one that preserves essential relationships between cell populations but is much easier to visualize and interpret.
- Fit a minimum spanning tree on the cells and identify the longest path:
Monocle constructs a minimum spanning tree (MST) on the cells [...] The algorithm finds the longest path through the MST, corresponding to the longest sequence of transcriptionally similar cells.
- Assign "pseudo time" to each cell along the path:
Finally, Monocle uses this sequence to produce a 'trajectory' of an individual cell's progress through differentiation.
TSCAN
TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis
- Gene clustering to reduce "drop out" genes:
Before pseudo-time reconstruction, [...] in order to alleviate the effect of drop-out events on the subsequent analyses, genes with similar expression patterns are grouped into clusters by hierarchical clustering (using Euclidean distance and complete linkage). [...] For each cluster and each cell, the expression measurements of all genes in the cluster are averaged to produce a cluster-level expression which will be used for subsequent MST construction.
- Then, they performed PCA on the clustered matrix.
After gene clustering, single-cell transcriptome for cell $i$ becomes a $H$ dimensional vector $E_{i}$ Here, $H$ is the number of gene clusters. $E_{i}$ still has high dimension, and many components in this vector are still correlated. The dimensionality makes visualization and statistical modeling difficult. For this reason, TSCAN further reduces the dimension of using principal component analysis (PCA). [...] After PCA, the H dimensional vector is mapped to a lower dimensional space and becomes a $K$ dimensional vector $\tilde{E}_{i}$. Here, $K$ is much smaller than $H$.
- Cell clustering by fitting a mixture of multivariate Gaussian
The clustering is performed using the mclust (22) package in R which fits a mixture of multivariate normal distributions to the data $\tilde{E}_{i}$.
- Minimum spanning tree
Next, TSCAN constructs a minimum spanning tree to connect all cluster centers. In a connected and undirected graph, a spanning tree is a subgraph that is a tree and connects all the vertices (or 'nodes'). [...] Unlike the MST approach used by Monocle where the tree is constructed to connect individual cells, the MST in TSCAN is constructed to connect clusters of cells.
- Identify "main" branch and pick the root randomly
A tree may have multiple branches. By default, we define the main path of the tree (solid lines in Figure 1D) as the path with the largest number of clusters. If more than one path has the same largest number of clusters, the path with the largest number of cells becomes the main path. The main path has two ends. Without other information, one end will be randomly picked up as the origin of the path.
- Cell ordering and pseudo-time calculation; we can learn an average direction between clusters $i$ and $j$ and apply each cell in the cluster $i$ (the preceding one) with respect to this vectorial information:
All cells in these clusters will be ordered along the path as follows. Let $C_{i}$ ($i$ = 1, 2, $\ldots$, $M$) indicate the ordered clusters, where $M$ is the number of clusters on the ordered path.
- Sort all the cells by these vectorial information:
Cell orderings are determined in three steps. First, for cells which are in the same cluster and are projected onto the same edge, their order is determined by the projected values on the edge. Second, within each cluster, the order of cells projected onto different edges is determined by the order of edges, which is given by the cluster-level ordering. Third, the order of cells in different clusters is determined by the order of clusters. In this way, all cells can be placed in order.
slingshot
Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics
- Estimate cluster-level MST
Slingshot identifies lineages by treating clusters of cells as nodes in a graph and drawing a minimum spanning tree (MST) between the nodes, [...] We have found that a Mahalanobis-like distance, i.e., a covariance-scaled Euclidean distance, that accounts for cluster shape, works well in practice, but users have the option of specifying any type of distance measure (e.g., Euclidean, Manhattan).
- Estimate pseudotime by principal curves
The second stage of Slingshot is concerned with assigning pseudotimes to individual cells. For this purpose, we make use of principal curves to draw a path through the gene expression space of each lineage
-
Project all data points onto the curve and calculate the arc length from the beginning of the curve to each point’s projection. Setting the lowest value to zero, this produces pseudotimes.
-
For each dimension $j$, $j \in {1,…,J'}$, use the cells' pseudotimes to predict their coordinates, typically with a smoothing spline. This produces a set of J' functions which collectively map pseudotime values, thereby defining a smooth curve in J' dimensions.
-
Repeat this process until convergence. We use the sum of squared distances between cells’ actual coordinates and their projections on the curves to determine convergence.
monocle3
The single-cell transcriptional landscape of mammalian organogenesis
- I think
monocle3
made a wrong decision. Okay, the originalmonocle
might have resulted in a noisy MST backbone, or the method is infeasible for a million cells. However, there is no rationale behind "projecting data onto UMAP and performing clustering," and the trajectoriesmonocle3
report report never looks like a trajectory (because UMAP internally optimizes for cluster-like patterns).
Monocle3
first projects cells onto a low-dimensional space encoding transcriptional state using UMAP. It then groups mutually similar cells using the Louvain community detection algorithm, and merges adjacent groups into ‘supergroups’. Finally, it resolves the paths or trajectories that individual cells can take during development, identifying the locations of branches and convergences within each supergroup. Finally, it resolves the paths or trajectories that individual cells can take during development, identifying the locations of branches and convergences within each supergroup.