Cluster Analysis of HighDimensional scRNA Sequencing Data
Abstract
With ongoing developments and innovations in singlecell RNA sequencing methods, advancements in sequencing performance could empower significant discoveries as well as new emerging possibilities to address biological and medical investigations. In the study, we will be using the dataset collected by the authors of Systematic comparative analysis of single cell RNAsequencing methods. The dataset consists of singlecell and/or single nucleus profiling from three types of samples – cell lines, peripheral blood mononuclear cells, and brain tissue, which offers 36 libraries in six separate experiments in a single center. Our quantitative comparison aims to identify unique characteristics associated with different singlecell sequencing methods, especially among lowthroughput sequencing methods and highthroughput sequencing methods. Our procedures also incorporate evaluations of every method’s capacity for recovering known biological information in the samples through clustering analysis.
1 Introduction
With the provided dataset of aggregated count matrices of the three samples, we will preprocess each aggregated count matrix into sub count matrices that correspond to separate replicates and sequencing methods. After preprocessing the scRNA sequencing data, we will quantitatively compare and evaluate various methods used on each sample replicate by assessing metrics such as dropout rate and sensitivity measures. Lastly, we will perform various dimensionality techniques along with a number of clustering methods on PBMC and cortex samples to investigate each utilized method’s capacity for recovering known cell types in the samples.
2 Data Understanding and Preprocessing
The dataset provides three data files for each sample, which includes the aggregated read count matrix for both replicates and all methods, as well as the gene annotation and cell annotation for the count matrix. The first stage of the study involved dividing the aggregated count matrix into a number of sub count matrices which correspond to different replicates, and sequencing methods and annotating rows with cell names and columns with gene names.
Among the whole dataset, seven scRNA sequencing methods are present. There are five highthroughput methods, 10xChromiumv2, 10xChromiumv3, inDrops, Dropseq, and sciRNAseq, alongside two lowthroughput methods, Smartseq2 and CELSeq2. By observing the dimensions of the count matrices, lowthroughput methods such as Smartseq2 and CELSeq2 tend to be associated with a relatively low number of cells. Figure 1 displays a summary of the count matrices generated after the division of the count matrix for each of the three samples.
3 Comparative Evaluations
3.1 Dropout Rate
A dropout event refers to the occurrence when a transcript is expressed in a cell but is undetected in its mRNA profile. The reason behind the occurrence of dropout events is the low amounts of mRNA in individual cells. The frequency of dropout events depends on scRNAseq protocols. There exists a general tradeoff between the number of cells and the frequency of dropout events, whereas scRNAseq protocols that generate more cells tend to have a higher frequency of dropout events. Figure 2 demonstrates a graphic illustration of dropout events.
The dropout rate of a sequencing method indicates the level of the sparsity of a scRNA sequencing method. For each of the count matrix that we have obtained from data preprocessing, we calculated the dropout rate by dividing the number of cells with zero entries by the total number of cells. The outcomes are demonstrated in figure 3 to figure 5 below.
According to the dropout rates table,

lowthroughput platebased methods such as Smartseq and Celseq are associated with the lowest dropout rates among all methods.

highthroughput methods such as Dropseq, Seqwell, and inDrops have significantly higher dropout rates.
3.2 Sensitivity
The sensitivity of a scRNASeq method refers to the likelihood to capture and convert a particular mRNA transcript present in a single cell into a cDNA molecule present in the library.
To assess the sensitivity of each scRNA sequencing method, we have generated visualizations of two metrics, gene detection per cell and cumulative gene detection. Please refer to figure 6 to figure 9 for the visualisations.

Gene detection per cell (left): The number of genes detected (counts 1) per cell. Each dot represents a cell and each box represents the median and first and third quartiles per replicate and method.

Cumulative gene detection (right): The cumulative number of genes detected as more cells are added. The xaxis represents the number of cells accounted while the yaxis represents the number of genes detected.
It is noticeable that lowthroughput methods appear to have high sensitivity, in which case they tend to have a higher number of genes detected per cell and they detected the most genes in an efficient manner. In contrast, Dropseq is consistently associated with low sensitivity across all samples, displaying relatively low gene detection efficiency. Other highthroughput also displayed relatively lackluster performance in terms of sensitivity.
Based on the graphs which display the cumulative number of genes detected with additional cells added, highthroughput methods tend to converge to a relatively low number of detected genes despite increments in the number of cells, implying that there is a group of genes that are potentially difficult to capture with highthroughput methods.
4 Cluster Analysis
Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many singlecell RNAseq data sets as it helps with identifying underlying gene types. In this section, we have conducted different clustering methods on the data after applying filtering, normalization, and dimensionality reductions.
4.1 Filtering
Although RNAseq technology has improved the dynamic range of gene expression quantification, lowexpression genes may be indistinguishable from sampling noise. The presence of noisy, lowexpression genes can lead to poor analysis results. Thus, it is beneficial to remove the lowexpression genes in the data before applying clustering techniques. Please refer to figure 1 for dimensions of count matrices after filtering.
We have taken the following two procedures to filter the data.

Row sparsity filtering: Remove genes with more than 80% zero counts.

CV filtering: Remove out the bottom 15% of the remaining genes which have the lowest coefficient of variation.
4.2 Normalization
Normalization is an essential step in an RNASeq analysis, in which the read count matrix is transformed to allow for meaningful comparisons of counts across samples. In this case, we have applied Quantile normalization. Quantile normalization is a technique for making distributions identical in statistical properties. Each column is converted to rank values before sorting and averaging the ranks to replace the previous ranks.
4.3 Dimensionality Reduction
To achieve better and more reasonable clustering visualization results on a twodimensional space, we needed to apply dimensionality techniques to implement feature engineering.
Pca
Principal component analysis is a linear dimensionality reduction technique, and it performs a linear mapping of the data to a lowerdimensional space in such a way that the variance of the data in the lowdimensional representation is maximized. The major procedure involves calculating the eigenvectors from the covariance matrix, where the eigenvectors that correspond to the largest eigenvalues are used to reconstruct a significant fraction of the variance of the original data. However, in the case of high dimensions, eigenvalues and eigenvectors of the sample covariance matrix are not consistent, which may affect the performance of PCA.
tSNE
tSNE is a nonlinear technique for dimensionality reduction technique. It calculates the probability of similarity of points in highdimensional space and calculating the probability of similarity of points in the corresponding lowdimensional space, and it minimizes the difference between these conditional probabilities (or similarities) in higherdimensional and lowerdimensional space for a representation of data points in lowerdimensional space. However, tSNE is computationally complex and nondeterministic, which makes the technique less robust.
*Projection Pursuit
Projection Pursuit is a component transform technique that looks for a component whose projection vector points to a direction of the quality of interest in data space which can be determined by a Projection Index (PI). In general, projections that go astray more from a normal distribution are viewed as more interesting. As each projection is found, the data are reduced by removing the component along that projection, and the process is repeated to find new projections. The idea of projection pursuit is to locate the projection or projections from highdimensional space to lowdimensional space that reveal the most insights regarding the structure of the data set. However, one potential issues associated with the technique is that it generally uses random initial conditions to produce projection index components. As a result, when the same projection pursuit is performed on different occasions or by different users, the resulting projection index components are generally not the equivalent. This approach has yet to be implemented in our study.
4.4 Clustering Methods
Kmeans Clustering
KMeans clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction. KMeans clustering minimizes the squared error function with a given k.
Hierarchical Clustering
Hierarchical clustering works by grouping the data one by one on the basis of the nearest distance measure of all the pairwise distance between the data point. Distance could be defined differently and we have used euclidean distance and 1  correlation to experiment with different distance measurements. Distance calculation methods also vary, which includes single linkage, complete linkage, average linkage, and ward’s method (sum of squared euclidean distance is minimized), which is our choice of method in this case. The results from hierarchical clustering will not be displayed due to its relatively poor performance on the data.
From the Single Cell Comparison studies on Single Cell Portal, we acquired the true number of cell types for Cortex and PBMC, respectively.

Cortex: 9 cell types

PBMC: 9 cell types along with an unidentified cell type
Lastly, We used silhouette as a method of interpretation and validation of consistency within clusters of data. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from to , where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. Below are the visualizations of the clustering results.
References
 Jiarui Ding, Xian Adiconis, Sean K. Simmons. Systematic comparative analysis of single cell RNAsequencing methods. https://www.biorxiv.org/content/10.1101/632216v1.full.
 Wei Vivian Li and Jingyi Jessica Li. Accurate and Robust Imputation for Singlecell RNAseq data. http://jsb.ucla.edu/sites/default/files/Dahshu_scImpute_043018.pdf.
 F. William Townes and Rafael A. Irizarryi. Quantile normalization of singlecell RNAseq read counts without unique molecular. https://www.biorxiv.org/content/biorxiv/early/2019/10/24/817031.full.pdf.
 Christoph Ziegenhain, Beate Vieth, Swati Parekh. Comparative Analysis of SingleCell RNA Sequencing Methods. https://www.sciencedirect.com/science/article/pii/S1097276517300497
 Study: Single Cell Comparison: PBMC data. https://singlecell.broadinstitute.org/single_cell/study/SCP424/singlecellcomparisonpbmcdatastudyvisualize.
 Study: Single Cell Comparison: Cortex data. https://singlecell.broadinstitute.org/single_cell/study/SCP425/singlecellcomparisoncortexdatastudyvisualize
 Soledad Espezuaa, Edwin Villanuevaa. Carlos D. Macielb, André Carvalhoa. A Projection Pursuit Algorithm for Exploratory Data Analysis. IEEE Transactions on Computers. J. H. Friedman and J. W. Tukey