Simon Coetzee
02/27/2024
“ A Sprint Through Single Cell RNA-Seq” is licensed under CC BY by Simon Coetzee.
It is also possible to only sequence the nuclei of the cells.
Single-cell profiling does not always provide an unbiased view on cell types for specific tissues or organs, such as, for example, the brain (or skeletal muscle or adipose).
Single-cell requires fresh tissue.
Ding, J., Adiconis, X., Simmons, S.K. et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat Biotechnol 38, 737–746 (2020). https://doi.org/10.1038/s41587-020-0465-8
cost of not (necessarily) covering the full gene length, making it difficult to unambiguously align reads to a transcript and distinguishing between different isoforms
However, we can then use unique molecular identifiers - UMIs to resolve biases in amplification.
The usage of UMIs further allows for normalization of gene counts to be performed without a loss of accuracy.
Before having a complete count matrix, two particular quality control steps must be performed:
Before having a complete count matrix, two particular quality control steps must be performed:
Before having a complete count matrix, two particular quality control steps must be performed:
The steps of the workflow are:
Regardless of the analysis being done, conclusions about a population based on a single sample per condition are not trustworthy. BIOLOGICAL REPLICATES ARE STILL NEEDED! That is, if you want to make conclusions that correspond to the population and not just the single sample.
scRNA-seq is able to capture expression at the cellular level, however
The data complexity involves:
Expression data from scRNA-seq represents tens or hundreds of thousands of reads for thousands of cells.
This output is much larger than normal bulk RNA-seq, and requires higher amounts of memory to analyze, larger storage requirements, and more time to run the analysis.
For the droplet-based methods of scRNA-seq, the depth of sequencing is shallow, often detecting only 10-50% of the transcriptome per cell.
This results in cells showing zero counts for many of the genes. However, in a particular cell, a zero count for a gene could either mean that the gene was not being expressed or the transcripts were just not detected.
Jiang, R., Sun, T., Song, D. et al. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biol 23, 31 (2022). https://doi.org/10.1186/s13059-022-02601-5
Svensson, V. Droplet scRNA-seq is not zero-inflated. Nat Biotechnol 38, 147–150 (2020). https://doi.org/10.1038/s41587-019-0379-5
Uninteresting sources of biological variation can result in gene expression between cells being more similar/different than the actual biological cell types/states, which can obscure the cell type identities. Uninteresting sources of biological variation (unless part of the experiment’s study) include:
Technical sources of variation can result in gene expression between cells being more similar/different based on technical sources instead of biological cell types/states, which can obscure the cell type identities. Technical sources of variation include:
How to know whether you have batches?
Any of these indicate that your data contains batch effects.
How to not have them:
To filter the data to only include true cells that are of high quality, so that when we cluster our cells it is easier to identify distinct cell type populations.
To identify any failed samples and either try to salvage the data or remove from analysis, in addition to, trying to understand why the sample failed
Delineating cells that are poor quality from less complex cells
Choosing appropriate thresholds for filtering, so as to keep high quality cells without removing biologically relevant cell types
Now that we have generated the various metrics to assess, we can explore them with visualizations. We will assess various metrics and then decide on which cells are low quality and should be removed from the analysis:
The cell counts are determined by the number of unique cellular bar codes detected. For this experiment, between 12,000 -13,000 cells are expected.
In an ideal world, you would expect the number of unique cellular bar codes to correspond to the number of cells you loaded. However, this is not the case as capture rates of cells are only a proportion of what is loaded.
10X capture efficiency is between 50-60%.
Occasionally a hydrogel can have more than one cellular barcode. Similarly, with the 10X protocol there is a chance of obtaining only a barcoded bead in the emulsion droplet (GEM) and no actual cell. Both of these, in addition to the presence of dying cells can lead to a higher number of cellular barcodes than cells.
The UMI counts per cell should generally be above 500, that is the low end of what we expect. If UMI counts are between 500-1000 counts, it is usable but the cells probably should have been sequenced more deeply.
For high quality data, the proportional histogram should contain a single large peak that represents cells that were encapsulated.
There can be a small shoulder to the left of the major peak (not present in our data), or a bimodal distribution of the cells.
Two metrics that are often evaluated together are the number of UMIs and the number of genes detected per cell.
Cells that are poor quality are likely to have low genes and UMIs per cell.
Mitochondrial read fractions are only high in particularly low count cells with few detected genes (darker colored data points). This could be indicative of damaged/dying cells whose cytoplasmic mRNA has leaked out through a broken membrane, and thus, only mRNA located in the mitochondria is still conserved.
This metric can identify whether there is a large amount of mitochondrial contamination from dead or dying cells. We define poor quality samples for mitochondrial counts as cells which surpass the 0.2 mitochondrial ratio mark, unless of course you are expecting this in your sample.
Can't count on this absolutely - kidney cells often have high mitochondrial RNA levels - the tissue is associated with mitochondrial function.
Novelty score is computed by taking the ratio of nGenes over nUMI.
If there are many captured transcripts (high nUMI) and a low number of genes detected in a cell, this likely means that you only captured a low number of genes and simply sequenced transcripts from those lower number of genes over and over again.
These low complexity (low novelty) cells could represent a specific cell type (i.e. red blood cells which lack a typical transcriptome), or could be due to some other strange artifact or contamination.
Generally, we expect the novelty score to be above 0.80 for good quality cells.
Consider the joint effects of these metrics when setting thresholds and set them to be as permissive as possible to avoid filtering out viable cell populations unintentionally.
Often the recommendations are a rough guideline, and the specific experiment needs to inform the exact thresholds chosen. We will use the following thresholds:
A more general recommendation from recent literature is to be quite liberal with allowing cells through to exclude poor-quality cells while avoiding bias against some sub-populations by excluding cells that are outliers according to at least two of the following thresholds:
Germain, PL., Sonrel, A. & Robinson, M.D. pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools. Genome Biol 21, 227 (2020). https://doi.org/10.1186/s13059-020-02136-7
Regress out number of UMIs (default using sctransform), mitochondrial content, and cell cycle, if needed and appropriate for experiment, so not to drive clustering downstream.
Correction for biological covariates serves to single out particular biological signals of interest, while correcting for technical covariates may be crucial to uncovering the underlying biological signal.
The most common biological data correction is to remove the effects of the cell cycle on the transcriptome.
The raw counts are not comparable between cells and we can’t use them as is for our exploratory analysis.
To assign each cell a score based on its expression of G2/M and S phase markers, we can use the Seurat function CellCycleScoring()
. This function calculates cell cycle phase scores based on canonical markers that required as input.
After scoring the cells for cell cycle, we would like to determine whether cell cycle is a major source of variation in our dataset using PCA. To perform PCA, we need to first choose the most variable features, then scale the data.
Scaling data is required so that our “highly variable genes” don't just reflect high expression.
To align same cell types across conditions.
Aligning cells of similar cell types so that we do not have clustering downstream due to differences between samples, conditions, modalities, or batches
Go through the analysis without integration first to determine whether integration is necessary
A form of PCA, in that it identifies the greatest sources of variation in the data, but only if it is shared or conserved across the conditions/groups
The difference in expression values between cells in an MNN pair provides an estimate of the batch effect, which is made more precise by averaging across many such pairs. A correction vector is obtained and applied to the expression values to perform batch correction.” [Stuart and Bulter et al. (2018)].
Assess the similarity between anchor pairs by the overlap in their local neighborhoods
Use anchors and corresponding scores to transform the cell expression values, allowing for the integration of the conditions/datasets (different samples, conditions, datasets, modalities)
Measure of similarity: Euclidean distance
Quality function: Within cluster distance
Seurat iteratively group cells together with the goal of optimizing the standard modularity function with graph-based clustering.
The resolution is an important argument that sets the “granularity” of the downstream clustering and will need to be optimized for every individual experiment.
Cell Type | Marker |
---|---|
CD14+ monocytes | CD14, LYZ |
FCGR3A+ monocytes | FCGR3A, MS4A7 |
Conventional dendritic cells | FCER1A, CST3 |
Plasmacytoid dendritic cells | IL3RA, GZMB, SERPINF1, ITM2C |
Macrophages | MARCO, ITGAM, ADGRE1 |
B cells | CD79A, MS4A1 |
T cells | CD3D |
CD4+ T cells | CD3D, IL7R, CCR7 |
CD8+ T cells | CD3D, CD8A |
NK cells | GNLY, NKG7 |
Megakaryocytes | PPBP |
Erythrocytes | HBB, HBA2 |
Do the clusters corresponding to the same cell types have biologically meaningful differences? Are there sub-populations of these cell types?
Can we acquire higher confidence in these cell type identities by identifying other marker genes for these clusters?
When we look at the entire list, we see clusters 0 and 6 have some overlapping genes, like CCR7 and SELL which correspond to markers of memory T cells. It is possible that these two clusters are more similar to one another and could be merged together as naive T cells. On the other hand, with cluster 2 we observe CREM as one of our top genes; a marker gene of activation. This suggests that perhaps cluster 2 represents activated T cells.
Cell State Marker
Naive T cells CCR7, SELL
Activated T cells CREM, CD69
For cluster 4, we see a lot of heat shock and DNA damage genes appear in the top gene list. Based on these markers, it is likely that these are stressed or dying cells. However, if we explore the quality metrics for these cells in more detail (i.e. mitoRatio and nUMI overlayed on the cluster) we don’t really support for this argument. There is a breadth of research supporting the association of heat shock proteins with reactive T cells in the induction of anti‐inflammatory cytokines in chronic inflammation. This is a cluster for which we would need a deeper understanding of immune cells to really tease apart the results and make a final conclusion.
Sini Junttila and others, Benchmarking methods for detecting differential states between conditions from multi-subject single-cell RNA-seq data, Briefings in Bioinformatics, Volume 23, Issue 5, September 2022, bbac286, https://doi.org/10.1093/bib/bbac286