Simon Coetzee
04/07/2022
“ A Sprint Through Single Cell RNA-Seq” is licensed under CC BY by Simon Coetzee.
The steps of the workflow are:
Regardless of the analysis being done, conclusions about a population based on a single sample per condition are not trustworthy. BIOLOGICAL REPLICATES ARE STILL NEEDED! That is, if you want to make conclusions that correspond to the population and not just the single sample.
scRNA-seq is able to capture expression at the cellular level, however
The data complexity involves:
Expression data from scRNA-seq represents tens or hundres of thousands of reads for thousands of cells.
This output is much larger than normal bulk RNA-seq, and requires higher amounts of memory to analyze, larger storage requirements, and more time to run the analysis.
For the droplet-based methods of scRNA-seq, the depth of sequencing is shallow, often detecting only 10-50% of the transcriptome per cell.
This results in cells showing zero counts for many of the genes. However, in a particular cell, a zero count for a gene could either mean that the gene was not being expressed or the transcripts were just not detected.
Jiang, R., Sun, T., Song, D. et al. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biol 23, 31 (2022). https://doi.org/10.1186/s13059-022-02601-5
Svensson, V. Droplet scRNA-seq is not zero-inflated. Nat Biotechnol 38, 147–150 (2020). https://doi.org/10.1038/s41587-019-0379-5
Uninteresting sources of biological variation can result in gene expression between cells being more similar/different than the actual biological cell types/states, which can obscure the cell type identities. Uninteresting sources of biological variation (unless part of the experiment’s study) include:
Technical sources of variation can result in gene expression between cells being more similar/different based on technical sources instead of biological cell types/states, which can obscure the cell type identities. Technical sources of variation include:
How to know whether you have batches?
Any of these indicate that your data contains batch effects.
How to not have them:
To filter the data to only include true cells that are of high quality, so that when we cluster our cells it is easier to identify distinct cell type populations.
To identify any failed samples and either try to salvage the data or remove from analysis, in addition to, trying to understand why the sample failed
Delineating cells that are poor quality from less complex cells
Choosing appropriate thresholds for filtering, so as to keep high quality cells without removing biologically relevant cell types
Now that we have generated the various metrics to assess, we can explore them with visualizations. We will assess various metrics and then decide on which cells are low quality and should be removed from the analysis:
The cell counts are determined by the number of unique cellular bar codes detected. For this experiment, between 12,000 -13,000 cells are expected.
In an ideal world, you would expect the number of unique cellular bar codes to correspond to the number of cells you loaded. However, this is not the case as capture rates of cells are only a proportion of what is loaded.
10X capture efficiency is between 50-60%.
Occasionally a hydrogel can have more than one cellular barcode. Similarly, with the 10X protocol there is a chance of obtaining only a barcoded bead in the emulsion droplet (GEM) and no actual cell. Both of these, in addition to the presence of dying cells can lead to a higher number of cellular barcodes than cells.
The UMI counts per cell should generally be above 500, that is the low end of what we expect. If UMI counts are between 500-1000 counts, it is usable but the cells probably should have been sequenced more deeply.
For high quality data, the proportional histogram should contain a single large peak that represents cells that were encapsulated.
There can be a small shoulder to the left of the major peak (not present in our data), or a bimodal distribution of the cells.
Two metrics that are often evaluated together are the number of UMIs and the number of genes detected per cell.
Cells that are poor quality are likely to have low genes and UMIs per cell.
Mitochondrial read fractions are only high in particularly low count cells with few detected genes (darker colored data points). This could be indicative of damaged/dying cells whose cytoplasmic mRNA has leaked out through a broken membrane, and thus, only mRNA located in the mitochondria is still conserved.
This metric can identify whether there is a large amount of mitochondrial contamination from dead or dying cells. We define poor quality samples for mitochondrial counts as cells which surpass the 0.2 mitochondrial ratio mark, unless of course you are expecting this in your sample.
Can't count on this absolutely - kidney cells often have high mitochrondrial RNA levels - the tissue is associated with mitochondrial function.
Novelty score is computed by taking the ratio of nGenes over nUMI.
If there are many captured transcripts (high nUMI) and a low number of genes detected in a cell, this likely means that you only captured a low number of genes and simply sequenced transcripts from those lower number of genes over and over again.
These low complexity (low novelty) cells could represent a specific cell type (i.e. red blood cells which lack a typical transcriptome), or could be due to some other strange artifact or contamination.
Generally, we expect the novelty score to be above 0.80 for good quality cells.
Consider the joint effects of these metrics when setting thresholds and set them to be as permissive as possible to avoid filtering out viable cell populations unintentionally.
Often the recommendations are a rough guideline, and the specific experiment needs to inform the exact thresholds chosen. We will use the following thresholds:
Regress out number of UMIs (default using sctransform), mitochondrial content, and cell cycle, if needed and appropriate for experiment, so not to drive clustering downstream.
Correction for biological covariates serves to single out particular biological signals of interest, while correcting for technical covariates may be crucial to uncovering the underlying biological signal.
The most common biological data correction is to remove the effects of the cell cycle on the transcriptome.
The raw counts are not comparable between cells and we can’t use them as is for our exploratory analysis.
To assign each cell a score based on its expression of G2/M and S phase markers, we can use the Seuart function CellCycleScoring()
. This function calculates cell cycle phase scores based on canonical markers that required as input.
After scoring the cells for cell cycle, we would like to determine whether cell cycle is a major source of variation in our dataset using PCA. To perform PCA, we need to first choose the most variable features, then scale the data.
Scaling data is requred so that our “highly variable genes” don't just reflect high expression.
To align same cell types across conditions.
Aligning cells of similar cell types so that we do not have clustering downstream due to differences between samples, conditions, modalities, or batches
Go through the analysis without integration first to determine whether integration is necessary