Lecture 8: A Sprint Through Single Cell RNA-Seq

Simon Coetzee
02/27/2024

“ A Sprint Through Single Cell RNA-Seq” is licensed under CC BY by Simon Coetzee.

Why?

  • Across and within tissues there is a great diversity of cell types, states, and interactions.

Why?

  • Across and within tissues there is a great diversity of cell types, states, and interactions.
  • Single Cell RNA-seq offers a glimpse of what genes are being expressed at the level of individual cells.

Why?

  • Across and within tissues there is a great diversity of cell types, states, and interactions.
  • Single Cell RNA-seq offers a glimpse of what genes are being expressed at the level of individual cells.

What can we use it to do?

  • Explore which cell types are present in a tissue.
  • Identify unknown/rare cell types or states.
  • Examine the changes in gene expression during differentiation processes, or across time, or across states.
  • Identify genes that are differentially expressed in particular cell types between conditions (treatment, disease, etc.)
  • Explore these changes while incorporating spatial, regulatory, or protein information.

What can we use it to do?

Compared to Bulk RNA-seq

  • compares averages of cellular expression.
  • is a good choice for quantifying expression signatures for disease studies.
  • potential for the discovery of disease biomarkers if you're not expecting or concerned about cellular heterogeneity.

Trapnell, C. Defining cell types and states with single-cell genomics, Genome Research 2015 (doi: https://dx.doi.org/10.1101/gr.190595.115)

A sprint:

A sprint:

A quick detour - Single Cell or Single Nucleus:

It is also possible to only sequence the nuclei of the cells.

  • Single-cell profiling does not always provide an unbiased view on cell types for specific tissues or organs, such as, for example, the brain (or skeletal muscle or adipose).

    • during tissue dissociation, some cell-types are more vulnerable and more difficult to capture.
    • non-neuronal cells survive dissociation better than neurons and are over represented in single-cell.
  • Single-cell requires fresh tissue.

    • nuclei are more resistant to mechanical force, and can be safely isolated from frozen tissue.

Resulting biases from these two methods are not fully known. However it has been shown that nuclei accurately reflect all transcriptional patterns of cells

Ding, J., Adiconis, X., Simmons, S.K. et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat Biotechnol 38, 737–746 (2020). https://doi.org/10.1038/s41587-020-0465-8

A sprint:

Luecken, MD and Theis, FJ. Current best practices in single‐cell RNA‐seq analysis: a tutorial, Mol Syst Biol 2019 (doi: https://doi.org/10.15252/msb.20188746)

A sprint:

In order to get count matrices we need to quantify gene expression.

  • 10x sequencing, as we saw biases strongly to the 3' end of the gene.
  • cost of not (necessarily) covering the full gene length, making it difficult to unambiguously align reads to a transcript and distinguishing between different isoforms

  • However, we can then use unique molecular identifiers - UMIs to resolve biases in amplification.

  • The usage of UMIs further allows for normalization of gene counts to be performed without a loss of accuracy.

A sprint:

A sprint - quantification:

Before having a complete count matrix, two particular quality control steps must be performed:

  • Empty droplet detection
  • Doublet detection

A sprint - quantification:

Before having a complete count matrix, two particular quality control steps must be performed:

  • Empty droplet detection

A sprint - quantification:

Before having a complete count matrix, two particular quality control steps must be performed:

  • Doublet detection

A sprint:

Luecken, MD and Theis, FJ. Current best practices in single‐cell RNA‐seq analysis: a tutorial, Mol Syst Biol 2019 (doi: https://doi.org/10.15252/msb.20188746)

A sprint:

Luecken, MD and Theis, FJ. Current best practices in single‐cell RNA‐seq analysis: a tutorial, Mol Syst Biol 2019 (doi: https://doi.org/10.15252/msb.20188746)

A sprint:

Luecken, MD and Theis, FJ. Current best practices in single‐cell RNA‐seq analysis: a tutorial, Mol Syst Biol 2019 (doi: https://doi.org/10.15252/msb.20188746)

A sprint:

Luecken, MD and Theis, FJ. Current best practices in single‐cell RNA‐seq analysis: a tutorial, Mol Syst Biol 2019 (doi: https://doi.org/10.15252/msb.20188746)

A sprint:

The steps of the workflow are:

  • Generation of the count matrix (method-specific steps): formatting reads, demultiplexing samples, mapping and quantification
  • Quality control of the raw counts: filtering of poor quality cells
  • Clustering of filtered counts: clustering cells based on similarities in transcriptional activity (cell types = different clusters)
  • Marker identification and cluster annotation: identifying gene markers for each cluster and annotating known cell type clusters
  • Optional downstream steps

Regardless of the analysis being done, conclusions about a population based on a single sample per condition are not trustworthy. BIOLOGICAL REPLICATES ARE STILL NEEDED! That is, if you want to make conclusions that correspond to the population and not just the single sample.

scRNA-seq challenges

scRNA-seq is able to capture expression at the cellular level, however

  • sample generation and library prep is more expensive.
  • the analysis is much more complicated and more difficult to interpret.

The data complexity involves:

  • Large volumes of data
  • Low depth of sequencing per cell
  • Technical variability across cells/samples
  • Biological variability across cells/samples

Large volumes of data.

Expression data from scRNA-seq represents tens or hundreds of thousands of reads for thousands of cells.

This output is much larger than normal bulk RNA-seq, and requires higher amounts of memory to analyze, larger storage requirements, and more time to run the analysis.

Low depth of sequencing per cell

For the droplet-based methods of scRNA-seq, the depth of sequencing is shallow, often detecting only 10-50% of the transcriptome per cell.

This results in cells showing zero counts for many of the genes. However, in a particular cell, a zero count for a gene could either mean that the gene was not being expressed or the transcripts were just not detected.

Jiang, R., Sun, T., Song, D. et al. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biol 23, 31 (2022). https://doi.org/10.1186/s13059-022-02601-5

Low depth of sequencing per cell

Droplet (10x) data is not zero inflated

Svensson, V. Droplet scRNA-seq is not zero-inflated. Nat Biotechnol 38, 147–150 (2020). https://doi.org/10.1038/s41587-019-0379-5

Low depth of sequencing per cell

Biological variability across cells/samples

Uninteresting sources of biological variation can result in gene expression between cells being more similar/different than the actual biological cell types/states, which can obscure the cell type identities. Uninteresting sources of biological variation (unless part of the experiment’s study) include:

  • Transcriptional bursting: Gene transcription is not turned on all of the time for all genes. Time of harvest will determine whether gene is on or off in each cell.
  • Varying rates of RNA processing: Different RNAs are processed at different rates.
  • Continuous or discrete cell identities (e.g. the pro-inflammatory potential of each individual T cell): Continuous phenotypes are by definition variable in gene expression, and separating the continuous from the discrete can sometimes be difficult.
  • Environmental stimuli: The local environment of the cell can influence the gene expression depending on spatial position, signaling molecules, etc.
  • Temporal changes: Fundamental fluctuating cellular processes, such as cell cycle, can affect the gene expression profiles of individual cells.

Biological variability across cells/samples

Technical variability across cells/samples

Technical sources of variation can result in gene expression between cells being more similar/different based on technical sources instead of biological cell types/states, which can obscure the cell type identities. Technical sources of variation include:

  • Cell-specific capture efficiency: Different cells will have differing numbers of transcripts captured resulting in differences in sequencing depth (e.g. 10-50% of transcriptome).
  • Library quality: Degraded RNA, low viability/dying cells, lots of free floating RNA, poorly dissociated cells, and inaccurate quantification of cells can result in low quality metrics
  • Amplification bias: During the amplification step of library preparation, not all transcripts are amplified to the same level.
  • Batch effects: Batch effects are a significant issue for scRNA-Seq analyses, since you can see significant differences in expression due solely to the batch effect.

Batch Effects

Stephanie C Hicks, F William Townes, Mingxiang Teng, Rafael A Irizarry, Missing data and technical variability in single-cell RNA-sequencing experiments, Biostatistics, Volume 19, Issue 4, October 2018, Pages 562–578, https://doi.org/10.1093/biostatistics/kxx053

Batch Effects

How to know whether you have batches?

  • Were all RNA isolations performed on the same day?
  • Were all library preparations performed on the same day?
  • Did the same person perform the RNA isolation/library preparation for all samples?
  • Did you use the same reagents for all samples?
  • Did you perform the RNA isolation/library preparation in the same location?

Any of these indicate that your data contains batch effects.

How to not have them:

  • Design the experiment in a way to avoid batches, if possible.
  • If unable to avoid batches:
    • Do NOT confound your experiment by batch
    • DO split replicates of the different sample groups across batches. The more replicates the better (definitely more than 2), if doing DE across conditions or making conclusions at the population level.
    • DO include batch information in your experimental metadata. During the analysis, we can regress out variation due to batch or integrate across batches, so it doesn’t affect our results if we have that information.

Single Cell Quality Control

Goals

  • To filter the data to only include true cells that are of high quality, so that when we cluster our cells it is easier to identify distinct cell type populations.

  • To identify any failed samples and either try to salvage the data or remove from analysis, in addition to, trying to understand why the sample failed

Challenges

  • Delineating cells that are poor quality from less complex cells

  • Choosing appropriate thresholds for filtering, so as to keep high quality cells without removing biologically relevant cell types

Single Cell Quality Control

There are three columns of information:

  • orig.ident: this column will contain the sample identity if known. It will default to the value we provided for the project argument when loading in the data
  • nCount_RNA: this column represents the number of UMIs per cell
  • nFeature_RNA: this column represents the number of genes detected per cell

In order to create the appropriate plots for the quality control analysis, we need to calculate some additional metrics. These include:

  • number of genes detected per UMI: this metric with give us an idea of the complexity of our dataset (more genes detected per UMI, more complex our data)
  • mitochondrial ratio: this metric will give us a percentage of cell reads originating from the mitochondrial genes

Single Cell Quality Control

Now that we have generated the various metrics to assess, we can explore them with visualizations. We will assess various metrics and then decide on which cells are low quality and should be removed from the analysis:

  • Cell counts
  • UMI counts per cell
  • Genes detected per cell
  • UMIs vs. genes detected
  • Mitochondrial counts ratio
  • Novelty

Cell Counts

The cell counts are determined by the number of unique cellular bar codes detected. For this experiment, between 12,000 -13,000 cells are expected.

In an ideal world, you would expect the number of unique cellular bar codes to correspond to the number of cells you loaded. However, this is not the case as capture rates of cells are only a proportion of what is loaded.

10X capture efficiency is between 50-60%.

Occasionally a hydrogel can have more than one cellular barcode. Similarly, with the 10X protocol there is a chance of obtaining only a barcoded bead in the emulsion droplet (GEM) and no actual cell. Both of these, in addition to the presence of dying cells can lead to a higher number of cellular barcodes than cells.

UMI counts (transcripts) per cell

The UMI counts per cell should generally be above 500, that is the low end of what we expect. If UMI counts are between 500-1000 counts, it is usable but the cells probably should have been sequenced more deeply.

Genes detected per cell

For high quality data, the proportional histogram should contain a single large peak that represents cells that were encapsulated.

There can be a small shoulder to the left of the major peak (not present in our data), or a bimodal distribution of the cells.

  • some cells that failed?
  • biologically different types of cells
    • quiescent cell populations
    • less complex cells
    • cells that are much smaller

UMIs vs. genes detected

Two metrics that are often evaluated together are the number of UMIs and the number of genes detected per cell.

Cells that are poor quality are likely to have low genes and UMIs per cell.

Mitochondrial read fractions are only high in particularly low count cells with few detected genes (darker colored data points). This could be indicative of damaged/dying cells whose cytoplasmic mRNA has leaked out through a broken membrane, and thus, only mRNA located in the mitochondria is still conserved.

Mitochondrial counts ratio

This metric can identify whether there is a large amount of mitochondrial contamination from dead or dying cells. We define poor quality samples for mitochondrial counts as cells which surpass the 0.2 mitochondrial ratio mark, unless of course you are expecting this in your sample.

Can't count on this absolutely - kidney cells often have high mitochondrial RNA levels - the tissue is associated with mitochondrial function.

Complexity

Novelty score is computed by taking the ratio of nGenes over nUMI.

If there are many captured transcripts (high nUMI) and a low number of genes detected in a cell, this likely means that you only captured a low number of genes and simply sequenced transcripts from those lower number of genes over and over again.

These low complexity (low novelty) cells could represent a specific cell type (i.e. red blood cells which lack a typical transcriptome), or could be due to some other strange artifact or contamination.

Generally, we expect the novelty score to be above 0.80 for good quality cells.

Filtering

Consider the joint effects of these metrics when setting thresholds and set them to be as permissive as possible to avoid filtering out viable cell populations unintentionally.

Often the recommendations are a rough guideline, and the specific experiment needs to inform the exact thresholds chosen. We will use the following thresholds:

  • nUMI > 500
  • nGene > 250
  • log10GenesPerUMI > 0.8
  • mitoRatio < 0.2

A more general recommendation from recent literature is to be quite liberal with allowing cells through to exclude poor-quality cells while avoiding bias against some sub-populations by excluding cells that are outliers according to at least two of the following thresholds:

  • nUMI >2.5 MADs or <5 MADs
  • nGene >2.5 MADs or <5 MADs,
  • pct_counts_in_top_20_features > or <5 MADs,
  • featcount_dist (distance to expected ratio of log10 counts and features) > or <5 MADs,
  • pct_counts_Mt >2.5 MADs and >0.08.

Germain, PL., Sonrel, A. & Robinson, M.D. pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools. Genome Biol 21, 227 (2020). https://doi.org/10.1186/s13059-020-02136-7

Normalization

Goals:

  • To accurately normalize and scale the gene expression values to account for differences in sequencing depth and overdispersed count values.
  • To identify the most variant genes likely to be indicative of the different cell types present.

Challenges:

  • Checking and removing unwanted variation so that we do not have cells clustering by artifacts downstream

Recommend:

Regress out number of UMIs (default using sctransform), mitochondrial content, and cell cycle, if needed and appropriate for experiment, so not to drive clustering downstream.

Explore sources of unwanted variation

Correction for biological covariates serves to single out particular biological signals of interest, while correcting for technical covariates may be crucial to uncovering the underlying biological signal.

The most common biological data correction is to remove the effects of the cell cycle on the transcriptome.

The raw counts are not comparable between cells and we can’t use them as is for our exploratory analysis.

Evaluating effects of cell cycle

To assign each cell a score based on its expression of G2/M and S phase markers, we can use the Seurat function CellCycleScoring(). This function calculates cell cycle phase scores based on canonical markers that required as input.

After scoring the cells for cell cycle, we would like to determine whether cell cycle is a major source of variation in our dataset using PCA. To perform PCA, we need to first choose the most variable features, then scale the data.

Scaling data is required so that our “highly variable genes” don't just reflect high expression.

  • adjusting the expression of each gene to give a mean expression across cells to be 0
  • scaling expression of each gene to give a variance across cells to be 1

Integration

Goals:

To align same cell types across conditions.

Challenges:

Aligning cells of similar cell types so that we do not have clustering downstream due to differences between samples, conditions, modalities, or batches

Recommendations:

Go through the analysis without integration first to determine whether integration is necessary

Integration

Integration

1. Perform canonical correlation analysis (CCA):

2. Identify anchors or mutual nearest neighbors (MNNs) across datasets (sometimes incorrect anchors are identified)

3. Filter anchors to remove incorrect anchors:

4. Integrate the conditions/datasets:

Integration

1. Perform canonical correlation analysis (CCA):

A form of PCA, in that it identifies the greatest sources of variation in the data, but only if it is shared or conserved across the conditions/groups

2. Identify anchors or mutual nearest neighbors (MNNs) across datasets (sometimes incorrect anchors are identified)

The difference in expression values between cells in an MNN pair provides an estimate of the batch effect, which is made more precise by averaging across many such pairs. A correction vector is obtained and applied to the expression values to perform batch correction.” [Stuart and Bulter et al. (2018)].

3. Filter anchors to remove incorrect anchors:

Assess the similarity between anchor pairs by the overlap in their local neighborhoods

4. Integrate the conditions/datasets:

Use anchors and corresponding scores to transform the cell expression values, allowing for the integration of the conditions/datasets (different samples, conditions, datasets, modalities)

Clustering

1. Construct a K-nearest neighbor (KNN) graph

Measure of similarity: Euclidean distance

Quality function: Within cluster distance

Edges are drawn between cells with similar features expression patterns.

Edge weights are refined between any two cells based on shared overlap in their local neighborhoods.

Clustering

2. Find Clusters

Seurat iteratively group cells together with the goal of optimizing the standard modularity function with graph-based clustering.

The resolution is an important argument that sets the “granularity” of the downstream clustering and will need to be optimized for every individual experiment.

3,000 - 5,000 cells, the resolution set between 0.4-1.4 generally yields good clustering

higher resolution, more clusters.

Exploring Cell Types:

Cell Type Marker
CD14+ monocytes CD14, LYZ
FCGR3A+ monocytes FCGR3A, MS4A7
Conventional dendritic cells FCER1A, CST3
Plasmacytoid dendritic cells IL3RA, GZMB, SERPINF1, ITM2C
Macrophages MARCO, ITGAM, ADGRE1
B cells CD79A, MS4A1
T cells CD3D
CD4+ T cells CD3D, IL7R, CCR7
CD8+ T cells CD3D, CD8A
NK cells GNLY, NKG7
Megakaryocytes PPBP
Erythrocytes HBB, HBA2

Marker Identification

Do the clusters corresponding to the same cell types have biologically meaningful differences? Are there sub-populations of these cell types?

Can we acquire higher confidence in these cell type identities by identifying other marker genes for these clusters?

Three Different ways:

Identification of all markers for each cluster

Identification of conserved markers for each cluster

Marker identification between specific clusters

Exploring T-cells

When we look at the entire list, we see clusters 0 and 6 have some overlapping genes, like CCR7 and SELL which correspond to markers of memory T cells. It is possible that these two clusters are more similar to one another and could be merged together as naive T cells. On the other hand, with cluster 2 we observe CREM as one of our top genes; a marker gene of activation. This suggests that perhaps cluster 2 represents activated T cells.

Cell State Marker

Naive T cells CCR7, SELL

Activated T cells CREM, CD69

For cluster 4, we see a lot of heat shock and DNA damage genes appear in the top gene list. Based on these markers, it is likely that these are stressed or dying cells. However, if we explore the quality metrics for these cells in more detail (i.e. mitoRatio and nUMI overlayed on the cluster) we don’t really support for this argument. There is a breadth of research supporting the association of heat shock proteins with reactive T cells in the induction of anti‐inflammatory cytokines in chronic inflammation. This is a cluster for which we would need a deeper understanding of immune cells to really tease apart the results and make a final conclusion.

Differential Expression

Differential Expression

Our results suggest that the pseudobulk methods and the mixed models that model subjects as a random effect were superior compared to the naïve single-cell methods that do not model the subjects in any way. We also recommend not to perform DS analysis using Seurat’s statistical tests so that the subjects are modeled as a latent variable. Overall, the pseudobulk methods outperformed the mixed models.

Sini Junttila and others, Benchmarking methods for detecting differential states between conditions from multi-subject single-cell RNA-seq data, Briefings in Bioinformatics, Volume 23, Issue 5, September 2022, bbac286, https://doi.org/10.1093/bib/bbac286

Differential Expression