Differential Expression Analysis Examples

Differential Expression

In this document we will be looking at differential expression done with DESeq2.

Much of this can also apply to edgeR and limma-voom

We begin by loading the data

We have some example data from the airway package from Bioconductor.

This package provides a RangedSummarizedExperiment object of read counts in genes for an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone. Details on the gene model and read counting procedure are provided in the package vignette. The citation for the experiment is: Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. ‘RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.’ PLoS One. 2014 Jun 13;9(6):e99625. PMID: 24926665. GEO: GSE52778.

BiocManager::install(c("airway", "DESeq2", "tximeta", "apeglm", "RColorBrewer", "pheatmap", "hexbin"))

dir <- system.file("extdata", package="airway", mustWork=TRUE)
list.files(dir)

##  [1] "GSE52778_series_matrix.txt"        "Homo_sapiens.GRCh37.75_subset.gtf"
##  [3] "quants"                            "sample_table.csv"                 
##  [5] "SraRunInfo_SRP033351.csv"          "SRR1039508_subset.bam"            
##  [7] "SRR1039509_subset.bam"             "SRR1039512_subset.bam"            
##  [9] "SRR1039513_subset.bam"             "SRR1039516_subset.bam"            
## [11] "SRR1039517_subset.bam"             "SRR1039520_subset.bam"            
## [13] "SRR1039521_subset.bam"

list.files(file.path(dir, "quants"))

## [1] "SRR1039508" "SRR1039509"

We have here listed data files and directories from the airway package. We will focus on the quants directory as that contains the the salmon quantification of the transcript expression.

The other piece of data we need is the table with detailed information for each of our samples that links samples to the associated FASTQ and Salmon directories.

csvfile <- file.path(dir, "sample_table.csv")
coldata <- read.csv(csvfile, row.names=1, stringsAsFactors=FALSE)
coldata

##            SampleName    cell   dex albut        Run avgLength Experiment
## SRR1039508 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345
## SRR1039509 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349
## SRR1039513 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353
## SRR1039517 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520       101  SRX384357
## SRR1039521 GSM1275875 N061011   trt untrt SRR1039521        98  SRX384358
##               Sample    BioSample
## SRR1039508 SRS508568 SAMN02422669
## SRR1039509 SRS508567 SAMN02422675
## SRR1039512 SRS508571 SAMN02422678
## SRR1039513 SRS508572 SAMN02422670
## SRR1039516 SRS508575 SAMN02422682
## SRR1039517 SRS508576 SAMN02422673
## SRR1039520 SRS508579 SAMN02422683
## SRR1039521 SRS508580 SAMN02422677

As a demonstration of loading salmon data into R we will use just two of the samples that are included as salmon quantifications with the airway package.

coldata <- coldata[1:2,]
coldata$names <- coldata$Run
coldata$files <- file.path(dir, "quants", coldata$names, "quant.sf.gz")
file.exists(coldata$files)

## [1] TRUE TRUE

library("tximeta")

## Warning: package 'tximeta' was built under R version 4.2.2

se <- tximeta(coldata)

## importing quantifications

## reading in files with read_tsv

## 1 2 
## found matching transcriptome:
## [ GENCODE - Homo sapiens - release 29 ]
## loading existing TxDb created: 2022-04-05 18:57:53
## Loading required package: GenomicFeatures
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## 
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## 
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## loading existing transcript ranges created: 2022-04-05 18:57:54

If the reference transcriptome checksum was recognized by tximeta (details on this in the tximeta vignette), and if we have a working internet connection, tximeta will locate and download the relevant annotation data from various sources.

Here we can see at least that we have the correct shape of data, and note that the data is imported at the transcript level.

dim(se)

## [1] 205870      2

head(rownames(se))

## [1] "ENST00000456328.2" "ENST00000450305.2" "ENST00000488147.1"
## [4] "ENST00000619216.1" "ENST00000473358.1" "ENST00000469289.1"

We will be focusing on the gene level analysis for this walk through so we will combine transcripts down into genes.

gse <- summarizeToGene(se)

## loading existing TxDb created: 2022-04-05 18:57:53

## obtaining transcript-to-gene mapping from database

## loading existing gene ranges created: 2022-04-05 19:00:20

## summarizing abundance

## summarizing counts

## summarizing length

dim(gse)

## [1] 58294     2

head(rownames(gse))

## [1] "ENSG00000000003.14" "ENSG00000000005.5"  "ENSG00000000419.12"
## [4] "ENSG00000000457.13" "ENSG00000000460.16" "ENSG00000000938.12"

Examining the Data

We now have an understanding of how to load the into R, we will use a pre-compiled data set with all the samples, not just the two we were using as an example.

library(airway)

## Loading required package: SummarizedExperiment

## Loading required package: MatrixGenerics

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians

## 
## Attaching package: 'MatrixGenerics'

## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars

## The following object is masked from 'package:Biobase':
## 
##     rowMedians

data(gse)
gse

## class: RangedSummarizedExperiment 
## dim: 58294 8 
## metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
## assays(3): counts abundance length
## rownames(58294): ENSG00000000003.14 ENSG00000000005.5 ...
##   ENSG00000285993.1 ENSG00000285994.1
## rowData names(1): gene_id
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(3): names donor condition

We have three interseting bits to look at in this data structure. the assays, the rowRanges, and the colData

assayNames(gse)

## [1] "counts"    "abundance" "length"

head(assay(gse), 3)

##                    SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003.14    708.164    467.962    900.992    424.368   1188.295
## ENSG00000000005.5       0.000      0.000      0.000      0.000      0.000
## ENSG00000000419.12    455.000    510.000    604.000    352.000    583.000
##                    SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003.14   1090.668    805.929    599.337
## ENSG00000000005.5       0.000      0.000      0.000
## ENSG00000000419.12    773.999    409.999    499.000

colSums(assay(gse))

## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 
##   21100805   19298584   26145537   15688246   25268618   31891456   19683767 
## SRR1039521 
##   21813903

rowRanges(gse)

## GRanges object with 58294 ranges and 1 metadata column:
##                      seqnames              ranges strand |            gene_id
##                         <Rle>           <IRanges>  <Rle> |        <character>
##   ENSG00000000003.14     chrX 100627109-100639991      - | ENSG00000000003.14
##    ENSG00000000005.5     chrX 100584802-100599885      + |  ENSG00000000005.5
##   ENSG00000000419.12    chr20   50934867-50958555      - | ENSG00000000419.12
##   ENSG00000000457.13     chr1 169849631-169894267      - | ENSG00000000457.13
##   ENSG00000000460.16     chr1 169662007-169854080      + | ENSG00000000460.16
##                  ...      ...                 ...    ... .                ...
##    ENSG00000285990.1    chr14   19244904-19269380      - |  ENSG00000285990.1
##    ENSG00000285991.1     chr6 149817937-149896011      - |  ENSG00000285991.1
##    ENSG00000285992.1     chr8   47129262-47132628      + |  ENSG00000285992.1
##    ENSG00000285993.1    chr18   46409197-46410645      - |  ENSG00000285993.1
##    ENSG00000285994.1    chr10   12563151-12567351      + |  ENSG00000285994.1
##   -------
##   seqinfo: 25 sequences (1 circular) from hg38 genome

colData(gse)

## DataFrame with 8 rows and 3 columns
##                 names    donor     condition
##              <factor> <factor>      <factor>
## SRR1039508 SRR1039508  N61311  Untreated    
## SRR1039509 SRR1039509  N61311  Dexamethasone
## SRR1039512 SRR1039512  N052611 Untreated    
## SRR1039513 SRR1039513  N052611 Dexamethasone
## SRR1039516 SRR1039516  N080611 Untreated    
## SRR1039517 SRR1039517  N080611 Dexamethasone
## SRR1039520 SRR1039520  N061011 Untreated    
## SRR1039521 SRR1039521  N061011 Dexamethasone

Quick DESeq2 Example

We start by taking a look at our conditions to make sure we understand them.

gse$donor

## [1] N61311  N61311  N052611 N052611 N080611 N080611 N061011 N061011
## Levels: N052611 N061011 N080611 N61311

gse$condition

## [1] Untreated     Dexamethasone Untreated     Dexamethasone Untreated    
## [6] Dexamethasone Untreated     Dexamethasone
## Levels: Untreated Dexamethasone

We can rename our variables if we want. Let’s use cell to denote the donor cell line, and dex to denote the treatment condition.

gse$cell <- gse$donor
gse$dex <- gse$condition

We can also change the names of the levels. It is critical when one renames levels to not change the order. Here we will rename “Untreated” as “untrt” and “Dexamethasone” as “trt”:

levels(gse$dex)

## [1] "Untreated"     "Dexamethasone"

# when renaming levels, the order must be preserved!
levels(gse$dex) <- c("untrt", "trt")

The simplest design formula for differential expression would be ~ condition, where condition is a column in colData(dds) that specifies which of two (or more groups) the samples belong to. For the airway experiment, we will specify ~ cell + dex meaning that we want to test for the effect of dexamethasone (dex) controlling for the effect of different cell line (cell).

Note: it is prefered in R that the first level of a factor be the reference level (e.g. control, or untreated samples). In this case, when the colData table was assembled the untreated samples were already set as the reference, but if this were not the case we could use relevel as shown below. While levels(...) <- above was simply for renaming the character strings associated with levels, relevel is a very different function, which decides how the variables will be coded, and how contrasts will be computed. For a two-group comparison, the use of relevel to change the reference level would flip the sign of a coefficient associated with a contrast between the two groups.

relevel(gse$dex, "untrt")

## [1] untrt trt   untrt trt   untrt trt   untrt trt  
## Levels: untrt trt

gse$dex <- relevel(gse$dex, "untrt")

Okay finally we can start looking at the data itself, starting with checking the different numbers of reads mapped to genes in the data itself.

round( colSums(assay(gse)) / 1e6, 1 )

## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 
##       21.1       19.3       26.1       15.7       25.3       31.9       19.7 
## SRR1039521 
##       21.8

Once we have our fully annotated SummarizedExperiment object, we can construct a DESeqDataSet object from it that will then form the starting point of the analysis. We add an appropriate design for the analysis:

library("DESeq2")

## Warning: package 'DESeq2' was built under R version 4.2.2

dds <- DESeqDataSet(gse, design = ~ cell + dex)

## using counts and average transcript lengths from tximeta

In order to reduce the size of the object, and to increase the speed of our functions, we can remove the rows that have no or nearly no information about the amount of gene expression. Here we apply the most minimal filtering rule: removing rows of the DESeqDataSet that have no counts, or only a single count across all samples. Additional weighting/filtering to improve power is applied at a later step in the workflow.

nrow(dds)

## [1] 58294

keep <- rowSums(counts(dds)) > 1
dds <- dds[keep,]
nrow(dds)

## [1] 31604

Exloratory analysis

Many common statistical methods for exploratory analysis of multidimensional data, for example clustering and principal components analysis (PCA), work best for data that generally has the same range of variance at different ranges of the mean values. So for visualizing distances between samples in genomic space we use the rlog function to correct for the fact that for RNA-seq counts, the expected variance grows with the mean

rld <- rlog(dds, blind = FALSE)

## using 'avgTxLength' from assays(dds), correcting for library size

head(assay(rld), 3)

##                    SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003.14   9.482613   9.172197   9.558383   9.346001   9.851349
## ENSG00000000419.12   8.860186   9.150196   9.000042   8.995902   8.951327
## ENSG00000000457.13   8.354790   8.166700   8.236582   8.366693   8.121781
##                    SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003.14   9.587602   9.727248   9.357876
## ENSG00000000419.12   9.091075   8.848782   9.054384
## ENSG00000000457.13   8.282307   8.392384   8.391023

colData(rld)

## DataFrame with 8 rows and 5 columns
##                 names    donor     condition     cell      dex
##              <factor> <factor>      <factor> <factor> <factor>
## SRR1039508 SRR1039508  N61311  Untreated      N61311     untrt
## SRR1039509 SRR1039509  N61311  Dexamethasone  N61311     trt  
## SRR1039512 SRR1039512  N052611 Untreated      N052611    untrt
## SRR1039513 SRR1039513  N052611 Dexamethasone  N052611    trt  
## SRR1039516 SRR1039516  N080611 Untreated      N080611    untrt
## SRR1039517 SRR1039517  N080611 Dexamethasone  N080611    trt  
## SRR1039520 SRR1039520  N061011 Untreated      N061011    untrt
## SRR1039521 SRR1039521  N061011 Dexamethasone  N061011    trt

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:matrixStats':
## 
##     count

## The following object is masked from 'package:AnnotationDbi':
## 
##     select

## The following object is masked from 'package:Biobase':
## 
##     combine

## The following objects are masked from 'package:GenomicRanges':
## 
##     intersect, setdiff, union

## The following object is masked from 'package:GenomeInfoDb':
## 
##     intersect

## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union

## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union

## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("ggplot2")

dds <- estimateSizeFactors(dds)

## using 'avgTxLength' from assays(dds), correcting for library size

df <- bind_rows(
  as_data_frame(log2(counts(dds, normalized=TRUE)[, 1:2]+1)) %>%
         mutate(transformation = "log2(x + 1)"),
  as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"))

## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
##   tibble, or `as.data.frame()` to convert to a data frame.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

colnames(df)[1:2] <- c("x", "y")  

lvls <- c("log2(x + 1)", "rlog")
df$transformation <- factor(df$transformation, levels=lvls)

ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
  coord_fixed() + facet_grid( . ~ transformation)

sampleDists <- dist(t(assay(rld)))
sampleDists

##            SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## SRR1039509   45.18375                                                       
## SRR1039512   37.99283   52.78136                                            
## SRR1039513   59.45032   42.60807   47.70712                                 
## SRR1039516   42.74926   56.62874   41.73956   61.38650                      
## SRR1039517   61.34501   49.73744   56.24779   48.43201   46.02443           
## SRR1039520   38.33193   55.27510   35.94851   56.03591   44.74166   60.48153
## SRR1039521   60.59154   43.96644   55.45531   36.52665   62.55424   50.12503
##            SRR1039520
## SRR1039509           
## SRR1039512           
## SRR1039513           
## SRR1039516           
## SRR1039517           
## SRR1039520           
## SRR1039521   48.98247

library("pheatmap")
library("RColorBrewer")
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep = " - " )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col = colors)

pcaData <- plotPCA(rld, intgroup = c( "dex", "cell"), returnData = TRUE)
pcaData

##                   PC1        PC2         group   dex    cell       name
## SRR1039508 -16.886308  -4.568896  untrt:N61311 untrt  N61311 SRR1039508
## SRR1039509   9.053108  -1.556848    trt:N61311   trt  N61311 SRR1039509
## SRR1039512 -10.398210  -5.099252 untrt:N052611 untrt N052611 SRR1039512
## SRR1039513  16.866083  -4.192501   trt:N052611   trt N052611 SRR1039513
## SRR1039516 -14.848605  14.773014 untrt:N080611 untrt N080611 SRR1039516
## SRR1039517   9.920424  20.130443   trt:N080611   trt N080611 SRR1039517
## SRR1039520 -11.476023 -10.960401 untrt:N061011 untrt N061011 SRR1039520
## SRR1039521  17.769532  -8.525559   trt:N061011   trt N061011 SRR1039521

percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(x = PC1, y = PC2, color = dex, shape = cell)) +
  geom_point(size =3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  coord_fixed() +
  ggtitle("PCA with RLOG data")

Running Differential Expression

At this point running the differential expression is easy.

dds <- DESeq(dds)

## using pre-existing normalization factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

Calling results without any arguments will extract the estimated log2 fold changes and p values for the last variable in the design formula. If there are more than 2 levels for this variable, results will extract the results table for a comparison of the last level over the first level. The comparison is printed at the top of the output: dex trt vs untrt

res <- results(dds)
res

## log2 fold change (MLE): dex trt vs untrt 
## Wald test p-value: dex trt vs untrt 
## DataFrame with 31604 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE      stat      pvalue
##                     <numeric>      <numeric> <numeric> <numeric>   <numeric>
## ENSG00000000003.14 739.940717     -0.3611537  0.106869 -3.379419 0.000726392
## ENSG00000000419.12 511.735722      0.2063147  0.128665  1.603509 0.108822318
## ENSG00000000457.13 314.194855      0.0378308  0.158633  0.238479 0.811509461
## ENSG00000000460.16  79.793622     -0.1152590  0.314991 -0.365912 0.714430444
## ENSG00000000938.12   0.307267     -1.3691185  3.503764 -0.390757 0.695977205
## ...                       ...            ...       ...       ...         ...
## ENSG00000285979.1   38.353886      0.3423657  0.359511  0.952310    0.340940
## ENSG00000285987.1    1.562508      0.7064145  1.547295  0.456548    0.647996
## ENSG00000285990.1    0.642315      0.3647333  3.433276  0.106235    0.915396
## ENSG00000285991.1   11.276284     -0.1165515  0.748601 -0.155692    0.876275
## ENSG00000285994.1    3.651041     -0.0960094  1.068660 -0.089841    0.928414
##                          padj
##                     <numeric>
## ENSG00000000003.14 0.00531137
## ENSG00000000419.12 0.29318870
## ENSG00000000457.13 0.92255697
## ENSG00000000460.16 0.87298038
## ENSG00000000938.12         NA
## ...                       ...
## ENSG00000285979.1    0.600750
## ENSG00000285987.1          NA
## ENSG00000285990.1          NA
## ENSG00000285991.1    0.952921
## ENSG00000285994.1          NA

The following command is the explicit version of the above command

res <- results(dds, contrast=c("dex","trt","untrt"))
res

## log2 fold change (MLE): dex trt vs untrt 
## Wald test p-value: dex trt vs untrt 
## DataFrame with 31604 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE      stat      pvalue
##                     <numeric>      <numeric> <numeric> <numeric>   <numeric>
## ENSG00000000003.14 739.940717     -0.3611537  0.106869 -3.379419 0.000726392
## ENSG00000000419.12 511.735722      0.2063147  0.128665  1.603509 0.108822318
## ENSG00000000457.13 314.194855      0.0378308  0.158633  0.238479 0.811509461
## ENSG00000000460.16  79.793622     -0.1152590  0.314991 -0.365912 0.714430444
## ENSG00000000938.12   0.307267     -1.3691185  3.503764 -0.390757 0.695977205
## ...                       ...            ...       ...       ...         ...
## ENSG00000285979.1   38.353886      0.3423657  0.359511  0.952310    0.340940
## ENSG00000285987.1    1.562508      0.7064145  1.547295  0.456548    0.647996
## ENSG00000285990.1    0.642315      0.3647333  3.433276  0.106235    0.915396
## ENSG00000285991.1   11.276284     -0.1165515  0.748601 -0.155692    0.876275
## ENSG00000285994.1    3.651041     -0.0960094  1.068660 -0.089841    0.928414
##                          padj
##                     <numeric>
## ENSG00000000003.14 0.00531137
## ENSG00000000419.12 0.29318870
## ENSG00000000457.13 0.92255697
## ENSG00000000460.16 0.87298038
## ENSG00000000938.12         NA
## ...                       ...
## ENSG00000285979.1    0.600750
## ENSG00000285987.1          NA
## ENSG00000285990.1          NA
## ENSG00000285991.1    0.952921
## ENSG00000285994.1          NA

We can understand the meanings of the columns by checking the object itself.

mcols(res, use.names = TRUE)

## DataFrame with 6 rows and 2 columns
##                        type            description
##                 <character>            <character>
## baseMean       intermediate mean of normalized c..
## log2FoldChange      results log2 fold change (ML..
## lfcSE               results standard error: dex ..
## stat                results Wald statistic: dex ..
## pvalue              results Wald test p-value: d..
## padj                results   BH adjusted p-values

summary(res)

## 
## out of 31604 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2373, 7.5%
## LFC < 0 (down)     : 1949, 6.2%
## outliers [1]       : 0, 0%
## low counts [2]     : 14706, 47%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

We can change the signficance threshold with the alpha parameter

res.05 <- results(dds, alpha = 0.05)
table(res.05$padj < 0.05)

## 
## FALSE  TRUE 
## 13357  3541

summary(res.05)

## 
## out of 31604 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 1994, 6.3%
## LFC < 0 (down)     : 1547, 4.9%
## outliers [1]       : 0, 0%
## low counts [2]     : 14706, 47%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

And the log Fold Change threshold with the lfcThreshold=1

resLFC1 <- results(dds, lfcThreshold=1)
table(resLFC1$padj < 0.1)

## 
## FALSE  TRUE 
## 16687   211

summary(resLFC1)

## 
## out of 31604 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 1.00 (up)    : 138, 0.44%
## LFC < -1.00 (down) : 73, 0.23%
## outliers [1]       : 0, 0%
## low counts [2]     : 14706, 47%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Plotting an MA-plot

An MA-plot (Dudoit et al. 2002) provides a useful overview for the distribution of the estimated coefficients in the model, e.g. the comparisons of interest, across all genes. On the y-axis, the “M” stands for “minus” – subtraction of log values is equivalent to the log of the ratio – and on the x-axis, the “A” stands for “average”.

The DESeq2 package uses a Bayesian procedure to moderate (or “shrink”) log2 fold changes from genes with very low counts and highly variable counts

res.noshr <- results(dds, name="dex_trt_vs_untrt")
plotMA(res.noshr, ylim = c(-5, 5))

Here we specify the apeglm method for shrinking coefficients, which is good for shrinking the noisy LFC estimates while giving low bias LFC estimates for true large differences

library("apeglm")
resultsNames(dds)

## [1] "Intercept"               "cell_N061011_vs_N052611"
## [3] "cell_N080611_vs_N052611" "cell_N61311_vs_N052611" 
## [5] "dex_trt_vs_untrt"

res <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm")

## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895

plotMA(res, ylim = c(-5, 5))

plotMA(res, ylim = c(-5, 5), alpha = 0.01)

Here we examine the difference between determining p-value as significance over a threshold, vs filtering a lfc threshold after calculating p-value. We observe that there are many values close to the threshold that are not adequately differentially expressed to be confident that the true lfc is greater than 1.

### calculate p-value relative to a biologically meaningful threshold
filter.first <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm", lfcThreshold = 1)

## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895

## computing FSOS 'false sign or small' s-values (T=1)

filter.first <- plotMA(filter.first, ylim = c(-5,5), alpha = 0.01, returnData = TRUE)

### calculate p-value for any change
filter.after <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm", lfcThreshold = 0)

## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895

filter.after <- plotMA(filter.after, ylim = c(-5,5), alpha = 0.01, returnData = TRUE)
filter.after <- filter.after %>% mutate(isDE = c(abs(lfc) > 1 & isDE))

### collect the data for plotting
plot.data <- filter.first
plot.data$filterAfterIsDE <- filter.after$isDE
plot.data <- plot.data %>% 
  mutate(color = case_when(isDE & !filterAfterIsDE ~ "firebrick",
                           !isDE & filterAfterIsDE ~ "darkorange1",
                           isDE & filterAfterIsDE ~ "gold2",
                           !isDE & !filterAfterIsDE ~ "plum4"
  ))

### plot
ggplot(plot.data, aes(x = mean, y = lfc, color = color)) +
  geom_point(shape = 20) +
  scale_colour_identity(labels = c("filter before only", "filter after only", "significant in both", "N.S."), 
                        guide = "legend") + 
  theme_minimal() + 
  scale_x_log10() + 
  coord_cartesian(ylim = c(-5,5)) + 
  geom_hline(yintercept = c(-1,1))

Differential Expression Analysis Examples

2023-06-30