In this document we will be looking at differential expression done with DESeq2.
Much of this can also apply to edgeR and limma-voom
We have some example data from the airway package from Bioconductor.
This package provides a RangedSummarizedExperiment object of read counts in genes for an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone. Details on the gene model and read counting procedure are provided in the package vignette. The citation for the experiment is: Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. ‘RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.’ PLoS One. 2014 Jun 13;9(6):e99625. PMID: 24926665. GEO: GSE52778.
dir <- system.file("extdata", package="airway", mustWork=TRUE)
list.files(dir)
## [1] "GSE52778_series_matrix.txt" "Homo_sapiens.GRCh37.75_subset.gtf"
## [3] "quants" "sample_table.csv"
## [5] "SraRunInfo_SRP033351.csv" "SRR1039508_subset.bam"
## [7] "SRR1039509_subset.bam" "SRR1039512_subset.bam"
## [9] "SRR1039513_subset.bam" "SRR1039516_subset.bam"
## [11] "SRR1039517_subset.bam" "SRR1039520_subset.bam"
## [13] "SRR1039521_subset.bam"
list.files(file.path(dir, "quants"))
## [1] "SRR1039508" "SRR1039509"
We have here listed data files and directories from the airway package. We will focus on the quants directory as that contains the the salmon quantification of the transcript expression.
The other piece of data we need is the table with detailed information for each of our samples that links samples to the associated FASTQ and Salmon directories.
csvfile <- file.path(dir, "sample_table.csv")
coldata <- read.csv(csvfile, row.names=1, stringsAsFactors=FALSE)
coldata
## SampleName cell dex albut Run avgLength Experiment
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 SRX384357
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 SRX384358
## Sample BioSample
## SRR1039508 SRS508568 SAMN02422669
## SRR1039509 SRS508567 SAMN02422675
## SRR1039512 SRS508571 SAMN02422678
## SRR1039513 SRS508572 SAMN02422670
## SRR1039516 SRS508575 SAMN02422682
## SRR1039517 SRS508576 SAMN02422673
## SRR1039520 SRS508579 SAMN02422683
## SRR1039521 SRS508580 SAMN02422677
As a demonstration of loading salmon data into R we will use just two of the samples that are included as salmon quantifications with the airway package.
coldata <- coldata[1:2,]
coldata$names <- coldata$Run
coldata$files <- file.path(dir, "quants", coldata$names, "quant.sf.gz")
file.exists(coldata$files)
## [1] TRUE TRUE
library("tximeta")
se <- tximeta(coldata)
## importing quantifications
## reading in files with read_tsv
## 1 2
## found matching transcriptome:
## [ GENCODE - Homo sapiens - release 29 ]
## loading existing TxDb created: 2022-04-05 18:57:53
## Loading required package: GenomicFeatures
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
##
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
##
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## loading existing transcript ranges created: 2022-04-05 18:57:54
If the reference transcriptome checksum was recognized by tximeta (details on this in the tximeta vignette), and if we have a working internet connection, tximeta will locate and download the relevant annotation data from various sources.
Here we can see at least that we have the correct shape of data, and note that the data is imported at the transcript level.
dim(se)
## [1] 205870 2
head(rownames(se))
## [1] "ENST00000456328.2" "ENST00000450305.2" "ENST00000488147.1"
## [4] "ENST00000619216.1" "ENST00000473358.1" "ENST00000469289.1"
We will be focusing on the gene level analysis for this walk through so we will combine transcripts down into genes.
gse <- summarizeToGene(se)
## loading existing TxDb created: 2022-04-05 18:57:53
## obtaining transcript-to-gene mapping from database
## loading existing gene ranges created: 2022-04-05 19:00:20
## summarizing abundance
## summarizing counts
## summarizing length
dim(gse)
## [1] 58294 2
head(rownames(gse))
## [1] "ENSG00000000003.14" "ENSG00000000005.5" "ENSG00000000419.12"
## [4] "ENSG00000000457.13" "ENSG00000000460.16" "ENSG00000000938.12"
We now have an understanding of how to load the into R, we will use a pre-compiled data set with all the samples, not just the two we were using as an example.
library(airway)
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
##
## rowMedians
data(gse)
gse
## class: RangedSummarizedExperiment
## dim: 58294 8
## metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
## assays(3): counts abundance length
## rownames(58294): ENSG00000000003.14 ENSG00000000005.5 ...
## ENSG00000285993.1 ENSG00000285994.1
## rowData names(1): gene_id
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(3): names donor condition
We have three interseting bits to look at in this data structure. the
assays
, the rowRanges
, and the
colData
assayNames(gse)
## [1] "counts" "abundance" "length"
head(assay(gse), 3)
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003.14 708.164 467.962 900.992 424.368 1188.295
## ENSG00000000005.5 0.000 0.000 0.000 0.000 0.000
## ENSG00000000419.12 455.000 510.000 604.000 352.000 583.000
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003.14 1090.668 805.929 599.337
## ENSG00000000005.5 0.000 0.000 0.000
## ENSG00000000419.12 773.999 409.999 499.000
colSums(assay(gse))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
## 21100805 19298584 26145537 15688246 25268618 31891456 19683767
## SRR1039521
## 21813903
rowRanges(gse)
## GRanges object with 58294 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000000003.14 chrX 100627109-100639991 - | ENSG00000000003.14
## ENSG00000000005.5 chrX 100584802-100599885 + | ENSG00000000005.5
## ENSG00000000419.12 chr20 50934867-50958555 - | ENSG00000000419.12
## ENSG00000000457.13 chr1 169849631-169894267 - | ENSG00000000457.13
## ENSG00000000460.16 chr1 169662007-169854080 + | ENSG00000000460.16
## ... ... ... ... . ...
## ENSG00000285990.1 chr14 19244904-19269380 - | ENSG00000285990.1
## ENSG00000285991.1 chr6 149817937-149896011 - | ENSG00000285991.1
## ENSG00000285992.1 chr8 47129262-47132628 + | ENSG00000285992.1
## ENSG00000285993.1 chr18 46409197-46410645 - | ENSG00000285993.1
## ENSG00000285994.1 chr10 12563151-12567351 + | ENSG00000285994.1
## -------
## seqinfo: 25 sequences (1 circular) from hg38 genome
colData(gse)
## DataFrame with 8 rows and 3 columns
## names donor condition
## <factor> <factor> <factor>
## SRR1039508 SRR1039508 N61311 Untreated
## SRR1039509 SRR1039509 N61311 Dexamethasone
## SRR1039512 SRR1039512 N052611 Untreated
## SRR1039513 SRR1039513 N052611 Dexamethasone
## SRR1039516 SRR1039516 N080611 Untreated
## SRR1039517 SRR1039517 N080611 Dexamethasone
## SRR1039520 SRR1039520 N061011 Untreated
## SRR1039521 SRR1039521 N061011 Dexamethasone
We start by taking a look at our conditions to make sure we understand them.
gse$donor
## [1] N61311 N61311 N052611 N052611 N080611 N080611 N061011 N061011
## Levels: N052611 N061011 N080611 N61311
gse$condition
## [1] Untreated Dexamethasone Untreated Dexamethasone Untreated
## [6] Dexamethasone Untreated Dexamethasone
## Levels: Untreated Dexamethasone
We can rename our variables if we want. Let’s use cell to denote the donor cell line, and dex to denote the treatment condition.
gse$cell <- gse$donor
gse$dex <- gse$condition
We can also change the names of the levels. It is critical when one renames levels to not change the order. Here we will rename “Untreated” as “untrt” and “Dexamethasone” as “trt”:
levels(gse$dex)
## [1] "Untreated" "Dexamethasone"
# when renaming levels, the order must be preserved!
levels(gse$dex) <- c("untrt", "trt")
The simplest design formula for differential expression would be
~ condition
, where condition is a column in
colData(dds)
that specifies which of two (or more groups)
the samples belong to. For the airway experiment, we will specify
~ cell + dex
meaning that we want to test for the effect of
dexamethasone (dex
) controlling for the effect of different
cell line (cell
).
Note: it is prefered in R that the first level of a factor be the
reference level (e.g. control, or untreated samples). In this case, when
the colData table was assembled the untreated samples were already set
as the reference, but if this were not the case we could use relevel as
shown below. While levels(...) <-
above was simply for
renaming the character strings associated with levels, relevel is a very
different function, which decides how the variables will be coded, and
how contrasts will be computed. For a two-group comparison, the use of
relevel
to change the reference level would flip the sign
of a coefficient associated with a contrast between the two groups.
relevel(gse$dex, "untrt")
## [1] untrt trt untrt trt untrt trt untrt trt
## Levels: untrt trt
gse$dex <- relevel(gse$dex, "untrt")
Okay finally we can start looking at the data itself, starting with checking the different numbers of reads mapped to genes in the data itself.
round( colSums(assay(gse)) / 1e6, 1 )
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
## 21.1 19.3 26.1 15.7 25.3 31.9 19.7
## SRR1039521
## 21.8
Once we have our fully annotated SummarizedExperiment object, we can construct a DESeqDataSet object from it that will then form the starting point of the analysis. We add an appropriate design for the analysis:
library("DESeq2")
dds <- DESeqDataSet(gse, design = ~ cell + dex)
## using counts and average transcript lengths from tximeta
In order to reduce the size of the object, and to increase the speed of our functions, we can remove the rows that have no or nearly no information about the amount of gene expression. Here we apply the most minimal filtering rule: removing rows of the DESeqDataSet that have no counts, or only a single count across all samples. Additional weighting/filtering to improve power is applied at a later step in the workflow.
nrow(dds)
## [1] 58294
keep <- rowSums(counts(dds)) > 1
dds <- dds[keep,]
nrow(dds)
## [1] 31604
Many common statistical methods for exploratory analysis of
multidimensional data, for example clustering and principal components
analysis (PCA), work best for data that generally has the same range of
variance at different ranges of the mean values. So for visualizing
distances between samples in genomic space we use the rlog
function to correct for the fact that for RNA-seq counts, the expected
variance grows with the mean
rld <- rlog(dds, blind = FALSE)
## using 'avgTxLength' from assays(dds), correcting for library size
head(assay(rld), 3)
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003.14 9.482613 9.172197 9.558383 9.346001 9.851349
## ENSG00000000419.12 8.860186 9.150196 9.000042 8.995902 8.951327
## ENSG00000000457.13 8.354790 8.166700 8.236582 8.366693 8.121781
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003.14 9.587602 9.727248 9.357876
## ENSG00000000419.12 9.091075 8.848782 9.054384
## ENSG00000000457.13 8.282307 8.392384 8.391023
colData(rld)
## DataFrame with 8 rows and 5 columns
## names donor condition cell dex
## <factor> <factor> <factor> <factor> <factor>
## SRR1039508 SRR1039508 N61311 Untreated N61311 untrt
## SRR1039509 SRR1039509 N61311 Dexamethasone N61311 trt
## SRR1039512 SRR1039512 N052611 Untreated N052611 untrt
## SRR1039513 SRR1039513 N052611 Dexamethasone N052611 trt
## SRR1039516 SRR1039516 N080611 Untreated N080611 untrt
## SRR1039517 SRR1039517 N080611 Dexamethasone N080611 trt
## SRR1039520 SRR1039520 N061011 Untreated N061011 untrt
## SRR1039521 SRR1039521 N061011 Dexamethasone N061011 trt
library("dplyr")
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:matrixStats':
##
## count
## The following object is masked from 'package:AnnotationDbi':
##
## select
## The following object is masked from 'package:Biobase':
##
## combine
## The following objects are masked from 'package:GenomicRanges':
##
## intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':
##
## intersect
## The following objects are masked from 'package:IRanges':
##
## collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
##
## first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("ggplot2")
dds <- estimateSizeFactors(dds)
## using 'avgTxLength' from assays(dds), correcting for library size
df <- bind_rows(
as_data_frame(log2(counts(dds, normalized=TRUE)[, 1:2]+1)) %>%
mutate(transformation = "log2(x + 1)"),
as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"))
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
colnames(df)[1:2] <- c("x", "y")
lvls <- c("log2(x + 1)", "rlog")
df$transformation <- factor(df$transformation, levels=lvls)
ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
coord_fixed() + facet_grid( . ~ transformation)
sampleDists <- dist(t(assay(rld)))
sampleDists
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## SRR1039509 45.18375
## SRR1039512 37.99283 52.78136
## SRR1039513 59.45032 42.60807 47.70712
## SRR1039516 42.74926 56.62874 41.73956 61.38650
## SRR1039517 61.34501 49.73744 56.24779 48.43201 46.02443
## SRR1039520 38.33193 55.27510 35.94851 56.03591 44.74166 60.48153
## SRR1039521 60.59154 43.96644 55.45531 36.52665 62.55424 50.12503
## SRR1039520
## SRR1039509
## SRR1039512
## SRR1039513
## SRR1039516
## SRR1039517
## SRR1039520
## SRR1039521 48.98247
library("pheatmap")
library("RColorBrewer")
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( rld$dex, rld$cell, sep = " - " )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors)
pcaData <- plotPCA(rld, intgroup = c( "dex", "cell"), returnData = TRUE)
pcaData
## PC1 PC2 group dex cell name
## SRR1039508 -16.886308 -4.568896 untrt:N61311 untrt N61311 SRR1039508
## SRR1039509 9.053108 -1.556848 trt:N61311 trt N61311 SRR1039509
## SRR1039512 -10.398210 -5.099252 untrt:N052611 untrt N052611 SRR1039512
## SRR1039513 16.866083 -4.192501 trt:N052611 trt N052611 SRR1039513
## SRR1039516 -14.848605 14.773014 untrt:N080611 untrt N080611 SRR1039516
## SRR1039517 9.920424 20.130443 trt:N080611 trt N080611 SRR1039517
## SRR1039520 -11.476023 -10.960401 untrt:N061011 untrt N061011 SRR1039520
## SRR1039521 17.769532 -8.525559 trt:N061011 trt N061011 SRR1039521
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(x = PC1, y = PC2, color = dex, shape = cell)) +
geom_point(size =3) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
coord_fixed() +
ggtitle("PCA with RLOG data")
At this point running the differential expression is easy.
dds <- DESeq(dds)
## using pre-existing normalization factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
Calling results without any arguments will extract the estimated log2
fold changes and p values for the last variable in the design formula.
If there are more than 2 levels for this variable, results will extract
the results table for a comparison of the last level over the first
level. The comparison is printed at the top of the output:
dex trt vs untrt
res <- results(dds)
res
## log2 fold change (MLE): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 31604 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003.14 739.940717 -0.3611537 0.106869 -3.379419 0.000726392
## ENSG00000000419.12 511.735722 0.2063147 0.128665 1.603509 0.108822318
## ENSG00000000457.13 314.194855 0.0378308 0.158633 0.238479 0.811509461
## ENSG00000000460.16 79.793622 -0.1152590 0.314991 -0.365912 0.714430444
## ENSG00000000938.12 0.307267 -1.3691185 3.503764 -0.390757 0.695977205
## ... ... ... ... ... ...
## ENSG00000285979.1 38.353886 0.3423657 0.359511 0.952310 0.340940
## ENSG00000285987.1 1.562508 0.7064145 1.547295 0.456548 0.647996
## ENSG00000285990.1 0.642315 0.3647333 3.433276 0.106235 0.915396
## ENSG00000285991.1 11.276284 -0.1165515 0.748601 -0.155692 0.876275
## ENSG00000285994.1 3.651041 -0.0960094 1.068660 -0.089841 0.928414
## padj
## <numeric>
## ENSG00000000003.14 0.00531137
## ENSG00000000419.12 0.29318870
## ENSG00000000457.13 0.92255697
## ENSG00000000460.16 0.87298038
## ENSG00000000938.12 NA
## ... ...
## ENSG00000285979.1 0.600750
## ENSG00000285987.1 NA
## ENSG00000285990.1 NA
## ENSG00000285991.1 0.952921
## ENSG00000285994.1 NA
The following command is the explicit version of the above command
res <- results(dds, contrast=c("dex","trt","untrt"))
res
## log2 fold change (MLE): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 31604 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003.14 739.940717 -0.3611537 0.106869 -3.379419 0.000726392
## ENSG00000000419.12 511.735722 0.2063147 0.128665 1.603509 0.108822318
## ENSG00000000457.13 314.194855 0.0378308 0.158633 0.238479 0.811509461
## ENSG00000000460.16 79.793622 -0.1152590 0.314991 -0.365912 0.714430444
## ENSG00000000938.12 0.307267 -1.3691185 3.503764 -0.390757 0.695977205
## ... ... ... ... ... ...
## ENSG00000285979.1 38.353886 0.3423657 0.359511 0.952310 0.340940
## ENSG00000285987.1 1.562508 0.7064145 1.547295 0.456548 0.647996
## ENSG00000285990.1 0.642315 0.3647333 3.433276 0.106235 0.915396
## ENSG00000285991.1 11.276284 -0.1165515 0.748601 -0.155692 0.876275
## ENSG00000285994.1 3.651041 -0.0960094 1.068660 -0.089841 0.928414
## padj
## <numeric>
## ENSG00000000003.14 0.00531137
## ENSG00000000419.12 0.29318870
## ENSG00000000457.13 0.92255697
## ENSG00000000460.16 0.87298038
## ENSG00000000938.12 NA
## ... ...
## ENSG00000285979.1 0.600750
## ENSG00000285987.1 NA
## ENSG00000285990.1 NA
## ENSG00000285991.1 0.952921
## ENSG00000285994.1 NA
We can understand the meanings of the columns by checking the object itself.
mcols(res, use.names = TRUE)
## DataFrame with 6 rows and 2 columns
## type description
## <character> <character>
## baseMean intermediate mean of normalized c..
## log2FoldChange results log2 fold change (ML..
## lfcSE results standard error: dex ..
## stat results Wald statistic: dex ..
## pvalue results Wald test p-value: d..
## padj results BH adjusted p-values
summary(res)
##
## out of 31604 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 2373, 7.5%
## LFC < 0 (down) : 1949, 6.2%
## outliers [1] : 0, 0%
## low counts [2] : 14706, 47%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
We can change the signficance threshold with the alpha
parameter
res.05 <- results(dds, alpha = 0.05)
table(res.05$padj < 0.05)
##
## FALSE TRUE
## 13357 3541
summary(res)
##
## out of 31604 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 2373, 7.5%
## LFC < 0 (down) : 1949, 6.2%
## outliers [1] : 0, 0%
## low counts [2] : 14706, 47%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
And the log Fold Change threshold with the lfcThreshold=1
resLFC1 <- results(dds, lfcThreshold=1)
table(resLFC1$padj < 0.1)
##
## FALSE TRUE
## 16687 211
summary(res)
##
## out of 31604 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 2373, 7.5%
## LFC < 0 (down) : 1949, 6.2%
## outliers [1] : 0, 0%
## low counts [2] : 14706, 47%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
An MA-plot (Dudoit et al. 2002) provides a useful overview for the distribution of the estimated coefficients in the model, e.g. the comparisons of interest, across all genes. On the y-axis, the “M” stands for “minus” – subtraction of log values is equivalent to the log of the ratio – and on the x-axis, the “A” stands for “average”.
The DESeq2 package uses a Bayesian procedure to moderate (or “shrink”) log2 fold changes from genes with very low counts and highly variable counts
res.noshr <- results(dds, name="dex_trt_vs_untrt")
plotMA(res.noshr, ylim = c(-5, 5))
Here we specify the apeglm
method for shrinking
coefficients, which is good for shrinking the noisy LFC estimates while
giving low bias LFC estimates for true large differences
library("apeglm")
resultsNames(dds)
## [1] "Intercept" "cell_N061011_vs_N052611"
## [3] "cell_N080611_vs_N052611" "cell_N61311_vs_N052611"
## [5] "dex_trt_vs_untrt"
res <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
plotMA(res, ylim = c(-5, 5))
plotMA(res, ylim = c(-5, 5), alpha = 0.01)
res <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm", lfcThreshold=1)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
## computing FSOS 'false sign or small' s-values (T=1)
plotMA(res, ylim = c(-5, 5))
## thresholding s-values on alpha=0.005 to color points