Dennis Hazelett
3/31/2022
This work, “Analysis and Design of Computational Biology Experiments"is licensed under CC BY 4.0 by Dennis Hazelett.
Bioinformatics: standardized skills and tools for analysis of (typically) large high-throughput experiments
When you think about it, caring for patients is 99 percent information and 1 percent intervention, so it's clear that with or without genomics, the paradigm is shifting. Bioinformatics brings a cutting edge capacity to healthcare.
Computational Biology: The application of mathematical models involving (formerly) prohibitive computational infrastructure as a general approach to drawing integrated inferences about biological questions
source: Randall Munroe's XKCD comic strip (link)
Systems biology is an approach in biomedical research to understanding the larger picture—be it at the level of the organism, tissue, or cell—by putting its pieces together. It’s in stark contrast to decades of reductionist biology, which involves taking the pieces apart.
“The next modern synthesis in biology will be driven by the absorption of mathematical, statistical, and computational methods into mainstream biological training.”
From: Predicting the molecular complexity of sequencing libraries, Daley & Smith; Nature Methods 2013 PMID: 23435259
why sequencing depth matters
from “The Cancer Genome Atlas Pancancer Analysis Project, Nature Genetics 2013
Don't reinvent the wheel!
from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022 and “The Far Side” by Gary Larson
Document EVERYTHING
from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022
Document EVERYTHING
from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022
Document EVERYTHING
from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022
Automate your workflows
from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022 and the Norman Rockwell Museum
Continuously measure Performance
Monitor Execution
…is a ubiquitous concept of biology, molecular biology, genomics & especially bioinformatics.
more things than expected due to random chance
more things than expected due to random chance
If we select a single marble, the probabilities change
m = 15, n = 45
If we select a single marble, the probabilities change
m = 15, n = 45
draw1: p_m = 15 / (15 + 45)
[1] 0.25
If we select a single marble, the probabilities change
m = 15, n = 45
draw1: p_m = 15 / (15 + 45)
[1] 0.25
draw2: p_m = 14 / (14 + 45)
[1] 0.237
If we select multiple marbles, the probabilities are described by
If we select multiple marbles, the probabilities are described by
Hypergeo is related to binomial dist
Function phyper
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
q
vector of quantiles representing the number of white marbles drawn without replacement from a bag which contains both black and white marbles.
m
the number of white marbles in the bag
n
the number of black marbles in the bag
k
the number of marbles drawn
(PMF = “Probability Mass Function)
(PMF = “Probability Mass Function)
library("dplyr")
library("ggplot2")
library("foreach")
library("RColorBrewer")
##
x = 0:15
k = 0:60
pmfprobs <- foreach(i = x, .combine = 'rbind') %do% data.frame(x=rep(i, length(k)), k, p = dhyper(i, 15, 45, k))
ggplot(pmfprobs[pmfprobs$k %in% c(1, 10, 20, 30, 40, 50, 59),]) +
geom_point(aes(x = x, y = p, colour = factor(k))) +
geom_line(aes(x = x, y = p, colour = factor(k))) +
scale_color_brewer(palette="Dark2", name = "number of trials") +
ylab("probability") +
xlab("successes")
theme_minimal() +
theme(text = element_text(size=24)) +
ggtitle("Probability Mass Function (m=15, n=45)")
# \/ \/
# \/ \/
# \/ \/
phyper(q, m, n, k, lower.tail = FALSE, log.p = FALSE)
k = 28 draws (m=15, n=45)
k = 28 draws (m=15, n=45)
phyper(2, 20, 80, 10, lower.tail = FALSE, log.p = FALSE)
[1] 0.3187799
lower.tail
logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x]
q = 0:10
probability = dhyper(x=q, m=20, n=80, k=10)
plot(q, probability, xlab = "number of successes", ylab = "probability in right tail", pch=16)
lines(q, probability)
abline(h = 0.05, lty = 2, col = 'red')
abline(v = 3, col = 'blue')
[1] 0.209
suppose we have 2 sets of observations:
suppose we have 2 sets of observations:
What is the probability of finding a read in a given gene (random draw), given the data
set.seed(4)
credible_expression <- rbeta(20, 4, 6)
plot(density(credible_expression), xlim=c(0,1))
abline(v=0.5, lty=3, col="red")
[1] "mean 5.8e-06"
What is the probability of finding a read in a given gene (random draw), given the data
credible_expression <- rbeta(10000, 4, 6)
plot(density(credible_expression), xlim=c(0,1))
abline(v=0.5, lty=3, col="red")
[1] "mean 0.4"
What is the probability of finding a read in a given gene (random draw), given the data
credible_expression <- rbeta(1e5, 234, 4e7)
cpm <- credible_expression * 1e6
plot(density(cpm), xlim=c(0,20))
[1] "5.8 cpm"
Splicing: splice forms A and B
Splicing: splice forms A and B
Controls: A:B = 48:186
# probability of observing form A in controls
plot(density(rbeta(1e5, 48, 186)), xlim=c(0,1), main="splicing example")
[1] 0.205
New Condition: observe 24 A, 47 B
# probability of observing form A in controls
plot(density(rbeta(1e5, 48, 186)), xlim=c(0,1), main="splicing example")
lines(density(rbeta(1e5, 24, 47)))
New Condition: observe 24 A, 47 B
# probability of observing form A in controls
plot(density(rbeta(1e5, 48, 186)), xlim=c(0,1), main="splicing example")
lines(density(rbeta(1e5, 24, 47)))
lines(density(rbeta(1e5, 24, 47)-rbeta(1e5, 48, 186)), lty=2, col='red')
“Null” hypothesis test: rejection!
nsamples <- 1e6
treatment <- rbeta(nsamples, 24, 47)
control <- rbeta(nsamples, 48, 186)
p_value <- sum(treatment - control <= 0) / nsamples
print(p_value)
[1] 0.012607
competitive and self-contained methods
Over Representation Analysis (ORA) – “competitive”
Gene Set Enrichment Analysis (GSEA) – “competitive”
“self-contained” methods test whether there are any active features in the set of interest
shamelessly stolen from: Hector Corrada Bravo
shamelessly stolen from: Hector Corrada Bravo
shamelessly stolen from: Hector Corrada Bravo
shamelessly stolen from: Hector Corrada Bravo
shamelessly stolen from: Hector Corrada Bravo
“In computer science and information science, an ontology encompasses a representation, formal naming and definition of the categories, properties and relations between the concepts, data and entities that substantiate one, many or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject.”
(from Wikipedia)
Molecular Function
the tasks performed by individual gene products (e.g. “adenylate cyclase activity”)Cellular component
subcellular structures, locations, and macromolecular complexes (e.g. “ribosome”)Biological Process
broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions (e.g. “DNA repair”)Molecular Function
the tasks performed by individual gene products (e.g. “adenylate cyclase activity”)Cellular component
subcellular structures, locations, and macromolecular complexes (e.g. “ribosome”)Biological Process
broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions (e.g. “DNA repair”)
Reactome is an expert-authored, peer-reviewed knowledgebase of reactions and pathways. (now version 79)
Navigating Reactome
Navigating Reactome
Navigating Reactome
Molecular Signatures Database (MSigDB)
Common issues
lack of methodological detail and errors in statistical analysis were widespread, which undermines … reliability and reproducibility
from Wijesooriya et al., 2022 Urgent need for consistent standards in functional enrichment analysis
from Wijesooriya et al., 2022 Urgent need for consistent standards in functional enrichment analysis
Define the gene set and version!
from Wijesooriya et al., 2022 Urgent need for consistent standards in functional enrichment analysis
Perform FDR correction
from Wijesooriya et al., 2022 Urgent need for consistent standards in functional enrichment analysis
from Wijesooriya et al., 2022 Urgent need for consistent standards in functional enrichment analysis
Specify your background list
from Wijesooriya et al., 2022 Urgent need for consistent standards in functional enrichment analysis
Make your code available to the community
from Wijesooriya et al., 2022 Urgent need for consistent standards in functional enrichment analysis
from Wijesooriya et al., 2022 Urgent need for consistent standards in functional enrichment analysis
Figure 7 from Smillie et al., 2019
“Intra- and Inter-cellular Rewiring of the Human Colon during Ulcerative Colitis”