Lecture 6 Computational Biology Experiments

Dennis Hazelett
3/31/2022

Efficient Design Principles for Analysis and Interpretation

This work, “Analysis and Design of Computational Biology Experiments"is licensed under CC BY 4.0 by Dennis Hazelett.

Bioinformatics, Computational, & Systems Biology

Bioinformatics: standardized skills and tools for analysis of (typically) large high-throughput experiments

When you think about it, caring for patients is 99 percent information and 1 percent intervention, so it's clear that with or without genomics, the paradigm is shifting. Bioinformatics brings a cutting edge capacity to healthcare.

–Christopher G. Chute MD, PhD (Johns Hopkins)

Bioinformatics, Computational, & Systems Biology

Computational Biology: The application of mathematical models involving (formerly) prohibitive computational infrastructure as a general approach to drawing integrated inferences about biological questions

source: Randall Munroe's XKCD comic strip (link)

Bioinformatics, Computational, & Systems Biology

Systems biology is an approach in biomedical research to understanding the larger picture—be it at the level of the organism, tissue, or cell—by putting its pieces together. It’s in stark contrast to decades of reductionist biology, which involves taking the pieces apart.

–Christophe Wanjek (NIH website)

Bioinformatics, Computational, & Systems Biology

“The next modern synthesis in biology will be driven by the absorption of mathematical, statistical, and computational methods into mainstream biological training.”

source: Markowetz 2017

Overview

Experimental Design
Management of Big Data
Biological Enrichment
Use (and Misuse) of Ontologies and Their Significance

Experimental Design is the Key to Success!

choice of cell lines or living models

Experimental Design is the Key to Success!

choice of cell lines or living models
choice of control conditions, genotypes, vehicle etc.

Experimental Design is the Key to Success!

choice of cell lines or living models
choice of control conditions, genotypes, vehicle etc.
care not to conflate variables

Experimental Design is the Key

choice of cell lines or living models
choice of control conditions, genotypes, vehicle etc.
care not to conflate variables
statistical power vs. design
- “no free lunch”
- emphasize quality over quantity

ENCODE CRISPR knockout of TFs in K562 cells

ENCODE CRISPRi knockdown of TFs in K562 cells

Experimental Design is the Key

choice of cell lines or living models
choice of control conditions, genotypes, vehicle etc.
care not to conflate variables
statistical power vs. design
- “no free lunch”
- emphasize quality over quantity
biological replication vs technical

Experimental Design is the Key

choice of cell lines or living models
choice of control conditions, genotypes, vehicle etc.
care not to conflate variables
statistical power vs. design
- “no free lunch”
- emphasize quality over quantity
biological replication vs technical
sequencing depth (complexity)

Library Complexity in experimental design

From: Predicting the molecular complexity of sequencing libraries, Daley & Smith; Nature Methods 2013 PMID: 23435259

Library Complexity in experimental design

why sequencing depth matters

Management of Big Data Projects

from “The Cancer Genome Atlas Pancancer Analysis Project, Nature Genetics 2013

Management of Big Data Projects

Don't reinvent the wheel!

from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022 and “The Far Side” by Gary Larson

Management of Big Data Projects

Document EVERYTHING

BIT.AI Blog

Management of Big Data Projects

Document EVERYTHING

use github! (ISSUES)

from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022

Management of Big Data Projects

Document EVERYTHING

use github!
comment your code extensively

from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022

Management of Big Data Projects

Document EVERYTHING

use github!
comment your code extensively
log decisions (–> README.md)

from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022

Management of Big Data Projects

Automate your workflows

from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022 and the Norman Rockwell Museum

Management of Big Data Projects

Continuously measure Performance

use profiling

Management of Big Data Projects

Monitor Execution

Sanity Checks!!

Biological Enrichment

…is a ubiquitous concept of biology, molecular biology, genomics & especially bioinformatics.

Why enrichment?

ENRICHMENT IS evidence for organized activity

What is enrichment?

more things than expected due to random chance

What is enrichment?

more things than expected due to random chance

what do you expect?

Calculating enrichment

finite number of marbles

Calculating enrichment

finite number of marbles
known number of blacks & whites

Calculating enrichment

finite number of marbles
known number of blacks & whites
therefore probabilities are known

Calculating enrichment

finite number of marbles
known number of blacks & whites
therefore probabilities are known p(white | m), p(black | n)

Calculating enrichment

If we select a single marble, the probabilities change

m = 15, n = 45

Calculating enrichment

If we select a single marble, the probabilities change

m = 15, n = 45

draw1: p_m = 15 / (15 + 45)

[1] 0.25

Calculating enrichment

If we select a single marble, the probabilities change

m = 15, n = 45

draw1: p_m = 15 / (15 + 45)

[1] 0.25

draw2: p_m = 14 / (14 + 45)

[1] 0.237

Calculating enrichment

If we select multiple marbles, the probabilities are described by

Hypergeometric distribution

Calculating enrichment

If we select multiple marbles, the probabilities are described by

Hypergeometric distribution

Hypergeo is related to binomial dist

finite population
sampling without replacement

Calculating enrichment in R

Function phyper

phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)

q vector of quantiles representing the number of white marbles drawn without replacement from a bag which contains both black and white marbles.
m the number of white marbles in the bag
n the number of black marbles in the bag
k the number of marbles drawn

Hypergeometric distribution

(PMF = “Probability Mass Function)

plot of chunk pmf

Hypergeometric distribution

(PMF = “Probability Mass Function)

library("dplyr")
library("ggplot2")
library("foreach")
library("RColorBrewer")
## 
x = 0:15
k = 0:60
pmfprobs <- foreach(i = x, .combine = 'rbind') %do% data.frame(x=rep(i, length(k)), k, p = dhyper(i, 15, 45, k))
ggplot(pmfprobs[pmfprobs$k %in% c(1, 10, 20, 30, 40, 50, 59),]) + 
  geom_point(aes(x = x, y = p, colour = factor(k))) +
  geom_line(aes(x = x, y = p, colour = factor(k))) +
  scale_color_brewer(palette="Dark2", name = "number of trials") +
  ylab("probability") +
  xlab("successes")
  theme_minimal() +
  theme(text = element_text(size=24)) +
  ggtitle("Probability Mass Function (m=15, n=45)")

Hypergeometric distribution: code

#                       \/       \/
#                       \/       \/
#                       \/       \/
phyper(q, m, n, k, lower.tail = FALSE, log.p = FALSE)

Hypergeometric distribution:

k = 28 draws (m=15, n=45)

plot of chunk hyper-graph

Hypergeometric distribution:

k = 28 draws (m=15, n=45)

plot of chunk hyper-graph-tail

Hypergeo example

100 marbles
20 are white
Question: draw 10 (k), obtain 3 (q); how likely is \( \geq 3 \)?

Hypergeo example

100 marbles
20 are white
Question: draw 10 (k), obtain 3 (q); how likely is \( \geq 3 \)?

phyper(2, 20, 80, 10, lower.tail = FALSE, log.p = FALSE)

[1] 0.3187799

lower.tail logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x]

Hypergeo example

100 marbles
20 are white
Question: draw 10 (k), obtain 3 (q); how likely is \( \geq 3 \)?

plot of chunk example-one-run-graph

Hypergeo example

100 marbles
20 are white
draw 10 (k), obtain 3 (q); how likely is exactly 3?

q = 0:10
probability = dhyper(x=q, m=20, n=80, k=10)
plot(q, probability, xlab = "number of successes", ylab = "probability in right tail", pch=16)
lines(q, probability)
abline(h = 0.05, lty = 2, col = 'red')
abline(v = 3, col = 'blue')

Hypergeo example

100 marbles
20 are white
draw 10 (k), obtain 3 (q); how likely is exactly 3?

plot of chunk example-one-run-graph-density-graph

[1] 0.209

Hypergeometric distribution 1-tailed

Fisher's exact test

What about general enrichment problems?

large populations >> k, \( p \)=very small
background available

What about general enrichment problems?

use math to estimate uncertainty
aside: if probabilities known: use \( \chi ^2 \) test!

What about general enrichment problems?

use math to estimate uncertainty
aside: if probabilities known: use \( \chi ^2 \) test!
true probability not known: Bayes to the rescue

What about general enrichment problems?

suppose we have 2 sets of observations:

one is control condition
one is treatment condition
each observation is a “draw” as in hypergeo, but now

What about general enrichment problems?

suppose we have 2 sets of observations:

one is control condition
one is treatment condition
each observation is a “draw” as in hypergeo, but now

sample with replacement

population unknown in both cases

General Enrichment Calculation

What is the probability of finding a read in a given gene (random draw), given the data

set.seed(4)
credible_expression <- rbeta(20, 4, 6)
plot(density(credible_expression), xlim=c(0,1))
abline(v=0.5, lty=3, col="red")

plot of chunk coin-flip

[1] "mean 5.8e-06"

General Enrichment Calculation

What is the probability of finding a read in a given gene (random draw), given the data

credible_expression <- rbeta(10000, 4, 6)
plot(density(credible_expression), xlim=c(0,1))
abline(v=0.5, lty=3, col="red")

plot of chunk coin-flip2

[1] "mean 0.4"

General Enrichment Calculation

What is the probability of finding a read in a given gene (random draw), given the data

credible_expression <- rbeta(1e5, 234, 4e7)
cpm <- credible_expression * 1e6
plot(density(cpm), xlim=c(0,20))

plot of chunk beta-rna-seq-cpm

[1] "5.8 cpm"

General Enrichment Calculation

Splicing: splice forms A and B

General Enrichment Calculation

Splicing: splice forms A and B

Controls: A:B = 48:186

# probability of observing form A in controls
plot(density(rbeta(1e5, 48, 186)), xlim=c(0,1), main="splicing example")

plot of chunk beta-splicing

[1] 0.205

General Enrichment Calculation

New Condition: observe 24 A, 47 B

# probability of observing form A in controls
plot(density(rbeta(1e5, 48, 186)), xlim=c(0,1), main="splicing example")
lines(density(rbeta(1e5, 24, 47)))

plot of chunk beta-splicing-observe

General Enrichment Calculation

New Condition: observe 24 A, 47 B

# probability of observing form A in controls
plot(density(rbeta(1e5, 48, 186)), xlim=c(0,1), main="splicing example")
lines(density(rbeta(1e5, 24, 47)))
lines(density(rbeta(1e5, 24, 47)-rbeta(1e5, 48, 186)), lty=2, col='red')

plot of chunk beta-splicing-diff

General Enrichment Calculation

“Null” hypothesis test: rejection!

nsamples <- 1e6
treatment <- rbeta(nsamples, 24, 47)
control <- rbeta(nsamples, 48, 186)
p_value <- sum(treatment - control <= 0) / nsamples
print(p_value)

[1] 0.012607

General Enrichment Calculation: Applications

splicing

General Enrichment Calculation: Applications

splicing
enrichment of SNPs in epigenomics data

General Enrichment Calculation: Applications

splicing
enrichment of SNPs in epigenomics data
allele specific expression (ASE)

General Enrichment Calculation: Applications

Any problem involving count data where the underlying probability is not known but a suitable “background” condition is available for comparison

Gene Set Enrichment

competitive and self-contained methods

Gene Set Enrichment

competitive and self-contained methods
- competitive H0: “the genes in my feature set are no more active than the background”

Gene Set Enrichment

competitive and self-contained methods
- competitive H0: “the genes in my feature set are no more active than the background”
- self-contained: “genes/annotations of my feature set are not active in this list”

Gene Set Enrichment

competitive and self-contained methods
- competitive H0: “the genes in my feature set are no more active than the background”
- self-contained: “genes/annotations of my feature set are not active in this list”
Over Representation Analysis (ORA) – “competitive”
- DAVID, clusterProfiler, LEGO

Gene Set Enrichment

competitive and self-contained methods
- competitive H0: “the genes in my feature set are no more active than the background”
- self-contained: “genes/annotations of my feature set are not active in this list”
Over Representation Analysis (ORA) – “competitive”
- DAVID, clusterProfiler, LEGO
- any of the count based analysis methods we've reviewed

Gene Set Enrichment

competitive and self-contained methods
- competitive H0: “the genes in my list are no more active than the background”
- self-contained: “genes/annotations of my feature set are not active in this list”
Over Representation Analysis (ORA) – “competitive”
- DAVID, clusterProfiler, LEGO
- any of the count based analysis methods we've reviewed
Gene Set Enrichment Analysis (GSEA) – “competitive”

Gene Set Enrichment

competitive and self-contained methods
- competitive H0: “the genes in my list are more active than the background”
- self-contained: “genes/annotations of my feature set are not active in this list”
Over Representation Analysis (ORA) – “competitive”
- any of the count based analysis methods we've reviewed
- even t-tests have been used (e.g. “DAVID”)
Gene Set Enrichment Analysis (GSEA) – “competitive”
“self-contained” methods test whether there are any active features in the set of interest
- global test, GlobalANCOVA, FORGE

Enrichment in ranked lists

Online methods

How GSEA Works

shamelessly stolen from: Hector Corrada Bravo

How GSEA Works

shamelessly stolen from: Hector Corrada Bravo

How GSEA Works

shamelessly stolen from: Hector Corrada Bravo

How GSEA Works

shamelessly stolen from: Hector Corrada Bravo

How GSEA Works

shamelessly stolen from: Hector Corrada Bravo

Ontologies, their uses and misuses

“In computer science and information science, an ontology encompasses a representation, formal naming and definition of the categories, properties and relations between the concepts, data and entities that substantiate one, many or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject.”

(from Wikipedia)

Ontologies

GeneOntology.org

Gene Ontology is a curated graph of terms

Molecular Function the tasks performed by individual gene products (e.g. “adenylate cyclase activity”)
Cellular component subcellular structures, locations, and macromolecular complexes (e.g. “ribosome”)
Biological Process broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions (e.g. “DNA repair”)

Gene Ontology is a curated graph of terms

Molecular Function the tasks performed by individual gene products (e.g. “adenylate cyclase activity”)
Cellular component subcellular structures, locations, and macromolecular complexes (e.g. “ribosome”)
Biological Process broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions (e.g. “DNA repair”)
- each gene annotated to a node on all three GOs

Other Useful Ontologies

Kyoto Encyclopedia of Genes and Genomes (Kegg)
Reactome
MSigDB
SynGO
Panther
WikiPathways

Other Useful Ontologies

Reactome is an expert-authored, peer-reviewed knowledgebase of reactions and pathways. (now version 79)

Manually curated human pathways with experimental evidence (regarded highest quality)
Manually inferred pathways for other organism (e.g. Gallus gallus, Mus musculus)