Lecture 6 Computational Biology Experiments

Dennis Hazelett
3/31/2022

Efficient Design Principles for Analysis and Interpretation

This work, “Analysis and Design of Computational Biology Experiments"is licensed under CC BY 4.0 by Dennis Hazelett.

Bioinformatics, Computational, & Systems Biology

Bioinformatics, Computational, & Systems Biology

Bioinformatics: standardized skills and tools for analysis of (typically) large high-throughput experiments

When you think about it, caring for patients is 99 percent information and 1 percent intervention, so it's clear that with or without genomics, the paradigm is shifting. Bioinformatics brings a cutting edge capacity to healthcare.

–Christopher G. Chute MD, PhD (Johns Hopkins)

Bioinformatics, Computational, & Systems Biology

Computational Biology: The application of mathematical models involving (formerly) prohibitive computational infrastructure as a general approach to drawing integrated inferences about biological questions

source: Randall Munroe's XKCD comic strip (link)

Bioinformatics, Computational, & Systems Biology

Systems biology is an approach in biomedical research to understanding the larger picture—be it at the level of the organism, tissue, or cell—by putting its pieces together. It’s in stark contrast to decades of reductionist biology, which involves taking the pieces apart.

–Christophe Wanjek (NIH website)

Bioinformatics, Computational, & Systems Biology

“The next modern synthesis in biology will be driven by the absorption of mathematical, statistical, and computational methods into mainstream biological training.”

source: Markowetz 2017

Overview

  • Experimental Design
  • Management of Big Data
  • Biological Enrichment
  • Use (and Misuse) of Ontologies and Their Significance

Experimental Design is the Key to Success!

  • choice of cell lines or living models

Experimental Design is the Key to Success!

  • choice of cell lines or living models
  • choice of control conditions, genotypes, vehicle etc.

Experimental Design is the Key to Success!

  • choice of cell lines or living models
  • choice of control conditions, genotypes, vehicle etc.
  • care not to conflate variables

Experimental Design is the Key

Experimental Design is the Key

  • choice of cell lines or living models
  • choice of control conditions, genotypes, vehicle etc.
  • care not to conflate variables
  • statistical power vs. design
    • “no free lunch”
    • emphasize quality over quantity

ENCODE CRISPR knockout of TFs in K562 cells

ENCODE CRISPRi knockdown of TFs in K562 cells

ENCODE CRISPRi knockdown of TFs in K562 cells

Experimental Design is the Key

  • choice of cell lines or living models
  • choice of control conditions, genotypes, vehicle etc.
  • care not to conflate variables
  • statistical power vs. design
    • “no free lunch”
    • emphasize quality over quantity
  • biological replication vs technical

Experimental Design is the Key

  • choice of cell lines or living models
  • choice of control conditions, genotypes, vehicle etc.
  • care not to conflate variables
  • statistical power vs. design
    • “no free lunch”
    • emphasize quality over quantity
  • biological replication vs technical
  • sequencing depth (complexity)

Library Complexity in experimental design

From: Predicting the molecular complexity of sequencing libraries, Daley & Smith; Nature Methods 2013 PMID: 23435259

Library Complexity in experimental design

why sequencing depth matters

Management of Big Data Projects

from “The Cancer Genome Atlas Pancancer Analysis Project, Nature Genetics 2013

Management of Big Data Projects

Don't reinvent the wheel!

from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022 and “The Far Side” by Gary Larson

Management of Big Data Projects

Document EVERYTHING

BIT.AI Blog

Management of Big Data Projects

Document EVERYTHING

  • use github! (ISSUES)

from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022

Management of Big Data Projects

Document EVERYTHING

  • use github!
  • comment your code extensively

from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022

Management of Big Data Projects

Document EVERYTHING

  • use github!
  • comment your code extensively
  • log decisions (–> README.md)

from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022

Management of Big Data Projects

Automate your workflows

from: “Ten Simple Rules for Large Scale Data Processing” Fungtammasan et al., 2022 and the Norman Rockwell Museum

Management of Big Data Projects

Continuously measure Performance

  • use profiling

Management of Big Data Projects

Monitor Execution

  • Sanity Checks!!

Biological Enrichment

…is a ubiquitous concept of biology, molecular biology, genomics & especially bioinformatics.

Why enrichment?

ENRICHMENT IS evidence for organized activity

What is enrichment?

more things than expected due to random chance

What is enrichment?

more things than expected due to random chance

  • what do you expect?

Calculating enrichment

Calculating enrichment

  • finite number of marbles

Calculating enrichment

  • finite number of marbles
  • known number of blacks & whites

Calculating enrichment

  • finite number of marbles
  • known number of blacks & whites
  • therefore probabilities are known

Calculating enrichment

  • finite number of marbles
  • known number of blacks & whites
  • therefore probabilities are known p(white | m), p(black | n)

Calculating enrichment

If we select a single marble, the probabilities change

m = 15, n = 45

Calculating enrichment

If we select a single marble, the probabilities change

m = 15, n = 45

draw1: p_m = 15 / (15 + 45)

[1] 0.25

Calculating enrichment

If we select a single marble, the probabilities change

m = 15, n = 45

draw1: p_m = 15 / (15 + 45)

[1] 0.25

draw2: p_m = 14 / (14 + 45)

[1] 0.237

Calculating enrichment

If we select multiple marbles, the probabilities are described by

Hypergeometric distribution

Calculating enrichment

If we select multiple marbles, the probabilities are described by

Hypergeometric distribution

Hypergeo is related to binomial dist

  • finite population
  • sampling without replacement

Calculating enrichment in R

Function phyper

phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
  • q vector of quantiles representing the number of white marbles drawn without replacement from a bag which contains both black and white marbles.

  • m the number of white marbles in the bag

  • n the number of black marbles in the bag

  • k the number of marbles drawn

Hypergeometric distribution

(PMF = “Probability Mass Function)

plot of chunk pmf

Hypergeometric distribution

(PMF = “Probability Mass Function)

library("dplyr")
library("ggplot2")
library("foreach")
library("RColorBrewer")
## 
x = 0:15
k = 0:60
pmfprobs <- foreach(i = x, .combine = 'rbind') %do% data.frame(x=rep(i, length(k)), k, p = dhyper(i, 15, 45, k))
ggplot(pmfprobs[pmfprobs$k %in% c(1, 10, 20, 30, 40, 50, 59),]) + 
  geom_point(aes(x = x, y = p, colour = factor(k))) +
  geom_line(aes(x = x, y = p, colour = factor(k))) +
  scale_color_brewer(palette="Dark2", name = "number of trials") +
  ylab("probability") +
  xlab("successes")
  theme_minimal() +
  theme(text = element_text(size=24)) +
  ggtitle("Probability Mass Function (m=15, n=45)")

Hypergeometric distribution: code

#                       \/       \/
#                       \/       \/
#                       \/       \/
phyper(q, m, n, k, lower.tail = FALSE, log.p = FALSE)

Hypergeometric distribution:

k = 28 draws (m=15, n=45)

plot of chunk hyper-graph

Hypergeometric distribution:

k = 28 draws (m=15, n=45)

plot of chunk hyper-graph-tail

Hypergeo example

  • 100 marbles
  • 20 are white
  • Question: draw 10 (k), obtain 3 (q); how likely is \( \geq 3 \)?

Hypergeo example

  • 100 marbles
  • 20 are white
  • Question: draw 10 (k), obtain 3 (q); how likely is \( \geq 3 \)?
phyper(2, 20, 80, 10, lower.tail = FALSE, log.p = FALSE)
[1] 0.3187799

lower.tail logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x]

Hypergeo example

  • 100 marbles
  • 20 are white
  • Question: draw 10 (k), obtain 3 (q); how likely is \( \geq 3 \)?

plot of chunk example-one-run-graph

Hypergeo example

  • 100 marbles
  • 20 are white
  • draw 10 (k), obtain 3 (q); how likely is exactly 3?
q = 0:10
probability = dhyper(x=q, m=20, n=80, k=10)
plot(q, probability, xlab = "number of successes", ylab = "probability in right tail", pch=16)
lines(q, probability)
abline(h = 0.05, lty = 2, col = 'red')
abline(v = 3, col = 'blue')

Hypergeo example

  • 100 marbles
  • 20 are white
  • draw 10 (k), obtain 3 (q); how likely is exactly 3?

plot of chunk example-one-run-graph-density-graph

[1] 0.209

Hypergeometric distribution 1-tailed

What about general enrichment problems?

  • large populations >> k, \( p \)=very small
  • background available

What about general enrichment problems?

  • use math to estimate uncertainty
  • aside: if probabilities known: use \( \chi ^2 \) test!

What about general enrichment problems?

  • use math to estimate uncertainty
  • aside: if probabilities known: use \( \chi ^2 \) test!
  • true probability not known: Bayes to the rescue

What about general enrichment problems?

suppose we have 2 sets of observations:

  • one is control condition
  • one is treatment condition
  • each observation is a “draw” as in hypergeo, but now

What about general enrichment problems?

suppose we have 2 sets of observations:

  • one is control condition
  • one is treatment condition
  • each observation is a “draw” as in hypergeo, but now

sample with replacement

population unknown in both cases

General Enrichment Calculation

What is the probability of finding a read in a given gene (random draw), given the data

set.seed(4)
credible_expression <- rbeta(20, 4, 6)
plot(density(credible_expression), xlim=c(0,1))
abline(v=0.5, lty=3, col="red")

plot of chunk coin-flip

[1] "mean 5.8e-06"

General Enrichment Calculation

What is the probability of finding a read in a given gene (random draw), given the data

credible_expression <- rbeta(10000, 4, 6)
plot(density(credible_expression), xlim=c(0,1))
abline(v=0.5, lty=3, col="red")

plot of chunk coin-flip2

[1] "mean 0.4"

General Enrichment Calculation

What is the probability of finding a read in a given gene (random draw), given the data

credible_expression <- rbeta(1e5, 234, 4e7)
cpm <- credible_expression * 1e6
plot(density(cpm), xlim=c(0,20))

plot of chunk beta-rna-seq-cpm

[1] "5.8 cpm"

General Enrichment Calculation

Splicing: splice forms A and B

General Enrichment Calculation

Splicing: splice forms A and B

Controls: A:B = 48:186

# probability of observing form A in controls
plot(density(rbeta(1e5, 48, 186)), xlim=c(0,1), main="splicing example")

plot of chunk beta-splicing

[1] 0.205

General Enrichment Calculation

New Condition: observe 24 A, 47 B

# probability of observing form A in controls
plot(density(rbeta(1e5, 48, 186)), xlim=c(0,1), main="splicing example")
lines(density(rbeta(1e5, 24, 47)))

plot of chunk beta-splicing-observe

General Enrichment Calculation

New Condition: observe 24 A, 47 B

# probability of observing form A in controls
plot(density(rbeta(1e5, 48, 186)), xlim=c(0,1), main="splicing example")
lines(density(rbeta(1e5, 24, 47)))
lines(density(rbeta(1e5, 24, 47)-rbeta(1e5, 48, 186)), lty=2, col='red')

plot of chunk beta-splicing-diff

General Enrichment Calculation

“Null” hypothesis test: rejection!

nsamples <- 1e6
treatment <- rbeta(nsamples, 24, 47)
control <- rbeta(nsamples, 48, 186)
p_value <- sum(treatment - control <= 0) / nsamples
print(p_value)
[1] 0.012607

General Enrichment Calculation: Applications

  • splicing

General Enrichment Calculation: Applications

  • splicing
  • enrichment of SNPs in epigenomics data

General Enrichment Calculation: Applications

  • splicing
  • enrichment of SNPs in epigenomics data
  • allele specific expression (ASE)

General Enrichment Calculation: Applications

Any problem involving count data where the underlying probability is not known but a suitable “background” condition is available for comparison

Gene Set Enrichment

Gene Set Enrichment

  • competitive and self-contained methods

Gene Set Enrichment

  • competitive and self-contained methods
    • competitive H0: “the genes in my feature set are no more active than the background”

Gene Set Enrichment

  • competitive and self-contained methods
    • competitive H0: “the genes in my feature set are no more active than the background”
    • self-contained: “genes/annotations of my feature set are not active in this list”

Gene Set Enrichment

  • competitive and self-contained methods
    • competitive H0: “the genes in my feature set are no more active than the background”
    • self-contained: “genes/annotations of my feature set are not active in this list”
  • Over Representation Analysis (ORA) – “competitive”
    • DAVID, clusterProfiler, LEGO

Gene Set Enrichment

  • competitive and self-contained methods
    • competitive H0: “the genes in my feature set are no more active than the background”
    • self-contained: “genes/annotations of my feature set are not active in this list”
  • Over Representation Analysis (ORA) – “competitive”
    • DAVID, clusterProfiler, LEGO
    • any of the count based analysis methods we've reviewed

Gene Set Enrichment

  • competitive and self-contained methods
    • competitive H0: “the genes in my list are no more active than the background”
    • self-contained: “genes/annotations of my feature set are not active in this list”
  • Over Representation Analysis (ORA) – “competitive”
    • DAVID, clusterProfiler, LEGO
    • any of the count based analysis methods we've reviewed
  • Gene Set Enrichment Analysis (GSEA) – “competitive”

Gene Set Enrichment

  • competitive and self-contained methods

    • competitive H0: “the genes in my list are more active than the background”
    • self-contained: “genes/annotations of my feature set are not active in this list”
  • Over Representation Analysis (ORA) – “competitive”

    • any of the count based analysis methods we've reviewed
    • even t-tests have been used (e.g. “DAVID”)
  • Gene Set Enrichment Analysis (GSEA) – “competitive”

  • “self-contained” methods test whether there are any active features in the set of interest

    • global test, GlobalANCOVA, FORGE

Enrichment in ranked lists

Online methods

How GSEA Works

shamelessly stolen from: Hector Corrada Bravo

How GSEA Works

shamelessly stolen from: Hector Corrada Bravo

How GSEA Works

shamelessly stolen from: Hector Corrada Bravo

How GSEA Works

shamelessly stolen from: Hector Corrada Bravo

How GSEA Works

shamelessly stolen from: Hector Corrada Bravo

Ontologies, their uses and misuses

“In computer science and information science, an ontology encompasses a representation, formal naming and definition of the categories, properties and relations between the concepts, data and entities that substantiate one, many or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject.”

(from Wikipedia)

Ontologies

Gene Ontology is a curated graph of terms

  • Molecular Function the tasks performed by individual gene products (e.g. “adenylate cyclase activity”)
  • Cellular component subcellular structures, locations, and macromolecular complexes (e.g. “ribosome”)
  • Biological Process broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions (e.g. “DNA repair”)

Gene Ontology is a curated graph of terms

  • Molecular Function the tasks performed by individual gene products (e.g. “adenylate cyclase activity”)
  • Cellular component subcellular structures, locations, and macromolecular complexes (e.g. “ribosome”)
  • Biological Process broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions (e.g. “DNA repair”)

    • each gene annotated to a node on all three GOs

Other Useful Ontologies

Other Useful Ontologies

Reactome is an expert-authored, peer-reviewed knowledgebase of reactions and pathways. (now version 79)

  • Manually curated human pathways with experimental evidence (regarded highest quality)
  • Manually inferred pathways for other organism (e.g. Gallus gallus, Mus musculus)

Other Useful Ontologies

Navigating Reactome

  • Webpage provides an easy way to access, browse, analyze and download pathway data

Other Useful Ontologies

Navigating Reactome

  • Pathway browser

Other Useful Ontologies

Navigating Reactome

  • Pathway Structure

Other Useful Ontologies

Molecular Signatures Database (MSigDB)

  • Hallmark genesets
  • Canonical pathways
  • Regulatory Target genesets
  • disease genesets
  • many cancer sets
  • Gene Ontology

Other Useful Ontologies

Finally: On the misuse of Ontologies in the Biomedical Literature

Common issues

lack of methodological detail and errors in statistical analysis were widespread, which undermines … reliability and reproducibility

from Wijesooriya et al., 2022 Urgent need for consistent standards in functional enrichment analysis

First some general stats

Misuse of Ontologies in the Biomedical Literature

Define the gene set and version!

  • ENSEMBL? ENTREZ? …

from Wijesooriya et al., 2022 Urgent need for consistent standards in functional enrichment analysis

Misuse of Ontologies in the Biomedical Literature

Perform FDR correction

from Wijesooriya et al., 2022 Urgent need for consistent standards in functional enrichment analysis

Misuse of Ontologies in the Biomedical Literature

Misuse of Ontologies in the Biomedical Literature

Specify your background list

from Wijesooriya et al., 2022 Urgent need for consistent standards in functional enrichment analysis

Misuse of Ontologies in the Biomedical Literature

Make your code available to the community

from Wijesooriya et al., 2022 Urgent need for consistent standards in functional enrichment analysis

The Good, the Bad, and the Ugly

Example of effective use GO enrichment

The End