Simon Coetzee
06/30/2023
“Quality Control & Alignment Basics” is licensed under CC BY by Simon Coetzee.
Understanding computational context of RNA-Seq
Learning to judge data for quality metrics of RNA-Seq
Requires a genome .fasta
file & preferably a .gtf
file containing a comprehensive set of genes
Requires reads in the form of .fastq
Outputs a .bam
/.sam
file indicating where reads are mapped in the genome.
To be followed up likely by a quantification step.
Counts represent the number of reads produced by the RNA-Seq experiment that can be assigned to a genomic feature.
They are the measurement that we can use to estimate differential expression.
They can be converted to many different abundance measurements.
Counts to abundance:
A lot more detail here
CPM = Counts Per Million
CPM = (count * 10e6) / total_reads
RPKM = Reads Per Kilobase per Million mapped reads or FPKM = Fragments Per Kilobase per Million mapped reads
RPKM (or FPKM) = (count * 1e3 * 1e6) / (total_reads * gene_length_in_bp)
TPM = Transcript Per Million
TPM = A * (1 / sum(A)) * 1e6
A = count * 1e3 / gene_length_in_bp
Considerations:
Work Around:
Tiny cluster for doing mapping
~ 15 million reads on one core in 11 minutes
Get read assignment to genes/transcripts for free
There is no separate quantification step
The main tools are salmon and kallisto
Requires a transcriptome .fasta
file & preferably a .gtf
file containing a comprehensive set of genes
Requires reads in the form of .fastq
Outputs a table with Transcript IDs, Counts, Transcript Lengths, and TPM values
Name | Length | EffectiveLength | TPM | NumReads |
---|---|---|---|---|
ENST00000456328.2 | 1657 | 1869.277 | 0.000000 | 0.000 |
ENST00000488147.1 | 1351 | 1646.916 | 7.718260 | 255.879 |
The effective length is akin to the regular length of each transcript type, except that it accounts for the fact that not every transcript in the population can produce a fragment of every length starting at every position. Actually, a transcript has an effective length with respect to each possible fragment that maps to it
Should be done both before and after alignment
AKA on the FASTQ files directly and on the resulting BAMS
A lot of quality control becomes clearer in the context of other experiments.
MultiQC facilitates this process with plugins to collate 92 different QC tools:
Preseq | featureCounts | Picard | STAR
What can we look for specifically?
Mapping rate guidelines:
Poor quality reads, contaminating sequences
Mapping rate guidelines:
Poor quality reads, contaminating sequences
Inappropriate alignment parameters, or reference genome/transcriptome
Mapping rate guidelines:
Poor quality reads, contaminating sequences
Inappropriate alignment parameters, or reference genome/transcriptome
Low quality reference genome/transcriptome (less relevant for mouse/human)
Can’t really produce a perfectly uniform coverage across entire transcripts.
Introduced by library construction, especially if starting RNA is degraded.
3’ bias common when subjected to polyA enrichment.
5’ bias common when subjected to rRNA depletion.
low (<50%) of reads aligning to exons.
high (>30%) of reads in introns or intergenic regions.
high (>2%) of reads in rRNA
low (<50%) of reads aligning to exons.
high (>30%) of reads in introns or intergenic regions.
high (>2%) of reads in rRNA.
The goal of most RNA-seq studies is to interrogate functional mRNA. However, structure RNAs such as Ribosomal RNA (rRNA), (or tRNAs) can constitute > 50% of total RNA in the cell, soaking up reads and lowering effective depth. These should be depleted in library preparation.
Two or more reads are assumed to be derived from the same nucleotide fragment and not representing independent transcriptome information from the sample. Typically, for paired-end read data (single-end data is also handled) these algorithms find the 5p coordinates and mapping orientations of each read pair while taking into account all clipping that has taking place as well as any gaps or jumps in the alignment. All read pairs sharing identical 5p coordinates and orientations are marked as duplicates except the “best” pair.
There is a concern that duplicates may correspond to biased PCR amplification of particular fragments, however, for highly expressed or short genes, duplicates are expected even if there is no amplification bias. Removing them will reduce the dynamic range of expression estimates. Generally duplicates should therefore not be removed in RNA-seq analysis
Preseq is a tool to help you design and optimize sequencing experiments by using population sampling models to infer properties of the population or the behavior under deeper sampling based upon a small initial sequencing experiment.
The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.
The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.
The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.