Simon Coetzee
02/20/2024
“NGS & File Formats” is licensed under CC BY by Simon Coetzee.
The poster, by Gary Overacre - drawn in 1986, features a white bearded wizard with UNIX related objects around him, for example a spool of thread, a boot, a fork, pipes, and a bunch of containers labeled troff, awk, diff, uucp, make, null, and there’s even a C container and cracked B container.
Understanding NGS file formats
Understanding NGS quality assessment
FASTA format reports a sequence
Can contain protein sequences or nucleic acid sequences
FASTA format reports a sequence
Can contain protein sequences or nucleic acid sequences
Common applications include
FASTA format reports a sequence
Can contain protein sequences or nucleic acid sequences
Common applications include
Nearly everything works with this format. Some common examples are:
Header:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
entry | description |
---|---|
EAS139 | the unique instrument name |
136 | the run id |
FC706VJ | the flowcell id |
2 | flowcell lane |
2104 | tile number within the flowcell lane |
15343 | 'x'-coordinate of the cluster within the tile |
197393 | 'y'-coordinate of the cluster within the tile |
1 | the member of a pair, 1 or 2 (paired-end or mate-pair reads only) |
Y | Y if the read is filtered (did not pass), N otherwise |
18 | 0 when none of the control bits are on, otherwise it is an even number |
ATCACG | index sequence |
We can translate from the table below from a symbol to a hexidecimal value, and then in our linux terminal to a decimal value.
Hexadecimal is just base 16
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, 10, ..., 19, 1A, 1B, ...
We can see that +
is equivalent to 2B
in hexadecimal. So to convert to decimal, and subtract the offset of 33
, we enter the following command which indicates that 2B is a hexadecimal number 16#2B
and subtracts 33 from it.
$ echo $(( 16#2B - 33 ))
10
common question “how many reads have an average phred quality score > 30?”
Phred quality score (Q) | Probability of incorrect call (P) | Base call accuracy |
---|---|---|
10 | 1 in 10 | 90% |
20 | 1 in 100 | 99% |
30 | 1 in 1000 | 99.9% |
40 | 1 in 10000 | 99.99% |
50 | 1 in 100000 | 99.999% |
Raw Data QC:
Raw Data QC:
Things to check for: (problems with sequencing)
Poor Quality across the sequence.
Drop in quality unexpectedly (e.g. in the middle)
Large Percentage of sequence with low quality scores.
Things to check for:
Biased sequence composition
High level of sequence duplications
Over-represented sequences
Can we ID degraded RNA-Seq samples at this stage?
Can we ID degraded RNA-Seq samples at this stage?
No, degraded samples usually just have generally sorter RNA molecules, the quality of the sequenced nucleotides will be fine.
Sequence Alignment Map (SAM)
Sequence Alignment Map (SAM)
Sequence Alignment Map (SAM)
Sequence Alignment Map (SAM)
Sequence Alignment Map (SAM)
BAM - compressed searchable binary SAM
CRAM - even smaller compressed searchable binary SAM
Sequence Alignment Map (SAM)
Fully Described in a specification
Complex header - many optional fields Some include:
Sequence Alignment Map (SAM)
Fully Described in a specification
Complex header - many optional fields
Each header line begins with the character @
followed by one of the two-letter header record type codes.
In the header, each line is TAB-delimited and, apart from @CO
lines, each data field follows a format TAG:VALUE
where TAG is a two-character string that defines the format and content of VALUE.
@HD The header line. The first line if present.
VN* Format version. Accepted format: /^[0-9]+.[0-9]+$/.
@SQ Reference sequence dictionary. The order of @SQ lines defines the alignment sorting order.
SN* Reference sequence name. The SN tags and all individual AN names in all @SQ lines must be distinct. The value of this field is used in the alignment records in RNAME and RNEXT fields. Regular expression: [:rname:∧ =][:rname:]
LN* Reference sequence length. Range: [1, 2 31 − 1]
@RG Read group. Unordered multiple @RG lines are allowed.
ID* Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read group IDs may be modified when merging SAM files in order to handle collisions.
Where is it used?
Sequence Alignment Map (SAM)
11 mandatory fields
Flags can tell you about each read, and allow for summaries on the file, and filtering.
Flags can tell you about each read, and allow for summaries on the file, and filtering.
MAPQ can encode the Mapping Quality
\( -10 log_{10} Pr \{mapping\ position\ is\ wrong\} \)
255 indicates no mapping quality is available
CIGAR can encode the alignment
Compact Idiosyncratic Gapped Alignment Report
CIGAR can encode the alignment
CIGAR can encode the alignment
CIGAR can encode the alignment
BED (Browser Extensible Data)
Simple format to describe intervals on the genome
Simple format to describe intervals on the genome
Basic form is 3 columns
start is 0 based
end is 1 based
the first 100 bases on chromosome 1 would be represented with
chr1 0 100
and the next 100 bases
chr1 100 200
Optional Fields:
Name, Score, Strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts
Optional Fields
Name, Score, Strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts
Optional Fields
Name, Score, Strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts
Optional Fields
Name, Score, Strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts
What software use bed files?
GTF (Gene Transfer Format)
Mostly used to describe Genes.
GTF (Gene Transfer Format)
Mostly used to describe Genes.
First 8 fields are required:
¯\_(ツ)_/¯
GTF (Gene Transfer Format)
Mostly used to describe Genes.
First 8 fields are required:
9th column
Required:
gene_id “ENSG00000227232.5”; transcript_id “ENST00000488147.1”;
Optional:
gene_type “unprocessed_pseudogene”; gene_name “WASH7P”;
transcript_type “unprocessed_pseudogene”; transcript_name “WASH7P-001”;
exon_number 11; exon_id “ENSE00001843071.1”;
level 2; transcript_support_level “NA”;
ont “PGO:0000005”; tag “basic”;
havana_gene “OTTHUMG00000000958.1”; havana_transcript “OTTHUMT00000002839.1”;
VCF (Variant Calling Format)
Describes SNVs and INDELs
VCF (Variant Calling Format)
Describes SNVs and INDELs
VCF (Variant Calling Format)
Another complex format, but has an official specification
8 required Fields
VCF (Variant Calling Format)
8 required Fields:
VCF (Variant Calling Format)
8 required Fields: