Skip to content

Module: IO Alignments

smehringer edited this page Nov 26, 2018 · 7 revisions

Notes

  • TODO: Right now the REF_ID field for the SAM format is a string but it should be a number, like in BAM files. That way SAM and BAM formats are consistent about the REF_ID

  • Alignment Class update : There is no alignment class anymore since we just have a pair of aligned sequences and a class wrapper would introduce unneccessary overhead.

    • replace stream_alignment function from PR #233 when the concept of a "seqan pretty debug stream" is implemented (#121)
    • close PR #233 if code is really not needed anymore
    • Add get_cigar_vector function (introduced in PR #414)
  • SAM tag dictionary (#415)

    • PR #415 introduced a consexpr string literal in file sam_tag_dictionary.hpp that introduces a warning on clang. If we support clang and require C++20 on GCC we can replace this with a constexpr string. This bullet point is also in meta issue #404
  • cigar_op and cigar alphabet class (#414)

Specific Alignment Formats

  • Sam Format
  • Bam Format
  • Blast Format (For time and simplicity reasons only the tabular format is considered for now and the Blast format will only be available as output format)

Parsing

  • Read binary data
  • Write binary data
  • Endianness
  • Formatted Numbers

Dependencies

  • BGZF input stream (#317)
  • BGZF output stream (#317)

Summary Alignment Record Discussion

The key field of the alignment record will be seqan3::field::alignment that features an seqan3::alignment object with two sequences. The alignment IO thereby only considers pairwise alignment as multiple alignments are a different use case. The two sequences in the alignment object will preferably be of the gap_decorator type that only stores the gap information and a reference to the seqan3::field::SEQ/seqan3::field::REF_SEQ objects.

For SAM/BAM file the alignment is usually incomplete since the reference sequence is unknown. This will be handled by introducing a dummy/proxy sequence that consists of N's. If the reference characters are queried, an exception is thrown to warn the user that no reference information is available. The alignment file (at least for BAM/SAM format) should be also constructable with a second file, the reference sequence file, in which case the reference is loaded and the alignments properly instantiated. There is also the idea of "lazy loading" where the provided reference sequence file is not opened unless an alignment is queried for reference related information, at which point the reference sequence file is accessed (via the index?) and the corresponding information extracted.

SAM required format column names (as in the SAM specifications) to filelds:

# SAM name FIELD name
1 QNAME ID
2 FLAG FLAG
3 RNAME REF_ID
4 POS REF_OFFSET
5 MAPQ MAPQ
6 CIGAR implicilty stored in ALIGNMENT
7 RNEXT MATE (tuple pos 0)
8 PNEXT MATE (tuple pos 1)
9 TLEN MATE (tuple pos 2)
10 SEQ SEQ
11 QUAL QUAL

The REF_SEQ field will be required to store the alignment information correctly. The (sequence/query) OFFSET will be required to store the soft clipping information at the read start (end clipping will be visible by the length of the alignment that holds the infix of SEQ).

Note: Due to representing the alignment as an infix of only the aligned parts of the two sequences, clipping and local alignments of BLAST can be easily visulized via the alignment and the (REF_)OFFSET values with view_to_position/position_to_view/view_to_clipped_position etc. The only downside is that Hard clipping information in the SAM file will be lost and cannot be recovered. As Hard clipping is usually a quality cut off and the respective sequence part is not present in the SAM file, we expect that this information is in 99% of the cases irrelevant. If requested, the hard clipping information could be stored in a seperate field later on.

BLAST tabular format default columns to fields:

Column names as in the NCBI Handbook

# BLAST name FIELD name
1 Query id ID
2 Subject id REF_ID
3 % identity implicilty stored in ALIGNMENT
4 alignment length implicilty stored in ALIGNMENT
5 mismatches implicilty stored in ALIGNMENT
6 gap openings implicilty stored in ALIGNMENT
7 q. start OFFSET
8 q. end implicilty stored in OFFSET and ALIGNMENT
9 s. start REF_OFFSET
10 s. end implicilty stored in REF_OFFSET and ALIGNMENT
11 e-value EVALUE
11 hit BIT_SCORE

in bold are the fields that occur in BLAST and SAM BLAST output format will always print the 12 default columns, so no extra fields will be present for the user to specify.

Default fields for the alignment file in

ID, SEQ, REF_ID, REF_SEQ, ALIGNMENT, OFFSET, REF_OFFSET

Clone this wiki locally