SeqAn3
seqan3::alignment_file_header Struct Reference

Stores the header information of alignment files. More...

#include <seqan3/io/alignment_file/header.hpp>

Classes

struct  program_info_t
 Stores information of the program/tool that was used to create the file. More...
 

Public Attributes

std::vector< std::string > comments
 The list of comments.
 
std::string format_version
 The file format version. Note: this is overwritten by our formats on output.
 
std::string grouping {"none"}
 The grouping state of the file. SAM: [none, query, reference].
 
std::vector< program_info_tprogram_infos
 The list of program information.
 
std::vector< std::pair< std::string, std::string > > read_groups
 The Read Group Dictionary. (used by the SAM/BAM format) More...
 
std::unordered_map< std::string, std::tuple< uint32_t, std::string > > ref_dict
 The Reference Dictionary. (used by the SAM/BAM format) More...
 
std::string sorting {"unknown"}
 The sorting state of the file. SAM: [unknown, unsorted, queryname, coordinate].
 

Detailed Description

Stores the header information of alignment files.

Member Data Documentation

◆ read_groups

std::vector<std::pair<std::string, std::string> > seqan3::alignment_file_header::read_groups

The Read Group Dictionary. (used by the SAM/BAM format)

The read group dictionary stores the group id and additional information of each read group in the file. The record may store a RG tag information referencing one of the stored id's. The id information is required if the header is provided.

The additional information (2nd tuple entry) for the SAM format must follow the following formatting rules: The information is given in a tab separated TAG:VALUE format, where TAG must be one of [AH, AN, AS, m5, SP, UR]. The following information and rules apply for each tag (taken from the SAM specs):

TAG Description and Rules
BC Barcode sequence identifying the sample or library. This value is

the expected barcode bases as read by the sequencing machine in the absence of errors. If there are several barcodes for the sample/library (e.g., one on each end of the template), the recommended implementation concatenates all the barcodes separating them with hyphens ('-'). | | CN | Name of sequencing center producing the read. | | DS | Description. UTF-8 encoding may be used. | | DT | Date the run was produced (ISO8601 date or date/time). | | FO | Flow order. The array of nucleotide bases that correspond to the nucleotides used for each flow of each read. Multi-base flows are encoded in IUPAC format, and non-nucleotide flows by various other characters. Format : /*|[ACMGRSVTWYHKDBN]+/ | | KS | The array of nucleotide bases that correspond to the key sequence of each read. | | LB | Library. | | PG | Programs used for processing the read group. | | PI | Predicted median insert size. | | PL | Platform/technology used to produce the reads. Valid values : CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, ONT, and PACBIO. | | PM | Platform model. Free-form text providing further details of the platform/technology used. | | PU | Platform unit (e.g. flowcell-barcode.lane for Illumina or slide for SOLiD). Unique identifier. | | SM | Sample. Use pool name where a pool is being sequenced. |

◆ ref_dict

std::unordered_map<std::string, std::tuple<uint32_t, std::string> > seqan3::alignment_file_header::ref_dict

The Reference Dictionary. (used by the SAM/BAM format)

The reference dictionary stores the reference name, its length and additional information of each reference sequence in the file. The record may (SAM) or must (BAM) then store only the index of the reference. The name and length information are required if the header is provided and each reference sequence that is referred to in any of the records must be present in the dictionary, otherwise a seqan3::format_error will be thrown upon reading or writing a file.

The additional information (2nd tuple entry) for the SAM format must follow the following formatting rules: The information is given in a tab separated TAG:VALUE format, where TAG must be one of [AH, AN, AS, m5, SP, UR]. The following information and rules apply for each tag (taken from the SAM specs):

TAG Description and Rules
AH Indicates that this sequence is an alternate locus. The value is

the locus in the primary assembly for which this sequence is an alternative, in the format 'chr:start-end', 'chr' (if known), or '*' (if unknown), where 'chr' is a sequence in the primary assembly. Must not be present on sequences in the primary assembly. | | AN | Alternative reference sequence names. A comma-separated list of alternative names that tools may use when referring to this reference sequence. These alternative names are not used elsewhere within the SAM file; in particular, they must not appear in alignment records’ RNAME or RNEXT fields. Regular expression : name (, name )* where name is [0-9A-Za-z][0-9A-Za-z*+.@ |-]* | | AS | Genome assembly identifier. | | M5 | MD5 checksum of the sequence. See Section 1.3.1 | | SP | Species. | | UR | URI of the sequence. This value may start with one of the standard protocols, e.g http: or ftp:. If it does not start with one of these protocols, it is assumed to be a file-system path |


The documentation for this struct was generated from the following file: