SeqAn3
Alphabet
Collaboration diagram for Alphabet:

Modules

 Adaptation
 Contains alphabet adaptions of some standard char and uint types.
 
 Aminoacid
 Contains the amino acid alphabets and functionality for translation from nucleotide.
 
 Composition
 Provides data structures joining multiple alphabets into a single alphabet.
 
 Gap
 Contains the gap alphabet and functionality to make an alphabet a gapped alphabet.
 
 Mask
 Contains the mask alphabet and functionality for creating masked compositions.
 
 Nucleotide
 Contains the different DNA and RNA alphabet types.
 
 Quality
 Contains the various quality score types.
 
 Structure
 The structure module contains alphabets for RNA and protein structure.
 

Classes

class  seqan3::alphabet_base< derived_type, size, char_t >
 A CRTP-base that makes defining a custom alphabet easier. More...
 
interface  seqan3::alphabet_concept
 The generic alphabet concept that covers most data types used in ranges.This is the core alphabet concept that many other alphabet concepts refine. More...
 
struct  seqan3::alphabet_size< alphabet_type >
 The size of the alphabet. [value metafunction base template]. More...
 
struct  std::hash< alphabet_t >
 Struct for hashing a character. More...
 
struct  std::hash< urng_t >
 Struct for hashing a range of characters. More...
 
struct  seqan3::max_pseudoknot_depth< alphabet_type >
 Metafunction that indicates to what extent an alphabet can handle pseudoknots. [value metafunction base template]. More...
 
interface  seqan3::semi_alphabet_concept
 The basis for seqan3::alphabet_concept, but requires only rank interface (not char). More...
 
struct  seqan3::underlying_char< alphabet_type >
 The char_type of the alphabet. [type metafunction base template]. More...
 
struct  seqan3::underlying_rank< semi_alphabet_type >
 The rank_type of the semi_alphabet. [type metafunction base template]. More...
 

Functions

size_t std::hash< alphabet_t >::operator() (alphabet_t const character) const noexcept
 Compute the hash for a character. More...
 
size_t std::hash< urng_t >::operator() (urng_t const &range) const noexcept
 Compute the hash for a range of characters. More...
 

Requirements for seqan3::semi_alphabet_concept

You can expect these functions on all types that implement seqan3::semi_alphabet_concept.

template<typename semi_alphabet_type >
using underlying_rank_t = typename underlying_rank< semi_alphabet_type >::type
 The rank_type of the semi_alphabet. [type metafunction shortcut]. More...
 
template<typename alphabet_type >
constexpr auto alphabet_size_v = alphabet_size<alphabet_type>::value
 The size of the alphabet. [value metafunction shortcut]. More...
 
rank_type to_rank (semi_alphabet_concept const alph)
 Returns the alphabet letter's value in rank representation. More...
 
semi_alphabet_concept && assign_rank (semi_alphabet_concept &&alph, rank_type const rank)
 Returns the alphabet letter's value in rank representation. More...
 

Requirements for seqan3::alphabet_concept

You can expect these functions on all types that implement seqan3::alphabet_concept.

template<typename alphabet_type >
using underlying_char_t = typename underlying_char< alphabet_type >::type
 The char_type of the alphabet. [type metafunction shortcut]. More...
 
char_type to_char (alphabet_concept const alph)
 Returns the alphabet letter's value in character representation. More...
 
alphabet_concept && assign_char (alphabet_concept &&alph, char_type const chr)
 Returns the alphabet letter's value in character representation. More...
 

Requirements for seqan3::rna_structure_concept

You can expect these functions on all types that implement seqan3::rna_structure_concept.

template<typename alphabet_type >
constexpr uint8_t max_pseudoknot_depth_v = max_pseudoknot_depth<alphabet_type>::value
 The pseudoknot ability of the alphabet. [value metafunction shortcut]. More...
 

Detailed Description

Introduction

Alphabets are a core component in SeqAn. They enable us to represent the smallest unit of biological sequence data, e.g. a nucleotide or an amino acid.

In theory, these could just be represented as a char and this is how many people perceive them, but it makes sense to use a smaller, stricter and well-defined alphabet in almost all cases, because:

In SeqAn there are alphabet types for typical sequence alphabets like DNA and amino acid, but also for qualities, RNA structures and alignment gaps. In addition there are templates for combining alphabet types into new alphabets, and wrappers for existing data types like the canonical char.

To be included into the alphabet module, an alphabet must satisfy the generic seqan3::alphabet_concept documented below. While this only encompasses a minimum set of requirements, many of our alphabets provide more features and there are more refined concepts. The inheritance diagram of seqan3::alphabet_concept gives a detailed overview. A more basic overview of this module and it's submodules is available in the collaboration diagram at the top of this page.

The alphabet concept

The seqan3::alphabet_concept defines the requirements a type needs to meet to be considered an alphabet by SeqAn, or in other words: you can expect certain properties and functions to be defined on all data types we call an alphabet.

Alphabet size

All alphabets in SeqAn have a fixed size. It can be queried via the seqan3::alphabet_size metafunction and optionally also the value_size static member of the alphabet (see below for "members VS free/global functions").

In some areas we provide alphabets types with different sizes for the same purpose, e.g. seqan3::dna4 ('A', 'C', 'G', 'T'), seqan3::dna5 (plus 'N') and seqan3::dna15 (plus ambiguous characters defined by IUPAC). By convention most of our alphabets carry their size in their name (seqan3::dna4 has size 4 a.s.o.).

A main reason for choosing a smaller alphabet over a bigger one is the possibility of optimising for space efficiency. Note, however, that a single letter by itself can never be smaller than a byte for architectural reasons. Actual space improvements are realised via secondary structures, e.g. when using a seqan3::bitcompressed_vector<seqan3::dna4> instead of std::vector<seqan3::dna4>. Also the single letter quality composition seqan3::qualified<seqan3::dna4, seqan3::phred42> fits into one byte, because the product of the alphabet sizes (4 * 42) is smaller than 256; whereas the same composition with seqan3::dna15 requires two bytes per letter (15 * 42 > 256).

Assigning and retrieving values

As mentioned above, we typically think of alphabets in their character representation, but we also require them in "rank representation" as programmers. In C and C++ it is quite difficult to cleanly differentiate between these, because the char type is considered an integral type and can be used to index an array (e.g. my_array['A'] translates to my_array[65]). Moreover the sign of char is implementation defined and on many platforms the smallest integer types int8_t and uint8_t are literally the same types as signed char and unsigned char respectively.

This leads to ambiguity when assigning and retrieving values:

// does not work:
// dna4 my_letter{0}; // we want to set the default, an A
// dna4 my_letter{'A'}; // we also want to set an A, but we are setting value 65
// debug_stream << my_letter; // you expect 'A', but how would you access the number?

To solve this problem, every alphabet defines two interfaces:

  1. a character based interface with
    1. the underlying character type able to represent this alphabet visually (almost always char, but could be char16_t or char32_t, as well)
    2. a seqan3::to_char function to produce the visual representation
    3. a seqan3::assign_char function to assign from the visual representation
  2. a rank based interface with
    1. the underlying rank type able to represent this alphabet numerically; this type must be able to represent the numbers from 0 to alphabet size - 1 (often uint8_t, but sometimes a larger unsigned integral type)
    2. a seqan3::to_rank function to produce the numerical representation
    3. a seqan3::assign_rank function to assign from the numerical representation

To prevent the aforementioned ambiguity, you can neither assign from rank or char representation via operator=, nor can you cast the alphabet to either of it's representation forms, you need to explicitly use the interfaces:

dna4 my_letter;
assign_rank(my_letter, 0); // assign an A via rank interface
assign_char(my_letter, 'A'); // assign an A via char interface
my_letter = 'A'_dna4; // some alphabets (BUT NOT ALL!) also provide an enum-like interface
debug_stream << to_char(my_letter); // prints 'A'
debug_stream << (unsigned)to_rank(my_letter); // prints 0
// we have to add the cast here, because uint8_t is also treated as a char type by default :(

For efficiency, the representation saved internally is normally the rank representation, and the character representation is generated via conversion tables. This is, however, not required as long as both interfaces are provided and all functions operate in constant time.

In the documentation you will also encounter seqan3::semi_alphabet_concept. It describes "one half" of an alphabet and only defines the rank interface as a type requirement. It is mainly used internally and not relevant to most users of SeqAn.

Members VS free/global functions

The alphabet concept (as most concepts in SeqAn) looks for free/global functions, i.e. you need to be able to call seqan3::to_rank(my_letter), however most alphabets also provide a member function, i.e. my_letter.to_rank(). The same is true for the metafunction seqan3::alphabet_size vs the static data member value_size.

Members are provided for convenience and if you are an application developer who works with a single concrete alphabet type you are fine with using the member functions. If you, however, implement a generic function that accepts different alphabet types, you need to use the free function / metafunction interface, because it is the only interface guaranteed to exist (member functions are not required/enforced by the concept).

Containers over alphabets

In SeqAn3 it is recommended you use the STL container classes like std::vector for storing sequence data, but you can use other class templates if they satisfy the respective seqan3::container_concept, e.g. std::deque or folly::fbvector or even Qt::QVector.

std::basic_string is also supported, however, we recommend against using it, because it is not safe (and not useful) to call certain members like .c_str() if our alphabets are used as value type.

We provide specialised containers with certain properties in the Range module.

Typedef Documentation

◆ underlying_char_t

template<typename alphabet_type >
using underlying_char_t = typename underlying_char<alphabet_type>::type
related

The char_type of the alphabet. [type metafunction shortcut].

Attention
Do not specialise this shortcut, instead specialise seqan3::underlying_char.

◆ underlying_rank_t

template<typename semi_alphabet_type >
using underlying_rank_t = typename underlying_rank<semi_alphabet_type>::type
related

The rank_type of the semi_alphabet. [type metafunction shortcut].

Attention
Do not specialise this shortcut, instead specialise seqan3::underlying_rank.

Function Documentation

◆ assign_char()

alphabet_concept && assign_char ( alphabet_concept &&  alph,
char_type const  chr 
)
related

Returns the alphabet letter's value in character representation.

Parameters
alphThe alphabet letter that you wish to assign to.
chrThe char you wish to assign.
Returns
A reference to alph or a temporary if alph was a temporary.
Attention
This is a concept requirement, not an actual function (however types satisfying this concept will provide an implementation).

◆ assign_rank()

semi_alphabet_concept && assign_rank ( semi_alphabet_concept &&  alph,
rank_type const  rank 
)
related

Returns the alphabet letter's value in rank representation.

Parameters
alphThe alphabet letter that you wish to assign to.
rankThe rank you wish to assign.
Returns
A reference to alph or a temporary if alph was a temporary.
Attention
This is a concept requirement, not an actual function (however types satisfying this concept will provide an implementation).

◆ operator()() [1/2]

template<seqan3::semi_alphabet_concept alphabet_t>
size_t std::hash< alphabet_t >::operator() ( alphabet_t const  character) const
inlinenoexcept

Compute the hash for a character.

Parameters
[in]characterThe character to process. Must model seqan3::semi_alphabet_concept.
Returns
size_t.
See also
seqan3::to_rank.

◆ operator()() [2/2]

template<ranges::InputRange urng_t>
size_t std::hash< urng_t >::operator() ( urng_t const &  range) const
inlinenoexcept

Compute the hash for a range of characters.

Parameters
[in]rangeThe input range to process. Must model std::ranges::InputRange and the reference type of the range of the range must model seqan3::semi_alphabet_concept.
Returns
size_t.

◆ to_char()

char_type to_char ( alphabet_concept const  alph)
related

Returns the alphabet letter's value in character representation.

Parameters
alphThe alphabet letter that you wish to convert to char.
Returns
The letter's value in the alphabet's char type (seqan3::underlying_char).
Attention
This is a concept requirement, not an actual function (however types satisfying this concept will provide an implementation).

◆ to_rank()

rank_type to_rank ( semi_alphabet_concept const  alph)
related

Returns the alphabet letter's value in rank representation.

Parameters
alphThe alphabet letter that you wish to convert to rank.
Returns
The letter's value in the alphabet's rank type (seqan3::underlying_rank).
Attention
This is a concept requirement, not an actual function (however types satisfying this concept will provide an implementation).

Variable Documentation

◆ alphabet_size_v

template<typename alphabet_type >
constexpr auto alphabet_size_v = alphabet_size<alphabet_type>::value
related

The size of the alphabet. [value metafunction shortcut].

Template Parameters
alphabet_typeMust satisfy seqan3::semi_alphabet_concept.
Attention
Do not specialise this shortcut, instead specialise seqan3::alphabet_size.

◆ max_pseudoknot_depth_v

template<typename alphabet_type >
constexpr uint8_t max_pseudoknot_depth_v = max_pseudoknot_depth<alphabet_type>::value
related

The pseudoknot ability of the alphabet. [value metafunction shortcut].

Attention
Do not specialise this shortcut, instead specialise seqan3::max_pseudoknot_depth.