SEGUID checksums for linear, circular, single- and double-stranded sequences

Functions for calculating sequence checksums for linear, circular, single- and double-stranded sequences based on either the original SEGUID (SEGUID v1) algorithm (Babnigg & Giometti, 2006) or the SEGUID v2 algorithm (Pereira et al., 2024).

Usage

seguid(seq, alphabet = "{DNA}", form = c("long", "short", "both"))

lsseguid(seq, alphabet = "{DNA}", form = c("long", "short", "both"))

csseguid(seq, alphabet = "{DNA}", form = c("long", "short", "both"))

ldseguid(watson, crick, alphabet = "{DNA}", form = c("long", "short", "both"))

cdseguid(watson, crick, alphabet = "{DNA}", form = c("long", "short", "both"))

Arguments

seq: (character string) The sequence for which the checksum should be calculated. The sequence may only comprise of symbols in the alphabet specified by the alphabet argument.
alphabet: (character string) The type of sequence used. If "{DNA}" (default), then the input is a DNA sequence. If "{RNA}", then the input is an RNA sequence. If "{protein}", then the input is an amino-acid sequence. If "{DNA-extended}" or "{RNA-extended}", then the input is a DNA or RNA sequence specified an extended set of symbols, including IUPAC symbols (4). If "{protein-extended}", then the input is an amino-acid sequence with an extended set of symbols, including IUPAC symbols (5). A custom alphabet may also be used. A non-complementary alphabet is specified as a comma-separated set of single symbols, e.g. "X,Y,Z". A complementary alphabet is specified as a comma-separated set of paired symbols, e.g. "AT,CG". It is also possible to extend a pre-defined alphabet, e.g. "{DNA},XY".
form: (character string) How the checksum is presented. If "long" (default), the full-length checksum is outputted. If "short", the short, six-digit checksum is outputted. If "both", both the short and the long checksums are outputted.
watson, crick: (character strings) Two reverse-complementary DNA sequences. Both sequences should be specified in the 5'-to-3' direction.

Value

The SEGUID functions return a single character string, if form is either "long" or "short". If form is "both", then a character vector of length two is return, where the first component holds the "short" checksum and the second the "long" checksum. The long checksum, without the prefix, is string with 27 characters. The short checksum, without the prefix, is the first six characters of the long checksum. All checksums are prefixed with a label indicating which SEGUID method was used. Except for seguid(), which uses base64 encoding, all functions produce checksums using the base64url encoding ("Base 64 Encoding with URL and Filename Safe Alphabet").

seguid() calculates the SEGUID v1 checksum for a linear, single-stranded sequence.

lsseguid() calculates the SEGUID v2 checksum for a linear, single-stranded sequence.

csseguid() calculates the SEGUID v2 checksum for a circular, single-stranded sequence.

ldseguid() calculates the SEGUID v2 checksum for a linear, double-stranded sequence.

cdseguid() calculates the SEGUID v2 checksum for a circular, double-stranded sequence.

Base64 and Base64url encodings

The base64url encoding is the base64 encoding with non-URL-safe characters substituted with URL-safe ones. Specifically, the plus symbol (+) is replaced by the minus symbol (-), and the forward slash (/) is replaced by the underscore symbol (_).

The Base64 checksum, which is used for the original SEGUID checksum, is not guaranteed to comprise symbols that can safely be used as-is in Uniform Resource Locator (URL). Specifically, it may consist of forward slashes (/) and plus symbols (+), which are characters that carry special meaning in a URL. For the same reason, a Base64 checksum cannot safely be used as a file or directory name, because it may have a forward slash.

The checksum returned is always 27-character long. This is because the SHA-1 hash (6) is 160-bit long (20 bytes), which result in the encoded representation always end with a padding character (=) so that the length is a multiple of four character. We relax this requirement, by dropping the padding character.

References

G Babnigg & CS Giometti, A database of unique protein sequence identifiers for proteome studies. Proteomics. 2006 Aug;6(16):4514-22, doi:10.1002/pmic.200600032 .
H Pereira, PC Silva, WM Davis, L Abraham, G Babnigg, H Bengtsson & B Johansson, SEGUID v2: Extending SEGUID Checksums for Circular, Linear, Single- and Double-Stranded Biological Sequences, bioRxiv, doi:10.1101/2024.02.28.582384 .
S Josefsson, The Base16, Base32, and Base64 Data Encodings, RFC 4648, October 2006, doi:10.17487/RFC4648 .
Wikipedia article 'Nucleic acid notation', February 2024, https://en.wikipedia.org/wiki/Nucleic_acid_notation.
Wikipedia article 'Amino acids', February 2024, https://en.wikipedia.org/wiki/Amino_acid.
Wikipedia article 'SHA-1' (Secure Hash Algorithm 1), December 2023, https://en.wikipedia.org/wiki/SHA-1.

Examples

## SEGUID v1 on linear single-stranded DNA
seguid("GATTACA")
#> [1] "seguid=tp2jzeCM2e3W4yxtrrx09CMKa/8"
#> seguid=tp2jzeCM2e3W4yxtrrx09CMKa/8

## SEGUID v2 on linear single-stranded DNA
lsseguid("GATTACA")
#> [1] "lsseguid=tp2jzeCM2e3W4yxtrrx09CMKa_8"
#> lsseguid=tp2jzeCM2e3W4yxtrrx09CMKa_8

## SEGUID v2 on cicular single-stranded DNA
## GATTACA = ATTACAG = ... = AGATTAC
csseguid("GATTACA")
#> [1] "csseguid=mtrvbtuwr6_MoBxvtm4BEpv-jKQ"
#> csseguid=mtrvbtuwr6_MoBxvtm4BEpv-jKQ

## SEGUID v2 on blunt, linear double-stranded DNA
##   GATTACA
##   CTAATGT
ldseguid("GATTACA", "TGTAATC")
#> [1] "ldseguid=zDq4dp6G3AIsldhPDDL8S5A0BKk"
#> ldseguid=AcRsEcNFrui5wCxI7xxo6wnDYPY

## SEGUID v2 on staggered, linear double-stranded DNA
##   -ATTACA
##   CTAAT--
ldseguid("-ATTACA", "--TAATC")
#> [1] "ldseguid=kkTcyGa7Z4DmjB49IzmJ2yMXeIQ"
#> ldseguid=98Klwxd3ZQPGHqnH3BheIuZVHQQ

## SEGUID v2 on circular double-stranded DNA
## GATTACA = ATTACAG = ... = AGATTAC
## CTAATGT = TAATGTC = ... = TCTAATG
cdseguid("GATTACA", "TGTAATC")
#> [1] "cdseguid=z7GBDOjQuqwVpDiiC_CEJkmOKZo"
#> cdseguid=zCuq031K3_-40pArbl-Y4N9RLnA

## SEGUID v2 on linear single-stranded expanded
## epigenetic sequence (Viner et al., 2024)
viner_DNA <- "{DNA},m1,h2,f3,c4"
lsseguid("AmT2C", alphabet = viner_DNA)
#> [1] "lsseguid=MW4Rh3lGY2mhwteaSKh1-Kn2fGA"
#> lsseguid=MW4Rh3lGY2mhwteaSKh1-Kn2fGA

## SEGUID v2 on linear double-stranded expanded
## epigenetic sequence (Viner et al., 2024)
ldseguid("AmT2C", "GhA1T", alphabet = viner_DNA)
#> [1] "ldseguid=bFZedILTms4ORUi3SSMfU0FUl7Q"
#> ldseguid=rsPDjP4SWr3-ploCeXTdTA80u0Y