Library for the Development and Use of Phylogenetic Network Methods
Multiple Sequence Alignment parsing, storage, grouping, and distance computation.
An individual sequence record. A sequence record is defined by 1) the data sequence 2) a name/string identifier 3) potentially a group ID number [0, inf).
Initialize a Sequence Record
| Parameter | Type | Description |
|---|---|---|
| sequence | list | a sequence of some type of biological data |
| name | str | some name or label |
| gid | int, optional | a group id number. Defaults to -1. This signals that the sequence does not belong to any group. |
Get the name of the sequence.
This getter returns the sequence, as parsed from any sort of file. The likely type is a list[str] or a list of characters/strings.
This getter returns the sequence, but in the event that the sequence is not already a list[int], translates each character into an integer. In the event that the sequence contains a character that is not mappable in hexadecimal, then it will be skipped.
Get the group id for this sequence
Set the ploidy of a data sequence. Only applicable for bimarker data, but there is no consequence for setting this value for other data.
| Parameter | Type | Description |
|---|---|---|
| ploidyness | int | the level of ploidy for a data sequence |
Get the ploidy for this sequence. Only relevant for Bimarker data.
Define the length of a DataSequence to be the length of the sequence
Calculate the distance between two DataSequence objects. The distance is calculated by the number of differences between the two sequences. If the sequences are of different lengths, the distance is the difference in length plus the number of differences in the shorter sequence compared to the longer sequence for the length of the shorter sequence.
| Parameter | Type | Description |
|---|---|---|
| seq2 | DataSequence | The second sequence to compare. |
Class that provides all packaging and functionality services to do with Multiple Sequence Alignments. This class stores all data and metadata about a sequence alignment, and can handle file I/O from nexus files that contain a matrix data block. If there is a grouping that applies to a set of sequences, it can be defined here.
Initialize an MSA object. Either a filename or a list of DataSequence objects can be provided. If a filename is provided, the MSA will be parsed from the file. If a list of DataSequence objects is provided, the MSA will be constructed from those objects. If grouping is provided, the sequences will be grouped accordingly.
| Parameter | Type | Description |
|---|---|---|
| filename | Union[str, None], optional | A filename to a nexus file that contains a matrix data block. Defaults to None. |
| data | Union[list[DataSequence], None], optional | A list of DataSequence objects. Defaults to None. |
| grouping | Union[dict[DataSequence, int], None], optional | A grouping map from DataSequence objects to group IDs. Defaults to None. |
| grouping_auto_detect | bool, optional | If set to True, the MSA will attempt to group sequences based on sequence name similarity. Defaults to False. |
Add a DataSequence object to the MSA. If the DataSequence object has a group ID, it will be added to the appropriate group. If the DataSequence object does not have a group ID, it will be added to the MSA as a new group.
| Parameter | Type | Description |
|---|---|---|
| data_seq | DataSequence | A DataSequence object to add to the MSA |
Take all DataSequence objects placed inside the MSA and if the groupid is -1 (indicating the DataSequence does not belong to any grouping), attempt to use autogrouping to pair with other like data. After calling this function, no DataSequence objects will have gid of -1. If autogrouping is not possible, the DataSequence will be placed in a group of its own.
Retrieve all sequences that are in this alignment.
Take a filename and grab the sequences and put them into DataSequence objects. If a grouping is defined (in the case of SNPs), group IDs will be assigned to each DataSequence for ease of counting red alleles.
Get the number of groups in the MSA.
Get the set of DataSequences that belong to a given group ID.
| Parameter | Type | Description |
|---|---|---|
| gid | int | group id |
Get the group ID for a given sequence name. If the sequence name is not found in the grouping map, raise a KeyError.
| Parameter | Type | Description |
|---|---|---|
| name | str | The name of the sequence to get the group ID for. |
KeyError: If the sequence name is not found in the grouping map.If no grouping of sequences is provided, but a grouping is still desired, group the sequences by name/label string "likeness". Note: not guaranteed to group things properly if the labels used for sequences does not follow some sort of general pattern. IE: human1 human2 human3 gorilla1 gorilla2 chimp1 is group-able. xh1 jp0 an2 am3 is less group-able.
Retrieves the sequence that belongs to this MSA that has a given name
| Parameter | Type | Description |
|---|---|---|
| name | str | The taxa/label name of the sequence. Must match exactly (same case, spacing, etc) |
For each record, accumulate the ploidyness to gather the total number of samples of alleles. If ploidy is not set (-1), treats as 1 sample.
Return the number of samples within a given group.
| Parameter | Type | Description |
|---|---|---|
| gid | int | group id |
Sets the ploidy of each group of sequences in the MSA. If sequence_ploidy is provided, it should be a list of numbers >= 1 where each index corresponds to the group ID. For example: [1,2,1] indicates that group 0 has ploidy 1, group 1 has ploidy 2, and group 2 has ploidy 1. If sequence_ploidy is not given, then the ploidy will be set to the maximum SNP data point found in the sequence. For a SNP sequence of 010120022202, the ploidy is 2. NOTE: It is assumed that if sequence_ploidy is not given, that ploidy values for each record within a group are identical!
| Parameter | Type | Description |
|---|---|---|
| sequence_ploidy | list[int], optional | implicit mapping from group ids (index) to the ploidy of that sequence, or set of sequences. Defaults to None. |
Return the dimensions of the MSA. The number of rows (first index) is equal to the number of DataSequence objects, and the number of columns (second index), is equal to the length of each DataSequence (they should all be the same).
Using the distance helper, calculates pairwise distances for each pair of (different) DataSequences in this MSA.
Iterate over the DataSequence objects in this MSA.
Compute the similarity ratio between two strings.
| Parameter | Type | Description |
|---|---|---|
| a | str | first string |
| b | str | second string |
Group a list of strings by similarity.
| Parameter | Type | Description |
|---|---|---|
| data | list[str] | a list of strings |