PhyNetPy Documentation

Library for the Development and Use of Phylogenetic Network Methods

MSA Module v1.0.0

Multiple Sequence Alignment parsing, storage, grouping, and distance computation.

Author:
Mark Kessler
Last Edit:
3/11/25
Source:
MSA.py

DataSequence

class DataSequence

An individual sequence record. A sequence record is defined by 1) the data sequence 2) a name/string identifier 3) potentially a group ID number [0, inf).

Constructor

__init__(sequence: list, name: str, gid: int = -1) -> None

Initialize a Sequence Record

Parameter Type Description
sequence list a sequence of some type of biological data
name str some name or label
gid int, optional a group id number. Defaults to -1. This signals that the sequence does not belong to any group.

Methods

get_name -> str

Get the name of the sequence.

Returns: str: sequence label
get_seq -> list[object]

This getter returns the sequence, as parsed from any sort of file. The likely type is a list[str] or a list of characters/strings.

Returns: list[object]: A list of data (of some type, commonly a string)
get_numerical_seq -> list[int]

This getter returns the sequence, but in the event that the sequence is not already a list[int], translates each character into an integer. In the event that the sequence contains a character that is not mappable in hexadecimal, then it will be skipped.

Returns: list[int]: an integer data sequence, in hexadecimal.
get_gid -> int

Get the group id for this sequence

Returns: int: group id.
set_ploidy(ploidyness: int) -> None

Set the ploidy of a data sequence. Only applicable for bimarker data, but there is no consequence for setting this value for other data.

Parameter Type Description
ploidyness int the level of ploidy for a data sequence
ploidy -> int

Get the ploidy for this sequence. Only relevant for Bimarker data.

Returns: int: ploidy value.
__len__ -> int

Define the length of a DataSequence to be the length of the sequence

Returns: int: _description_
distance(seq2: DataSequence) -> float

Calculate the distance between two DataSequence objects. The distance is calculated by the number of differences between the two sequences. If the sequences are of different lengths, the distance is the difference in length plus the number of differences in the shorter sequence compared to the longer sequence for the length of the shorter sequence.

Parameter Type Description
seq2 DataSequence The second sequence to compare.
Returns: float: The distance between the two sequences.

MSA

class MSA(Iterable[DataSequence])

Class that provides all packaging and functionality services to do with Multiple Sequence Alignments. This class stores all data and metadata about a sequence alignment, and can handle file I/O from nexus files that contain a matrix data block. If there is a grouping that applies to a set of sequences, it can be defined here.

Constructor

__init__(filename: Union[str, None] = None, data: Union[list[DataSequence], None] = None, grouping: Union[dict[DataSequence, int], None] = None, grouping_auto_detect: bool = False) -> None

Initialize an MSA object. Either a filename or a list of DataSequence objects can be provided. If a filename is provided, the MSA will be parsed from the file. If a list of DataSequence objects is provided, the MSA will be constructed from those objects. If grouping is provided, the sequences will be grouped accordingly.

Parameter Type Description
filename Union[str, None], optional A filename to a nexus file that contains a matrix data block. Defaults to None.
data Union[list[DataSequence], None], optional A list of DataSequence objects. Defaults to None.
grouping Union[dict[DataSequence, int], None], optional A grouping map from DataSequence objects to group IDs. Defaults to None.
grouping_auto_detect bool, optional If set to True, the MSA will attempt to group sequences based on sequence name similarity. Defaults to False.

Methods

add_data(data_seq: DataSequence) -> None

Add a DataSequence object to the MSA. If the DataSequence object has a group ID, it will be added to the appropriate group. If the DataSequence object does not have a group ID, it will be added to the MSA as a new group.

Parameter Type Description
data_seq DataSequence A DataSequence object to add to the MSA
retroactive_group -> None

Take all DataSequence objects placed inside the MSA and if the groupid is -1 (indicating the DataSequence does not belong to any grouping), attempt to use autogrouping to pair with other like data. After calling this function, no DataSequence objects will have gid of -1. If autogrouping is not possible, the DataSequence will be placed in a group of its own.

get_records -> list[DataSequence]

Retrieve all sequences that are in this alignment.

Returns: list[DataSequence]: list of all sequence records.
parse -> list[DataSequence]

Take a filename and grab the sequences and put them into DataSequence objects. If a grouping is defined (in the case of SNPs), group IDs will be assigned to each DataSequence for ease of counting red alleles.

Returns: list[DataSequence]: A list of DataSequence objs
num_groups -> int

Get the number of groups in the MSA.

Returns: int: the number of groups in the MSA.
group_given_id(gid) -> list[DataSequence]

Get the set of DataSequences that belong to a given group ID.

Parameter Type Description
gid int group id
Returns: list[DataSequence]: the set (as a list) of DataSequences that have a given gid
get_category(name: str) -> int

Get the group ID for a given sequence name. If the sequence name is not found in the grouping map, raise a KeyError.

Parameter Type Description
name str The name of the sequence to get the group ID for.
Returns: int: The group ID for the sequence.
Raises: KeyError: If the sequence name is not found in the grouping map.
group_auto_detect -> dict[int, str]

If no grouping of sequences is provided, but a grouping is still desired, group the sequences by name/label string "likeness". Note: not guaranteed to group things properly if the labels used for sequences does not follow some sort of general pattern. IE: human1 human2 human3 gorilla1 gorilla2 chimp1 is group-able. xh1 jp0 an2 am3 is less group-able.

Returns: dict[int, str]: a grouping map from gid's to sequence names
seq_by_name(name: str) -> DataSequence

Retrieves the sequence that belongs to this MSA that has a given name

Parameter Type Description
name str The taxa/label name of the sequence. Must match exactly (same case, spacing, etc)
Returns: DataSequence: the sequence with the label 'name'
total_samples -> int

For each record, accumulate the ploidyness to gather the total number of samples of alleles. If ploidy is not set (-1), treats as 1 sample.

Returns: int: the total number of samples
samples_given_group(gid: int) -> int

Return the number of samples within a given group.

Parameter Type Description
gid int group id
Returns: int: total samples within the group defined by 'gid'
set_sequence_ploidy(sequence_ploidy: list[int] = None) -> None

Sets the ploidy of each group of sequences in the MSA. If sequence_ploidy is provided, it should be a list of numbers >= 1 where each index corresponds to the group ID. For example: [1,2,1] indicates that group 0 has ploidy 1, group 1 has ploidy 2, and group 2 has ploidy 1. If sequence_ploidy is not given, then the ploidy will be set to the maximum SNP data point found in the sequence. For a SNP sequence of 010120022202, the ploidy is 2. NOTE: It is assumed that if sequence_ploidy is not given, that ploidy values for each record within a group are identical!

Parameter Type Description
sequence_ploidy list[int], optional implicit mapping from group ids (index) to the ploidy of that sequence, or set of sequences. Defaults to None.
dim -> tuple[int]

Return the dimensions of the MSA. The number of rows (first index) is equal to the number of DataSequence objects, and the number of columns (second index), is equal to the length of each DataSequence (they should all be the same).

Returns: tuple[int]: row, col tuple that describes the dimensions of the MSA.
distance_matrix -> dict[tuple[DataSequence, DataSequence], float]

Using the distance helper, calculates pairwise distances for each pair of (different) DataSequences in this MSA.

Returns: dict[tuple[DataSequence, DataSequence], float]: Map from DataSequence pairs to the distance between them.
__iter__ -> Iterable[DataSequence]

Iterate over the DataSequence objects in this MSA.

Returns: Iterable[DataSequence]: an iterable of DataSequence objects.

Module Functions

ratio(a: str, b: str) -> float

Compute the similarity ratio between two strings.

Parameter Type Description
a str first string
b str second string
Returns: float: similarity ratio between a and b
group_some_strings(data: list[str]) -> list[list[str]]

Group a list of strings by similarity.

Parameter Type Description
data list[str] a list of strings
Returns: list[list[str]]: a list of groups of strings

Navigation

Modules

This Page