MSA Module v1.0.0

Multiple Sequence Alignment parsing, storage, grouping, and distance computation.

Author:: Mark Kessler
Last Edit:: 3/11/25
Source:: MSA.py

DataSequence

class DataSequence

An individual sequence record. A sequence record is defined by 1) the data sequence 2) a name/string identifier 3) potentially a group ID number [0, inf).

Constructor

__init__(sequence: list, name: str, gid: int = -1) -> None

Initialize a Sequence Record

Parameter	Type	Description
sequence	list	a sequence of some type of biological data
name	str	some name or label
gid	int, optional	a group id number. Defaults to -1. This signals that the sequence does not belong to any group.

Methods

get_name -> str

Get the name of the sequence.

Returns: str: sequence label

get_seq -> list[object]

This getter returns the sequence, as parsed from any sort of file. The likely type is a list[str] or a list of characters/strings.

Returns: list[object]: A list of data (of some type, commonly a string)

get_numerical_seq -> list[int]

This getter returns the sequence, but in the event that the sequence is not already a list[int], translates each character into an integer. In the event that the sequence contains a character that is not mappable in hexadecimal, then it will be skipped.

Returns: list[int]: an integer data sequence, in hexadecimal.

get_gid -> int

Get the group id for this sequence

Returns: int: group id.

set_ploidy(ploidyness: int) -> None

Set the ploidy of a data sequence. Only applicable for bimarker data, but there is no consequence for setting this value for other data.

Parameter	Type	Description
ploidyness	int	the level of ploidy for a data sequence

ploidy -> int

Get the ploidy for this sequence. Only relevant for Bimarker data.

Returns: int: ploidy value.

__len__ -> int

Define the length of a DataSequence to be the length of the sequence

Returns: int: _description_

distance(seq2: DataSequence) -> float

Calculate the distance between two DataSequence objects. The distance is calculated by the number of differences between the two sequences. If the sequences are of different lengths, the distance is the difference in length plus the number of differences in the shorter sequence compared to the longer sequence for the length of the shorter sequence.

Parameter	Type	Description
seq2	DataSequence	The second sequence to compare.

Returns: float: The distance between the two sequences.

MSA

class MSA(Iterable[DataSequence])

Class that provides all packaging and functionality services to do with Multiple Sequence Alignments. This class stores all data and metadata about a sequence alignment, and can handle file I/O from nexus files that contain a matrix data block. If there is a grouping that applies to a set of sequences, it can be defined here.

Constructor

__init__(filename: Union[str, None] = None, data: Union[list[DataSequence], None] = None, grouping: Union[dict[DataSequence, int], None] = None, grouping_auto_detect: bool = False) -> None

Initialize an MSA object. Either a filename or a list of DataSequence objects can be provided. If a filename is provided, the MSA will be parsed from the file. If a list of DataSequence objects is provided, the MSA will be constructed from those objects. If grouping is provided, the sequences will be grouped accordingly.

Parameter	Type	Description
filename	Union[str, None], optional	A filename to a nexus file that contains a matrix data block. Defaults to None.
data	Union[list[DataSequence], None], optional	A list of DataSequence objects. Defaults to None.
grouping	Union[dict[DataSequence, int], None], optional	A grouping map from DataSequence objects to group IDs. Defaults to None.
grouping_auto_detect	bool, optional	If set to True, the MSA will attempt to group sequences based on sequence name similarity. Defaults to False.

Methods

add_data(data_seq: DataSequence) -> None

Add a DataSequence object to the MSA. If the DataSequence object has a group ID, it will be added to the appropriate group. If the DataSequence object does not have a group ID, it will be added to the MSA as a new group.

Parameter	Type	Description
data_seq	DataSequence	A DataSequence object to add to the MSA

retroactive_group -> None

Take all DataSequence objects placed inside the MSA and if the groupid is -1 (indicating the DataSequence does not belong to any grouping), attempt to use autogrouping to pair with other like data. After calling this function, no DataSequence objects will have gid of -1. If autogrouping is not possible, the DataSequence will be placed in a group of its own.

get_records -> list[DataSequence]

Retrieve all sequences that are in this alignment.

Returns: list[DataSequence]: list of all sequence records.

parse -> list[DataSequence]

Take a filename and grab the sequences and put them into DataSequence objects. If a grouping is defined (in the case of SNPs), group IDs will be assigned to each DataSequence for ease of counting red alleles.

Returns: list[DataSequence]: A list of DataSequence objs

num_groups -> int

Get the number of groups in the MSA.

Returns: int: the number of groups in the MSA.

group_given_id(gid) -> list[DataSequence]

Get the set of DataSequences that belong to a given group ID.

Parameter	Type	Description
gid	int	group id

Returns: list[DataSequence]: the set (as a list) of DataSequences that have a given gid

get_category(name: str) -> int

Get the group ID for a given sequence name. If the sequence name is not found in the grouping map, raise a KeyError.

Parameter	Type	Description
name	str	The name of the sequence to get the group ID for.

Returns: int: The group ID for the sequence.

Raises: KeyError: If the sequence name is not found in the grouping map.

group_auto_detect -> dict[int, str]

If no grouping of sequences is provided, but a grouping is still desired, group the sequences by name/label string "likeness". Note: not guaranteed to group things properly if the labels used for sequences does not follow some sort of general pattern. IE: human1 human2 human3 gorilla1 gorilla2 chimp1 is group-able. xh1 jp0 an2 am3 is less group-able.

Returns: dict[int, str]: a grouping map from gid's to sequence names

seq_by_name(name: str) -> DataSequence

Retrieves the sequence that belongs to this MSA that has a given name

Parameter	Type	Description
name	str	The taxa/label name of the sequence. Must match exactly (same case, spacing, etc)

Returns: DataSequence: the sequence with the label 'name'

total_samples -> int

For each record, accumulate the ploidyness to gather the total number of samples of alleles. If ploidy is not set (-1), treats as 1 sample.

Returns: int: the total number of samples

samples_given_group(gid: int) -> int

Return the number of samples within a given group.

Parameter	Type	Description
gid	int	group id

Returns: int: total samples within the group defined by 'gid'

set_sequence_ploidy(sequence_ploidy: list[int] = None) -> None

Sets the ploidy of each group of sequences in the MSA. If sequence_ploidy is provided, it should be a list of numbers >= 1 where each index corresponds to the group ID. For example: [1,2,1] indicates that group 0 has ploidy 1, group 1 has ploidy 2, and group 2 has ploidy 1. If sequence_ploidy is not given, then the ploidy will be set to the maximum SNP data point found in the sequence. For a SNP sequence of 010120022202, the ploidy is 2. NOTE: It is assumed that if sequence_ploidy is not given, that ploidy values for each record within a group are identical!

Parameter	Type	Description
sequence_ploidy	list[int], optional	implicit mapping from group ids (index) to the ploidy of that sequence, or set of sequences. Defaults to None.

dim -> tuple[int]

Return the dimensions of the MSA. The number of rows (first index) is equal to the number of DataSequence objects, and the number of columns (second index), is equal to the length of each DataSequence (they should all be the same).

Returns: tuple[int]: row, col tuple that describes the dimensions of the MSA.

distance_matrix -> dict[tuple[DataSequence, DataSequence], float]

Using the distance helper, calculates pairwise distances for each pair of (different) DataSequences in this MSA.

Returns: dict[tuple[DataSequence, DataSequence], float]: Map from DataSequence pairs to the distance between them.

__iter__ -> Iterable[DataSequence]

Iterate over the DataSequence objects in this MSA.

Returns: Iterable[DataSequence]: an iterable of DataSequence objects.

Module Functions

ratio(a: str, b: str) -> float

Compute the similarity ratio between two strings.

Parameter	Type	Description
a	str	first string
b	str	second string

Returns: float: similarity ratio between a and b

group_some_strings(data: list[str]) -> list[list[str]]

Group a list of strings by similarity.

Parameter	Type	Description
data	list[str]	a list of strings

Returns: list[list[str]]: a list of groups of strings

PhyNetPy Documentation

MSA Module v1.0.0

Contents

DataSequence

Constructor

Methods

MSA

Constructor

Methods

Module Functions

Navigation

Modules

This Page