PhyNetPy Documentation

Library for the Development and Use of Phylogenetic Network Methods

IO Module v1.1.0

Central I/O hub for reading and writing phylogenetic file formats (FASTA, VCF, Newick, Nexus).

Author:
Mark Kessler
Last Edit:
2/6/26
Source:
IO.py

Constants

FASTA_LINE_WIDTH : int = 80

Exceptions

exception IOError(Exception)

Exception raised when file I/O operations fail within PhyNetPy.

Module Functions

read_fasta_records(filepath: str) -> List[DataSequence]

Read a FASTA file and return a list of DataSequence objects. This is the lower-level reader that returns raw DataSequence objects without wrapping them in an MSA. Useful for attaching sequences directly to Node objects in an existing Network (via Node.set_seq()). A FASTA file looks like: >sequence_name_1 ATCGATCGATCG... >sequence_name_2 GCTAGCTAGCTA... Each record becomes a DataSequence where: - name = the FASTA header (sequence ID) - seq = list of characters from the sequence string

Parameter Type Description
filepath str Path to a FASTA file (.fasta, .fas, .fa, .fna, .ffn, .faa).
Returns: list[DataSequence]: A list of DataSequence objects, one per FASTA record.
Raises: FileNotFoundError: If the file does not exist., IOError: If BioPython cannot parse the file or it contains no valid sequences.
read_fasta(filepath: str, grouping: Optional[Dict[str, list]] = None, grouping_auto_detect: bool = False) -> MSA

Read a FASTA file and return an MSA object containing all sequences. This function parses a FASTA file, converts each record into a DataSequence, and wraps them in an MSA for downstream phylogenetic analyses such as distance calculations, alignment inspection, or model-based inference.

Parameter Type Description
filepath str Path to a FASTA file (.fasta, .fas, .fa, .fna, .ffn, .faa).
grouping dict[str, list], optional A mapping from group names to lists of sequence names that belong to that group. If
provided sequences will be assigned group IDs accordingly. Defaults to None.
grouping_auto_detect bool, optional If True, attempt to automatically group sequences by name similarity. Defaults to False.
Returns: MSA: A Multiple Sequence Alignment object containing all parsed sequences.
Raises: FileNotFoundError: If the file does not exist., IOError: If the file cannot be parsed or contains no valid sequences.
write_fasta(msa: MSA, filepath: str, line_width: int = FASTA_LINE_WIDTH) -> None

Write an MSA object to a FASTA file. Each DataSequence in the MSA is written as a FASTA record: >sequence_name ATCGATCG... (wrapped at line_width characters)

Parameter Type Description
msa MSA The Multiple Sequence Alignment to write.
filepath str The output file path. Will be created or overwritten.
line_width int, optional Number of characters per sequence line. Standard FASTA convention is 80. Defaults to 80.
Returns: None
Raises: IOError: If the MSA has no records to write, or if the file cannot be written., ValueError: If line_width is less than 1.
write_fasta_from_network(network: Network, filepath: str, line_width: int = FASTA_LINE_WIDTH) -> None

Extract sequences from the leaf nodes of a Network and write them to a FASTA file. Only leaf nodes that have an associated DataSequence (set via Node.set_seq()) will be written. The node label becomes the FASTA header, and the attached sequence becomes the FASTA sequence. This is useful when a Network has been annotated with molecular data and the user wants to export just the sequence data.

Parameter Type Description
network Network A phylogenetic network whose leaf nodes may carry DataSequence objects.
filepath str The output FASTA file path.
line_width int, optional Characters per line for sequence wrapping. Defaults to 80.
Returns: None
Raises: IOError: If no leaf nodes in the network have sequence data attached, or if the file cannot be written., ValueError: If line_width is less than 1.
read_vcf(filepath: str, grouping: Optional[Dict[str, list]] = None, missing_value: str = '?') -> MSA

Read a VCF (Variant Call Format) file and return an MSA object. Each sample in the VCF becomes a DataSequence whose sequence is the vector of ALT allele counts across all variant sites. This maps directly to the SNP/BiMarkers pipeline used in PhyNetPy. A typical VCF file looks like:: ##fileformat=VCFv4.1 ##INFO=<...> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Samp1 Samp2 chr1 100 . A T 30 PASS . GT 0/0 0/1 chr1 200 . G C 50 PASS . GT 1/1 0/1 Genotype encoding: - 0/0 -> 0 (homozygous reference, 0 copies of ALT allele) - 0/1 -> 1 (heterozygous, 1 copy of ALT allele) - 1/1 -> 2 (homozygous alternate, 2 copies of ALT allele) - ./. -> missing_value (missing genotype)

Parameter Type Description
filepath str Path to a VCF file (.vcf).
grouping dict[str, list], optional A mapping from group/species names to lists of sample names that belong to that group. Used for the BiMarkers pipeline where multiple individuals map to a single species. Defaults to None.
missing_value str, optional The character to use for missing genotype data (./.). Defaults to "?".
Returns: MSA: A Multiple Sequence Alignment where each DataSequence represents one sample's genotype vector across all sites.
Raises: FileNotFoundError: If the file does not exist., IOError: If the file cannot be parsed or contains no variant data.
read_vcf_metadata(filepath: str) -> Dict[str, Any]

Read only the metadata and header from a VCF file without loading all variant data. Useful for inspecting what samples and fields are available before a full parse.

Parameter Type Description
filepath str Path to a VCF file.
Returns: dict[str, Any]: A dictionary containing: - "fileformat": The VCF version string - "metadata_lines": List of all ## header lines - "sample_names": List of sample column names - "info_fields": List of INFO field IDs - "format_fields": List of FORMAT field IDs - "filter_fields": List of FILTER field IDs - "contig_fields": List of contig IDs
Raises: FileNotFoundError: If the file does not exist., IOError: If the file cannot be read.
write_vcf(msa: MSA, filepath: str, chrom: str = 'chr1', start_pos: int = 1, ref_allele: str = 'A', alt_allele: str = 'T') -> None

Write an MSA of SNP/allele-count data to a simplified VCF file. This produces a minimal VCF where each site in the MSA becomes a variant record, and each DataSequence becomes a sample column. The allele count values (0, 1, 2, ...) are converted back to VCF genotype notation (e.g., 0/0, 0/1, 1/1). Note: Because the MSA does not store chromosome position, reference alleles, or other VCF-specific metadata, this output is a simplified reconstruction. It is suitable for round-tripping SNP data or creating test files, but will not preserve full VCF metadata from an original file.

Parameter Type Description
msa MSA The MSA containing allele count data (values like
0 1, 2 per site).
filepath str The output VCF file path.
chrom str, optional Chromosome name for all records. Defaults to "chr1".
start_pos int, optional Starting position for the first variant. Each subsequent variant increments by 1. Defaults to 1.
ref_allele str, optional Reference allele character. Defaults to "A".
alt_allele str, optional Alternate allele character. Defaults to "T".
Returns: None
Raises: IOError: If the MSA has no records, or the file cannot be written.
read_newick(newick_str: str) -> Network

Parse a single newick/extended-newick string into a PhyNetPy Network. Supports standard newick features (branch lengths, internal node names) as well as the extended newick format for phylogenetic networks (reticulation nodes prefixed with '#', gamma inheritance comments). Examples of accepted strings:: ((A:0.1,B:0.2):0.3,C:0.4); ((A:0.1,(B:0.2)#H1:0.3):0.4,(#H1:0.5,C:0.6):0.7);

Parameter Type Description
newick_str str A newick or extended-newick string. Trailing semicolons are handled automatically.
Returns: Network: A PhyNetPy Network object with the same topology, names, and branch lengths as described in the newick string.
Raises: IOError: If the string cannot be parsed.
read_newick_file(filepath: str, return_type: Literal['networks', 'genetrees'] = 'networks', species_gene_mapping: Optional[Dict[str, List[str]]] = None, naming_rule: Optional[Callable[..., Any]] = None) -> Union[List[Network], GeneTrees]

Read a file containing one or more newick strings (one per line) and parse each into a PhyNetPy Network. Blank lines and lines starting with '#' are skipped.

Parameter Type Description
filepath str Path to a file containing newick strings.
return_type str ``"networks"`` (default) returns a list of Network objects. ``"genetrees"`` validates each network as a rooted binary tree and wraps them in a GeneTrees object.
species_gene_mapping dict, optional Explicit species -> gene label mapping. Only used when *return_type* is ``"genetrees"``.
naming_rule Callable, optional Gene-label-to-species callable. Only used when *return_type* is ``"genetrees"`` and no explicit mapping is given.
Returns: list[Network] | GeneTrees: Parsed phylogenetic data.
Raises: FileNotFoundError: If the file does not exist., IOError: If no valid newick strings are found, or parsing fails.
write_newick(network: Network) -> str

Convert a PhyNetPy Network into a newick string. Delegates to the Network's built-in ``newick()`` method, which produces extended-newick notation for networks with reticulation nodes.

Parameter Type Description
network Network A PhyNetPy Network object.
Returns: str: The newick representation of the network, ending with ';'.
write_newick_file(networks: List[Network], filepath: str) -> None

Write one or more Networks to a file as newick strings, one per line.

Parameter Type Description
networks list[Network] Networks to write.
filepath str Output file path. Will be created or overwritten.
Returns: None
Raises: IOError: If the list is empty or the file cannot be written.
read_nexus(filepath: str, validate_input: bool = False, print_validation_summary: bool = False, return_type: Literal['networks', 'genetrees'] = 'networks', species_gene_mapping: Optional[Dict[str, List[str]]] = None, naming_rule: Optional[Callable[..., Any]] = None) -> Union[List[Network], GeneTrees]

Read a nexus file and parse all trees/networks in the TREES block into PhyNetPy Network objects. This replicates the core functionality of ``NetworkParser`` as a standalone function, making it easy to call without instantiating a class. A typical nexus file looks like:: #NEXUS BEGIN TAXA; DIMENSIONS NTAX=3; TAXALABELS A B C; END; BEGIN TREES; Tree t1 = ((A:0.1,B:0.2):0.3,C:0.4); Tree t2 = ((B:0.1,C:0.2):0.3,A:0.4); END;

Parameter Type Description
filepath str Path to a nexus file (.nex, .nexus).
validate_input bool, optional If True, run NexusValidator on the file before parsing. Defaults to False.
print_validation_summary bool, optional If True and validate_input is True, print the validation summary. Defaults to False.
return_type str ``"networks"`` (default) returns a list of Network objects. ``"genetrees"`` validates each network as a rooted binary tree and wraps them in a GeneTrees object.
species_gene_mapping dict, optional Explicit species -> gene label mapping. Only used when *return_type* is ``"genetrees"``.
naming_rule Callable, optional Gene-label-to-species callable. Only used when *return_type* is ``"genetrees"`` and no explicit mapping is given.
Returns: list[Network] | GeneTrees: Parsed phylogenetic data.
Raises: FileNotFoundError: If the file does not exist., IOError: If the file cannot be parsed or contains no trees.
read_nexus_msa(filepath: str) -> MSA

Read the sequence data (DATA/CHARACTERS block) from a nexus file and return it as an MSA object. This is a convenience wrapper around the MSA constructor's built-in nexus parsing. Use this when you want the alignment data rather than the tree topology.

Parameter Type Description
filepath str Path to a nexus file containing a DATA or CHARACTERS block.
Returns: MSA: The parsed Multiple Sequence Alignment.
Raises: FileNotFoundError: If the file does not exist., IOError: If no sequence data is found.
write_nexus(networks: List[Network], filepath: str, taxa: Optional[Set[str]] = None, tree_prefix: str = 'net', overwrite: bool = True, phylonet_cmds: Optional[List[str]] = None) -> None

Write one or more Networks to a nexus file with TAXA and TREES blocks. This replicates the functionality of the ``NexusTemplate`` class as a standalone function. The generated file follows the standard nexus format:: #NEXUS BEGIN TAXA; DIMENSIONS NTAX=3; TAXALABELS A B C ; END; BEGIN TREES; Tree net1 = ((A:0.1,B:0.2):0.3,C:0.4); Tree net2 = ...; END;

Parameter Type Description
networks list[Network] The networks to write.
filepath str Output file path.
taxa set[str], optional An explicit set of taxa labels. If
None taxa are inferred from the newick strings. Defaults to None.
tree_prefix str, optional Label prefix for each tree line. Defaults to "net".
overwrite bool, optional If False, raises IOError if the file already exists. Defaults to True.
phylonet_cmds list[str], optional A list of PhyloNet commands to include in a PHYLONET block. Defaults to None.
Returns: None
Raises: IOError: If the list is empty, or the file cannot be written, or the file already exists and overwrite is False.
detect_newick_standard(newick_str: str) -> str

Auto-detect which newick convention a string uses based on its formatting. The detection heuristic is: 1. If the string contains ``#Name:len::gamma`` double-colon notation on a reticulation node → **Phylonet** 2. If the string starts with ``[&R]`` or ``[&U]`` → **Beast** 3. If the string contains ``[&...gamma=...]`` → **PhyNetPy** 4. Otherwise (plain newick) → **PhyNetPy** (default)

Parameter Type Description
newick_str str A newick or extended-newick string.
Returns: str: One of ``"Phylonet"``, ``"Beast"``, or ``"PhyNetPy"``.
convert_newick(newick_str: str, standard: str = 'PhyNetPy') -> str

Convert a newick/extended-newick string between different software conventions. The three supported standards differ primarily in how they encode inheritance probabilities (gamma) on reticulation edges: **PhyNetPy** uses BioPython-style bracket comments:: ((C:.1,(B:.05)#H0[&gamma=.7]:.05)I1:.1,(A:.1,#H0:.05)I2:.1)I3; **Phylonet** uses Rich Newick double-colon notation:: ((C:.1,(B:.05)#H0:.05::.7)I1:.1,(A:.1,#H0:.05)I2:.1)I3; **Beast** uses the same annotation as PhyNetPy but prefixes the string with ``[&R]`` for rooted trees (or ``[&U]`` for unrooted):: [&R] ((C:.1,(B:.05)#H0[&gamma=.7]:.05)I1:.1,(A:.1,#H0:.05)I2:.1)I3; The function auto-detects the input convention and converts to the target. Non-gamma metadata (e.g. ``[&posterior=0.95]``) on non-reticulation nodes is preserved in all conversions.

Parameter Type Description
newick_str str A newick or extended-newick string in any of the three conventions.
standard str, optional Target convention. One of ``"PhyNetPy"`` (default), ``"Phylonet"``, or ``"Beast"``.
Returns: str: The newick string reformatted for the target software.
Raises: ValueError: If ``standard`` is not one of the three valid options., IOError: If the input string is empty.

Navigation

Modules

This Page