IO Module v1.1.0

Central I/O hub for reading and writing phylogenetic file formats (FASTA, VCF, Newick, Nexus).

Author:: Mark Kessler
Last Edit:: 2/6/26
Source:: IO.py

Constants

FASTA_LINE_WIDTH : int = 80

Exceptions

exception IOError(Exception)

Exception raised when file I/O operations fail within PhyNetPy.

Module Functions

read_fasta_records(filepath: str) -> List[DataSequence]

Read a FASTA file and return a list of DataSequence objects. This is the lower-level reader that returns raw DataSequence objects without wrapping them in an MSA. Useful for attaching sequences directly to Node objects in an existing Network (via Node.set_seq()). A FASTA file looks like: >sequence_name_1 ATCGATCGATCG... >sequence_name_2 GCTAGCTAGCTA... Each record becomes a DataSequence where: - name = the FASTA header (sequence ID) - seq = list of characters from the sequence string

Parameter	Type	Description
filepath	str	Path to a FASTA file (.fasta, .fas, .fa, .fna, .ffn, .faa).

Returns: list[DataSequence]: A list of DataSequence objects, one per FASTA record.

Raises: FileNotFoundError: If the file does not exist., IOError: If BioPython cannot parse the file or it contains no valid sequences.

read_fasta(filepath: str, grouping: Optional[Dict[str, list]] = None, grouping_auto_detect: bool = False) -> MSA

Read a FASTA file and return an MSA object containing all sequences. This function parses a FASTA file, converts each record into a DataSequence, and wraps them in an MSA for downstream phylogenetic analyses such as distance calculations, alignment inspection, or model-based inference.

Parameter	Type	Description
filepath	str	Path to a FASTA file (.fasta, .fas, .fa, .fna, .ffn, .faa).
grouping	dict[str, list], optional	A mapping from group names to lists of sequence names that belong to that group. If
provided		sequences will be assigned group IDs accordingly. Defaults to None.
grouping_auto_detect	bool, optional	If True, attempt to automatically group sequences by name similarity. Defaults to False.

Returns: MSA: A Multiple Sequence Alignment object containing all parsed sequences.

Raises: FileNotFoundError: If the file does not exist., IOError: If the file cannot be parsed or contains no valid sequences.

write_fasta(msa: MSA, filepath: str, line_width: int = FASTA_LINE_WIDTH) -> None

Write an MSA object to a FASTA file. Each DataSequence in the MSA is written as a FASTA record: >sequence_name ATCGATCG... (wrapped at line_width characters)

Parameter	Type	Description
msa	MSA	The Multiple Sequence Alignment to write.
filepath	str	The output file path. Will be created or overwritten.
line_width	int, optional	Number of characters per sequence line. Standard FASTA convention is 80. Defaults to 80.

Returns: None

Raises: IOError: If the MSA has no records to write, or if the file cannot be written., ValueError: If line_width is less than 1.

write_fasta_from_network(network: Network, filepath: str, line_width: int = FASTA_LINE_WIDTH) -> None

Extract sequences from the leaf nodes of a Network and write them to a FASTA file. Only leaf nodes that have an associated DataSequence (set via Node.set_seq()) will be written. The node label becomes the FASTA header, and the attached sequence becomes the FASTA sequence. This is useful when a Network has been annotated with molecular data and the user wants to export just the sequence data.

Parameter	Type	Description
network	Network	A phylogenetic network whose leaf nodes may carry DataSequence objects.
filepath	str	The output FASTA file path.
line_width	int, optional	Characters per line for sequence wrapping. Defaults to 80.

Returns: None

Raises: IOError: If no leaf nodes in the network have sequence data attached, or if the file cannot be written., ValueError: If line_width is less than 1.

read_vcf(filepath: str, grouping: Optional[Dict[str, list]] = None, missing_value: str = '?') -> MSA

Read a VCF (Variant Call Format) file and return an MSA object. Each sample in the VCF becomes a DataSequence whose sequence is the vector of ALT allele counts across all variant sites. This maps directly to the SNP/BiMarkers pipeline used in PhyNetPy. A typical VCF file looks like:: ##fileformat=VCFv4.1 ##INFO=<...> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Samp1 Samp2 chr1 100 . A T 30 PASS . GT 0/0 0/1 chr1 200 . G C 50 PASS . GT 1/1 0/1 Genotype encoding: - 0/0 -> 0 (homozygous reference, 0 copies of ALT allele) - 0/1 -> 1 (heterozygous, 1 copy of ALT allele) - 1/1 -> 2 (homozygous alternate, 2 copies of ALT allele) - ./. -> missing_value (missing genotype)

Parameter	Type	Description
filepath	str	Path to a VCF file (.vcf).
grouping	dict[str, list], optional	A mapping from group/species names to lists of sample names that belong to that group. Used for the BiMarkers pipeline where multiple individuals map to a single species. Defaults to None.
missing_value	str, optional	The character to use for missing genotype data (./.). Defaults to "?".

Returns: MSA: A Multiple Sequence Alignment where each DataSequence represents one sample's genotype vector across all sites.

Raises: FileNotFoundError: If the file does not exist., IOError: If the file cannot be parsed or contains no variant data.

read_vcf_metadata(filepath: str) -> Dict[str, Any]

Read only the metadata and header from a VCF file without loading all variant data. Useful for inspecting what samples and fields are available before a full parse.

Parameter	Type	Description
filepath	str	Path to a VCF file.

Returns: dict[str, Any]: A dictionary containing: - "fileformat": The VCF version string - "metadata_lines": List of all ## header lines - "sample_names": List of sample column names - "info_fields": List of INFO field IDs - "format_fields": List of FORMAT field IDs - "filter_fields": List of FILTER field IDs - "contig_fields": List of contig IDs

Raises: FileNotFoundError: If the file does not exist., IOError: If the file cannot be read.

write_vcf(msa: MSA, filepath: str, chrom: str = 'chr1', start_pos: int = 1, ref_allele: str = 'A', alt_allele: str = 'T') -> None

Write an MSA of SNP/allele-count data to a simplified VCF file. This produces a minimal VCF where each site in the MSA becomes a variant record, and each DataSequence becomes a sample column. The allele count values (0, 1, 2, ...) are converted back to VCF genotype notation (e.g., 0/0, 0/1, 1/1). Note: Because the MSA does not store chromosome position, reference alleles, or other VCF-specific metadata, this output is a simplified reconstruction. It is suitable for round-tripping SNP data or creating test files, but will not preserve full VCF metadata from an original file.

Parameter	Type	Description
msa	MSA	The MSA containing allele count data (values like
0		1, 2 per site).
filepath	str	The output VCF file path.
chrom	str, optional	Chromosome name for all records. Defaults to "chr1".
start_pos	int, optional	Starting position for the first variant. Each subsequent variant increments by 1. Defaults to 1.
ref_allele	str, optional	Reference allele character. Defaults to "A".
alt_allele	str, optional	Alternate allele character. Defaults to "T".

Returns: None

Raises: IOError: If the MSA has no records, or the file cannot be written.

read_newick(newick_str: str) -> Network

Parse a single newick/extended-newick string into a PhyNetPy Network. Supports standard newick features (branch lengths, internal node names) as well as the extended newick format for phylogenetic networks (reticulation nodes prefixed with '#', gamma inheritance comments). Examples of accepted strings:: ((A:0.1,B:0.2):0.3,C:0.4); ((A:0.1,(B:0.2)#H1:0.3):0.4,(#H1:0.5,C:0.6):0.7);

Parameter	Type	Description
newick_str	str	A newick or extended-newick string. Trailing semicolons are handled automatically.

Returns: Network: A PhyNetPy Network object with the same topology, names, and branch lengths as described in the newick string.

Raises: IOError: If the string cannot be parsed.

read_newick_file(filepath: str, return_type: Literal['networks', 'genetrees'] = 'networks', species_gene_mapping: Optional[Dict[str, List[str]]] = None, naming_rule: Optional[Callable[..., Any]] = None) -> Union[List[Network], GeneTrees]

Read a file containing one or more newick strings (one per line) and parse each into a PhyNetPy Network. Blank lines and lines starting with '#' are skipped.

Parameter	Type	Description
filepath	str	Path to a file containing newick strings.
return_type	str	``"networks"`` (default) returns a list of Network objects. ``"genetrees"`` validates each network as a rooted binary tree and wraps them in a GeneTrees object.
species_gene_mapping	dict, optional	Explicit species -> gene label mapping. Only used when return_type is ``"genetrees"``.
naming_rule	Callable, optional	Gene-label-to-species callable. Only used when return_type is ``"genetrees"`` and no explicit mapping is given.

Returns: list[Network] | GeneTrees: Parsed phylogenetic data.

Raises: FileNotFoundError: If the file does not exist., IOError: If no valid newick strings are found, or parsing fails.

write_newick(network: Network) -> str

Convert a PhyNetPy Network into a newick string. Delegates to the Network's built-in ``newick()`` method, which produces extended-newick notation for networks with reticulation nodes.

Parameter	Type	Description
network	Network	A PhyNetPy Network object.

Returns: str: The newick representation of the network, ending with ';'.

write_newick_file(networks: List[Network], filepath: str) -> None

Write one or more Networks to a file as newick strings, one per line.

Parameter	Type	Description
networks	list[Network]	Networks to write.
filepath	str	Output file path. Will be created or overwritten.

Returns: None

Raises: IOError: If the list is empty or the file cannot be written.

read_nexus(filepath: str, validate_input: bool = False, print_validation_summary: bool = False, return_type: Literal['networks', 'genetrees'] = 'networks', species_gene_mapping: Optional[Dict[str, List[str]]] = None, naming_rule: Optional[Callable[..., Any]] = None) -> Union[List[Network], GeneTrees]

Read a nexus file and parse all trees/networks in the TREES block into PhyNetPy Network objects. This replicates the core functionality of ``NetworkParser`` as a standalone function, making it easy to call without instantiating a class. A typical nexus file looks like:: #NEXUS BEGIN TAXA; DIMENSIONS NTAX=3; TAXALABELS A B C; END; BEGIN TREES; Tree t1 = ((A:0.1,B:0.2):0.3,C:0.4); Tree t2 = ((B:0.1,C:0.2):0.3,A:0.4); END;

Parameter	Type	Description
filepath	str	Path to a nexus file (.nex, .nexus).
validate_input	bool, optional	If True, run NexusValidator on the file before parsing. Defaults to False.
print_validation_summary	bool, optional	If True and validate_input is True, print the validation summary. Defaults to False.
return_type	str	``"networks"`` (default) returns a list of Network objects. ``"genetrees"`` validates each network as a rooted binary tree and wraps them in a GeneTrees object.
species_gene_mapping	dict, optional	Explicit species -> gene label mapping. Only used when return_type is ``"genetrees"``.
naming_rule	Callable, optional	Gene-label-to-species callable. Only used when return_type is ``"genetrees"`` and no explicit mapping is given.

Returns: list[Network] | GeneTrees: Parsed phylogenetic data.

Raises: FileNotFoundError: If the file does not exist., IOError: If the file cannot be parsed or contains no trees.

read_nexus_msa(filepath: str) -> MSA

Read the sequence data (DATA/CHARACTERS block) from a nexus file and return it as an MSA object. This is a convenience wrapper around the MSA constructor's built-in nexus parsing. Use this when you want the alignment data rather than the tree topology.

Parameter	Type	Description
filepath	str	Path to a nexus file containing a DATA or CHARACTERS block.

Returns: MSA: The parsed Multiple Sequence Alignment.

Raises: FileNotFoundError: If the file does not exist., IOError: If no sequence data is found.

write_nexus(networks: List[Network], filepath: str, taxa: Optional[Set[str]] = None, tree_prefix: str = 'net', overwrite: bool = True, phylonet_cmds: Optional[List[str]] = None) -> None

Write one or more Networks to a nexus file with TAXA and TREES blocks. This replicates the functionality of the ``NexusTemplate`` class as a standalone function. The generated file follows the standard nexus format:: #NEXUS BEGIN TAXA; DIMENSIONS NTAX=3; TAXALABELS A B C ; END; BEGIN TREES; Tree net1 = ((A:0.1,B:0.2):0.3,C:0.4); Tree net2 = ...; END;

Parameter	Type	Description
networks	list[Network]	The networks to write.
filepath	str	Output file path.
taxa	set[str], optional	An explicit set of taxa labels. If
None		taxa are inferred from the newick strings. Defaults to None.
tree_prefix	str, optional	Label prefix for each tree line. Defaults to "net".
overwrite	bool, optional	If False, raises IOError if the file already exists. Defaults to True.
phylonet_cmds	list[str], optional	A list of PhyloNet commands to include in a PHYLONET block. Defaults to None.

Returns: None

Raises: IOError: If the list is empty, or the file cannot be written, or the file already exists and overwrite is False.

detect_newick_standard(newick_str: str) -> str

Auto-detect which newick convention a string uses based on its formatting. The detection heuristic is: 1. If the string contains ``#Name:len::gamma`` double-colon notation on a reticulation node → **Phylonet** 2. If the string starts with ``[&R]`` or ``[&U]`` → **Beast** 3. If the string contains ``[&...gamma=...]`` → **PhyNetPy** 4. Otherwise (plain newick) → **PhyNetPy** (default)

Parameter	Type	Description
newick_str	str	A newick or extended-newick string.

Returns: str: One of ``"Phylonet"``, ``"Beast"``, or ``"PhyNetPy"``.

convert_newick(newick_str: str, standard: str = 'PhyNetPy') -> str

Convert a newick/extended-newick string between different software conventions. The three supported standards differ primarily in how they encode inheritance probabilities (gamma) on reticulation edges: **PhyNetPy** uses BioPython-style bracket comments:: ((C:.1,(B:.05)#H0[&gamma=.7]:.05)I1:.1,(A:.1,#H0:.05)I2:.1)I3; **Phylonet** uses Rich Newick double-colon notation:: ((C:.1,(B:.05)#H0:.05::.7)I1:.1,(A:.1,#H0:.05)I2:.1)I3; **Beast** uses the same annotation as PhyNetPy but prefixes the string with ``[&R]`` for rooted trees (or ``[&U]`` for unrooted):: [&R] ((C:.1,(B:.05)#H0[&gamma=.7]:.05)I1:.1,(A:.1,#H0:.05)I2:.1)I3; The function auto-detects the input convention and converts to the target. Non-gamma metadata (e.g. ``[&posterior=0.95]``) on non-reticulation nodes is preserved in all conversions.

Parameter	Type	Description
newick_str	str	A newick or extended-newick string in any of the three conventions.
standard	str, optional	Target convention. One of ``"PhyNetPy"`` (default), ``"Phylonet"``, or ``"Beast"``.

Returns: str: The newick string reformatted for the target software.

Raises: ValueError: If ``standard`` is not one of the three valid options., IOError: If the input string is empty.

PhyNetPy Documentation

IO Module v1.1.0

Contents

Constants

Exceptions

Module Functions

Navigation

Modules

This Page