PhyNetPy Documentation

Library for the Development and Use of Phylogenetic Network Methods

ModelSelection Module v0.3.2

Information-criterion-based reticulation-count selection (AIC, BIC, AICc) and reticulation sweep helpers.

Author:
Mark Kessler
Last Edit:
4/24/26
Source:
ModelSelection.py

SweepRow

class SweepRow

One row of the reticulation sweep: results for a single ``k``. Attributes: k: Reticulation count this row was produced with. best_log_lik: Best log-likelihood observed across seeds at this ``k`` (higher is better for MPL). all_log_liks: Per-seed log-likelihoods, in the order the seeds were evaluated. Useful for diagnosing multimodality. n_params: Effective parameter count used for AIC/BIC. aic: ``2 * n_params - 2 * best_log_lik``. bic: ``n_params * ln(data_size) - 2 * best_log_lik``. elapsed_s: Wall-clock time spent on this ``k`` (all seeds). delta_log_lik: Marginal gain over the previous ``k`` row; set by :func:`reticulation_sweep` after all rows are built. ``None`` for the first row.

SweepResult

class SweepResult

Container for reticulation-sweep rows with selection/plotting helpers. Built by :func:`reticulation_sweep`. Exposes :meth:`best_by` for criterion-based ``k`` recommendation, :meth:`print_summary` for a console table, :meth:`save_csv` for a machine-readable dump, and :meth:`plot` for a visual report. Attributes: rows: Ordered list of :class:`SweepRow`, one per ``k``. data_size: ``n`` used in the BIC formula. params_per_reticulation: Parameters added by each reticulation. base_params: Backbone (k=0) parameter count. log_lik_label: Human-readable name for the y-axis / summary column (e.g. ``"log-pseudo-likelihood"``).

Methods

best_by(criterion: Criterion) -> int

Return the recommended ``k`` under the given criterion.

Parameter Type Description
criterion One of * ``"logL"``: ``argmax_k logL(k)`` (ignores parsimony; always picks the largest ``k`` in a non-overfit regime). * ``"aic"``: ``argmin_k AIC(k)``. * ``"bic"``: ``argmin_k BIC(k)``. * ``"elbow"``: smallest ``k`` at which the next-step gain in log-likelihood falls below ``elbow_tol_frac`` of the maximum gain across the sweep. This matches the classic "knee plot" heuristic.
Returns: Recommended reticulation count.
Raises: ValueError: If the sweep is empty or ``criterion`` is unrecognised.
print_summary(file = None) -> None

Pretty-print the sweep table, deltas, and recommendations.

Parameter Type Description
file Optional text stream to print to. Defaults to ``sys.stdout``.
save_csv(path: str | Path) -> None

Write the sweep rows to ``path`` as CSV, one row per ``k``. Parent directories are created on demand. Per-seed scores are joined into a single ``;``-separated field so the CSV stays flat.

Parameter Type Description
path Output CSV path.
plot(path: str | Path | None = None, show: bool = False, title: str | None = None) -> None

Render the log-likelihood / AIC / BIC curves versus ``k``. Writes a PNG to ``path`` (if provided) and/or opens an interactive window (``show=True``). Each recommended ``k`` is marked with a vertical dashed line colour-matched to its criterion.

Parameter Type Description
path Optional PNG output path.
show When True, also display an interactive window.
title Optional figure title.

Module Functions

reticulation_sweep(search_fn: Callable[[int, int], float], k_values: Sequence[int], seeds: Sequence[int] = (0,), data_size: int = 1, params_per_reticulation: int = 3, base_params: int = 0, log_lik_label: str = 'log-likelihood', progress: bool = True) -> SweepResult

Run ``search_fn`` over each ``k in k_values`` and summarize.

Parameter Type Description
search_fn Callable ``f(k, seed) -> float`` that performs one search with ``max_reticulations == k`` and returns the best log-likelihood found. The caller is responsible for constructing a fresh search object / starting network for each invocation if that's appropriate for the method.
k_values Reticulation counts to sweep over (e.g. ``range(0, 4)``).
seeds RNG seeds to run at each ``k``. If multiple seeds are provided the best (highest) log-likelihood across seeds is taken as the representative score for that ``k``.
data_size ``n`` used in BIC (``p ln(n) - 2 logL``). For MPL this is typically ``len(gene_trees.trees)``.
params_per_reticulation Number of free parameters each additional reticulation contributes. For MPL, 3 (one gamma + two new branch lengths) is a sensible default; 1 (gamma only) is the most conservative choice.
base_params Parameters attributable to the backbone tree (commonly ``2 * n_taxa - 3`` for an unrooted binary tree; pass ``0`` to ignore -- only differences across ``k`` matter for the AIC/BIC comparison.)
log_lik_label Y-axis label, e.g. "log-pseudo-likelihood".
progress Print a one-line progress update per search.
Returns: A: class:`SweepResult` with the best score and derived stats at each ``k``.

Navigation

Modules

This Page