Python API Reference

NOTE: The MicroHapulator Python API is under Semantic Versioning. In brief, this means that every stable version of the MicroHapulator software is assigned a version number, and that any changes to the software's behavior or interface require the software version number to be updated in prescribed and predictable ways.

Haplotype calling

microhapulator.api.type(bamfile, markertsv, minbasequal=10, max_depth=1000000.0)

Perform haplotype calling

Parameters
  • bamfile (str) -- path of a BAM file containing NGS reads aligned to marker reference sequences and sorted

  • markertsv (str) -- path of a TSV file containing marker metadata, specifically the offset of each SNP for every marker in the panel

  • minbasequal (int) -- minimum base quality (PHRED score) to be considered reliable for haplotype calling; default is 10, corresponding to Q10, i.e., 90% probability that the base call is correct

  • max_depth (float) -- maximum permitted read depth

Returns

an unfiltered catalog of haplotype counts for each marker (a typing result)

Return type

microhapulator.profile.TypingResult

microhapulator.profile.TypingResult.filter(self, static=None, dynamic=None, config=None)

Apply static and/or dynamic thresholds to distinguish true and false haplotypes

Thresholds are applied to the haplotype read counts of a raw typing result. Static integer thresholds are commonly used as detection thresholds, below which any haplotype count is considered noise. Dynamic thresholds are commonly used as analytical thresholds and represent a percentage of the total read count at the marker, after any haplotypes failing a static threshold are discarded.

Parameters
  • static (int) -- global fixed read count threshold

  • dynamic (float) -- global percentage of total read count; e.g. use dynamic=0.02 to apply a 2% analytical threshold

  • config (pandas.DataFrame) -- tabular data structure specifying marker-specific thresholds to override global thresholds; three required columns: Marker for the marker name; Static and Dynamic for marker-specific thresholds

Analysis and interpretation

microhapulator.api.balance(result, include_discarded=True)

Compute interlocus balance

Parameters
  • result (microhapulator.profile.TypingResult) -- a typing result including haplotype counts

  • included_discarded (bool) -- flag indicating whether to include in each marker's total read count reads that are successfully aligned but discarded because they do not span all SNPs at the marker

Returns

total read counts for each marker in a two-column tabular data structure

Return type

pandas.DataFrame

microhapulator.api.contrib(profile)

Estimate the minimum number of DNA contributors to a suspected mixture

Let \(N_{\text{al}}\) represent the largest number of haplotypes observed at any marker. We then estimate the minimum number of DNA contributors as follows.

\(C_{\text{min}} = \left\lceil\frac{N_{\text{al}}}{2}\right\rceil\)

Parameters

profile (microhapulator.profile.Profile) -- a typing result or a simulated genotype

Returns

a tuple (E, N, P), where E is the estimate for the minimum number of DNA contributors, N is the number of markers supporting the estimate, and P is the percentage of markers supporting the estimate

Return type

tuple(int, int, float)

microhapulator.api.prob(frequencies, prof1, prof2=None, erate=0.001)

Compute a profile random match probability (RMP) or an RMP-based likelihood ratio (LR) test

The LR test, when performed, assesses the relative weight of two competing propositions.

  • \(H_p\): the genetic profiles prof1 and prof2 originated from the same individuals

  • \(H_d\): prof1 and prof2 originated from two unrelated individuals in the population

The test statistic is computed as follows.

\[LR = \frac{P(H_p)}{P(H_d)}\]

The probability \(P(H_p) = \epsilon^R\), where \(\epsilon\) is the per-marker rate of genotyping error (erate) and \(R\) is the number of markers with discordant genotypes between profiles. The probability \(P(H_d)\) is the RMP of prof1. Note that when there is a perfect match between prof1 and prof2, \(P(H_p) = 1\) and the LR statistic is simply the reciprocal of the RMP.

Note that the LR test as formulated assumes that prof1 and prof2 are close matches. The error rate accommodates a small amount of allelic drop-out or drop-in but is not designed to accommodate profiles with substantial differences.

Parameters
  • frequencies (pandas.DataFrame) -- table of population haplotype frequencies

  • prof1 (microhapulator.profile.Profile) -- a typing result or simulated genotype

  • prof2 (microhapulator.profile.Profile) -- a typing result or simulated genotype; optional

  • erate (float) -- rate of genotyping error

Returns

the RMP of prof1 if prof2 is undefined, or the LR test statistic if prof2 is defined

Return type

float

microhapulator.api.diff(prof1, prof2)

Compare two profiles and determine the markers at which their genotypes differ

Note: this is a generator function.

Parameters
  • prof1 (microhapulator.profile.Profile) -- typing result or simulated profile

  • prof2 (microhapulator.profile.Profile) -- typing result or simulated profile

Yields

for each discordant marker, a tuple (M, X, Y), where M is the marker name, X is the set of haplotypes unique to prof1, and Y is the set of haplotypes unique to prof2

microhapulator.api.dist(prof1, prof2)

Compute a simple Hamming distance between two profiles

Parameters
  • prof1 (microhapulator.profile.Profile) -- typing result or simulated profile

  • prof2 (microhapulator.profile.Profile) -- typing result or simulated profile

Returns

the number of markers with a discordant genotype between the two profiles

Return type

int

microhapulator.api.contain(prof1, prof2)

Perform a simple containment test

Calculate a simple containment statistic C, representing the number of markers whose haplotypes in prof1 are a subset of the haplotypes in prof2. Dividing by the total number of markers provides a rudimentary measure of containment, the proportion of prof1 "contained" in prof2. Note that this statistic does not accommodate allelic drop-in or drop-out.

Parameters
  • prof1 (microhapulator.profile.Profile) -- a typing result or simulated genotype

  • prof2 (microhapulator.profile.Profile) -- a typing result or simulated genotype

Returns

a tuple (C, T), where C is the containment statistic and T is the total number of markers

Simulation

microhapulator.api.sim(frequencies, seed=None)

Simulate a diploid genotype from the specified microhaplotype frequencies

Parameters
  • frequencies (pandas.DataFrame) -- population haplotype frequencies

  • seed (int) -- seed for random number generator

Returns

a simulated genotype profile for all markers specified in the haplotype frequencies

Return type

microhapulator.profile.SimulatedProfile

microhapulator.profile.SimulatedProfile.merge(profiles)

Combine simulated profiles into a mock DNA mixture

Parameters

profiles (list) -- list of simulated profiles

Returns

a combined profile

Return type

microhapulator.profile.SimulatedProfile

microhapulator.profile.Profile.unite(mom, dad)

Simulate the creation of a new profile from a mother and father

Parameters
  • mom (microhapulator.profile.Profile) -- typing result or simulated profile

  • dad (microhapulator.profile.Profile) -- typing result or simulated profile

Returns

a simulated offspring

Return type

microhapulator.profile.SimulatedProfile

microhapulator.api.seq(profiles, markers, refrseqs, seeds=None, threads=1, totalreads=500000, proportions=None, sig=None)

Simulate paired-end Illumina MiSeq sequencing of the given profile(s)

This generator function accepts any combination of simple (single-source) or complex (multi-source mixture) profiles as input. Each profile is "sequenced" separately, and then all reads are aggregated.

Parameters
  • profiles (list) -- list of mock profiles

  • markers (pandas.DataFrame) -- marker definitions, provided as a table of SNP offsets, one row per SNP; required columns: Marker and Offset, representing the distance of the SNP from the first nucleotide in the reference sequence

  • refrseqs (dict) -- a dictionary with marker names as the keys and marker reference sequences as the values

  • seeds (list) -- optional list of random seeds, one per profile

  • threads (int) -- number of threads to use for each sequencing task; note that optimal performance is typically achieved with a single thread

  • totalreads (int) -- total number of reads to generate across all profiles

  • proportions (list) -- optional list of relative proportions, equal to the number of profiles; by default, each profile contributes an equal number of reads

  • sig (str) -- an optional signature (prefix) to apply to all simulated reads

Yields

for each read pair, a tuple (I, X, Y) where I is a serial index, X is the forward read (R1), and Y is the reverse read (R2)