Python API Reference¶

NOTE: The MicroHapulator Python API is under Semantic Versioning. In brief, this means that every stable version of the MicroHapulator software is assigned a version number, and that any changes to the software's behavior or interface require the software version number to be updated in prescribed and predictable ways.

Haplotype calling¶

microhapulator.api.type(bamfile, markertsv, minbasequal=10, max_depth=1000000.0)¶

Perform haplotype calling

Parameters

bamfile (str) -- path of a BAM file containing NGS reads aligned to marker reference sequences and sorted
markertsv (str) -- path of a TSV file containing marker metadata, specifically the offset of each SNP for every marker in the panel
minbasequal (int) -- minimum base quality (PHRED score) to be considered reliable for haplotype calling; default is 10, corresponding to Q10, i.e., 90% probability that the base call is correct
max_depth (float) -- maximum permitted read depth

Returns

an unfiltered catalog of haplotype counts for each marker (a typing result)

Return type

microhapulator.profile.TypingResult

microhapulator.profile.TypingResult.filter(self, static=None, dynamic=None, config=None)¶

Apply static and/or dynamic thresholds to distinguish true and false haplotypes

Thresholds are applied to the haplotype read counts of a raw typing result. Static integer thresholds are commonly used as detection thresholds, below which any haplotype count is considered noise. Dynamic thresholds are commonly used as analytical thresholds and represent a percentage of the total read count at the marker, after any haplotypes failing a static threshold are discarded.

Parameters

static (int) -- global fixed read count threshold
dynamic (float) -- global percentage of total read count; e.g. use dynamic=0.02 to apply a 2% analytical threshold
config (pandas.DataFrame) -- tabular data structure specifying marker-specific thresholds to override global thresholds; three required columns: Marker for the marker name; Static and Dynamic for marker-specific thresholds

Analysis and interpretation¶

microhapulator.api.balance(result, include_discarded=True)¶

Compute interlocus balance

Parameters

result (microhapulator.profile.TypingResult) -- a typing result including haplotype counts
included_discarded (bool) -- flag indicating whether to include in each marker's total read count reads that are successfully aligned but discarded because they do not span all SNPs at the marker

Returns

total read counts for each marker in a two-column tabular data structure

Return type

pandas.DataFrame

microhapulator.api.contrib(profile)¶

Estimate the minimum number of DNA contributors to a suspected mixture

Let \(N_{\text{al}}\) represent the largest number of haplotypes observed at any marker. We then estimate the minimum number of DNA contributors as follows.

\(C_{\text{min}} = \left\lceil\frac{N_{\text{al}}}{2}\right\rceil\)

Parameters: profile (microhapulator.profile.Profile) -- a typing result or a simulated genotype
Returns: a tuple (E, N, P), where E is the estimate for the minimum number of DNA contributors, N is the number of markers supporting the estimate, and P is the percentage of markers supporting the estimate
Return type: tuple(int, int, float)

microhapulator.api.prob(frequencies, prof1, prof2=None, erate=0.001)¶

Compute a profile random match probability (RMP) or an RMP-based likelihood ratio (LR) test

The LR test, when performed, assesses the relative weight of two competing propositions.

\(H_p\): the genetic profiles prof1 and prof2 originated from the same individuals
\(H_d\): prof1 and prof2 originated from two unrelated individuals in the population

The test statistic is computed as follows.

\[LR = \frac{P(H_p)}{P(H_d)}\]

The probability \(P(H_p) = \epsilon^R\), where \(\epsilon\) is the per-marker rate of genotyping error (erate) and \(R\) is the number of markers with discordant genotypes between profiles. The probability \(P(H_d)\) is the RMP of prof1. Note that when there is a perfect match between prof1 and prof2, \(P(H_p) = 1\) and the LR statistic is simply the reciprocal of the RMP.

Note that the LR test as formulated assumes that prof1 and prof2 are close matches. The error rate accommodates a small amount of allelic drop-out or drop-in but is not designed to accommodate profiles with substantial differences.

Parameters

frequencies (pandas.DataFrame) -- table of population haplotype frequencies
prof1 (microhapulator.profile.Profile) -- a typing result or simulated genotype
prof2 (microhapulator.profile.Profile) -- a typing result or simulated genotype; optional
erate (float) -- rate of genotyping error

Returns

the RMP of prof1 if prof2 is undefined, or the LR test statistic if prof2 is defined

Return type

float

microhapulator.api.diff(prof1, prof2)¶

Compare two profiles and determine the markers at which their genotypes differ

Note: this is a generator function.

Parameters

prof1 (microhapulator.profile.Profile) -- typing result or simulated profile
prof2 (microhapulator.profile.Profile) -- typing result or simulated profile

Yields

for each discordant marker, a tuple (M, X, Y), where M is the marker name, X is the set of haplotypes unique to prof1, and Y is the set of haplotypes unique to prof2

microhapulator.api.dist(prof1, prof2)¶

Compute a simple Hamming distance between two profiles

Parameters

prof1 (microhapulator.profile.Profile) -- typing result or simulated profile
prof2 (microhapulator.profile.Profile) -- typing result or simulated profile

Returns

the number of markers with a discordant genotype between the two profiles

Return type

int

microhapulator.api.contain(prof1, prof2)¶

Perform a simple containment test

Calculate a simple containment statistic C, representing the number of markers whose haplotypes in prof1 are a subset of the haplotypes in prof2. Dividing by the total number of markers provides a rudimentary measure of containment, the proportion of prof1 "contained" in prof2. Note that this statistic does not accommodate allelic drop-in or drop-out.

Parameters

prof1 (microhapulator.profile.Profile) -- a typing result or simulated genotype
prof2 (microhapulator.profile.Profile) -- a typing result or simulated genotype

Returns

a tuple (C, T), where C is the containment statistic and T is the total number of markers

Simulation¶

microhapulator.api.sim(frequencies, seed=None)¶

Simulate a diploid genotype from the specified microhaplotype frequencies

Parameters

frequencies (pandas.DataFrame) -- population haplotype frequencies
seed (int) -- seed for random number generator

Returns

a simulated genotype profile for all markers specified in the haplotype frequencies

Return type

microhapulator.profile.SimulatedProfile

microhapulator.profile.SimulatedProfile.merge(profiles)¶

Combine simulated profiles into a mock DNA mixture

Parameters: profiles (list) -- list of simulated profiles
Returns: a combined profile
Return type: microhapulator.profile.SimulatedProfile

microhapulator.profile.Profile.unite(mom, dad)¶

Simulate the creation of a new profile from a mother and father

Parameters

mom (microhapulator.profile.Profile) -- typing result or simulated profile
dad (microhapulator.profile.Profile) -- typing result or simulated profile

Returns

a simulated offspring

Return type

microhapulator.profile.SimulatedProfile

microhapulator.api.seq(profiles, markers, refrseqs, seeds=None, threads=1, totalreads=500000, proportions=None, sig=None)¶

Simulate paired-end Illumina MiSeq sequencing of the given profile(s)

This generator function accepts any combination of simple (single-source) or complex (multi-source mixture) profiles as input. Each profile is "sequenced" separately, and then all reads are aggregated.

Parameters

profiles (list) -- list of mock profiles
markers (pandas.DataFrame) -- marker definitions, provided as a table of SNP offsets, one row per SNP; required columns: Marker and Offset, representing the distance of the SNP from the first nucleotide in the reference sequence
refrseqs (dict) -- a dictionary with marker names as the keys and marker reference sequences as the values
seeds (list) -- optional list of random seeds, one per profile
threads (int) -- number of threads to use for each sequencing task; note that optimal performance is typically achieved with a single thread
totalreads (int) -- total number of reads to generate across all profiles
proportions (list) -- optional list of relative proportions, equal to the number of profiles; by default, each profile contributes an equal number of reads
sig (str) -- an optional signature (prefix) to apply to all simulated reads

Yields

for each read pair, a tuple (I, X, Y) where I is a serial index, X is the forward read (R1), and Y is the reverse read (R2)

Python API Reference¶

Haplotype calling¶

Analysis and interpretation¶

Simulation¶

Table of Contents

Previous topic

Next topic

This Page