Python API Reference¶
NOTE: The MicroHapulator Python API is under Semantic Versioning. In brief, this means that every stable version of the MicroHapulator software is assigned a version number, and that any changes to the software's behavior or interface require the software version number to be updated in prescribed and predictable ways.
Haplotype calling¶
- microhapulator.api.type(bamfile, markertsv, minbasequal=10, max_depth=1000000.0)¶
Perform haplotype calling
- Parameters
bamfile (str) -- path of a BAM file containing NGS reads aligned to marker reference sequences and sorted
markertsv (str) -- path of a TSV file containing marker metadata, specifically the offset of each SNP for every marker in the panel
minbasequal (int) -- minimum base quality (PHRED score) to be considered reliable for haplotype calling; default is 10, corresponding to Q10, i.e., 90% probability that the base call is correct
max_depth (float) -- maximum permitted read depth
- Returns
an unfiltered catalog of haplotype counts for each marker (a typing result)
- Return type
microhapulator.profile.TypingResult
- microhapulator.profile.TypingResult.filter(self, static=None, dynamic=None, config=None)¶
Apply static and/or dynamic thresholds to distinguish true and false haplotypes
Thresholds are applied to the haplotype read counts of a raw typing result. Static integer thresholds are commonly used as detection thresholds, below which any haplotype count is considered noise. Dynamic thresholds are commonly used as analytical thresholds and represent a percentage of the total read count at the marker, after any haplotypes failing a static threshold are discarded.
- Parameters
static (int) -- global fixed read count threshold
dynamic (float) -- global percentage of total read count; e.g. use dynamic=0.02 to apply a 2% analytical threshold
config (pandas.DataFrame) -- tabular data structure specifying marker-specific thresholds to override global thresholds; three required columns: Marker for the marker name; Static and Dynamic for marker-specific thresholds
Analysis and interpretation¶
- microhapulator.api.balance(result, include_discarded=True)¶
Compute interlocus balance
- Parameters
result (microhapulator.profile.TypingResult) -- a typing result including haplotype counts
included_discarded (bool) -- flag indicating whether to include in each marker's total read count reads that are successfully aligned but discarded because they do not span all SNPs at the marker
- Returns
total read counts for each marker in a two-column tabular data structure
- Return type
pandas.DataFrame
- microhapulator.api.contrib(profile)¶
Estimate the minimum number of DNA contributors to a suspected mixture
Let \(N_{\text{al}}\) represent the largest number of haplotypes observed at any marker. We then estimate the minimum number of DNA contributors as follows.
\(C_{\text{min}} = \left\lceil\frac{N_{\text{al}}}{2}\right\rceil\)
- Parameters
profile (microhapulator.profile.Profile) -- a typing result or a simulated genotype
- Returns
a tuple (E, N, P), where E is the estimate for the minimum number of DNA contributors, N is the number of markers supporting the estimate, and P is the percentage of markers supporting the estimate
- Return type
tuple(int, int, float)
- microhapulator.api.prob(frequencies, prof1, prof2=None, erate=0.001)¶
Compute a profile random match probability (RMP) or an RMP-based likelihood ratio (LR) test
The LR test, when performed, assesses the relative weight of two competing propositions.
\(H_p\): the genetic profiles prof1 and prof2 originated from the same individuals
\(H_d\): prof1 and prof2 originated from two unrelated individuals in the population
The test statistic is computed as follows.
\[LR = \frac{P(H_p)}{P(H_d)}\]The probability \(P(H_p) = \epsilon^R\), where \(\epsilon\) is the per-marker rate of genotyping error (erate) and \(R\) is the number of markers with discordant genotypes between profiles. The probability \(P(H_d)\) is the RMP of prof1. Note that when there is a perfect match between prof1 and prof2, \(P(H_p) = 1\) and the LR statistic is simply the reciprocal of the RMP.
Note that the LR test as formulated assumes that prof1 and prof2 are close matches. The error rate accommodates a small amount of allelic drop-out or drop-in but is not designed to accommodate profiles with substantial differences.
- Parameters
frequencies (pandas.DataFrame) -- table of population haplotype frequencies
prof1 (microhapulator.profile.Profile) -- a typing result or simulated genotype
prof2 (microhapulator.profile.Profile) -- a typing result or simulated genotype; optional
erate (float) -- rate of genotyping error
- Returns
the RMP of prof1 if prof2 is undefined, or the LR test statistic if prof2 is defined
- Return type
float
- microhapulator.api.diff(prof1, prof2)¶
Compare two profiles and determine the markers at which their genotypes differ
Note: this is a generator function.
- Parameters
prof1 (microhapulator.profile.Profile) -- typing result or simulated profile
prof2 (microhapulator.profile.Profile) -- typing result or simulated profile
- Yields
for each discordant marker, a tuple (M, X, Y), where M is the marker name, X is the set of haplotypes unique to prof1, and Y is the set of haplotypes unique to prof2
- microhapulator.api.dist(prof1, prof2)¶
Compute a simple Hamming distance between two profiles
- Parameters
prof1 (microhapulator.profile.Profile) -- typing result or simulated profile
prof2 (microhapulator.profile.Profile) -- typing result or simulated profile
- Returns
the number of markers with a discordant genotype between the two profiles
- Return type
int
- microhapulator.api.contain(prof1, prof2)¶
Perform a simple containment test
Calculate a simple containment statistic C, representing the number of markers whose haplotypes in prof1 are a subset of the haplotypes in prof2. Dividing by the total number of markers provides a rudimentary measure of containment, the proportion of prof1 "contained" in prof2. Note that this statistic does not accommodate allelic drop-in or drop-out.
- Parameters
prof1 (microhapulator.profile.Profile) -- a typing result or simulated genotype
prof2 (microhapulator.profile.Profile) -- a typing result or simulated genotype
- Returns
a tuple (C, T), where C is the containment statistic and T is the total number of markers
Simulation¶
- microhapulator.api.sim(frequencies, seed=None)¶
Simulate a diploid genotype from the specified microhaplotype frequencies
- Parameters
frequencies (pandas.DataFrame) -- population haplotype frequencies
seed (int) -- seed for random number generator
- Returns
a simulated genotype profile for all markers specified in the haplotype frequencies
- Return type
microhapulator.profile.SimulatedProfile
- microhapulator.profile.SimulatedProfile.merge(profiles)¶
Combine simulated profiles into a mock DNA mixture
- Parameters
profiles (list) -- list of simulated profiles
- Returns
a combined profile
- Return type
microhapulator.profile.SimulatedProfile
- microhapulator.profile.Profile.unite(mom, dad)¶
Simulate the creation of a new profile from a mother and father
- Parameters
mom (microhapulator.profile.Profile) -- typing result or simulated profile
dad (microhapulator.profile.Profile) -- typing result or simulated profile
- Returns
a simulated offspring
- Return type
microhapulator.profile.SimulatedProfile
- microhapulator.api.seq(profiles, markers, refrseqs, seeds=None, threads=1, totalreads=500000, proportions=None, sig=None)¶
Simulate paired-end Illumina MiSeq sequencing of the given profile(s)
This generator function accepts any combination of simple (single-source) or complex (multi-source mixture) profiles as input. Each profile is "sequenced" separately, and then all reads are aggregated.
- Parameters
profiles (list) -- list of mock profiles
markers (pandas.DataFrame) -- marker definitions, provided as a table of SNP offsets, one row per SNP; required columns: Marker and Offset, representing the distance of the SNP from the first nucleotide in the reference sequence
refrseqs (dict) -- a dictionary with marker names as the keys and marker reference sequences as the values
seeds (list) -- optional list of random seeds, one per profile
threads (int) -- number of threads to use for each sequencing task; note that optimal performance is typically achieved with a single thread
totalreads (int) -- total number of reads to generate across all profiles
proportions (list) -- optional list of relative proportions, equal to the number of profiles; by default, each profile contributes an equal number of reads
sig (str) -- an optional signature (prefix) to apply to all simulated reads
- Yields
for each read pair, a tuple (I, X, Y) where I is a serial index, X is the forward read (R1), and Y is the reverse read (R2)