Python API Reference¶
NOTE: The MicroHapulator Python API is under Semantic Versioning. In brief, this means that every stable version of the MicroHapulator software is assigned a version number, and that any changes to the software's behavior or interface require the software version number to be updated in prescribed and predictable ways.
Haplotype calling¶
- microhapulator.api.type(bamfile, markertsv, minbasequal=10, max_depth=1000000.0)¶
Perform haplotype calling
- Parameters:
bamfile (str) -- path of a BAM file containing NGS reads aligned to marker reference sequences and sorted
markertsv (str) -- path of a TSV file containing marker metadata, specifically the offset of each SNP for every marker in the panel
minbasequal (int) -- minimum base quality (PHRED score) to be considered reliable for haplotype calling; default is 10, corresponding to Q10, i.e., 90% probability that the base call is correct
max_depth (float) -- maximum permitted read depth
- Returns:
an unfiltered catalog of haplotype counts for each marker (a typing result)
- Return type:
microhapulator.profile.TypingResult
- microhapulator.profile.TypingResult.filter(self, static=None, dynamic=None, config=None)¶
Apply static and/or dynamic thresholds to distinguish true and false haplotypes
Thresholds are applied to the haplotype read counts of a raw typing result. Static integer thresholds are commonly used as detection thresholds, below which any haplotype count is considered noise. Dynamic thresholds are commonly used as analytical thresholds and represent a percentage of the total read count at the marker, after any haplotypes failing a static threshold are discarded.
- Parameters:
static (int) -- global fixed read count threshold
dynamic (float) -- global percentage of total read count; e.g. use dynamic=0.02 to apply a 2% analytical threshold
config (pandas.DataFrame) -- tabular data structure specifying marker-specific thresholds to override global thresholds; three required columns: Marker for the marker name; Static and Dynamic for marker-specific thresholds
Analysis, QA/QC, and interpretation¶
- microhapulator.api.read_length_dist(fastq, outfile, xlabel='Read Length (bp)', xlim=None, scale=1000, title=None)¶
Plot distribution of read lengths
- Parameters:
fastq (str) -- path of a FASTQ file containing NGS reads
outfile (str) -- path of a graphic file to create
xlabel (str) -- label for the X axis
xlim (tuple) -- a 2-tuple of numbers (x1, x2) representing the start and end points of the portion of the X axis to be displayed; by default this is determined automatically
scale (float) -- scaling factor for the Y axis
title (str) -- title for the plot
- microhapulator.api.interlocus_balance(result, include_discarded=True, terminal=True, tofile=None, title=None, figsize=(6, 4), dpi=200)¶
Compute interlocus balance
Plot interlocus balance in the terminal and/or a high-resolution graphic. Also normalize read counts and perform a chi-square goodness-of-fit test assuming uniform read coverage across markers. The reported chi-square statistic measures the extent of imbalance, and can be compared among samples sequenced using the same panel: the minimum value of 0 represents perfectly uniform coverage, while the maximum value of D occurs when all reads map to a single marker (D represents the degrees of freedom, or the number of markers minus 1).
- Parameters:
result (microhapulator.profile.TypingResult) -- a typing result including haplotype counts
included_discarded (bool) -- flag indicating whether to include in each marker's total read count reads that are successfully aligned but discarded because they do not span all SNPs at the marker
terminal (bool) -- flag indicating whether to print the interlocus balance histogram to standard output; enabled by default
tofile (str) -- name of image file to which the interlocus balance histogram will be written using Matplotlib; image format is inferred from file extension; by default, no image file is generated
title (str) -- add a title (such as a sample name) to the histogram plot
figsize (tuple) -- a 2-tuple of integers indicating the dimensions of the image file to be generated
dpi (int) -- resolution (in dots per inch) of the image file to be generated
- Returns:
a tuple (S, C) where S is the chi-square statistic, and C is a table of total read counts for each marker
- Return type:
tuple(float, pandas.DataFrame)
- microhapulator.api.heterozygote_balance(result, tofile=None, title=None, figsize=None, dpi=200, dolabels=False, absolute=False)¶
Compute heterozygote balance
Compute absolute and relative abundance of major and minor alleles at heterozygote loci and plot abundances in a high-resolution graphic. Also perform a one-sided paired t-test and report the t-statistic as a measure of heterozygote imbalance.
Graphics implementation adapted from https://www.pythoncharts.com/matplotlib/grouped-bar-charts-matplotlib/.
- Parameters:
result (microhapulator.profile.TypingResult) -- a filtered typing result including haplotype counts and genotype calls
tofile (str) -- name of image file to which the interlocus balance histogram will be written using Matplotlib; image format is inferred from file extension; by default, no image file is generated
title (str) -- add a title (such as a sample name) to the histogram plot
figsize (tuple) -- a 2-tuple of integers indicating the dimensions of the image file to be generated
dpi (int) -- resolution (in dots per inch) of the image file to be generated
dolabels (bool) -- flag indicating whether marker labels and read counts should be added to the figure
absolute (bool) -- plot absolute read counts rather than relative abundances
- Returns:
a tuple (T, C) where T is the t-statistic, and C is the table of read counts and percentages for each heterozygous marker
- Return type:
pandas.DataFrame
- microhapulator.api.contrib(profile)¶
Estimate the minimum number of DNA contributors to a suspected mixture
Let \(N_{\text{al}}\) represent the largest number of haplotypes observed at any marker. We then estimate the minimum number of DNA contributors as follows.
\(C_{\text{min}} = \left\lceil\frac{N_{\text{al}}}{2}\right\rceil\)
- Parameters:
profile (microhapulator.profile.Profile) -- a typing result or a simulated genotype
- Returns:
a tuple (E, N, P), where E is the estimate for the minimum number of DNA contributors, N is the number of markers supporting the estimate, and P is the percentage of markers supporting the estimate
- Return type:
tuple(int, int, float)
- microhapulator.api.prob(frequencies, prof1, prof2=None, erate=0.01)¶
Compute a profile random match probability (RMP) or an RMP-based likelihood ratio (LR) test
The LR test, when performed, assesses the relative weight of two competing propositions.
\(H_1\): the genetic profiles prof1 and prof2 originated from the same individuals
\(H_2\): prof1 and prof2 originated from two unrelated individuals in the population
The test statistic is computed as follows.
\[LR = \frac{P(H_1)}{P(H_2)}\]The probability \(P(H_1) = \epsilon^R\), where \(\epsilon\) is the per-marker rate of genotyping error (erate) and \(R\) is the number of markers with discordant genotypes between profiles. The probability \(P(H_2)\) is the RMP of prof1. Note that when there is a perfect match between prof1 and prof2, \(P(H_1) = 1\) and the LR statistic is simply the reciprocal of the RMP.
Note that the LR test as formulated assumes that prof1 and prof2 are close matches. The error rate accommodates a small amount of allelic drop-out or drop-in but is not designed to accommodate profiles with substantial differences.
- Parameters:
frequencies (pandas.DataFrame) -- table of population haplotype frequencies
prof1 (microhapulator.profile.Profile) -- a typing result or simulated genotype
prof2 (microhapulator.profile.Profile) -- a typing result or simulated genotype; optional
erate (float) -- rate of genotyping error
- Returns:
the RMP of prof1 if prof2 is undefined, or the LR test statistic if prof2 is defined
- Return type:
float
- microhapulator.api.diff(prof1, prof2)¶
Compare two profiles and determine the markers at which their genotypes differ
Note: this is a generator function.
- Parameters:
prof1 (microhapulator.profile.Profile) -- typing result or simulated profile
prof2 (microhapulator.profile.Profile) -- typing result or simulated profile
- Yields:
for each discordant marker, a tuple (M, X, Y), where M is the marker name, X is the set of haplotypes unique to prof1, and Y is the set of haplotypes unique to prof2
- microhapulator.api.dist(prof1, prof2)¶
Compute a simple Hamming distance between two profiles
- Parameters:
prof1 (microhapulator.profile.Profile) -- typing result or simulated profile
prof2 (microhapulator.profile.Profile) -- typing result or simulated profile
- Returns:
the number of markers with a discordant genotype between the two profiles
- Return type:
int
- microhapulator.api.contain(prof1, prof2)¶
Perform a simple containment test
Calculate a simple containment statistic C, representing the number of markers whose haplotypes in prof1 are a subset of the haplotypes in prof2. Dividing by the total number of markers provides a rudimentary measure of containment, the proportion of prof1 "contained" in prof2. Note that this statistic does not accommodate allelic drop-in or drop-out.
- Parameters:
prof1 (microhapulator.profile.Profile) -- a typing result or simulated genotype
prof2 (microhapulator.profile.Profile) -- a typing result or simulated genotype
- Returns:
a tuple (C, T), where C is the containment statistic and T is the total number of markers
- Return type:
tuple(float, int)
Simulation¶
- microhapulator.api.sim(frequencies, seed=None)¶
Simulate a diploid genotype from the specified microhaplotype frequencies
- Parameters:
frequencies (pandas.DataFrame) -- population haplotype frequencies
seed (int) -- seed for random number generator
- Returns:
a simulated genotype profile for all markers specified in the haplotype frequencies
- Return type:
microhapulator.profile.SimulatedProfile
- microhapulator.profile.SimulatedProfile.merge(profiles)¶
Combine simulated profiles into a mock DNA mixture
- Parameters:
profiles (list) -- list of simulated profiles
- Returns:
a combined profile
- Return type:
microhapulator.profile.SimulatedProfile
- microhapulator.profile.Profile.unite(mom, dad)¶
Simulate the creation of a new profile from a mother and father
- Parameters:
mom (microhapulator.profile.Profile) -- typing result or simulated profile
dad (microhapulator.profile.Profile) -- typing result or simulated profile
- Returns:
a simulated offspring
- Return type:
microhapulator.profile.SimulatedProfile
- microhapulator.api.seq(profiles, markers, refrseqs, seeds=None, threads=1, totalreads=500000, proportions=None, sig=None)¶
Simulate paired-end Illumina MiSeq sequencing of the given profile(s)
This generator function accepts any combination of simple (single-source) or complex (multi-source mixture) profiles as input. Each profile is "sequenced" separately, and then all reads are aggregated.
- Parameters:
profiles (list) -- list of mock profiles
markers (pandas.DataFrame) -- marker definitions, provided as a table of SNP offsets, one row per SNP; required columns: Marker and Offset, representing the distance of the SNP from the first nucleotide in the reference sequence
refrseqs (dict) -- a dictionary with marker names as the keys and marker reference sequences as the values
seeds (list) -- optional list of random seeds, one per profile
threads (int) -- number of threads to use for each sequencing task; note that optimal performance is typically achieved with a single thread
totalreads (int) -- total number of reads to generate across all profiles
proportions (list) -- optional list of relative proportions, equal to the number of profiles; by default, each profile contributes an equal number of reads
sig (str) -- an optional signature (prefix) to apply to all simulated reads
- Yields:
for each read pair, a tuple (I, X, Y) where I is a serial index, X is the forward read (R1), and Y is the reverse read (R2)