# Configuration MicroHapulator's core operations depend on three sets of metadata describing the panel used for microhaplotype sequencing. More information for each is provided below. Not every piece of metadata is used for each operation, but it is recommended that all three be prepared at the same time, prior to any analysis or interpretation. - microhaplotype marker reference sequences - microhaplotype marker definitions - population haplotype frequencies This document will demonstrate how to prepare these files using the [MicroHapDB](https://github.com/bioforensics/MicroHapDB) database, which contains a comprehensive collection of published microhaplotypes. A trivially small panel composed of three arbitrary markers (identifiers: mh01KK-205, mh03USC-3qC, mh18CP-005) will be used as an example to show how the data is formatted. When preparing a full panel composed of dozens of markers, it is recommended that marker identifiers be placed in a plain text file, one identifier per line, and that the configuration data be written to directly to files rather than printed to the terminal screen. This will also be demonstrated. ## Reference sequences A reference sequence must be provided for each marker, and aggregated into a single FASTA file. At a minimum, each reference sequence must contain the SNPs associated with the microhaplotype, but additional flanking sequence can also be included. The following command demonstrates how to use MicroHapDB to retrieve these reference sequences. ``` $ microhapdb marker --format=fasta --delta=25 --min-length=200 mh01KK-205 mh03USC-3qC mh18CP-005 >mh01KK-205 PermID=MHDBM-1f7eaca2 GRCh38:chr1:18396197-18396351 variants=25,46,134,179 Xref=SI664550A CACCAGTTCTCATGAATCTGAGGAATTCTTCCTCCTAGCTACTTCCTTCCTTTTCCCTCATTACATCCCTGCCAAGGACA AATTCTGCCATTTGCATGGCAGGACTCCTCCAAAAAGGGGCTTCCTCCCTTTCCGTTAGTAAAGGAAGAGGTTACCTGAG ACTTGACTTAACCTCCTTGGGAGGGAACATGCTTTCACTGTTGCG >mh03USC-3qC PermID=MHDBM-eacabfd9 GRCh38:chr3:196653025-196653121 variants=52,61,71,111,148 TAGCATTGAAATGATGCCTTGTAATTTACTAAATCTGCAACTATGCAGCCTTATTTCATGGCGGGCAGTGGTGGTGATCC CAGGTTTCAGGGGCGGGGAAGGGTGCTGGGGGGATCCTGAGGTCAGGAACCCGTACACCTCTGCTTCTGCCCTCTCTTCC CTGTGCCGGCCACAAGGCAATGACTCCTGTGTGGGTGCAGA >mh18CP-005 PermID=MHDBM-a85754d3 GRCh38:chr18:8892864-8892907 variants=78,107,110,121 Xref=SI664898P GAGATTCTGTCTCAAAAAATAAAAAATTAAAAAAAATTTTTTTAAACCCAAAATATTACTGCAGATGTCCTTATACGCAG TGGTGTTAGTTTTAGAAACTGATTCTACGGGTATGCTTGCTCGTGTGTAAAATTATTCATATACAAATTATTTATGACAG TATTGTTTCTAGTAGTAAAATATCGGAAATATTCTAAATG ``` ## Marker definitions MicroHapulator needs to know the location of every SNP of interest in the corresponding reference sequence. The list of SNP positions for a microhaplotype is its *marker definition*, and this information is provided in a tab-separated tabular plain text (TSV) file. The **Marker** column contains the identifier (name, label, or designator) of a microhaplotype in the panel, and the **Offset** column contains the distance of one SNP from the beginning of the reference sequence. For example, if a SNP of interest is the very first nucleotide in the reference, it has a distance of 0 from the beginning of the sequence and thus its offset is `0`. If a SNP is the 10th nucleotide, its offset is `9`. The **Chrom** and **OffsetHg38** columns indicate the position of each SNP in the GRCh38 reference human genome assembly. **While these two columns are not strictly required, certain quality control checks in the end-to-end MH analysis pipeline will be disabled if this data is absent.** The following command shows how to use MicroHapDB (version 0.8 or greater) to prepare a marker definition file. Note that it is identical to the previous command, except that the `--format=fasta` setting was changed to `--format=offsets`. ``` $ microhapdb marker --format=offsets --delta=25 --min-length=200 mh01KK-205 mh03USC-3qC mh18CP-005 Marker Offset Chrom OffsetHg38 mh01KK-205 25 chr1 18396197 mh01KK-205 46 chr1 18396218 mh01KK-205 134 chr1 18396306 mh01KK-205 179 chr1 18396351 mh03USC-3qC 52 chr3 196653025 mh03USC-3qC 61 chr3 196653034 mh03USC-3qC 71 chr3 196653044 mh03USC-3qC 111 chr3 196653084 mh03USC-3qC 148 chr3 196653121 mh18CP-005 78 chr18 8892864 mh18CP-005 107 chr18 8892893 mh18CP-005 110 chr18 8892896 mh18CP-005 121 chr18 8892907 ``` ## Population frequencies Performing forensic interpretation or simulating mock profiles depends on reliable estimates of population microhaplotype frequencies. These must also be provided to MicroHapulator as a tab-separated tabular plain text (TSV) file. The **Marker** column contains the name/label/designator of a microhaplotype in the panel, the **Haplotype** column contains a comma-separated list of SNP alleles, and the **Frequency** column contains the relative prevalance of that haplotype in the population of interest. MicroHapDB contains population frequency estimates from 26 global populations in the [1000 Genomes Project](https://www.internationalgenome.org/) for most of its markers. MicroHapDB (version 0.7 or greater) can format this frequency data for use with MicroHapulator. Using the correct population identifier (running `microhapdb population` beforehand if needed), haplotype frequencies can be retrieved and formatted as follows. (The "PUR" population, "Puerto Ricans from Puerto Rico", is used for this example.) ``` $ microhapdb frequency --format mhpl8r --population PUR --marker mh01KK-205 mh03USC-3qC mh18CP-005 Marker Haplotype Frequency mh01KK-205 C,C,A,G 0.149 mh01KK-205 T,C,A,G 0.361 mh01KK-205 T,T,A,A 0.183 mh01KK-205 T,T,A,G 0.13 mh01KK-205 T,T,G,G 0.178 mh03USC-3qC A,C,C,A,G 0.043 mh03USC-3qC A,C,C,G,G 0.12 mh03USC-3qC A,C,C,G,T 0.034 mh03USC-3qC A,C,T,G,G 0.269 mh03USC-3qC A,C,T,G,T 0.212 mh03USC-3qC G,C,C,A,G 0.005 mh03USC-3qC G,C,C,G,G 0.014 mh03USC-3qC G,C,T,G,T 0.005 mh03USC-3qC G,T,C,A,G 0.226 mh03USC-3qC G,T,C,G,G 0.067 mh03USC-3qC G,T,C,G,T 0.005 mh18CP-005 A,C,A,T 0.38 mh18CP-005 A,C,G,C 0.255 mh18CP-005 A,T,A,C 0.255 mh18CP-005 A,T,G,C 0.043 mh18CP-005 G,C,A,C 0.01 mh18CP-005 G,T,A,C 0.058 ``` ## Summary As mentioned at the beginning of this document, you're much better off writing the config data directly to files rather than printing it to your screen. If you have the marker identifiers for your panel (one identifier per line) in a plain text file named, say, `mypanel.txt`, you would create your config files like so. (Replace "PUR" with the appropriate population.) ``` $ microhapdb marker --format=fasta --delta=25 --min-length=200 --panel=mypanel.txt > mypanel-refr.fasta $ microhapdb marker --format=offsets --delta=25 --min-length=200 --panel=mypanel.txt > mypanel-defn.tsv $ microhapdb frequency --format=mhpl8r --population=PUR --marker --panel=mypanel.txt > mypanel-freq-pr.tsv ``` These commands will create three files: `mypanel-refr.fasta` with the reference sequences, `mypanel-defn.tsv` with the marker definitions, and `mypanel-freq-pr.tsv` with the haplotype frequencies. ## Example configuration files The [`microhapulator/data/configs/`](https://github.com/bioforensics/MicroHapulator/tree/master/microhapulator/data/configs/) directory in the MicroHapulator source code distribution contains example configuration files for a published panel. ## What if my data isn't in MicroHapDB? If you have marker and/or frequency data that you would like to submit to MicroHapDB, that is always welcome! See [this page](https://github.com/bioforensics/MicroHapDB#adding-markers-to-microhapdb) for details. In any case, the examples above show how the reference sequences, marker definitions, and haplotype frequencies should be formatted. So if your data is not included in MicroHapDB, you should still be able to configure MicroHapulator correctly, it will just take some extra time to prepare the files manually.