# Configuration MicroHapulator's core operations depend on three sets of metadata describing the panel used for microhaplotype sequencing. More information for each is provided below. Not every piece of metadata is used for each operation, but it is recommended that all three be prepared at the same time, prior to any analysis or interpretation. - microhaplotype marker reference sequences - microhaplotype marker definitions - population haplotype frequencies This document will demonstrate how to prepare these files using the [MicroHapDB](https://github.com/bioforensics/MicroHapDB) database, which contains a comprehensive collection of published microhaplotypes. A trivially small panel composed of three arbitrary markers (identifiers: mh01KK-205, mh03USC-3qC, mh18CP-005) will be used as an example to show how the data is formatted. When preparing a full panel composed of dozens of markers, it is recommended that marker identifiers be placed in a plain text file, one identifier per line, and that the configuration data be written to directly to files rather than printed to the terminal screen. This will also be demonstrated. ## Reference sequences A reference sequence must be provided for each marker, and aggregated into a single FASTA file. At a minimum, each reference sequence must contain the SNPs associated with the microhaplotype, but additional flanking sequence can also be included. The following command demonstrates how to use MicroHapDB to retrieve these reference sequences. ``` $ microhapdb marker --format=fasta --delta=25 --min-length=200 mh01KK-205 mh03USC-3qC mh18CP-005 >mh01KK-205 PermID=MHDBM-1f7eaca2 GRCh38:chr1:18396197-18396351 variants=25,46,134,179 Xref=SI664550A CACCAGTTCTCATGAATCTGAGGAATTCTTCCTCCTAGCTACTTCCTTCCTTTTCCCTCATTACATCCCTGCCAAGGACA AATTCTGCCATTTGCATGGCAGGACTCCTCCAAAAAGGGGCTTCCTCCCTTTCCGTTAGTAAAGGAAGAGGTTACCTGAG ACTTGACTTAACCTCCTTGGGAGGGAACATGCTTTCACTGTTGCG >mh03USC-3qC PermID=MHDBM-eacabfd9 GRCh38:chr3:196653025-196653121 variants=52,61,71,111,148 TAGCATTGAAATGATGCCTTGTAATTTACTAAATCTGCAACTATGCAGCCTTATTTCATGGCGGGCAGTGGTGGTGATCC CAGGTTTCAGGGGCGGGGAAGGGTGCTGGGGGGATCCTGAGGTCAGGAACCCGTACACCTCTGCTTCTGCCCTCTCTTCC CTGTGCCGGCCACAAGGCAATGACTCCTGTGTGGGTGCAGA >mh18CP-005 PermID=MHDBM-a85754d3 GRCh38:chr18:8892864-8892907 variants=78,107,110,121 Xref=SI664898P GAGATTCTGTCTCAAAAAATAAAAAATTAAAAAAAATTTTTTTAAACCCAAAATATTACTGCAGATGTCCTTATACGCAG TGGTGTTAGTTTTAGAAACTGATTCTACGGGTATGCTTGCTCGTGTGTAAAATTATTCATATACAAATTATTTATGACAG TATTGTTTCTAGTAGTAAAATATCGGAAATATTCTAAATG ``` ## Marker definitions MicroHapulator needs to know the location of every SNP of interest in the corresponding reference sequence. The list of SNP positions for a microhaplotype is its *marker definition*, and this information is provided in a tab-separated tabular plain text (TSV) file. The **Marker** column contains the identifier (name, label, or designator) of a microhaplotype in the panel, and the **Offset** column contains the distance of one SNP from the beginning of the reference sequence. For example, if a SNP of interest is the very first nucleotide in the reference, it has a distance of 0 from the beginning of the sequence and thus its offset is `0`. If a SNP is the 10th nucleotide, its offset is `9`. The following command shows how to use MicroHapDB (version 0.7 or greater) to prepare a marker definition file. Note that it is identical to the previous command, except that the `--format=fasta` setting was changed to `--format=offsets`. ``` $ microhapdb marker --format=offsets --delta=25 --min-length=200 mh01KK-205 mh03USC-3qC mh18CP-005 Marker Offset mh01KK-205 25 mh01KK-205 46 mh01KK-205 134 mh01KK-205 179 mh03USC-3qC 52 mh03USC-3qC 61 mh03USC-3qC 71 mh03USC-3qC 111 mh03USC-3qC 148 mh18CP-005 78 mh18CP-005 107 mh18CP-005 110 mh18CP-005 121 ``` ## Population frequencies Performing forensic interpretation or simulating mock profiles depends on reliable estimates of population microhaplotype frequencies. These must also be provided to MicroHapulator as a tab-separated tabular plain text (TSV) file. The **Marker** column contains the name/label/designator of a microhaplotype in the panel, the **Haplotype** column contains a comma-separated list of SNP alleles, and the **Frequency** column contains the relative prevalance of that haplotype in the population of interest. MicroHapDB contains population frequency estimates from 26 global populations in the [1000 Genomes Project](https://www.internationalgenome.org/) for most of its markers. MicroHapDB (version 0.7 or greater) can format this frequency data for use with MicroHapulator. Using the correct population identifier (running `microhapdb population` beforehand if needed), haplotype frequencies can be retrieved and formatted as follows. (The "PUR" population, "Puerto Ricans from Puerto Rico", is used for this example.) ``` $ microhapdb frequency --format mhpl8r --population PUR --marker mh01KK-205 mh03USC-3qC mh18CP-005 Marker Haplotype Frequency mh01KK-205 C,C,A,G 0.149 mh01KK-205 T,C,A,G 0.361 mh01KK-205 T,T,A,A 0.183 mh01KK-205 T,T,A,G 0.13 mh01KK-205 T,T,G,G 0.178 mh03USC-3qC A,C,C,A,G 0.043 mh03USC-3qC A,C,C,G,G 0.12 mh03USC-3qC A,C,C,G,T 0.034 mh03USC-3qC A,C,T,G,G 0.269 mh03USC-3qC A,C,T,G,T 0.212 mh03USC-3qC G,C,C,A,G 0.005 mh03USC-3qC G,C,C,G,G 0.014 mh03USC-3qC G,C,T,G,T 0.005 mh03USC-3qC G,T,C,A,G 0.226 mh03USC-3qC G,T,C,G,G 0.067 mh03USC-3qC G,T,C,G,T 0.005 mh18CP-005 A,C,A,T 0.38 mh18CP-005 A,C,G,C 0.255 mh18CP-005 A,T,A,C 0.255 mh18CP-005 A,T,G,C 0.043 mh18CP-005 G,C,A,C 0.01 mh18CP-005 G,T,A,C 0.058 ``` ## Summary As mentioned at the beginning of this document, you're much better off writing the config data directly to files rather than printing it to your screen. If you have the marker identifiers for your panel (one identifier per line) in a plain text file named, say, `mypanel.txt`, you would create your config files like so. (Replace "PUR" with the appropriate population.) ``` $ microhapdb marker --format=fasta --delta=25 --min-length=200 --panel=mypanel.txt > mypanel-refr.fasta $ microhapdb marker --format=offsets --delta=25 --min-length=200 --panel=mypanel.txt > mypanel-defn.tsv $ microhapdb frequency --format=mhpl8r --population=PUR --marker --panel=mypanel.txt > mypanel-freq-pr.tsv ``` These commands will create three files: `mypanel-refr.fasta` with the reference sequences, `mypanel-defn.tsv` with the marker definitions, and `mypanel-freq-pr.tsv` with the haplotype frequencies. ## Example configuration files The [`microhapulator/data/configs/`](https://github.com/bioforensics/MicroHapulator/tree/master/microhapulator/data/configs/) directory in the MicroHapulator source code distribution contains example configuration files for a published panel. ## What if my data isn't in MicroHapDB? If you have marker and/or frequency data that you would like to submit to MicroHapDB, that is always welcome! See [this page](https://github.com/bioforensics/MicroHapDB#adding-markers-to-microhapdb) for details. In any case, the examples above show how the reference sequences, marker definitions, and haplotype frequencies should be formatted. So if your data is not included in MicroHapDB, you should still be able to configure MicroHapulator correctly, it will just take some extra time to prepare the files manually.