Merlin (Multipoint Engine for Rapid Likelihood INference) is a software package that uses sparse inheritance trees for pedigree analysis; it performs rapid haplotyping, genotype error detection, IBD and kinship estimation, genotype simulation by gene-dropping, single and multipoint non-parametric linkage, QTL, and variance component analysis and affected pair linkage analyses and can handle more markers than other pedigree analysis packages. Merlin performs tens of times faster than most other existing software packages and the clever data structure allows for analysis of complex pedigrees and dense marker maps using less memory than most other analysis programs. Currently Merlin does not support parametric analyses but these may be included in upcoming releases of Merlin Merlin is accompanied by several related utility and analysis programs Pedstats, pedwipe, Merlin-regress, Merlin-X and QTDT that utilize the same input files.
In addition to Merlin and QTDT format input files Merlin also accepts LINKAGE and MENDEL format input files. The names of the input files are not constrained since they are given as command line parameters to Merlin. QTDT format is very flexible and consists of three general ASCII-text input files (created with notepad or some other text editor). Note that if you have a pre-makeped pedigree file, a Mega2 format map file and a phenotype file as described on these pages under the title "File Formats" you can use PedConvert to make QTDT format Merlin input files. See "Linkage utils mainpage" for documentation.
The pedigree file contains the family relationships, genotypes and phenotypes of individuals. Only the first five columns (family, individual, father, mother and sex) are mandatory others are defined by the accompaning data file. Genotypes and phenotypes may be in any order since their identity is determined in the data file. Also marker allele labels are not constrained, for example they can be labelled as 'A' and 'G' or 212 and 216. Affection status for a discrete trait can be determined as U or 1 for unaffecteds, A or 2 for affecteds and X or 0 for missing phenotypes. For genotypes the missing values are either 0 or X. For quantitative traits the defaults missing values are -99.999 and X but any character, string or value can be set as the missing value code. In example.ped there is a nuclear pedigree with three children. Father and son are affected for some discrete phenotype, all family members except the father are genotyped for three markers and have age and height information.
<example.ped> 100 100 0 0 1 2 0 0 0 0 0 0 X X 1 100 101 0 0 2 1 100 500 A T 6 5 63.2 163.0 2 100 102 100 101 1 2 200 100 A G 4 6 21.8 183.5 1 100 103 100 101 2 1 400 100 A C 4 6 18.6 170.0 2 100 104 100 101 2 1 200 500 T G 1 6 14.5 168.5 2 <example.ped>
The data file describes the contents of the pedigree file from the sixth column onwards. The data file includes one row per data item in the pedigree file, indicating the data type (encoded as M - marker, A - affection status, T - Quantitative Trait and C - Covariate) and providing a one-word label for each item. When the data type is M, Merlin will read two columns from the pedigree file corresponding to the two marker alleles.
<example.dat> A TRAIT M STR1 M SNP1 M STR2 C AGE T HEIGHT C SEX <example.dat>
The map file contains the chromosomes, marker names and genetic locations of the markers. The markers in the map file may be in any order and order does not need to correspond to the marker order in the pedigree or data files. The map file may contain more markers than individual data files.
<example.map> 1 STR2 23.123 2 SNP1 2.423 1 STR1 1.399 <example.map>
After the input files are created it is imperative to verify that these are interpreted correctly by Merlin. Pedstats is also highly useful for producing summary information and graphical output of pedigree, genotype and phenotype data. Pedstats also performs rapid Hardy-Weinberg checking from genotype data. For example summary data for both sexes separately can be produced with the following command:
$ pedstats -p example.ped -d example.dat --bySexNote that as with all Merlin affiliated software the order of the options does not matter as long as the correct input files are followed the corresponding options. Output will be produced in the standard output so it must be redirected to a file for example with command:
$ pedstats -p example.ped -d example.dat --bySex > example_pedstats_bysex.out
$ merlin -p example.ped -d example.dat -m example.map --errorA summary file merlin.err of unlikely genotypes is generated as output. The authors also provide a utility program called Pedwipe that uses this merlin.err file to erase the unlikely genotypes from the pedigree file.
$ pedwipe -p example.ped -d example.datPedwipe produces a pedigree file named wiped.ped and a data file named wiped.dat as output that can be used in all subsequent Merlin analyses. Note, however that the output is produced in strict Merlin format and can not be analyzed with other software packages.
Identity-by-descent (IBD) estimation is used in allele-sharing based linkage analyses (non-parametric) as is needed as input for some programs. Merlin can estimate the number of alleles shared identical-by-descent among relatives in a pedigree, and summarize this information either as probabilities that a given pair will share 0, 1 or 2 alleles IBD or as the kinship coefficient between each pair at a particular locus. By default Merlin uses information on all markers for these analyses, but each marker position can also be considered individually using the --singlepoint option. For example the command for producing the IBD matrices for all relative pairs using information at all marker loci simultaneously is and using marker names instead of their positions:
$ merlin -p example.ped -d example.dat -m example.map --ibd --markernamesThe command for producing the kinship coefficient matrix for all relative pairs for each marker position separately is:
$ merlin -p example.ped -d example.dat -m example.map --kin --singlepointMerlin produces output files containing the matrices named merlin.ibd and merlin.kin, respectively. Some programs require IBD estimates as input for their analysis. For example, QTDT tests for association using all phenotypes from related individuals and requires IBD matrices to distinguish between linkage and association.
Haplotypes are useful for example for increasing the statistical power of association analyses. Information about gene flow in a pedigree can be used to reconstruct likely haplotypes for families and individuals. Merlin has three haplotype estimation modes. It can either provide haplotypes corresponding to the most likely pattern of gene flow (--best command line option), sample gene flow patterns according to their likelihood (--sample) or provide all non-recombinant haplotypes (--zero --all). For example the command for producing the haplotypes based on the most likely pattern of gene flow redirected to a file named 'example.chr' is:
$ merlin -p example.ped -d example.dat -m example.map --best --prefix exampleOutput is procuded in file named example.chr.
Merlin has many general options for all linkage analysis methods concerning the number of calculation points (e.g. options --steps and --grid), computational limits (e.g. options --bits and --minutes) and resource usage (e.g. options --megabytes and --swap). Please see the Merlin reference for a full list and descriptions of all options. Output for linkage analyses will be produced in the standard output so it must be redirected to a file. By default Merlin uses multipoint linkage analysis for all statistics, using the --singlepoint option will calculate singlepoint linkage analyses. The order of command line options is not constrained; they can be in any order. For most analyses the LOD score for each individual pedigree can be outputted using the --perfamily option, which can be useful in detecting families can contribute most to the linkage signal.
Merlin implements algorithms for calculation of the Whittemore and Halpern NPL all and NPL pairs statistics as well as calculates a LOD score using the Kong and Cox linear model. The former is accessed using the --npl option and the latter using the --pairs option. For example the command for calculation of both non-parametric statistics for our trait with output redirected to file named 'example.npl' is:
$ merlin -p example.ped -d example.dat -m example.map --npl --pairs > example.npl
Variance components analysis is a powerful method for localizing loci for normally distributed, unselected quantitative traits. Variance components analyses can also incorporate user-specified covariates. For example the command for variance components analysis for height using age and sex as covariates is:
$ merlin -p example.ped -d example.dat -m example.map --vc --usecovariates
$ merlin-regress -p example.ped -d example.dat -m example.map --mean 0.00 --variance 1.00 --heritability 0.85
For non-normally distributed quantitative traits is is advisable to use other statistics than variance components or Haseman-Elston regression. Merlin implements two such statistics described by Whittemore and Halpern (1994) and Kong and Cox (1997). Note that covariates can not be included into the analysis; adjustment of the quantitative trait for covariates must be performed prior to the analysis step. These statictics are accessed using the --qtl and --deviates option, whwre the former is suitable for unselected and the latter for selected samples. When the --qtl option is selected Merlin uses the sample mean to estimate the population mean. For example the command for qtl analysis for height using the sample mean is:
$ merlin -p example.ped -d example.dat -m example.map --qtlWhen the --deviates option is selected, Merlin fixes the population mean at zero. This option is suitable for the analysis of selected samples if the sample mean is subtracted from individual phenotypes prior to analysis. For example the command for qtl analysis for height using the population mean deviates is:
$ merlin -p example.ped -d example.dat -m example.map --deviates
In most instances it is useful to obtain an empirical test statistic (e.g. LOD score) distribution for a given dataset instead of relying solely on asymptotic significance levels. Simulating a dataset conditional on the properties of the observed data can yield a reliable estimate of the false positive rate of declaring linkage in the observed data given that enough (up to 10000) replicates are simulated and analyzed. With the --simulate option, Merlin can generate random datasets that look like the original datain terms of marker informativeness, spacing and missing data patterns. In these datasets, marker data are simulated under the null hypothesis of no linkage or association to observed phenotypes. Phenotypic measurements, including covariates, quantitative traits and affection status are preserved. For example the command for simulating a dataset comparable to our own with the output file prefix 'example' is:
$ merlin -p example.ped -d example.dat -m example.map --simulate -r36548 --prefix example --saveThe output produced in files example-replicate.dat, example-replicate.freq, example-replicate.map and example-replicate.ped is strict Merlin format and cannot be analyzed with other software. If you simulate multiple datasets bare in mind that you must use a unique random seed for each replicate (option -p99999).
Merlin produces different types of result files depending on the analysis method used. Please see the the Merlin documentation for assistance in their interpretation.
Gonçalo Abecasis
Abecasis GR, Cherny SS, Cookson WO and Cardon LR. Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet (2002) 30:97-101
The authors of the program provide an excellent web tutorial on Merlin and Pedstats which every user should read thoroughly.
Merlin Tutorial
Merlin Reference
Pedstats Tutorial
Pedstats Reference