Setting DNA substitution models in BEAST

JC model of sequence evolution from TreeThinkers.

BEAST (Bayesian Evolutionary Analysis Sampling Trees) software provides a growing set of clear options for specifying evolutionary models of DNA substitution for alignments loaded into the program. However, the set of models that are readily available and “spelled out” in drop-down menus in the BEAUti (Bayesian Evolutionary Analysis Utility) GUI is much smaller compared to the standard set of 11 substitution schemes considered by other software programs for model selection (giving rise to a total of 88 models in jModelTest, or 56 models in PartitionFinder). This poses a problem that I have heard several BEAST users bring up in online discussion groups, and that several people have asked me about personally. In this post, I shed some light on this problem and clear up some of the confusion by giving details on how users can easily specify different substitution models for their BEAST runs. 

Background: Substitution models in BEAST

I will assume that readers are familiar with DNA substitution models. Also, there are a number of details about how BEAST handles DNA substitution models, which you can learn from the BEAST2 book or the manual, but that I will not cover. For example, by default, BEAST normalizes the rate matrix so that the mean rate is 1.0, and so on. Moving on to the issues…

It seems that many people are having issues with the fact that it’s not a simple “one-click” procedure to specify the diverse variety of substitution models that are frequently selected during model selection inference conducted prior to setting up BEAST runs. This is because BEAST1 only gives the obvious options to specify three of the most popular models of DNA substitution: TrN, HKY, and GTR, while BEAST2 versions give users obvious options to specify one of four models: JC69, TrN, HKY, and GTR. Thus other models such as TIM, K80, and F81 can seem cryptic to some users, who will often exclaim in dismay, “Where is the option for model X???” or “I don’t see how to specify model Y!!!” and then post a question about it (that may languish, never to even be directly answered, who knows).

Solving the problem: Learning BEAST functionalities “under your nose”

The real solution to the above problems is to realize the options available to you, especially that BEAUti actually gives several important options in the Site Model panel that each user must learn to use to tailor the models of evolution for each alignment/partition. These options are located right “under your nose,” but I’ll show you where they are and how to use them in Tables 1 and 2 below, as follows:

Table 1. Details for setting substitution models in BEAST1 (e.g. v1.8.3).

Best-fit substitution model (Base) Model to select in BEAUti Additional specifications in BEAUti
JC69 JC69 None 
TrN TN93 None 
TrNef TN93 base Frequencies set to “All Equal”
K80 (K2P) HKY base Frequencies set to “All Equal”
F81 GTR turn off operators for all rate 
parameters
HKY HKY None
SYM GTR base Frequencies set to “All Equal”
TIM GTR

edit XML file by hand so that all other rates are equal to the AG rate

TVM GTR unchecking AG rate operator
TVMef GTR unchecking AG rate operator and setting base Frequencies to “All Equal”
GTR GTR None

Table 2. Details for setting substitution models in BEAST2 (e.g. 2.4.0+).

Best-fit substitution model (Base) Model to select in BEAUti 2 Additional specifications in BEAUti 2
JC69 JC69 None 
TrN TN93 None
TrNef TN93 base Frequencies set to “All Equal”
K80 (K2P) HKY base Frequencies set to “All Equal”
F81 GTR fix all rate parameters to 1.0 (uncheck the “estimate” box)
HKY HKY  None
SYM GTR base Frequencies set to “All Equal”
TIM GTR fix CT and AG rate parameters to 1.0 (uncheck the “estimate” box)
TVM GTR fix the AG rate parameter to 1.0 (uncheck the “estimate” box)
TVMef GTR fix the AG rate parameter to 1.0 (uncheck the “estimate” box), and also set base Frequencies to “All Equal”
GTR GTR  None

 

 

I have quickly constructed these tables based on different options available in BEAST versions, my knowledge of the substitution models, and the list of model code for different substitution models on the BEAST1 website (information for XML format for BEAST1 / earlier versions than present). How can you use the above tables to set a desired site model? Easy. You simply check Tables 1 and 2 and for a given model you might want to set (first column), the tables tell you which model to select in BEAUti (second column) and the modifications, if any, you will need to make to that model or the XML file itself to get the model you want (third column). 

Another way…

If you would like more detail, you can also draw on other resources available online to do this properly. For example, if you wanted to set the F81 model for a partition of your dataset, you could also search the substitution model code page mentioned above for “F81”. If you did this you would find the following block for BEAST1:

<!-- *** SUBSTITUTION MODEL FOR PARTITION Gene2 *** -->
<!-- The F81 substitution model (Felsenstein, 1981) -->
<gtrModel id="F81_Gene2">
	<frequencies>
		<frequencyModel dataType="nucleotide">
			<frequencies>
				<parameter id="F81_Gene2.frequencies" value="0.25 0.25 0.25 0.25"/>
			</frequencies>
		</frequencyModel>
	</frequencies>
	<rateAC>
		<parameter id="F81_Gene2.ac" value="1.0"/>
	</rateAC>
	<rateAG>
		<parameter id="F81_Gene2.ag" value="1.0"/>
	</rateAG>
	<rateAT>
		<parameter id="F81_Gene2.at" value="1.0"/>
	</rateAT>
	<rateCG>
		<parameter id="F81_Gene2.cg" value="1.0"/>
	</rateCG>
	<rateGT>
		<parameter id="F81_Gene2.gt" value="1.0"/>
	</rateGT>
</gtrModel>

We can infer from this that the F81 model can be specified in any BEAST version by first specifying the GTR model, and then setting all rates to 1.0 and also setting all base frequencies to be equal “All Equal” (i.e. each equal to 0.25). And you would specify this information in the Site Model panel in BEAUti. You would also need to make sure that the priors were not included for these parameters, since they will be fixed; so you do that by commenting out the rate priors within the XML file. Accordingly, you would also need to delete the rate parameters from the log, because you won’t need to log those parameters since you won’t be estimating them.

~J

Additional Reading

Substitution model basics from Wikipedia: substitution model, models of DNA evolution  

Beast users google group threads:

BEAST2 website FAQ: link

PartitionFinder website: PartitionFinder

Latest *BEAST trace: Most beautiful thing I’ve seen today

My new *BEAST (starbeast) species tree runs are converging wonderfully in BEAST v1.8.3, so I am off to a great start this morning. Here, I’m including a screenshot of the results of one of these runs, which I ran for 200 million generations using 3 normally distributed calibration points, as visualized in Tracer. Note the good posterior trace, very high ESS scores for all parameter estimates, and good use of burn-in (default 10% value works quite well in this case).

This is the most beautiful thing I’ve seen in my research today, or even this week!38_STARBEAST_PSmle_post_trace_Mar10 

I am now checking and processing the results from multiple such runs using different species schemes (trait mapping schemes/files). They seem like they are going to give me a nice estimate of the species tree for this dataset.

Currently, the second steps of these runs are still running, using path sampling/stepping-stone sampling to calculate marginal likelihood estimates (MLE) for each model. When this is completed, I will calculate Bayes factors for each model and use them for Bayes factor species delimitation (BFD) sensu Grummer et al. (2014). 

Is anyone else working on BFD analyses like this right now? Would love to hear from you.

~J

References

Grummer JA, Bryson RW Jr., Reeder TW. 2014. Species delimitation using Bayes factors: simulations and application to the Sceloporus scalaris species group (Squamata: Phrynosomatidae). Systematic Biology 63(2): 119-133.

Migrate-N and phylogeography: Bayesian priors and run settings

In a previous post I discussed using Migrate-N for phylogeography (e.g. migration model testing, run settings, general tips).  I mentioned that Migrate-N runs maximum likelihood (ML) and Bayesian algorithms and I discussed those a little bit, including how to run under different estimation modes.  While Bayesian analysis is likely to be preferred for several reasons, a common critique of Bayesian methods in statistical population genetics and elsewhere is that it is difficult to set the prior information for each run.  I pointed out that using a priori information from ecology, for example, to set boundaries on migration or theta, or using broad/previously-published rate calibrations, can increase the objectivity of this step.  However, I also suggested that when information is essentially totally lacking, one method might be an empirical, step-wise approach using ML estimates of the run parameters and their 95% CIs to set the priors for subsequent Bayesian runs.  Although this might produce good results and has been published before, I did not stress enough that this form of analysis, while empirically based, actually runs counter to Bayesian theory and hypotheses testing methods.

The reason for this is that in Bayesian analysis we draw our priors from external information, rather than from the data.  Then the goal is to allow the Markov Chain Monte Carlo (MCMC) simulations we do to find the best parameter estimates (peaks of the marginal densities of their posterior distributions i.e. likelihoods), without focusing on the starting parameters (priors) much themselves.  In other words, we use the priors to create distributions of values that the programs draw from during the analysis.  As I mentioned before, this means an ML-Bayes-stepwise approach requires “looking at the data twice” because we would have used the data to generate the priors.  This is opposed to Bayesian inference and creates an unnecessary step.

After reading through the Migrate-N google group, it has become obvious to me, mainly from Peter Beerli’s comments and criticisms, that the best way to start with Bayesian runs in Migrate-N with little a priori information is to use reasonable uniform priors for theta and M and allow the program to do it’s job.  For mtDNA, Beerli has suggested using a uniform theta prior between min = 0.0 and max = 0.1 and setting the prior for M around means of 100 or 1000 with ranges from 0 to 1000 or 10000.  These may be broad, but they seem highly realistic i.e. likely to capture the values of these parameters from natural populations, and that is very desirable.  Personally, I feel that a theta-max of 0.1 is a bit high since I have never seen such a theta actually reported for a censused or indirect (e.g. popgen) estimate of this parameter in any published study.  That doesn’t mean it has never happened, of course, and we should remember that “population” can be defined in multiple ways given different assumptions about the systems under study.  To illustrate why I think this is a high value, let’s imagine a breeding population of vertebrates with (mtDNA-derived) theta = 0.1 and known generation time (2 years) and mutation parameter (mu = 1.0 x 10^-9).  We can convert theta theta to Ne simply by dividing theta by mu, which gives us Ne = 10^8!  This is scarcely a believable Ne value for any vertebrate at any level of biological organization of individuals that we would like to consider our population!  It even strains credulity for the most abundant vertebrates on the planet, including humans with a current population size of around 7 billion but an estimated Ne that is rather small as well as a historical Ne that is known to have been much smaller.  A lower more realistic set of values should be obtained for regional groupings (up to something like 0.001) or local communities (something more like 0.0001).

Another important point is that these prior values of our Bayesian runs should also not influence the results.  If your prior “drives” your results towards a particular value, then the data can be considered uninformative (e.g. if your prior mean is your posterior mean).  However, there will still usually be a correlation between the prior and posterior, yet of course essentially no relationship between these values and the likelihood as viewed by the joint-marginal distributions.

~J