Here’s the Gist…
BEAST (Bayesian Evolutionary Analysis Sampling Trees) software provides a growing set of clear options for specifying evolutionary models of DNA substitution for alignments loaded into the program. However, the set of models that are readily available and “spelled out” in drop-down menus in the BEAUti (Bayesian Evolutionary Analysis Utility) GUI is much smaller compared to the standard set of 11 substitution schemes considered by other software programs for model selection (giving rise to a total of 88 models in jModelTest, or 56 models in PartitionFinder). This poses a problem that I have heard several BEAST users bring up in online discussion groups, and that several people have asked me about personally. In this post, I shed some light on this problem and clear up some of the confusion by giving details on how users can easily specify different substitution models for their BEAST runs.
I will assume that readers are familiar with DNA substitution models. Also, there are a number of details about how BEAST handles DNA substitution models, which you can learn from the BEAST2 book or the manual, but that I will not cover. For example, by default, BEAST normalizes the rate matrix so that the mean rate is 1.0, and so on. Moving on to the issues…
It seems that many people are having issues with the fact that it’s not a simple “one-click” procedure to specify the diverse variety of substitution models that are frequently selected during model selection inference conducted prior to setting up BEAST runs. This is because BEAST1 only gives the obvious options to specify three of the most popular models of DNA substitution: TrN, HKY, and GTR, while BEAST2 versions give users obvious options to specify one of four models: JC69, TrN, HKY, and GTR. Thus other models such as TIM, K80, and F81 can seem cryptic to some users, who will often exclaim in dismay, “Where is the option for model X???” or “I don’t see how to specify model Y!!!” and then post a question about it (that may languish, never to even be directly answered, who knows).
The real solution to the above problems is to realize the options available to you, especially that BEAUti actually gives several important options in the Site Model panel that each user must learn to use to tailor the models of evolution for each alignment/partition. These options are located right “under your nose,” but I’ll show you where they are and how to use them in Tables 1 and 2 below, as follows:
Table 1. Details for setting substitution models in BEAST1 (e.g. v1.8.3).
|Best-fit substitution model||(Base) Model to select in BEAUti||Additional specifications in BEAUti|
|TrNef||TN93||base Frequencies set to “All Equal”|
|K80 (K2P)||HKY||base Frequencies set to “All Equal”|
|F81||GTR||turn off operators for all rate
|SYM||GTR||base Frequencies set to “All Equal”|
edit XML file by hand so that all other rates are equal to the AG rate
|TVM||GTR||unchecking AG rate operator|
|TVMef||GTR||unchecking AG rate operator and setting base Frequencies to “All Equal”|
Table 2. Details for setting substitution models in BEAST2 (e.g. 2.4.0+).
|Best-fit substitution model||(Base) Model to select in BEAUti 2||Additional specifications in BEAUti 2|
|TrNef||TN93||base Frequencies set to “All Equal”|
|K80 (K2P)||HKY||base Frequencies set to “All Equal”|
|F81||GTR||fix all rate parameters to 1.0 (uncheck the “estimate” box)|
|SYM||GTR||base Frequencies set to “All Equal”|
|TIM||GTR||fix CT and AG rate parameters to 1.0 (uncheck the “estimate” box)|
|TVM||GTR||fix the AG rate parameter to 1.0 (uncheck the “estimate” box)|
|TVMef||GTR||fix the AG rate parameter to 1.0 (uncheck the “estimate” box), and also set base Frequencies to “All Equal”|
I have quickly constructed these tables based on different options available in BEAST versions, my knowledge of the substitution models, and the list of model code for different substitution models on the BEAST1 website (information for XML format for BEAST1 / earlier versions than present). How can you use the above tables to set a desired site model? Easy. You simply check Tables 1 and 2 and for a given model you might want to set (first column), the tables tell you which model to select in BEAUti (second column) and the modifications, if any, you will need to make to that model or the XML file itself to get the model you want (third column).
If you would like more detail, you can also draw on other resources available online to do this properly. For example, if you wanted to set the F81 model for a partition of your dataset, you could also search the substitution model code page mentioned above for “F81”. If you did this you would find the following block for BEAST1:
<!-- *** SUBSTITUTION MODEL FOR PARTITION Gene2 *** --> <!-- The F81 substitution model (Felsenstein, 1981) --> <gtrModel id="F81_Gene2"> <frequencies> <frequencyModel dataType="nucleotide"> <frequencies> <parameter id="F81_Gene2.frequencies" value="0.25 0.25 0.25 0.25"/> </frequencies> </frequencyModel> </frequencies> <rateAC> <parameter id="F81_Gene2.ac" value="1.0"/> </rateAC> <rateAG> <parameter id="F81_Gene2.ag" value="1.0"/> </rateAG> <rateAT> <parameter id="F81_Gene2.at" value="1.0"/> </rateAT> <rateCG> <parameter id="F81_Gene2.cg" value="1.0"/> </rateCG> <rateGT> <parameter id="F81_Gene2.gt" value="1.0"/> </rateGT> </gtrModel>
We can infer from this that the F81 model can be specified in any BEAST version by first specifying the GTR model, and then setting all rates to 1.0 and also setting all base frequencies to be equal “All Equal” (i.e. each equal to 0.25). And you would specify this information in the Site Model panel in BEAUti. You would also need to make sure that the priors were not included for these parameters, since they will be fixed; so you do that by commenting out the rate priors within the XML file. Accordingly, you would also need to delete the rate parameters from the log, because you won’t need to log those parameters since you won’t be estimating them.
Beast users google group threads:
BEAST2 website FAQ: link
PartitionFinder website: PartitionFinder
Hi guys, OK, in case you haven’t heard, there is a new version of BEAST2, so make sure that you go get that update on GitHub. According to Remco Bouckaert, the new version, v2.3.2, improves the ability of the user to interact with data in BEAUti, and it also improves the efficiency of some BEAST functions, particularly the MRCAPrior. Moreover, LogCombiner has been fixed to ensure logcombiner burn in editing finished properly, and LogAnalyser now has one line per file mode. Supposedly, BEAST now has “more sensible error messages” in several packages.
However, I find that BEAST is still plagued with problematic error reporting in some cases. Take for example this infamous error: “Parsing error – the input file is not a valid XML file”. The program developers have made good suggestions for dealing with such vague errors. But hopefully the guys will continue to improve error reporting and related issues much more in future versions.
Hi BEAST users subscribed to justinbagley.org. As what is for many of you an academic summer starts winding down its backslope, this is a little reminder that you should try and finish all the BEAST analyses you’ve been putting off, and in case you haven’t done so already you should consider updating to the latest version of BEAST2, v2.3.0, which was released in June. Make sure you also update to the most up-to-date java version too; don’t worry, after the update BEAST v2.3.0 will still run.
However, I will caution that I have run into several problems with BEAST2 recently, and because of this I have had to revert back to earlier versions of BEAST or the “other” version (v1), as doing so was either a) necessary, or b) easier, to complete a set of analyses. Thus, I am backsliding on my otherwise spotless recent adherence to using BEAST2, and loving it.
In fact, my recent experiences with BEAST v2 led me to ask, “Do you need it?” There are analyses that I could not even figure out how to run in BEAST2 that input files can relatively easily be constructed for an ran in BEAST1. Other than resorting back to legacy versions of BEAST v1, such as the excellent oldy-but-goody v1.5.4, I have enjoyed updating to the latest version of BEAST v1.8.2, which was released in March 2015. One great thing about this was that I could create an xml file using BEAUti v1.5.4 and still open it and run it in BEAST v1.8.2. Also, Andrew Rambaut and friends have a nice new website for BEAST1 with new organization and new tutorials that are nice and helpful. More than likely, you are already aware of this though, as the BEAST2 website has struggled a bit in terms of tutorial availability, and so you’ve probably already ran across lots of older and newer tutorial content from the BEAST1 team while racking your brain trying to figure out how to do something in BEAST2.
Given I’ve recently had success running 1) coalescent historical demographic modeling analyses, 2) standard multilocus BEAST analyses, and 3) total-evidence “tip-dating” analyses including molecular and morphological data for extant and extinct taxa as tips and calibrating with fossils (sensu Pyron, 2011; Ronquist et al., 2012), all using BEAST v1.75 or v1.8.2, I’m starting to feel that I only prefer to use BEAST2 for certain other analyses now, such as *BEAST runs. But I still use BEAST2 for those. So, in conclusion, I’ll argue, “You need both. For now, at least. Just make sure you have the latest versions of the software and tutorials.”
Pyron, R.A., 2011. Divergence time estimation using fossils as terminal taxa and the origins of Lissamphibia. Syst. Biol. 60(4), 466-481.
Ronquist, F., Klopfstein, S., Vilhelmsen, L., Schulmeister, S., Murray, D.L., Rasnitsyn, A.P., 2012. A total-evidence approach to dating with fossils, applied to the early radiation of the Hymenoptera. Syst. Biol. 61, 973-999.
Yet again, a new version of an important piece of software for molecular ecologists, evolutionary biologists, and biogeographers has just become available. Mary Kuhner and colleagues have just released a new version of the software program Lamarc, a program that originally only implemented likelihood analysis but now implements ML and Bayesian algorithms for estimating population migration rates (gene flow), population sizes, exponential growth rates and recombination rates. You can download the most up-to-date version of Lamarc, Lamarc v2.1.9, from the corresponding Lamarc download website, which can be found here. I have found that some people have switched to Peter Beerli’s software program Migrate-n to estimate gene flow levels between populations and species, and to test different migration models (e.g. biogeographic hypotheses) in a Bayesian framework, using Bayes factors. However, Lamarc does several things well, and has recently (in the literature I read, at least) maintained its popularity for estimating population growth rates, g, which provides a nice complement to other analyses for inferring the historical demography of populations. For example, a positive g estimate can support an inference of a recent (on-going, or postglacial) population expansion from a group of molecular DNA sequences from present-day samples.
OK, heads up everyone using BEAST, or interested in summarizing and analyzing the results of Bayesian analyses in other software programs! There is a new version of the program Tracer available at Andrew Rambaut’s website. You can obtain the new version of Tracer, Tracer v1.6 by downloading the appropriate version here. The new version supports several new features, and here is a list of the new stuff they’ve added to the program, taken from the website:
If you haven’t downloaded a new version of Tracer in awhile, you may also notice several other new features… and obviously there have been some bug fixes. So get the new version!
Cheers ~ J
Andrew Rambaut has released the first version of a new software program, BEASTGen, that transforms phylogenetic data between the standard NEXUS, FASTA and BEAST XML formats. Could this mean that we are one step closer to calming many graduate students’ (and other people’s) woes over this issue?
The new software can be downloaded from the BEAST Google code website, at https://code.google.com/p/beast-mcmc/downloads/detail?name=BEASTGen_v1.0.tgz.
Andrew has also written a brief set of instructions that can be found on the BEAST website, at http://beast.bio.ed.ac.uk/BEASTGen.
Here is my comprehensive list of phylogenetics packages for R, with links to their websites (for some, links to support pages or more detailed information are provided in subsequent sections). The requirement for inclusion in the list is at least one function related to manipulation or analysis of phylogenetic data. Please E-mail me if you know of another program to add, or if you are developing a new package for phylogenetics in R. The packages, listed in alphabetical order, are as follows:
Broad overview: a brief summary of what the packages do[More coming soon.]
At the time of this writing, I am fortunate to be doing my PhD at BYU, which is a great environment to work on biogeography, systematics and evolution mainly because of the great people with similar interests that seem to always be attracted to the place. However, I am also very fortunate that at my institution we have excellent infrastructure and internal resources for conducting the highest quality research in molecular evolution and ecology. In particular, students and faculty have access to a cutting edge DNA Sequencing Center (accessed here) as well as high-power supercomputers for running computationally burdensome algorithms and software during statistical analyses (think you of phylogenetic and population genetics software, where inference is increasingly based on explicit modeling), through BYU’s Fulton Supercomputing Lab (for more information visit their webpage at http://marylou.byu.edu/).
While the supercomputer is a great resource, I have found that life interacting with the supercomputer can be wonderful at times or seemingly-eternally frustrating at others (e.g. if things run slow, download/exchange capabilities are limited for some reason, the supercomputer is broken, or things don’t run at all)! And there have been plenty of “if I only knew then what I know now” moments as I have mastered the basics of putting these resources to work for me. Nevertheless, an integral part of successfully using supercomputing resources for evolutionary analyses is having the most flexible means of communicating securely and rapidly with the available computing resources. In this regard, among your greatest friends in data management and analysis become platform-format SSH clients (e.g. software programs that use secure shell protocols) and command-line knowledge. That’s what I want to discuss today, especially what I’ve found playing around with different SSH platforms and command line tricks. The supercomputer we use is Linux based and so I will also get into Linux-specific issues that I have learned a little about.
SFTP (Secure File Transfer Protocol) software programs are necessary to conduct file transfers between computers, networks, etc., by sending data through secure information tunnels known as secure shells (SSH). The (terminal emulator or terminal, or other) software for this purpose and for related functions will contain files, menus or buttons (or perhaps even names) with the word “protocol” in them. In other words, as you learn about SSH, this word will keep popping up. Uninitiated users will probably note this as strange. However, “protocol” in this sense simply refers to a set of message formats and rules for exchanging digital message information. SFTP software programs have a range of capabilities, but all the main ones are true OpenSSH software or are open source multi-clients that specialize in cryptographic communication sessions between devices/people.
Although many SSH client software exist, a common and excellent choice for Windows machines is WinSCP. I have now made the switch from Windows to Mac-almost-full-time. And during this change I came across the problem of needing to (a) find capability to run WinSCP on my Mac (which didn’t really interest me) or (b) find the best SSH client software available for mac users. In making this change, I first went online to mac forums and discovered the program Fugu. I quickly downloaded it and it ran fine, but was a nightmare on two important counts: Fugu does not have the capability of transferring directories! and Fugu also can be slow at best. After several days using Fugu, I broke down and realized I needed a new software. I went back online and searched for other software, and I finally found what I had been wanting: Cyberduck. This program is hands-down the best SSH client software for mac that I have found and I recommend that you use it also (although Filezilla is a close second).
Later, I will discuss other SFTP software to avoid, and the many features that Cyberduck has that will help you interact with your supercomputer with ease. Also I will review the command line interface and how to use other SSH client software to run interact with supercomputers to run programs during analyses.
BEAST is a powerful computer program for evolutionary analysis developed by Andrew Rambaut and Alexei Drummond, as well as Marc Suchard, Simon Ho, Joseph Heled, and others. BEAST was originally designed to provide a Bayesian Markov chain Monte Carlo (MCMC) method for simultaneously co-estimating phylogenies and divergence times (times to the most recent common ancestor, TMRCAs) from DNA sequence data (or amino acids) for one or multiple loci. However, it has now been extended to trait data and provides means of testing for correlated trait evolution. It currently implements a wide variety of evolutionary models. For those interested in studying phylogenetics, phylogeography and demography in a Bayesian framework, BEAST is a choice option for analysis, being among the premier tools in each of these fields.
Before reading the rest of this post, make sure you have downloaded BEAST. You can get the latest version at the BEAST website, which can be accessed by clicking here [Updated May 2016: or you can download BEAST2 here]. I’d also point out that BEAST developers have provided a very useful set of tutorials as well as a frequently-visited google group for discussing and dealing with issues related to BEAST analysis. Like other programs common to molecular ecology and evolutionary genetic analysis, e.g. migrate, first learning to use BEAST can be challenging. Fortunately, the program has grown quite a bit since earlier versions, with more flexibility (options) and a more user-friendly interface in BEAUTi for generating input files (less direct XML file editing).
Molecular clocks play a key role in all of the models implemented in BEAST. Despite the good documentation and support mentioned above, however, a frequently asked question regarding setting up BEAST runs is, “I have an idea of what rate I would like to use in my analysis, but how do I input it into the program?” In response to such questions, the purpose of this post is to practically introduce readers to the way strict molecular clocks are used in BEAST. Although BEAST implements three different molecular clock models (strict, relaxed, and random), I focus on strict clock analyses where users will want to set mutation rate priors. Assuming a strict or “global” molecular clock will often be a poor assumption for interspecific molecular data, because rates vary among lineages; however, in programs such as BEAST that are based on the coalescent (which is usually assumed to tick at a strict pace), when users analyze large intraspecific datasets, relaxed clocks are almost always inappropriate. Thus for intraspecific phylogeography of a single mtDNA locus, for example, a strict clock will usually be preferred.
Clock setting in BEAST
To set up a strict clock run in BEAST, the user must specify a strict molecular clock in their input (.xml) file. This is accomplished when creating the input file in the utility program BEAUti, which is part of the BEAST installation. After importing your NEXUS-formatted DNA seq data into BEAUti, select the desired taxon sets and site model(s) for the analysis. Next, click on the “Clocks” tab (this applies to BEAUti 1.7.4; older versions of this software may have a tab named “Clock Models”). The strict clock model is the default setting. Importantly, clock rates if constant can be spread across a range of values (e.g. uniform prior); however, a constant mutation rate is not necessary, and mutation can be modeled using several distributions. For simplicity, let’s set the clock first; we will discuss these distributions a little later.
In BEAST, we input the mutation rate using the ‘clock.rate’ parameter, which is in units of substitutions/site/unit time (default). The default value for this parameter is 1.0. If you leave the clock.rate parameter set to 1.0 and do not check “estimate,” then you are going to get a resulting phylogeny or demographic model with units of substitutions/site, the same as if you were just reconstructing a phylogeny using standard maximum likelihood or Bayesian inference analysis methods. However, when analyzing multilocus datasets using strict molecular clocks on each locus in *BEAST, you can leave the “estimate” box unchecked for one locus (e.g. imagine the first locus you sampled/entered was a chunk of mtDNA sequence) leave the clock.rate parameter set to 1.0, while checking “estimate” for the others. In this case, BEAST will estimate rates of every other locus relative to the one whose rate was set to 1.0. For the present tutorial, leave the strict clock model selected.
Now that you have specified the strict clock model, you will need to place a prior on the clock.rate parameter. You now need to set up code in the input file that will supply the program with your rate or rates. How you do this will depend on what your rate/data is and your desired outcome. Here are three potential situations you might be in:
(1) “I have a pairwise rate. How do I go from pairwise to per-lineage rates and input them into BEAST?”
Usually, when we discuss evolutionary rates, or mutation rates, it is understood that we are talking about ‘per-lineage’ rates. This is not surprising given evolution occurs along individual branches/lineages in the tree of life. Thus, BEAST works with per-lineage rates. However, pairwise rates of evolution are commonly reported and thus of interest as candidate priors. When we talk about divergence as a percentage, x% pairwise divergence between two lineages is equivalent to x%/2 divergence per lineage (along one branch of a two-tip phylogenetic tree).
Imagine you are interested in the standard 2% rate for vertebrate mtDNA evolution. This is a pairwise rate (2%/million years, Myr). If you wanted to set the program (i.e. fix it) to exactly this rate (taking it as a mean rate across the phylogeny of your samples)… then the rate must be converted to a per-lineage rate in units of substitutions/site before being input. To obtain a per-lineage rate from a pairwise rate, you simply divide the pairwise rate by 2. The above rate becomes 1% when expressed as a percentage, and we divide this value by 100 to get 0.01 expressed in substitutions/site/Myr. Once you have this value, you need to simply leave the “estimate” box unchecked, check the “Fix mean substitution rate” box, and enter your rate to set the mean clock.rate in your xml file. Here, we are assuming that you want your units (hence the scaling of your output) to be in millions of years (Myr).
But there are other options… and we can imagine other cases.
For example, you may want the x-axis of your BEAST output files (e.g. for your time tree or Bayesian skyline plot) to be in units of years. To accomplish this, you need to format the desired rate value appropriately. In the case of a pairwise rate such as the 2%/Myr rate in the example above, you would divide 0.01 by 10^6 (one million), leaving you with 1.0 x 10^-8 or 0.00000001. This is the value you would input into BEAUTi to achieve the result desired here. This assumes generation time = 1.0 yr (thus you could think of it as the output actually being in generations).
Note: if you had a fossil calibration point dated in Ma and you checked “estimate” for your rate under a model specifying a strict clock and input your fossil calibration, the resulting output would also be in Myr. So, if your fossil is 30 Ma +/- 2 Ma for a given node, you would input that into the program as a point estimate of 30.0 (fixed; also you could use a variety of calibration transformations/distributions; see below). And you would get output in units of Ma. BEAST would also estimate the substitution rate in subs/site/Myr for you.
(2) “I have a per-lineage rate(s).”
If you have a per-lineage rate already… put it in the program as discussed above.
(3) “I don’t have rates… (Or, what is a molecular clock anyway?)”
If this is your situation, then wow, maybe I should have started this post by explaining this!! If rates are not available (i.e. published) for your taxa or nodes of interest, then you may want to roughly estimate a rate to use in your analysis. For this, you would need a justifiable calibration point (e.g. fossil data, or information on a biogeographical or paleogeographical event, etc.) and an in-hand estimate of d, pairwise genetic divergence (i.e. DNA sequence divergence). Care must be taken in doing this and in setting the priors on the rate distribution (e.g. your outcome will depend on your choice and quality of d calculation and calibration point selection), but if you need to do this then you must first understand the basic molecular clock model.
To understand the molecular clock, let’s imagine a two-taxon phylogenetic tree representing the relationship between two sister lineages. You have sequence data from these lineages and you calculate the pairwise divergence, dxy (= mean # differences, i.e. mean # nucleotide substitutions/site), between the two sequences. This value is a product of mutation rate (mu) in substitutions/site/Myr and the amount of time since two unique lineages diverged (originated). Given this information, the simplest calculation of the time (T) of divergence in millions of years ago (Ma) for a given lineage is calculated by taking dxy and dividing by twice the substitution rate (mu; the mutation rate). In other words T = dxy/2mu [this is commonly also written as T = (dxy/mu)/2, as in Nei (1987)]. Since you need to calculate a rate, you need prior information on the splitting time/event leading to the level of observed divergence. So you need to re-arrange this equation. To obtain a per-lineage rate (what we want!), you see that mu = dxy/2T [which is the same as (dxy/T)/2]. If you have an estimate of 4% divergence between two lineages and you know that this d is at least as old as a certain number, say 5 Ma (based on the geological date of a mountain chain uplifting rapidly or some other vicariate event), you could calculate a per-lineage rate as 4/(5*2) = 0.4%/My, assuming a global clock. This would be input into BEAST as 0.004, in units of substitutions/site/Myr. But, again, you could input this number into BEAST in other ways. As discussed in (1) above, if you want the results to be in years, take your 0.4%/Myr rate and divide it by a million, and this leaves you with 4 x 10^-9, or “0.000000004”, to enter into the program (assuming generation time = 1 yr).
Prior distributions on clock.rate
The most simple way of inputting an evolutionary rate is to leave clock.rate set to a point estimate; however, this is hardly realistic, and you will probably want to factor in some uncertainty in this estimate. When doing so, the most simple prior distribution that can be placed on the clock.rate parameter in BEAST is a uniform distribution. More coming soon…
OK, that’s all for now on this particular topic. Hope this helps. ~ J
Nei, M., 1987. Molecular Evolutionary Genetics. Columbia University Press, New York.