Latest Posts

Analyzing ddRADseq Data in SNAPP

Oh, SNAPP…

Over the last year, as I’ve become increasingly comfortable with GitHub, I’ve naturally shifted all of my note writing to Markdown language. During the process, I’ve started frequently jotting down useful notes related to analysis, code, or scientific computing as “Gists” on GitHub. In contrast to most repositories, Gists are small snippets of code or notes and are posted at gist.github.com rather than github.com. To date, I have published 8 Gists on topics as broad as Pandoc, ∂a∂i, SNAPP, and Markdown.

Recently, I published an updated public version of an earlier Gist containing a broad description of steps for analyzing SNP (single nucleotide polymorphism) data from ddRADseq loci in the software program SNAPP (Bryant et al. 2012; implemented in BEAST v2.4++), to infer species trees with divergence times. The new Gist lists 8 steps for doing this, which take advantage of several available software programs, as well as my own shell scripts. In the spirit of open science, my motivation for writing this note was that, while people don’t necessarily need another full-blown, in-depth tutorial on this (e.g. they can read this one), I could provide a set of steps that I used to prepare and analyze SNP data in SNAPP for recent projects. 

Here is a link to the Gist.

Here, I am sharing this Gist with you as embedded Markdown, thanks to the oEmbed Gist plugin:

Cheers, ~ J

 

Converting scientific manuscripts from Word to LaTeX PDF on Mac

LaTeX is a very powerful document preparation and typesetting system that provides an accessible means for the programming-minded user to prepare beautiful scientific documents of all kinds, including manuscripts, software documentation, theses and dissertations, etc. Because documents typeset in LaTeX exhibit a high degree of professionalism in their design, while also rendering computer- and math-related characters extremely well, LaTeX has become widely used in many fields of science, and in the academe in general. By using a markup language similar to HTML, and which is non-WYSIWYG (What You See Is What You Get), LaTeX provides the user with tremendous control and flexibility in document design during the writing process. A commonly touted benefit is to, “…leave document design to document designers, and to let authors get on with writing documents.”

I got interested in LaTeX as a way of creating more professional documents for preprint and final manuscript submissions to servers/journals. This post describes some of my recent experiences while teaching myself LaTeX over the weekend.

Let me start by saying that my intuition at the outset was that I was *not* ready to switch from writing manuscripts in Word to writing a document from scratch–i.e. generating the content of a document itself–within a LaTeX editor like TeXShop or Texmaker. The latter simply does not hold the same degree of appeal to me as doing so in a WYSIWYG processor like Word, and I just wasn’t ready to make that big of a jump. So, my idea was to start from a scientific manuscript prepared in Word and convert it into LaTeX, with the goal being to learn to improve the design of my final manuscript in LaTeX, i.e. to learn to present my previously generated content in LaTeX with appropriate formatting before final rendering to PDF. (“But perhaps I’ll like it and write only in LaTeX in the future though”, I thought.)

Dependencies

Software this post is based on: Pandoc v1.19.2.1, TexShop v3.85, TextWrangler v4.5.12

STEP #1: Word to LaTeX

Since I already had Pandoc installed locally, I started from a .docx file containing the final submission draft of one of my recent papers. I opened Terminal (my preferred command line interface on Mac) and there I used Pandoc to convert this file from .docx to LaTeX (.tex) format using a basic/default command, with no special options, as follows:

pandoc -s manuscript.docx -o manuscript.tex

After doing this, I opened the resulting .tex document in TextWrangler and resaved it with UTF-8 encoding for UNIX; here, I changed the filename from “manuscript.tex” to “manuscript1.tex” and kept both versions in the working directory.

STEP #2 – n: Editing and Typesetting in TexShop

Next, I opened the .tex doc in TexShop. I tried to typeset to PDF in TexShop, but of course things failed “out of the box” several times. The keyboard shortcut for typesetting your tex doc in TexShop is Command-T; I pressed this several times during my initial attempts. Typesetting failed once due to an erroneous/problematic “” character that somehow got added by Pandoc and screwed things up at one point in the manuscript, and a second time due to a UTF-8 command passed to a package called ‘inputenc’. I fixed the latter error by simply commenting out the problematic line from the .tex file. The line I commented out was within the first 10 lines of the document and looked like this:

  \usepackage[utf8]{inputenc}

After these changes, I was able to get a first, awkward PDF file rendered, and I immediately realized that there were a number of issues that cropped up when going with the defaults: (1) wrong font size, (2) numerous wrongly displayed symbol characters, (3) line numbers disappeared, (4) missing page breaks, (5) improperly formatted tables and table/fig captions, and on and on… It became clear that many additional edits would need to be made to the file in a thoughtful manner to make it even approximate something close to my desired end result.

To make a long story short (I couldn’t go through all the changes that I made, or everything that I learned about LaTeX during this exercise in a single post), I experimented with adding several packages to the preamble of my .tex file, I read documentation on codes for basic document formatting and code for correctly calling mathematical characters and symbols, I proofread the text and fixed errors, I remade the tables, I fixed page size and rotation, and I gave my bibliography section an appropriate hanging indent format.

Preamble-Title Page Template

Through many additional edits to the document, I came up with the following preamble-title page template, which I will probably continue to modify, but you can feel free to start from when making a .tex file for any double-spaced, 12 pt font manuscript:

\documentclass[12pt]{article}
\usepackage{graphicx}
\usepackage{lmodern}
\usepackage{amssymb,amsmath}
\usepackage{ifxetex,ifluatex}
\usepackage[letterpaper, portrait, margin=1in]{geometry}
\usepackage{setspace}
\doublespacing
\usepackage{lineno}
%%%\usepackage[utf8]{inputenc}
\def\linenumberfont{\normalfont\small\sffamily}
\linenumbers\usepackage{fixltx2e} % provides \textsubscript
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
  \usepackage[T1]{fontenc}
\else % if luatex or xelatex
  \ifxetex
    \usepackage{mathspec}
  \else
    \usepackage{fontspec}
  \fi
  \defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase}
\fi
% use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
% use microtype if available
\IfFileExists{microtype.sty}{%
\usepackage[]{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\PassOptionsToPackage{hyphens}{url} % url is loaded by hyperref
\usepackage[unicode=true]{hyperref}
\hypersetup{
            pdfborder={0 0 0},
            breaklinks=true}
\urlstyle{same}  % don't use monospace font for urls
\usepackage{longtable,booktabs}
% Fix footnotes in tables (requires footnote package)
\IfFileExists{footnote.sty}{\usepackage{footnote}\makesavenoteenv{long table}}{}
\IfFileExists{parskip.sty}{%
\usepackage{parskip}
}{% else
\setlength{\parindent}{6pt}
\setlength{\parskip}{10pt plus 2pt minus 1pt}
}
\setlength{\emergencystretch}{3em}  % prevent overfull lines
\providecommand{\tightlist}{%
  \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\setcounter{secnumdepth}{0}
% Redefines (sub)paragraphs to behave more like sections
\ifx\paragraph\undefined\else
\let\oldparagraph\paragraph
\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}}
\fi
\ifx\subparagraph\undefined\else
\let\oldsubparagraph\subparagraph
\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}}
\fi

% set default figure placement to htbp
\makeatletter
\def\fps@figure{htbp}
\makeatother

\usepackage[document]{ragged2e}
\usepackage{indentfirst}
%%%\usepackage{lscape}
\usepackage{pdflscape}

\date{}

\begin{document}

\begin{center}
Original article formatted for \emph{JOURNAL\_NAME}

\vspace{5mm} %5mm vertical space

\large\textbf{MANUSCRIPT\_TITLE}
\normalsize

\textsc{FIRST\_AUTHOR\_NAME\textsuperscript{1,2,*}} and \textsc{SECOND\_AUTHOR\_NAME\textsuperscript{1} }

\textsuperscript{1}\emph{FIRST\_INSTITUTIONAL\_ADDRESS}

\textsuperscript{2}\emph{SECOND\_INSTITUTIONAL\_ADDRESS}

\textsuperscript{*}\emph{Corresponding author.} E-mail:
\href{mailto:FIRST\_AUTHOR\_EMAIL}{\nolinkurl{FIRST\_AUTHOR\_EMAIL}}.

\emph{Running Head:} RUNNING\_HEAD (NUM\_CHAR\_RUNNING\_HEAD
characters w/spaces)

\emph{Word Count:} PAPER\_WORD\_COUNT words, Abstract, main body, plus table and
figure captions; Abstract only,
\protect\hypertarget{OLE_LINK3}{}{\protect\hypertarget{OLE_LINK4}{}{}}ABSTRACT\_WORD\_COUNT
words.
\end{center}

\vspace{5mm} %5mm vertical space
\vspace{5mm} %5mm vertical space

\end{document}

This code contains the preamble and title page for the manuscript (“\end{document}” is added only to round out document structure, so that this code can be copied and pasted into TexShop and rendered). Aside from double spacing and appropriate font size, this setup also causes pages to be numbered and the manuscript to have continuous line numbering throughout, consistent with the required or preferred format of many journals. To use this, all you need to do is copy and paste it in place of the top/preamble section of your manuscript .tex document, then change the uppercase “pothole_case” labels (e.g. “FIRST\_AUTHOR_EMAIL”) to the relevant content for your paper. Here, backslashes are only present in the labels to escape the underscore characters. You can see the “inputenc” line that I commented out at line 10. Right after this, you could add a page break (“\break”) and then enter your Abstract, and so on.

Here is a screenshot of the rendered Preamble-Title Page Template:

Example Table Format Using “longtable”

I discovered that, when you convert a Word document to LaTeX using Pandoc, the default table format it spits out is that using the “longtable” environment. There are other ways of making tables in LaTeX, e.g. using “tabular” or “tabularx” environments, that seem to be better suited for some things, and perhaps more popular. However, longtable is probably used as the default because, when tables span multiple pages (be vertically long), it prints the column headings on each page, which can be nice. You can find other examples online, but here is an example showing a partial table from my paper, using longtable:

\singlespacing
\footnotesize%%%%%%%%%%%  smaller font size %%%%%%%%
\begin{longtable}[]{@{}llll@{}}
\toprule
\textbf{Subset} & \textbf{No. subset partitions} & \textbf{Total length
(bp)} & \textbf{Best-fit model}\tabularnewline
\hline
\textbf{`\emph{balfouriana}' group}\tabularnewline
1 & 16 & 7579 & HKY+$\Gamma$\tabularnewline
2 & 26 & 11791 & HKY+$\Gamma$\tabularnewline
3 & 4 & 1724 & GTR+$\Gamma$+\emph{I}\tabularnewline
4 & 21 & 9077 & TrN+$\Gamma$\tabularnewline
5 & 9 & 3263 & K80+$\Gamma$\tabularnewline
6 & 7 & 3395 & HKY+$\Gamma$\tabularnewline
7 & 23 & 10129 & HKY+$\Gamma$\tabularnewline
8 & 13 & 6810 & HKY+$\Gamma$\tabularnewline
9 & 2 & 499 & SYM+$\Gamma$+\emph{I}\tabularnewline
10 & 6 & 2237 & SYM+$\Gamma$\tabularnewline
11 & 8 & 3015 & HKY+$\Gamma$+\emph{I}\tabularnewline
12 & 2 & 914 & TrN+$\Gamma$\tabularnewline
13 & 10 & 3688 & K80+$\Gamma$+\emph{I}\tabularnewline
14 & 13 & 4872 & K80+$\Gamma$\tabularnewline
15 & 5 & 1995 & HKY+$\Gamma$\tabularnewline
16 & 1 & 443 & K80+$\Gamma$+\emph{I}\tabularnewline
Total: & 166 & 71,431 & --\tabularnewline
\bottomrule
\end{longtable}
\normalsize
\doublespacing
bp, nucleotide base pairs.
%%%\end{landscape}

Here is a screenshot of the rendered table:

That’s all for now! More later when I have time!

 

~J

Running AMOVA (analysis of molecular variance) on biallelic SNP data in R

AMOVA in R

Resources for conducting population genetic and phylogenetic analyses in the R computing environment are continually improving, and to date several packages have provided functions for estimating phi-statistics and hierarchical patterns of population variance partitioning using AMOVA (analysis of molecular variance; Quattro et al. 1992). The paper for the AMOVA method, penned by population geneticists Joe Quattro, Peter Smouse, and Laurent Excoffier, now has >12,000 citations, making it one of the most used and cited methods of all time in evolutionary genetic analysis; many people continue using this method, including me. What is interesting about recent packages is that they allow us to do AMOVA rather easily on SNP data, and today I’m going to show you some code for conducting AMOVA in R on biallelic SNPs.

Photo Credit: P. Aquino

Hypostomus sp. 2

I’ll be using an example of SNP sites from ddRAD-seq analyses of Hypostomus suckermouth armored catfishes (sp. 2, pictured at left), that are hierarchically spatially distributed as individuals, within sites (populations), within three drainage basins in central Brazil. We are going to start from SNPs, get the SNPs into R using the packages adegenet (Jombart 2008; Jombart and Ahmed 2011; Jombart et al. 2015) and poppr (Kamvar et al. 2014, 2015), and then conduct some data massaging/prepwork before conducting final AMOVA analyses using the same packages. The analyses will experiment with the effects on AMOVA results of varying the missing data level threshold in poppr’s AMOVA function (i.e. the “cutoff” option within ‘poppr.amova‘ function).

Practical

SETUP: Load R packages

First, let’s open R (I’m using R version 3.3.2 (2016-10-31) — “Sincere Pumpkin Patch”) and set up the workspace by loading adegenet and poppr; let’s also load the APE library just in case we need it. If you don’t have R, then download the latest version for Windows here, and download the latest version for Mac OS X here. If you have R but don’t have the packages, then install them from within R by typing install. packages(“name”) where name is the name of the package.

library(adegenet)
library(poppr) ## Ignore 'OMP parallel support' info...
library(ape)

SETUP: Read in the data

Next, we’ll read in the data and process it (see next code section below). We will start by reading my SNP data into R from a file with an “.str” extension in the current working directory, in which the data are stored in STRUCTURE format. This data file contains a moderate number of SNPs (representing 1 SNP per RAD tag) from central Brazilian Hypostomus suckermouth armored catfish populations, and it was originally output by pyRAD following data assembly and SNP calling. Here, in my case, the data were produced by NGS sequencing on a ddRAD-seq genomic library that included outgroup samples, in order to facilitate downstream phylogenomic analyses. However, we need to make sure that we only do AMOVA on the ingroup data, because we are interested in patterns of hierarchical genetic structure in the focal species, and not in focal species + outgroup. Including outgroup individuals would in fact bias the results. So, we must start by reading in a data file that contains only ingroup Hypostomus individuals, and *no outgroup individuals* (hence the “noout” code in the file name). 

I will do this below using the following four steps:

  • Step #1: read in the data as a genind object
  • Step #2: convert the data frame from a genind object to a genclone object using the ‘as.genclone’ function of poppr (note: as.genclone is available as of poppr v2.2.1, but may or may not be available in later versions of the package)
  • Step #3: read in a hierarchy file
  • Step #4: recreate the genclone object after specifying the hierarchy to the genind object as the strata.

Here is the code…

Step #1

###### READ STRUCTURE DATA INTO R AS GENIND OBJECT
Hyp.genind.noout = import2genind('hypostomus_noout.str', onerowperind=F, n.ind=40, n.loc=226, row.marknames=NULL, col.lab=1, ask=F)

Step #2

## CONVERT GENIND OBJECT TO A GENCLONE OBJECT
Hyp.genclone.noout = as.genclone(Hyp.genind.noout)

Step #3

The name of my hierarchy file is “Hyp_3hier_ind-pop-drain_assignments.txt” and the first 10 lines of this file looks as follows:

ind pop drainind pop drain

1 1 6 3

2 2 1 1

3 3 7 1

4 4 7 1

5 5 1 1

6 6 9 3

7 7 10 1

8 8 10 1

9 9 1 1

10 10 10 1

Here is the code for the hierarchy file:

## READ IN HIERARCHY FILE
##--Input a hierarchy file that specifies individual number, population number, and factor/group
##--number. The name of my hierarchy file is "Hyp_3hier_ind-pop-drain_assignments.txt"; I read it in using
##--the 'read.table' function. 
Hyp.hier.noout = read.table('Hyp_3hier_ind-pop-drain_NOOUT_assignments.txt', header=T)

Step #4

Now we can add the hierarchy information in one of two ways:

## Example #1:
Hyp.gc.strat.noout = as.genclone(Hyp.genind.noout, strata=Hyp.hier.noout)

## Example #2
strata(Hyp.genind.noout) = Hyp.hier.noout[-1]
pop(Hyp.genind.noout) = c(6, 1, 7, 7, 1, 9, 10, 10, 1, 10, 10, 10, 11, 11, 11, 11, 1, 1, 1, 12, 1, 8, 9, 9, 14, 2, 2, 2, 2, 2, 2, 2, 3, 3, 4, 5, 5, 6, 6, 6)
Hyp.gc.strat.noout = as.genclone(Hyp.genind.noout)

The purpose of showing these steps in this order is to demonstrate that we can easily convert from genind to genclone; however, it is better–in fact, necessary–that we do this for AMOVA while adding information about the hierarchical relationships (population hierarchy) of the dataset. In the last commands, under Step #4 “Example #2”, the individual information is already present in the genind object, we just needed to add the population and hierarchy information.

ANALYSIS: AMOVAs

OK, that’s it for processing! Now, we’re ready to conduct AMOVAs and get some results for our Hypostomus populations. As noted above, we will run separate AMOVA analyses using different cutoff values for missing data. This “cutoff” value sets the level at which missing data will be ignored or modified during calculations. It’s somewhat counterintuitive, but a low value will specify a low missing data threshold and thus identify and remove more sites with missing data from the analysis, leaving potentially many fewer sites for analysis. By comparison, a high cutoff value will result in leaving more sites for analysis, depending on the distribution of missing data in the data frame. In my case, as you’ll see below, a 95% cutoff value removed 0 loci. But let’s see if changing the cutoff value from 30% (analyzing fewer sites) to 95% (analyzing more sites) changes the outcome for the Hypostomus headwater fish system. 

Here is the code for doing the AMOVAs and significance tests (to get p-values):

## AMOVA TESTS (not including outgroup individuals)

## Do Hyp AMOVA with 0.3 (30%) missing data cutoff (leaves fewer sites):
Hyp.drain.amova.res = poppr.amova(Hyp.gc.noout, ~drain/pop, cutoff=0.3)
Hyp.drain.amova.res
Hyp.drain.amova.test = randtest(Hyp.drain.amova.res)
str(Hyp.drain.amova.test)
plot(Hyp.drain.amova.test)
Hyp.drain.amova.test

## Do Hyp AMOVA with all the data, by using an 0.95 (95%) missing data cutoff (leaves more sites):
Hyp.drain.amova.res.95 = poppr.amova(Hyp.gc.noout, ~drain/pop, cutoff=0.95)
Hyp.drain.amova.res.95
Hyp.drain.amova.test.95 = randtest(Hyp.drain.amova.res.95)
str(Hyp.drain.amova.test.95)
plot(Hyp.drain.amova.test.95)
Hyp.drain.amova.test.95

When I ran the above code and called R to print the basic AMOVA results to screen for both models (by typing “Hyp.drain.amova.res” or “Hip.drain.amova.res.95” and then pressing Enter), the basic pattern of change in the number of sites in the analyses was as follows: 338 of the 3,829 total SNPs in the dataset were removed due to having 30% missing data (or more, but this is very unlikely; see below) during the 30% cutoff analysis, and 0 SNPs were removed due to missing data in the 95% cutoff analysis (results not shown).

RESULTS!

Now, let’s look at the results of variance partitioning, as well as the significance of the variation within and between groups, as they would be pictured in the R console (GUI) window:

30% cutoff AMOVA results

## 30% cutoff covariance and Phi-statistics results:
...
$componentsofcovariance
                                             Sigma          %
Variations  Between drain               5.10788445  27.509734
Variations  Between pop Within drain    3.42570578  18.449958
Variations  Between samples Within pop  9.93798617  53.523403
Variations  Within samples              0.09597662   0.516905
Total variations                       18.56755302 100.000000

$statphi
                        Phi
Phi-samples-total 0.9948309
Phi-samples-pop   0.9904348
Phi-pop-drain     0.2545164
Phi-drain-total   0.2750973

## 30% cutoff significance test results:
                        Test        Obs    Std.Obs   Alter Pvalue
1  Variations within samples 0.09597662 -15.327787    less   0.01
2 Variations between samples 9.93798617   6.716293 greater   0.01
3     Variations between pop 3.42570578   2.566409 greater   0.02
4   Variations between drain 5.10788445   4.266484    less   1.00

Results plot: Hypostomus drainage AMOVA with 30% missing data cutoff.

95% cutoff AMOVA results

## 95% cutoff covariance and Phi-statistics results:
...
$componentsofcovariance
                                             Sigma          %
Variations  Between drain               5.10788445  27.509734
Variations  Between pop Within drain    3.42570578  18.449958
Variations  Between samples Within pop  9.93798617  53.523403
Variations  Within samples              0.09597662   0.516905
Total variations                       18.56755302 100.000000

$statphi
                        Phi
Phi-samples-total 0.9948309
Phi-samples-pop   0.9904348
Phi-pop-drain     0.2545164
Phi-drain-total   0.2750973

## 95% cutoff significance test results:
                        Test        Obs    Std.Obs   Alter Pvalue
1  Variations within samples 0.09597662 -16.617535    less   0.01
2 Variations between samples 9.93798617   7.826314 greater   0.01
3     Variations between pop 3.42570578   3.095911 greater   0.01
4   Variations between drain 5.10788445   3.725359    less   1.00

Results plot: Hypostomus drainage AMOVA with 95% missing data cutoff.

 

Conclusions

Overall, these results showcase two main findings.

First, we can reject the null hypothesis of panmixia in favor of significant genetic structuring within and between individuals, as well as at the level of populations within drainages (p < 0.05), with ~99.5% of the variance in the data being present at these three levels. However, there is support for random gene flow between populations in different drainages, and in large part this is due to the fact the the genetic populations (e.g. admixture groups from fastSTRUCTURE) or “clades” (inferred from phylogenomic analyses using maximum likelihood) in the sample occur in multiple drainage basins.

Second, the data are sufficient in number and variability so that AMOVAs on these data are robust to up to 30% (and above) missing data per locus. This makes sense, given that, in pyRAD, these data were stringently filtered so that each RAD tag would contain no more than 30% missing data.

I hope that you have enjoyed doing AMOVA with me! Also, I hope this helps you conduct AMOVA on your phylogeographic or population genomic dataset, in the R environment. Cheers,

~J

 

References

Excoffier L, Smouse PE, Quattro JM (1992) Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics, 131(2), 479-491.

Jombart T (2008). adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics, 24(11), 1403-1405.

Jombart T, Ahmed I (2011) adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics, 27(21), 3070-3071.

Jombart T, Collins C (2015) A tutorial for discriminant analysis of principal components (DAPC) using adegenet 2.0.0. Imp Coll London-MRC Cent Outbreak Anal Model, 43.

Kamvar ZN, Tabima JF, Grünwald NJ (2014) Poppr: an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ, 2, e281. doi:10.7717/peerj.281

Kamvar ZN, Brooks JC and Grünwald NJ (2015) Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality. Frontiers in Genetics, 6, 208. doi:10.3389/fgene.2015.00208

Getting BEAST2 Path Sampler analyses running on Windows 10

I am a relatively big Mac (UNIX) and Linux fan, as many of you know. So, my standard approach to running BEAST or BEAST2 is to use my Mac for local setup and testing and then employ shell scripts (effective wrappers) that I’ve developed to automate final run setup and queuing on a Linux supercomputing cluster. However, there are times, especially when running path sampling (PS) analysis (Larillot and Phillipe 2006; Baele et al. 2012; Baele 2016 lecture) to estimate log-marginal likelihoods of BEAST models, when I tend to run into issues on the supercomputer. These are often memory problems (e.g. step memory limits), input XML reading errors, and other issues, and oddly they can happen when attempting to run PS XML files that pass local run tests. In these times (but only with small numbers of models, to date), I turn to running PS analysis locally using the Path Sampler GUI version (image at upper left), and I use as many computers as I have at my disposal!

Today, I wanted to speed up parallel PS runs by adding a 2.2 GHz Lenovo Windows 10 laptop that came available, but which did not have BEAST or BEAGLE installed! So, I eased back into the waters of (1) Windows BEAST setup, and more generally (2) setting up Windows for analyzing scientific data, in general–two things that I essentially hadn’t done since the early/mid part of my PhD, before I switched to Mac.

This post briefly explains how I got BEAST Path Sampler working with the BEAGLE API/scientific computing library on Windows 10.

Steps

The 7 steps I followed to setup BEAST2 and BEAGLE on Windows:

  1. Install/update Oracle Java 1.7 and 1.8 runtime environments.
  2. Install BEAGLE v2.1 using binary installer (.msi file, image/link at left) from the beagle-dev GitHub account.
  3. Install BEAST2 v2.4.7 for Windows by downloading and saving .zip file here.
  4. Install latest Java SE Development Kit (jdk8) for 64bit Windows.
  5. Set global, and user, Path environmental variables to include libhmsbeagle and the path to an appropriate Java bin folder (recommended starting point: JRE bin; see below).
  6. Open BEAUti v2.4.7 and manage packages (adding most available packages), then closing and reopening BEAUti to check.
  7. Test the installs by conducting runs with BEAST and Path Sampler!
Details

Steps #1-4

Steps #1–4 were accomplished using standard methods, for example following installer prompts to designate a drive and agree to the license, etc. For BEAST + BEAGLE install alone, step #4 above might not be considered obligatory by others. In any event, we subsequently set the path to the latest JRE. However, there are indications from discussions in online forums that using JDK may have helped some users overcome issues. And, in my experience it seems it is always better to install the JDKs as well, so that you can point to them if needed. Thus, I consider step #4 to be recommended (just to be on the safe side). 

Step #5

Windows 10 Control Panel

The only part of this procedure that really required any extra thinking and work at all was step #5. Here, all you have to remember is that much can be done on Windows through Control Panel! You simply

  1. Go to Advanced system settings by navigating to Control Panel > System and Settings > System > Advanced system settings (on left pane)
  2. Click on “Environmental Variables” at the bottom of the pop-up window that is displayed.
  3. Set the User and System “Path” variables in the list so that libhmsbeagle (BEAGLE directory; where BEAGLE was installed) is part of the path.
    1. Do this by selecting Path, then adding “; C:\path\to\libhmsbeagle” to the end of the current/default Path (that is, separate the original path from the libhmsbeagle path you are adding, by using a semicolon).
    2. Do this under User variables and under System variables.

On the machine that I used, the BEAGLE path was

C:\Program Files (x86)\Common Files\libhmsbeagle\ 

and the Java bin path was

C:\Program Files (x86)\Java\jre1.8.0_144\bin 

(corresponding to the very latest JRE available (follow link given under Step #1 above). If you follow my steps now, and do a standard install of this software, check me on this, but you should be able to use the same path settings as these.

Steps #6 and #7

The final two steps are quite easy. For step #6, you can easily accomplish this from within the BEAUti v2.4.7 GUI by

  1. Opening the GUI,
  2. Clicking on File > Manage Packages,
  3. Choosing and installing packages of interest,
  4. Then closing and reopening (look back in the package manager) to check that everything was in fact installed or updated.

Step #7 also is quite easy, because it just involves playing with example files and making sure everything runs fine. However, you want to make sure to do two things here–check that BEAST software and utilities are opening and running fine, and also make sure that BEAGLE is found and utilized by BEAST and its packages.

Troubleshooting

If you get an error that says, “Failed to load BEAGLE library: no hmsbeagle-jni in java.library.path,” then you are potentially about to go down the rabbit hole to fix that issue!! Your best bet is to follow my general procedure above, update Java, and make sure you set a path variable for Java in addition to BEAGLE lib. If this doesn’t work, uninstall and reinstall BEAGLE, then reset the path. In the event of any issue with BEAST, I always recommend consulting recent discussions on the beast-users group discussion boards for additional information. In this case, it may help to see this relevant post* on the beast-users Google Group (*full disclosure: I have not tested the suggestions there). 

I hope that someone finds this useful! Cheers, and good luck with BEAST! Also, thanks to the developers for working hard to provide great BEAST software that works on Mac, Linux, and Windows!

~J

References

Baele G, Lemey P, Vansteelandt S (2013) Make the most of your samples: Bayes factor estimators for high-dimensional models of sequence evolution. BMC Evol. Biol. 14, 85.

Lartillot N, Philippe H (2006) Computing Bayes factors using thermodynamic integration. Syst. Biol. 55, 195-207.

Data mining resources for species distribution modeling and macroecology

Research projects in ecology, evolution, and conservation biology increasingly rely upon ecological modeling and GIS data and tools for manipulating and analyzing spatially linked data from across a variety of spatial scales. Of particular importance are species distribution modeling (SDM; also known as Ecological Niche Models, or ENMs, or Environmental Niche Models) and niche divergence tools, geospatial data layers representing Earth environments from a variety of perspectives and scales, and GIS and mapping software.

This post provides a list of links to resources (mainly databases and websites) from which researchers can mine data for conducting SDM and macroecology analyses. I intend to curate this list and update it as I learn about or use new data or data mining resources. This is the first draft of the list, and so it will necessarily contain some gaps; however, I have tried to link to the most popular databases and software programs. Websites in the References section provide additional lists, but most of the software they list has been incorporated herein. Species distribution modeling software and GIS methods will not be reviewed here (see manuscripts and online lists of these elsewhere).

Along with my collaborators, I am currently using these data sources and software programs to conduct species distribution modeling analyses of North American conifers and Neotropical freshwater fishes.

Jump to…

 

Species Occurrence & Range Map Data

 


 

Climate Data Layers

 

NOTE: I don’t have time now but I plan to add links to download pages for several global circulation models, as well as future climate models, here in the near future. ~J

 


 

Other GIS Data Layers & Maps

 


 

GIS Software

Note: Check back… more links coming soon! ~J

 


 

References

CMEC (2017) Software and GIS data. Center for Macroecology, Evolution and Climate. Natural History Museum of Denmark, available at: http://macroecology.ku.dk/resources/software/.

WWF (2017) Conservation Science Data and Tools. World Wildlife Fund, available at: https://www.worldwildlife.org/pages/conservation-science-data-and-tools.

 


 

Other Resources

MVZ (Museum of Vertebrate Zoology) GIS Portal, University of California, Berkeley –  https://mvzgis.wordpress.com/

Setting DNA substitution models in BEAST

JC model of sequence evolution from TreeThinkers.

BEAST (Bayesian Evolutionary Analysis Sampling Trees) software provides a growing set of clear options for specifying evolutionary models of DNA substitution for alignments loaded into the program. However, the set of models that are readily available and “spelled out” in drop-down menus in the BEAUti (Bayesian Evolutionary Analysis Utility) GUI is much smaller compared to the standard set of 11 substitution schemes considered by other software programs for model selection (giving rise to a total of 88 models in jModelTest, or 56 models in PartitionFinder). This poses a problem that I have heard several BEAST users bring up in online discussion groups, and that several people have asked me about personally. In this post, I shed some light on this problem and clear up some of the confusion by giving details on how users can easily specify different substitution models for their BEAST runs. 

Background: Substitution models in BEAST

I will assume that readers are familiar with DNA substitution models. Also, there are a number of details about how BEAST handles DNA substitution models, which you can learn from the BEAST2 book or the manual, but that I will not cover. For example, by default, BEAST normalizes the rate matrix so that the mean rate is 1.0, and so on. Moving on to the issues…

It seems that many people are having issues with the fact that it’s not a simple “one-click” procedure to specify the diverse variety of substitution models that are frequently selected during model selection inference conducted prior to setting up BEAST runs. This is because BEAST1 only gives the obvious options to specify three of the most popular models of DNA substitution: TrN, HKY, and GTR, while BEAST2 versions give users obvious options to specify one of four models: JC69, TrN, HKY, and GTR. Thus other models such as TIM, K80, and F81 can seem cryptic to some users, who will often exclaim in dismay, “Where is the option for model X???” or “I don’t see how to specify model Y!!!” and then post a question about it (that may languish, never to even be directly answered, who knows).

Solving the problem: Learning BEAST functionalities “under your nose”

The real solution to the above problems is to realize the options available to you, especially that BEAUti actually gives several important options in the Site Model panel that each user must learn to use to tailor the models of evolution for each alignment/partition. These options are located right “under your nose,” but I’ll show you where they are and how to use them in Tables 1 and 2 below, as follows:

Table 1. Details for setting substitution models in BEAST1 (e.g. v1.8.3).

Best-fit substitution model (Base) Model to select in BEAUti Additional specifications in BEAUti
JC69 JC69 None 
TrN TN93 None 
TrNef TN93 base Frequencies set to “All Equal”
K80 (K2P) HKY base Frequencies set to “All Equal”
F81 GTR turn off operators for all rate 
parameters
HKY HKY None
SYM GTR base Frequencies set to “All Equal”
TIM GTR

edit XML file by hand so that all other rates are equal to the AG rate

TVM GTR unchecking AG rate operator
TVMef GTR unchecking AG rate operator and setting base Frequencies to “All Equal”
GTR GTR None

Table 2. Details for setting substitution models in BEAST2 (e.g. 2.4.0+).

Best-fit substitution model (Base) Model to select in BEAUti 2 Additional specifications in BEAUti 2
JC69 JC69 None 
TrN TN93 None
TrNef TN93 base Frequencies set to “All Equal”
K80 (K2P) HKY base Frequencies set to “All Equal”
F81 GTR fix all rate parameters to 1.0 (uncheck the “estimate” box)
HKY HKY  None
SYM GTR base Frequencies set to “All Equal”
TIM GTR fix CT and AG rate parameters to 1.0 (uncheck the “estimate” box)
TVM GTR fix the AG rate parameter to 1.0 (uncheck the “estimate” box)
TVMef GTR fix the AG rate parameter to 1.0 (uncheck the “estimate” box), and also set base Frequencies to “All Equal”
GTR GTR  None

 

 

I have quickly constructed these tables based on different options available in BEAST versions, my knowledge of the substitution models, and the list of model code for different substitution models on the BEAST1 website (information for XML format for BEAST1 / earlier versions than present). How can you use the above tables to set a desired site model? Easy. You simply check Tables 1 and 2 and for a given model you might want to set (first column), the tables tell you which model to select in BEAUti (second column) and the modifications, if any, you will need to make to that model or the XML file itself to get the model you want (third column). 

Another way…

If you would like more detail, you can also draw on other resources available online to do this properly. For example, if you wanted to set the F81 model for a partition of your dataset, you could also search the substitution model code page mentioned above for “F81”. If you did this you would find the following block for BEAST1:

<!-- *** SUBSTITUTION MODEL FOR PARTITION Gene2 *** -->
<!-- The F81 substitution model (Felsenstein, 1981) -->
<gtrModel id="F81_Gene2">
	<frequencies>
		<frequencyModel dataType="nucleotide">
			<frequencies>
				<parameter id="F81_Gene2.frequencies" value="0.25 0.25 0.25 0.25"/>
			</frequencies>
		</frequencyModel>
	</frequencies>
	<rateAC>
		<parameter id="F81_Gene2.ac" value="1.0"/>
	</rateAC>
	<rateAG>
		<parameter id="F81_Gene2.ag" value="1.0"/>
	</rateAG>
	<rateAT>
		<parameter id="F81_Gene2.at" value="1.0"/>
	</rateAT>
	<rateCG>
		<parameter id="F81_Gene2.cg" value="1.0"/>
	</rateCG>
	<rateGT>
		<parameter id="F81_Gene2.gt" value="1.0"/>
	</rateGT>
</gtrModel>

We can infer from this that the F81 model can be specified in any BEAST version by first specifying the GTR model, and then setting all rates to 1.0 and also setting all base frequencies to be equal “All Equal” (i.e. each equal to 0.25). And you would specify this information in the Site Model panel in BEAUti. You would also need to make sure that the priors were not included for these parameters, since they will be fixed; so you do that by commenting out the rate priors within the XML file. Accordingly, you would also need to delete the rate parameters from the log, because you won’t need to log those parameters since you won’t be estimating them.

~J

Additional Reading

Substitution model basics from Wikipedia: substitution model, models of DNA evolution  

Beast users google group threads:

BEAST2 website FAQ: link

PartitionFinder website: PartitionFinder

Automatically install or update all phylogenetics packages in R

In case you are unaware of this trick, it is possible to simultaneously install or update all of the phylogenetics packages available in R in one fail swoop.

A number of different CRAN Task Views have been created summarizing R packages and utilities for different fields of analysis or research. Luckily, there is not only a compilation of these located on the CRAN-R website here, but also Brian O’Meara (@omearabrian #omearabrian) has developed a nice CRAN Task View entitled “Phylogenetics, Especially Comparative Methods.” This means that you can view all of the software for phylogenetics and phylogenetic comparative methods in R at one time from the corresponding Phylogenetics website. Here is what these websites look like:

CRAN_Task_Views_screenshotOmeara_Phylogenetics_ctv

 

 

 

The most important thing, though, is that you can automatically install the entire Phylogenetics, Especially Comparative Methods view! Here’s how you do it! First, open R, download the ctv package(ctv stands for “CRAN Task Views”), then load the ctv library by typing:

 library(ctv) 

 

Next, you can install the Phylogenetics, Especially Comparative Methods view using the ‘install.views’ function, and later you can update this view using the ‘update.views’ function. If this is your first time using R for phylogenetics, you will not have any of the phylogenetics software, so you will do the following to install the view:

install.views('Phylogenetics')

 

If you have already been working with phylogenies in R and have some of the related packages installed, you will want to use update.views as follows:

update.views('Phylogenetics') 

 

This will check through all of the packages in the Phylogenetics, Especially Comparative Methods view and update packages you already have installed, while also installing for the first time any packages that are missing. Here, an important note to make is that you must 1) be connected to the internet to do this, and 2) realize that this could take a substantial amount of time to install all of the packages. So, only do this when you have time!!

~J

How to kill an R process without forcing R to quit on mac

From time to time, we make mistakes in programming or testing a new R script or function, only to find that R “freezes” and appears to be stuck, or working but giving the impression that it will take an eternity to complete the computation. What could be happening is that the process is based on an maximum-likelihood estimation of a parameter that requires convergence, you could have accidentally (e.g. by default) run a function that needs to visit the total number of models possible for your dataset or a certain amount of parameter space. Alternatively, there might be an issue with FORTRAN coding. Or, the function you’re using might require a maximum number of iterations to be specified, or else it will use an exhaustive search. R can “hang” for these and many other reasons.

Today, I’m doing a short post to show you how to get out of this situation by killing the process in R from outside the R environment. Specifically, I’ll show you how to do this from mac Terminal, i.e. the command line.

First do the following at the command line to obtain a list of processes including R:

ps aux | grep R


Look through the resulting list of PIDs (process IDs, which are unique identifiers for each process) and info for the R process, which will likely be using a great deal of CPU (%CPU; eg. %CPU >90%), and thus will be near the top of the list (which will be in decreasing order). Here is an example of the results of running this code from my terminal to ongoing processes, including an R analysis (this is just an excerpt, the output was much longer):

Last login: Sun Apr 3 23:45:14 on ttys000 Justin-Bagleys-MacBook-Pro-3:~ justinbagley$ ps aux | grep R USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND justinbagley 334 93.2 12.7 9618560 2122612 ?? R Tue05PM 317:07.03 /Applications/R.app/Contents/MacOS/R -psn_0_57358 justinbagley 339 7.2 2.1 4949212 355924 ?? R Tue05PM 47:05.88 /System/Library/CoreServices/Finder.app/Contents/MacOS/Finder _windowserver 200 1.7 1.4 14600476 231856 ?? Ss Tue05PM 113:19.80 /System/Library/Frameworks/ApplicationServices.framework/Frameworks/CoreGraphics.framework/Resources/WindowServer -daemon _mdnsresponder 94 0.2 0.0 2544976 3864 ?? Ss Tue05PM 0:39.06 /usr/sbin/mDNSResponder justinbagley 491 0.1 0.2 3713764 33524 ?? S Tue05PM 11:19.43 /System/Library/CoreServices/Dock.app/Contents/Resources/DashboardClient.app/Contents/MacOS/DashboardClient justinbagley 23663 0.0 0.0 3358044 6476 ?? S Fri10AM 0:02.38 /Applications/Adobe Acrobat Reader DC.app/Contents/Frameworks/RdrCEF helper.app/Contents/MacOS/RdrCEF helper --type=renderer --disable-3d-apis --disable-databases --disable-direct-npapi-requests --disable-file-system --disable-notifications --disable-shared-workers --no-sandbox --lang=en-US --lang=en-US --log-severity=disable --product-version=ReaderServices/15.10.20056 Chrome/45.0.2454.85 --enable-delegated-renderer --num-raster-threads=4 --gpu-rasterization-msaa-sample-count=8 --content-image-texture-target=3553 --video-image-texture-target=3553 --channel=23661.1.1456228839


An alternative way of getting this information on mac is to use the Activity Monitor app, but we’re not covering that here. OK, now armed with this information you can use the kill command to identify and stop the analysis. In my example above, the process ID for the R process is 334, and this process is consuming 93.2% of the CPU and 12.7% of the computer’s memory, so we can stop this analysis by entering the following command at the command line:

kill -s INT 334


Two great things about this are 1) that this is easily accomplished at the command line, and 2) that you don’t have to call R or alter R, or do “force quit” (e.g. by pressing option + command + Esc), you only have to stop the offending process. So, this means that R will not be closed as a result of these actions, but will be “liberated” from the hung prodess. As a result, your log of commands and output will not be lost.

Just as a tip, if you are worried about losing your command history at any time, you can usually save this from R even if the R environment/GUI hangs on an analytical process by first clicking the “Show/Hide R command history” button at the top of the R GUI, and then clicking “Save History.”

~J

Latest *BEAST trace: Most beautiful thing I’ve seen today

My new *BEAST (starbeast) species tree runs are converging wonderfully in BEAST v1.8.3, so I am off to a great start this morning. Here, I’m including a screenshot of the results of one of these runs, which I ran for 200 million generations using 3 normally distributed calibration points, as visualized in Tracer. Note the good posterior trace, very high ESS scores for all parameter estimates, and good use of burn-in (default 10% value works quite well in this case).

This is the most beautiful thing I’ve seen in my research today, or even this week!38_STARBEAST_PSmle_post_trace_Mar10 

I am now checking and processing the results from multiple such runs using different species schemes (trait mapping schemes/files). They seem like they are going to give me a nice estimate of the species tree for this dataset.

Currently, the second steps of these runs are still running, using path sampling/stepping-stone sampling to calculate marginal likelihood estimates (MLE) for each model. When this is completed, I will calculate Bayes factors for each model and use them for Bayes factor species delimitation (BFD) sensu Grummer et al. (2014). 

Is anyone else working on BFD analyses like this right now? Would love to hear from you.

~J

References

Grummer JA, Bryson RW Jr., Reeder TW. 2014. Species delimitation using Bayes factors: simulations and application to the Sceloporus scalaris species group (Squamata: Phrynosomatidae). Systematic Biology 63(2): 119-133.