AWclust: Manual
AWclust is easy to use non-parametric population structure analysis software written for R with a GUI interface. Just point and click and you will be on your way to discovering the important cluster information in your SNP data sets.
The software works well with both small and large SNP datasets, depending on the degree of differences between the subpopulations. For closely related populations (those with a relatively small degree of differences between them), AWclust is capable of quickly processing the large datasets required to separate them.
AWclust is non-parametric, while most of the publicly available population structure programs use parametric approaches, either Bayesian (e.g STRUCTURE and BAPS) or frequentist (e.g. L-POP)
One advantage that a non-parametric method has over parametric methods is that the results from a non-parametric approach do not depend on the validity of assumptions, like Hardy-Weinberg Equilibrium (HWE) and linkage equilibrium (LE). If these assumptions are not valid, then non-parametric methods have more power than their parametric counterparts. Even the latest version of STRUCTURE (version 2.1), which can incorporate linkage information, is still sensitive to high background linkage disequilibrium.
Another advantage that AWclust has over other parametric methods is that it does not require an estimate of the allele frequency. Allele frequency estimates suffer from large variation when the sample size is small, e.g. less than ten individuals, and this variation can hinder the analysis of population structure.
Simply put, AWclust is easy to use and does not require the researcher to justify assumptions about the data to produce useful results. Just point and click and you are on your way to scientific discovery.
For researchers wishing to use parametric approaches, AWclust is a useful first pass analysis for building confidence about assumptions that parametric methods require.
AWclust requires R (version 2.5 or higher) and the following R packages:
> library(package)where
package
is the name of a specific package,
not the word package.
If you do not already have R running on your computer system, or you are missing one or more of the required packages, you can download everything you need from the Comprehensive R Archive Network (CRAN): http://cran.r-project.org/
Installing and running AWclust
To install AWclust on Linux/Unix/Macintosh machines:
AWclust_X.Y.tar.gz
file (were X.Y
represents the latest version number), then enter
the command or commands (you need root account to install software): shell$ R CMD INSTALL AWclust_X.Y.tar.gz
To install AWclust on Windows machines:
AWclust_X.Y.zip
file (where X.Y
represents the latest version number), thenInstall pacakge(s) from local
zip files
from the Packages
menu.
To run AWclust type the following on the command line (Windows users may just need to double click on the R icon to get R running):
shell$ RTo get help:
> library(AWclust)
> AWclust()
> help(AWclust)or
> ? AWclustTo remove AWclust from your R distribution, quit R and enter the following command:
shell$ R CMD REMOVE AWclust
After you have installed AWclust, start R and launch the program (Windows users may just need to double click on the R icon to get R running):
shell$ RAt this point, if everything worked correctly, a new window should pop up that looks similar to the window in Figure 1 (the figures in this tutorial may be slightly different from what you see. However, the essential details should be the same.)
> library(AWclust)
> AWclust()
Figure 1. The main AWclust interface.
The next thing to do is to load in a SNP data file. For this tutorial,
click
on the Browse
button and select the file HapMap500.txt
.
This file should be in the AWclust directory,
however, if you can not find it, you can download
it. Also, if you would like to use
your own SNP data, just make sure that it is in a format
that AWclust supports.
The HapMap500 dataset contains data for 500 SNP
loci collected from 209 individuals.
Once you have selected
HapMap500.txt
, click on the Load
button, and
then
click on the Calculate!
button underneath it. Clicking on
the
Calculate!
button generates a matrix of pair-wise allele sharing distances (ASD) between all of the
individuals in your dataset. This ASD matrix is required
for all of the analysis methods that follow.
With the ASD matrix calculated, you can now start to analyze the population structure in your data.
The first thing you want to do is look for any potential outliers in
the dataset. In the case of population structure, an outlier would
create a sub-population with only one member, or sub-clusters with very
limited numbers of individuals.
One way to check for outliers
is with multidimensional scaling (MDS) plots.
AWclust offers two
types of MDS plots, 2 dimensional (2D) and 3 dimensional (3D).
Generating
the plots is as simple as clicking on the Plot
buttons.
The two dimensional plot (see Figure 2) essentially shows orthogonal
projections of the 3D plot (see Figure 3). The advantage of the 2D plot
is that, while the
IDs for the various individuals in the data may look all jumbled at
first, you
can export the plot as a PDF and zoom into it to make the individual
IDs
discernible. The advantage of the 3D plot is that the cube can be
rotated to
make the clustering easier to see. Try moving the sliders in the main
AWclust window within the MDS 3D plot
frame to rotate the
cube
so that you can get a good feel for how the data should cluster.
Figure 2. 2D MDS plot for the HapMap500 dataset.
Figure 3. 3D MDS plot for the HapMap500 dataset.
As can be seen Figures 2 and 3, MDS plots can also give you a general idea of how the data should cluster. For the HapMap500 dataset, it appears that there should be 3 main sub-populations.
The MDS plots do not indicate that there are any major outliers in
the HapMap500 dataset, so we can, without modifying the data, go ahead
and plot
a dendrogram, or tree, that shows how the individuals cluster into
subpopulations (see Figure 4). Do this clicking on the Plot
button in in the Hierarchical plot
frame. This
frame also allows you to draw a red-line across the tree to indicate
what level of sub-division is best suited for your research by clicking
on the +
or -
buttons.
NOTE: At any time, if you see a plot that they would like
to save,
you need only click on the appropriate Save
button to
export
it as a PDF file.
Figure 4. A dendrogram plot for the HapMap500 dataset.
So far, both the MDS and the hierarchical plots have shown that the HapMap500 dataset subdivides into three main clusters. However, this inference has come only from eye-balling the figures. With the gap statistics frame, AWclust will let us objectively determine the optimal number of clusters.
NOTE: Computing the gap statistic can take time, so, unlike drawing the other plots, the results are not likely to appear on your screen instantaneously. Be patient (and make sure your patience is proportional to the size of your dataset, the number assigned to K, and the number of iterations from the null simulations).
Based on our visual inspection of the MDS and hierarchical plots, we are fairly sure that the optimal number of clusters is going to be around 3. If we thought that there might be more clusters (up to 8), we could move the slider on the left side of the frame to increase the range of values K to to determine the optimal number of clusters. However, larger ranges for K take longer to compute.
The slider on the right side of the Gap statistics
allows you
to change the number of Null simulations
. The default
value
is usually pretty good, but you can increase the value if you need
to
reduce the variation in your estimate for the optimal number of
sub-populations. Increasing the number of null simulations
increases the amount of time required to compute the gap statistic.
For example, on a PC with an Intel Core2 2.5 GHz CPU with 2 GB of RAM,
running 30 simulations (K=1 to 6) with this example dataset takes about
30 seconds.
Increasing the number of simulations to 60 requires takes a little longer
than a minute to compute.
After clicking on Calculate & Plot
you should see
something like
what we have in Figure 5. In the graph on the left the red line
represents
the log of the pooled within-cluster sum of squares (Wk) from the
observed
data. The blue line represents the expected log of Wk from a uniform
distribution. The optimal cluster size will maximize the distance
between
the observed and expected lines and this point can usually be
identified by the appearance of an elbow bend in the red line. Here we
see
the elbow at K=3.
The graph on the right shows the actual gap statistic (the difference between the red and blue lines in the graph on the left) and it is clear that K=3 is indeed the best cluster size.
Figure 5. Gap statistics plots for the HapMap500 dataset.
One thing to keep in mind with clustering is that even with strong statistical support for the number of clusters, one can view K as a somewhat flexible number that can fit your needs. For example, if you know that individuals from two of the main clusters respond well to a drug/treatment and that individuals in the third cluster do not, then, for the purposes of deciding which sub-populations should receive the drug/treatment, one can consider K to be two.
Here is a small example taken from the
HapMap500.txt file:
AWclust works only with SNPs that have two alleles.
D = 1/L ∑l=1,2,3..L dl
where
dl = 0, if two individuals
have two alleles in common at the
lth locus; dl = 1
if two individuals have only a single allele in
common at the lth locus; dl = 2
if individuals have no allele in common at the
lth locus. Note: some researchers use 0, 1/2 and 1 (proportion of
allele sharing) instead
of 0, 1 and 2.
Reference: Bowcock et
al. (1994), Mountain J, Cavalli-Sforza L (1997).
Reference: Borg I, Groenen PJF (1997).
AWclust uses Ward's minimum variance algorithm to cluster the individuals
in the ASD matrix.
In the initial step, each
cluster contains one individual. At each subsequent step,
the algorithm merges the
two groups that will result in the smallest increase in the value of
within-cluster variance. The pair is then joined and the number of
clusters reduced by one. The clustering process continues until one
cluster contains all individuals. Therefore, the within-cluster
variance takes the minimal increase at each fusion. The cluster
variance increases nonlinearly as the clustering process builds up,
which clearly indicates where groups separate from each other. Reference: Ward JH (1963), Ward JH, Hook ME (1963).
The gap statistic selects the optimal number of clusters, k,
such that log(Wk), where Wk is the
pooled within-cluster
sum of squares, is farthest below its null reference distribution
curve. The gap statistic is defined as
Gap(k) = E[log(Wk)] - log(Wk),
where E() denotes the expectation from the
null reference distribution. In this case, the null reference distribution is
a a uniform distribution.
Although the default number of null simulations seems to work well, you can
increase the number if you need to reduce the variation in the estimate.
However, increasing the number of null simulations increases the time
required to compute the statistic. See the tutorial for examples.
Reference: Tibshirani R, Walther G, Hastie T (2001). Gao X and Starmer J, AWclust: point-and-click software for
non-parametric population structure analysis,
BMC Bioinformatics 2008, 9:77
Gao X and Starmer J, Human population structure analysis via
multilocus genotype clustering,
BMC Genetics 2007, 8:34
Gao X and Martin ER. Using allele sharing distance for detecting
human population stratification. Human Heredity 2009; 68:182-191.
SNP file format
The first row in the SNP file is contains names or IDs of the individuals
in the dataset separated by white space.
Each subsequent row represents a single SNP and the different
alleles each individual has for that SNP, also separated by white space.
The SNP information is encoded as numeric values (i.e. 0, 1, or 2) to
represent the number of variant SNP alleles in genotypes (i.e. 0 implies that
there are no SNP variants in the genotype, 1 for heterozygotes and 2 for
homozygotes for SNP variants), and -1 is
used to represent missing values.
CEU1 CEU2 CEU3 CEU4 CEU5
1 0 1 0 1
1 1 2 0 2
-1 1 0 1 2
1 1 2 1 1
Detailed descriptions of the various analysis techniques
Allele sharing distance (ASD)
The distance between individuals i and j was defined as:
Multidimensional scaling (MDS)
MDS is a statistical technique for allowing differences and similarities to be
visualized. The differences are represented by distances between points on
a graph. Elements in the ASD matrix that are close together will tend to
cluster together.
Hierarchical clustering
Gap statistic