AWclust: Manual

- Introduction
- Getting started
- Installing and running AWclust
- A short tutorial
- SNP file format
- Detailed descriptions of the various analysis techniques
- Citation information
- Feedback and bug reports

**What is AWclust?**AWclust is easy to use non-parametric population structure analysis software written for R with a GUI interface. Just point and click and you will be on your way to discovering the important cluster information in your SNP data sets.

The software works well with both small and large SNP datasets, depending on the degree of differences between the subpopulations. For closely related populations (those with a relatively small degree of differences between them), AWclust is capable of quickly processing the large datasets required to separate them.

**What makes it different?**AWclust is non-parametric, while most of the publicly available population structure programs use parametric approaches, either Bayesian (e.g STRUCTURE and BAPS) or frequentist (e.g. L-POP)

One advantage that a non-parametric method has over parametric methods is that the results from a non-parametric approach do not depend on the validity of assumptions, like Hardy-Weinberg Equilibrium (HWE) and linkage equilibrium (LE). If these assumptions are not valid, then non-parametric methods have more power than their parametric counterparts. Even the latest version of STRUCTURE (version 2.1), which can incorporate linkage information, is still sensitive to high background linkage disequilibrium.

Another advantage that AWclust has over other parametric methods is that it does not require an estimate of the allele frequency. Allele frequency estimates suffer from large variation when the sample size is small, e.g. less than ten individuals, and this variation can hinder the analysis of population structure.

**Why should researchers use it?**Simply put, AWclust is easy to use and does not require the researcher to justify assumptions about the data to produce useful results. Just point and click and you are on your way to scientific discovery.

For researchers wishing to use parametric approaches, AWclust is a useful first pass analysis for building confidence about assumptions that parametric methods require.

AWclust requires R (version 2.5 or higher) and the following R packages:

- graphics
- stats
- tcltk
- lattice

> library(package)where

`package`

is the name of a specific package,
not the word package.
If you do not already have R running on your computer system, or you are missing one or more of the required packages, you can download everything you need from the Comprehensive R Archive Network (CRAN): http://cran.r-project.org/

To install AWclust on Linux/Unix/Macintosh machines:

- Download the
`AWclust_X.Y.tar.gz`

file (were`X.Y`

represents the latest version number), then enter the command or commands (you need root account to install software):

shell$ R CMD INSTALL AWclust_X.Y.tar.gz

To install AWclust on Windows machines:

- Download the
`AWclust_X.Y.zip`

file (where`X.Y`

represents the latest version number), then

- Start the R interface and select
`Install pacakge(s) from local zip files`

from the`Packages`

menu.

To run AWclust type the following on the command line (Windows users may just need to double click on the R icon to get R running):

shell$ RTo get help:

> library(AWclust)

> AWclust()

> help(AWclust)or

> ? AWclustTo remove AWclust from your R distribution, quit R and enter the following command:

shell$ R CMD REMOVE AWclust

After you have installed AWclust, start R and launch the program (Windows users may just need to double click on the R icon to get R running):

shell$ RAt this point, if everything worked correctly, a new window should pop up that looks similar to the window in Figure 1 (the figures in this tutorial may be slightly different from what you see. However, the essential details should be the same.)

> library(AWclust)

> AWclust()

Figure 1. The main AWclust interface.

The next thing to do is to load in a SNP data file. For this tutorial,
click
on the `Browse`

button and select the file `HapMap500.txt`

.
This file should be in the AWclust directory,
however, if you can not find it, you can download
it. Also, if you would like to use
your own SNP data, just make sure that it is in a format
that AWclust supports.
The HapMap500 dataset contains data for 500 SNP
loci collected from 209 individuals.
Once you have selected
`HapMap500.txt`

, click on the `Load`

button, and
then
click on the `Calculate!`

button underneath it. Clicking on
the
`Calculate!`

button generates a matrix of pair-wise allele sharing distances (ASD) between all of the
individuals in your dataset. This ASD matrix is required
for all of the analysis methods that follow.

With the ASD matrix calculated, you can now start to analyze the population structure in your data.

The first thing you want to do is look for any potential outliers in
the dataset. In the case of population structure, an outlier would
create a sub-population with only one member, or sub-clusters with very
limited numbers of individuals.
One way to check for outliers
is with multidimensional scaling (MDS) plots.
AWclust offers two
types of MDS plots, 2 dimensional (2D) and 3 dimensional (3D).
Generating
the plots is as simple as clicking on the `Plot`

buttons.

The two dimensional plot (see Figure 2) essentially shows orthogonal
projections of the 3D plot (see Figure 3). The advantage of the 2D plot
is that, while the
IDs for the various individuals in the data may look all jumbled at
first, you
can export the plot as a PDF and zoom into it to make the individual
IDs
discernible. The advantage of the 3D plot is that the cube can be
rotated to
make the clustering easier to see. Try moving the sliders in the main
AWclust window within the `MDS 3D plot`

frame to rotate the
cube
so that you can get a good feel for how the data should cluster.

Figure 2. 2D MDS plot for the HapMap500 dataset.

Figure 3. 3D MDS plot for the HapMap500 dataset.

As can be seen Figures 2 and 3, MDS plots can also give you a general idea of how the data should cluster. For the HapMap500 dataset, it appears that there should be 3 main sub-populations.

The MDS plots do not indicate that there are any major outliers in
the HapMap500 dataset, so we can, without modifying the data, go ahead
and plot
a dendrogram, or tree, that shows how the individuals cluster into
subpopulations (see Figure 4). Do this clicking on the `Plot`

button in in the `Hierarchical plot`

frame. This
frame also allows you to draw a red-line across the tree to indicate
what level of sub-division is best suited for your research by clicking
on the `+`

or `-`

buttons.

**NOTE:** At any time, if you see a plot that they would like
to save,
you need only click on the appropriate `Save`

button to
export
it as a PDF file.

Figure 4. A dendrogram plot for the HapMap500 dataset.

So far, both the MDS and the hierarchical plots have shown that the HapMap500 dataset subdivides into three main clusters. However, this inference has come only from eye-balling the figures. With the gap statistics frame, AWclust will let us objectively determine the optimal number of clusters.

**NOTE:** Computing the gap statistic can take time, so, unlike
drawing
the other plots, the results are not likely to appear on your screen
instantaneously. Be patient (and make sure your patience is
proportional to the size of your dataset, the number assigned to K, and the
number of iterations from the null simulations).

Based on our visual inspection of the MDS and hierarchical plots, we are fairly sure that the optimal number of clusters is going to be around 3. If we thought that there might be more clusters (up to 8), we could move the slider on the left side of the frame to increase the range of values K to to determine the optimal number of clusters. However, larger ranges for K take longer to compute.

The slider on the right side of the `Gap statistics`

allows you
to change the number of `Null simulations`

. The default
value
is usually pretty good, but you can increase the value if you need
to
reduce the variation in your estimate for the optimal number of
sub-populations. Increasing the number of null simulations
increases the amount of time required to compute the gap statistic.
For example, on a PC with an Intel Core2 2.5 GHz CPU with 2 GB of RAM,
running 30 simulations (K=1 to 6) with this example dataset takes about
30 seconds.
Increasing the number of simulations to 60 requires takes a little longer
than a minute to compute.

After clicking on `Calculate & Plot`

you should see
something like
what we have in Figure 5. In the graph on the left the red line
represents
the log of the pooled within-cluster sum of squares (Wk) from the
observed
data. The blue line represents the expected log of Wk from a uniform
distribution. The optimal cluster size will maximize the distance
between
the observed and expected lines and this point can usually be
identified by the appearance of an elbow bend in the red line. Here we
see
the elbow at K=3.

The graph on the right shows the actual gap statistic (the difference between the red and blue lines in the graph on the left) and it is clear that K=3 is indeed the best cluster size.

Figure 5. Gap statistics plots for the HapMap500 dataset.

One thing to keep in mind with clustering is that even with strong statistical support for the number of clusters, one can view K as a somewhat flexible number that can fit your needs. For example, if you know that individuals from two of the main clusters respond well to a drug/treatment and that individuals in the third cluster do not, then, for the purposes of deciding which sub-populations should receive the drug/treatment, one can consider K to be two.

Here is a small example taken from the HapMap500.txt file:

CEU1 CEU2 CEU3 CEU4 CEU5 1 0 1 0 1 1 1 2 0 2 -1 1 0 1 2 1 1 2 1 1

AWclust works only with SNPs that have two alleles.

D = ^{1}/_{L} ∑_{l=1,2,3..L} d_{l}

where
d_{l} = 0, if two individuals
have two alleles in common at the
*l*th locus; d_{l} = 1
if two individuals have only a single allele in
common at the *l*th locus; d_{l} = 2
if individuals have no allele in common at the
*l*th locus. Note: some researchers use 0, 1/2 and 1 (proportion of
allele sharing) instead
of 0, 1 and 2.

*Reference: *Bowcock et
al. (1994), Mountain J, Cavalli-Sforza L (1997).

*Reference: *Borg I, Groenen PJF (1997).

AWclust uses Ward's minimum variance algorithm to cluster the individuals in the ASD matrix. In the initial step, each cluster contains one individual. At each subsequent step, the algorithm merges the two groups that will result in the smallest increase in the value of within-cluster variance. The pair is then joined and the number of clusters reduced by one. The clustering process continues until one cluster contains all individuals. Therefore, the within-cluster variance takes the minimal increase at each fusion. The cluster variance increases nonlinearly as the clustering process builds up, which clearly indicates where groups separate from each other.

*Reference: * Ward JH (1963), Ward JH, Hook ME (1963).

The gap statistic selects the optimal number of clusters, *k*,
such that log(W_{k}), where W_{k} is the
pooled within-cluster
sum of squares, is farthest below its null reference distribution
curve. The gap statistic is defined as

Gap(*k*) = E[log(W_{k})] - log(W_{k}),

where E() denotes the expectation from the null reference distribution. In this case, the null reference distribution is a a uniform distribution. Although the default number of null simulations seems to work well, you can increase the number if you need to reduce the variation in the estimate. However, increasing the number of null simulations increases the time required to compute the statistic. See the tutorial for examples.

*Reference: * Tibshirani R, Walther G, Hastie T (2001).

Gao X and Starmer J, AWclust: point-and-click software for non-parametric population structure analysis, BMC Bioinformatics 2008, 9:77

Gao X and Starmer J, Human population structure analysis via multilocus genotype clustering, BMC Genetics 2007, 8:34

Gao X and Martin ER. Using allele sharing distance for detecting human population stratification. Human Heredity 2009; 68:182-191.