AWclust: Manual

Welcome!

Welcome and thank you for downloading and using AWclust. Our goal is to make population structure analysis as easy as a simple point-and-click. If you have any questions or suggestions on how we can improve AWclust and make it easier to use, please let us know.

Table of contents

  1. Introduction
  2. Getting started
  3. Installing and running AWclust
  4. A short tutorial
  5. SNP file format
  6. Detailed descriptions of the various analysis techniques
  7. Citation information
  8. Feedback and bug reports

Introduction

Getting started

AWclust requires R (version 2.5 or higher) and the following R packages:

All recent binary distributions of R come with the required R packages pre-bundled so, in theory, you should not have to go out of your way to install them. You can verify that these packages are installed with the command:
> library(package)
where package is the name of a specific package, not the word package.

If you do not already have R running on your computer system, or you are missing one or more of the required packages, you can download everything you need from the Comprehensive R Archive Network (CRAN): http://cran.r-project.org/

Installing and running AWclust

To install AWclust on Linux/Unix/Macintosh machines:

shell$ R CMD INSTALL AWclust_X.Y.tar.gz

To install AWclust on Windows machines:

To run AWclust type the following on the command line (Windows users may just need to double click on the R icon to get R running):

shell$ R
> library(AWclust)
> AWclust()
To get help:
> help(AWclust)
or
> ? AWclust
To remove AWclust from your R distribution, quit R and enter the following command:
shell$ R CMD REMOVE AWclust

A short tutorial

After you have installed AWclust, start R and launch the program (Windows users may just need to double click on the R icon to get R running):

shell$ R
> library(AWclust)
> AWclust()
At this point, if everything worked correctly, a new window should pop up that looks similar to the window in Figure 1 (the figures in this tutorial may be slightly different from what you see. However, the essential details should be the same.)

Figure 1. The main AWclust interface.

The next thing to do is to load in a SNP data file. For this tutorial, click on the Browse button and select the file HapMap500.txt. This file should be in the AWclust directory, however, if you can not find it, you can download it. Also, if you would like to use your own SNP data, just make sure that it is in a format that AWclust supports. The HapMap500 dataset contains data for 500 SNP loci collected from 209 individuals. Once you have selected HapMap500.txt, click on the Load button, and then click on the Calculate! button underneath it. Clicking on the Calculate! button generates a matrix of pair-wise allele sharing distances (ASD) between all of the individuals in your dataset. This ASD matrix is required for all of the analysis methods that follow.

With the ASD matrix calculated, you can now start to analyze the population structure in your data.

The first thing you want to do is look for any potential outliers in the dataset. In the case of population structure, an outlier would create a sub-population with only one member, or sub-clusters with very limited numbers of individuals. One way to check for outliers is with multidimensional scaling (MDS) plots. AWclust offers two types of MDS plots, 2 dimensional (2D) and 3 dimensional (3D). Generating the plots is as simple as clicking on the Plot buttons.

The two dimensional plot (see Figure 2) essentially shows orthogonal projections of the 3D plot (see Figure 3). The advantage of the 2D plot is that, while the IDs for the various individuals in the data may look all jumbled at first, you can export the plot as a PDF and zoom into it to make the individual IDs discernible. The advantage of the 3D plot is that the cube can be rotated to make the clustering easier to see. Try moving the sliders in the main AWclust window within the MDS 3D plot frame to rotate the cube so that you can get a good feel for how the data should cluster.

Figure 2. 2D MDS plot for the HapMap500 dataset.

Figure 3. 3D MDS plot for the HapMap500 dataset.

As can be seen Figures 2 and 3, MDS plots can also give you a general idea of how the data should cluster. For the HapMap500 dataset, it appears that there should be 3 main sub-populations.

The MDS plots do not indicate that there are any major outliers in the HapMap500 dataset, so we can, without modifying the data, go ahead and plot a dendrogram, or tree, that shows how the individuals cluster into subpopulations (see Figure 4). Do this clicking on the Plot button in in the Hierarchical plot frame. This frame also allows you to draw a red-line across the tree to indicate what level of sub-division is best suited for your research by clicking on the + or - buttons.

NOTE: At any time, if you see a plot that they would like to save, you need only click on the appropriate Save button to export it as a PDF file.

Figure 4. A dendrogram plot for the HapMap500 dataset.

So far, both the MDS and the hierarchical plots have shown that the HapMap500 dataset subdivides into three main clusters. However, this inference has come only from eye-balling the figures. With the gap statistics frame, AWclust will let us objectively determine the optimal number of clusters.

NOTE: Computing the gap statistic can take time, so, unlike drawing the other plots, the results are not likely to appear on your screen instantaneously. Be patient (and make sure your patience is proportional to the size of your dataset, the number assigned to K, and the number of iterations from the null simulations).

Based on our visual inspection of the MDS and hierarchical plots, we are fairly sure that the optimal number of clusters is going to be around 3. If we thought that there might be more clusters (up to 8), we could move the slider on the left side of the frame to increase the range of values K to to determine the optimal number of clusters. However, larger ranges for K take longer to compute.

The slider on the right side of the Gap statistics allows you to change the number of Null simulations. The default value is usually pretty good, but you can increase the value if you need to reduce the variation in your estimate for the optimal number of sub-populations. Increasing the number of null simulations increases the amount of time required to compute the gap statistic. For example, on a PC with an Intel Core2 2.5 GHz CPU with 2 GB of RAM, running 30 simulations (K=1 to 6) with this example dataset takes about 30 seconds. Increasing the number of simulations to 60 requires takes a little longer than a minute to compute.

After clicking on Calculate & Plot you should see something like what we have in Figure 5. In the graph on the left the red line represents the log of the pooled within-cluster sum of squares (Wk) from the observed data. The blue line represents the expected log of Wk from a uniform distribution. The optimal cluster size will maximize the distance between the observed and expected lines and this point can usually be identified by the appearance of an elbow bend in the red line. Here we see the elbow at K=3.

The graph on the right shows the actual gap statistic (the difference between the red and blue lines in the graph on the left) and it is clear that K=3 is indeed the best cluster size.

Figure 5. Gap statistics plots for the HapMap500 dataset.

One thing to keep in mind with clustering is that even with strong statistical support for the number of clusters, one can view K as a somewhat flexible number that can fit your needs. For example, if you know that individuals from two of the main clusters respond well to a drug/treatment and that individuals in the third cluster do not, then, for the purposes of deciding which sub-populations should receive the drug/treatment, one can consider K to be two.

SNP file format

The first row in the SNP file is contains names or IDs of the individuals in the dataset separated by white space. Each subsequent row represents a single SNP and the different alleles each individual has for that SNP, also separated by white space. The SNP information is encoded as numeric values (i.e. 0, 1, or 2) to represent the number of variant SNP alleles in genotypes (i.e. 0 implies that there are no SNP variants in the genotype, 1 for heterozygotes and 2 for homozygotes for SNP variants), and -1 is used to represent missing values.

Here is a small example taken from the HapMap500.txt file:

CEU1 CEU2 CEU3 CEU4 CEU5
1 0 1 0 1
1 1 2 0 2
-1 1 0 1 2
1 1 2 1 1

AWclust works only with SNPs that have two alleles.

Detailed descriptions of the various analysis techniques

Allele sharing distance (ASD)

The distance between individuals i and j was defined as:

D = 1/Ll=1,2,3..L dl

where dl = 0, if two individuals have two alleles in common at the lth locus; dl = 1 if two individuals have only a single allele in common at the lth locus; dl = 2 if individuals have no allele in common at the lth locus. Note: some researchers use 0, 1/2 and 1 (proportion of allele sharing) instead of 0, 1 and 2.

Reference: Bowcock et al. (1994), Mountain J, Cavalli-Sforza L (1997).

Multidimensional scaling (MDS)

MDS is a statistical technique for allowing differences and similarities to be visualized. The differences are represented by distances between points on a graph. Elements in the ASD matrix that are close together will tend to cluster together.

Reference: Borg I, Groenen PJF (1997).

Hierarchical clustering

AWclust uses Ward's minimum variance algorithm to cluster the individuals in the ASD matrix. In the initial step, each cluster contains one individual. At each subsequent step, the algorithm merges the two groups that will result in the smallest increase in the value of within-cluster variance. The pair is then joined and the number of clusters reduced by one. The clustering process continues until one cluster contains all individuals. Therefore, the within-cluster variance takes the minimal increase at each fusion. The cluster variance increases nonlinearly as the clustering process builds up, which clearly indicates where groups separate from each other.

Reference: Ward JH (1963), Ward JH, Hook ME (1963).

Gap statistic

The gap statistic selects the optimal number of clusters, k, such that log(Wk), where Wk is the pooled within-cluster sum of squares, is farthest below its null reference distribution curve. The gap statistic is defined as

Gap(k) = E[log(Wk)] - log(Wk),

where E() denotes the expectation from the null reference distribution. In this case, the null reference distribution is a a uniform distribution. Although the default number of null simulations seems to work well, you can increase the number if you need to reduce the variation in the estimate. However, increasing the number of null simulations increases the time required to compute the statistic. See the tutorial for examples.

Reference: Tibshirani R, Walther G, Hastie T (2001).

Citation information

Gao X and Starmer J, AWclust: point-and-click software for non-parametric population structure analysis, BMC Bioinformatics 2008, 9:77

Gao X and Starmer J, Human population structure analysis via multilocus genotype clustering, BMC Genetics 2007, 8:34

Gao X and Martin ER. Using allele sharing distance for detecting human population stratification. Human Heredity 2009; 68:182-191.

Feedback and bug reports

Feedback and bug reports can be submitted via the project's Sourceforge page.