VAMPS Header -Top VAMPS Header -Bottom dna

Frequently Asked Questions

Clustering and Diversity

How are SLP clusters created?

A combination of ESPRIT, SLP and mothur computes taxonomic independent clusters (Operational Taxonomic Units - OTUs)
using the total collection of available V6 sequences in VAMPS. The sequences were binned into separate datasets for the
Archaeal or Eukaryal domains, and into Bacterial phylum- or Proteobacterial class-level datasets. For each bin, the unique.seqs
function in mothur, eliminated duplicate sequences but retained information about observed frequencies for each unique read.
The kmerdist module of ESPRIT (with default values) identified all sequence pairs within each bin that are predicted to be at
least 90% similar. The needledist module in ESPRIT generated a sparse matrix of pairwise distances by performing a Needleman-Wunsch
alignment on the sequence pairs and calculating pairwise distances using quickdist. The algorithm SLP uses the pairwise distances
to perform a modified single-linkage preclustering at 2% to reduce noise in the sequence data. Initially SLP orders sequences
according to their rank abundance and then steps through the ordered sequences assigning them to clusters. The most abundant
sequence defines the first cluster. Each subsequent sequence is tested against the growing list of clusters using the
single-linkage algorithm. If the sequence has a pairwise distance less than 0.02 (equivalent to a single difference in the V6 region)
to any of the sequences already in the cluster, the new sequence will be added to the cluster and not tested against subsequent clusters.
If the sequence is not within a distance of 0.02 from any read in any of the existing clusters, it will establish a new cluster.
Once all sequences have been assigned to clusters, sequences in the low abundance clusters (< 10 tags) are tested against
the larger clusters and added to those clusters if possible. For each precluster, SLP uses the sequence with the highest
frequency and the count of all tags in the precluster for average linkage clustering by mothur. Taxonomy for each cluster
relies upon on a two-thirds majority of the taxonomy for each cluster member; CATCHALL estimates the estimate richness.


  Huse, S.M., D. Mark Welch, H.G Morrison, and M.L. Sogin. (2010) 
  Ironing out the wrinkles in the rare biosphere. Environmental Microbiology early view.

What is included in the OTU Cluster data package?

  • The .fa file is a fasta file of the unique sequences used to create the clusters.
  • The .names is a text file specifying the additional tags that have the exact same sequence.
  • The .list file specifies the OTU membership - one line for each OTU width.
    The first item in this file is the OTU width (unique or 0.01-0.10 corresponding to 0-10%), the second is the number of OTUs,
    and following this is the list of OTUs. Commas separate tags within the same OTUs and tabs separate one OTU from the next.
  • The .otu file specifies the OTU sizes - one line for each OTU width.
    The first item in this file is the OTU width (unique or 0.01-0.10 corresponding to 0-10%), the second is the number of OTUs,
    and following is the list of OTU sizes separated by tabs.
  • The .tax file shows the taxonomy corresponding to the reads in the .list file for the 0.03, 0.06,
    and 0.10 widths. Clusters are tab-delimited and contain a list of comma-separated taxa and their counts. The lists are in order of
    decreasing abundance, so the most common taxon is first. The format will be
    "0.03 Taxon1:GAST_distance:Count,Taxon2:GAST_distance:Count,...".

NOTE: The .list file, the .tax file and the .otu file may exceed Excel limitations on cell size and number of columns.
Other software may be required to utilize these data.

OTU Creation and Upload

    OTU Creation Methods:

  • USEARCH: website

    Uses the GreenGenes reference database.

  • CROP:
    From the website:
    By using a Gaussian Mixture model, CROP can automatically determine the best clustering result for 16S rRNA sequences at
    different phylogenetic levels without setting a hard cutoff threshold as hierarchical clustering does.
    Yet, at the same time, it is able to manage large datasets and to overcome sequencing errors.

  • SLP:
    Uses mothur methods of pre-clustering at 2 mismatches and clustering using the average linkage method.

    OTU File formats for uploads:

  • The OTU Table (matrix) file
    This is a simple tab-delimited text file where the first column contains the Cluster (OTU) IDs,
    the subsequent columns represent the datasets, and the optional last column is the taxonomy for
    that cluster. The values in the table cells contain the frequency of the cluster (OTU) in the dataset.
    A header line is REQUIRED and consists of "Cluster_ID", the dataset names, and optionally "Taxonomy"
    as the last word of the header, again separated by tabs. If taxonomy is included, then all rows
    need to have a value for taxonomy. Use "Unknown" for clusters of unknown taxonomy. Dataset names
    and Cluster_IDs must be alphanumeric – only letters, numbers, and underscore. Spaces, hyphens
    and other characters are not allowed.
    Below is a sample matrix file:
    Cluster_ID  DS1	DS2	DS3	DS4	DS5	DS6	DS7	DS8  Taxonomy
    Cluster101  0	0	1	1	0	2	0	1   Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae
    Cluster102  5	0	0	14	0	33	41	12  Bacteria;Nitrospirae;Nitrospira;Nitrospirales;Nitrospiraceae;Leptospirillum
    Cluster103  2	0	0	8	0	9	2	0   Bacteria;Proteobacteria;Alphaproteobacteria;Rickettsiales;Rickettsiaceae;Rickettsia;endosymbiont
    Cluster105  1	11	0	0	0	0	2	0   Bacteria;Firmicutes;Clostridia;Clostridiales;Unassigned;Sulfobacillus
    Cluster107  0	0	0	1	0	12	18	0   Bacteria;Firmicutes;Clostridia;Clostridiales
    Cluster108  72	74	108	46	1	39	164	41  Bacteria;Actinobacteria;Actinobacteria;Acidimicrobiales;Acidimicrobiaceae

  • The Biom file
    The Biom file is a sparse matrix file derived from the output of QIIME and is described Here. It should have taxonomy in the metadata field for each row.
    Below is a sample biom file:
    "rows": [
        {"id": "0", "metadata": {"taxonomy": ["Bacteria", "Firmicutes", "Bacilli", "Bacillales", "Staphylococcaceae", "Staphylococcus"]}}, 
        {"id": "1", "metadata": {"taxonomy": ["Bacteria", "Firmicutes", "Clostridia", "Clostridiales"]}}, 
        {"id": "2", "metadata": {"taxonomy": ["Bacteria", "Bacteroidetes"]}}, 
        {"id": "3", "metadata": {"taxonomy": ["Bacteria", "Bacteroidetes"]}}, 
        {"id": "4", "metadata": {"taxonomy": ["Bacteria", "Firmicutes", "Clostridia", "Clostridiales", "Lachnospiraceae"]}} 
    "format": "Biological Observation Matrix 0.9.1-dev", 
    "data": [[0, 0, 16.0], [1, 1, 1.0], [2, 0, 3.0], [3, 2, 0.9], [4, 2, 9.0]],
    "columns": [
        {"id": "PC.636", "metadata": null}, 
        {"id": "PC.635", "metadata": null}, 
        {"id": "PC.356", "metadata": null}
    "generated_by": "QIIME 1.5.0, svn revision 2956", 
    "matrix_type": "sparse", 
    "shape": [5, 3], 
    "date": "2012-05-06T15:59:11.792652", 
    "type": "OTU table", 
    "id": null, 
    "matrix_element_type": "int"

<Back to Top>

Sequence Processing

How is the tag sequence trimming done and low-quality reads removed?

For each read from the GS-FLX Sequencer we trim primer bases from the beginning and the end of each read, and remove sequences
likely to be of low-quality based on our assessment of pyrosequencing error rates (Huse et al, 2007).

  • A sequencing request is submitted by each researcher outlining the 5-nt run key and amplification primers used for each
    dataset in a GS-FLX run. These data are imported into the database.
  • The GS-FLX sff data files are converted to raw tag sequences.
  • The first five nucleotides of each tag are removed and compared with the list of expected run keys. If the first five
    nucleotides of a tag are not in the expected set, the tag is deleted for low-quality.
  • If one or more Ns are located in the tag sequence, the read is flagged as low-quality and deleted.
  • The 5-nt run key is used to look up the amplification primers on the researcher submission form. The start of the tag
    sequence is compared with the list of amplification start primers (i.e., 5' end of a forward read). If an exact match
    to an expected proximal primer is not found, the tag is deleted for low-quality, otherwise the run key and primer are
    removed from the tag sequence.
  • If an exact match to the distal primer is not located, BLASTN (with non-default parameters -q -1, -W 7, -S 1, -E 2, -F 'F',
    and -G 1), and the EMBOSS program fuzznuc (with non-default parameter -mismatch=3) are used for a fuzzy comparison of expected
    primers with the end of the sequence. If a match is found (either exact or fuzzy) the distal primer is removed from the sequence.
  • If the length of the tag sequence is less than 50 nt after removal of the run key and amplification primers, the tag is
    flagged as low-quality and deleted.

<Back to Top>

How is taxonomy assigned to the tags through the GAST process?

In Sogin et al. (2006), we proposed a tag mapping methodology, GAST (Global Alignment for Sequence Taxonomy) to assign a
taxonomic classification to environmental V6 tags.

The steps to assigning taxonomy via GAST are as follows:

  • We create a reference database of rRNA genes, RefSSU, based on the SILVA database
    (Pruesse et al. 2007), and taxonomy assigned to each reference sequence assigned primarily
    with RDP Classifier (Wang et al. 2007).
  • We create reference sets, RefVx, of hypervariable region tags (e.g., RefV6, RefV3, RefV9) by excising the hypervariable
    region from RefSSU using the SILVA alignment.
  • We BLAST each pyrosequencing tag against the RefVx database to generate a set of 100 best local matches to the reference database.
  • Because the top BLAST hit may not have the highest overall similarity to the tag sequence, particularly because edge-effects
    in the short region being compared can be pronounced, we align the tag sequence to the reference tags corresponding to the
    top 100 BLAST hits. We use MUSCLE (with non-default parameters - diags diags1 and -maxiters 2 to reduce processing time).
  • We calculate the global distance from the tag to each of the aligned reference sequences as the number of insertions,
    deletions and mismatches divided by the length of the tag. We considered the reference sequence or sequences with the
    minimum global distance to be the top GAST match(es). The top BLAST hit was generally the best global match; however,
    for 5% to 25% of tags the best global match is to a reference sequence with a lower BLAST score.
  • We identify all of the reference long sequences in Ref16S that contain the exact hypervariable sequence of the top GAST
    match or matches. We compile the taxonomic classifications (with RDP bootstrap values >= 80) of all these 16S sequences
  • If two-thirds or more of the full-length sequences share the same assigned genus, the tag is assigned to that genus.
    If there is no such agreement, we proceed up the tree one level to family. If there is a two-thirds or better consensus
    at the family level, we assign this taxonomy to the tag, and if not, we continue up the tree, until we achieve a
    two-thirds majority. Tags that do not match any reference tag by BLAST were not given a taxonomic assignment.
    Comparison of taxonomic assignments of hypervariable tags via GAST with the taxonomic assignments of known source
    full-length sequences through RDP show a 98+% correlation.

Click the chart for a full-size view.

GAST Process Flowchart

<Back to Top>

Export Taxonomic Counts

How are the datasets normalized?

Normalized to Largest Dataset - For each particular taxonomic assignment for each dataset, the number of reads in
that taxonomic assignment are multiplied by the ratio of the total number of reads in the largest dataset to the
total number of reads in the current dataset.

  • Normalized Sample Count = (Actual Sample Count) * (Total of Largest Dataset) / (Total of Current Dataset)

Normalized by Percent within Datasets - The frequency of each taxonomic assignment is reported as a percent
(number of reads assigned to a taxonomy over total number of reads in the dataset).

  • Normalized Sample Count = (Actual Sample Count) / (Total of Current Dataset)

How do I export raw sequence data?

Because of the size of Illumina datasets, we only keep unique sequences on the server (that is, unique PER dataset).
These sequences include a frequency value in their definition lines. If you need the raw fastq files for some purpose
(QIIME, SRA submission), please contact Hilary Morrison (hmorrison1981 [at] and provide a tab-delimited text file of project and dataset names.

How do I download the taxonomic data export files?

The export files are text files, and are not compressed. When you run the query by clicking "Get Taxonomy Data",
three files will be generated. Please allow time for all three files to finish. When ready, select the desired
output file and download.

  • Mac OS X / Firefox. Right-click on the file name link and choose "Save Link As...".
  • Mac OS X / Opera. Right-click on the file name link and choose "Save Linked Content As...".
  • Mac OS X / Safari. Right-click on the file name link and choose "Save Linked File As...".
  • Windows / Firefox. Right-click on the file name link and choose "Save Link As...".
  • Windows / Internet Explorer. Right-click on the file name link and choose "Save Target As...".
  • Windows / Opera. Right-click on the file name link and choose "Save Linked Content As...".

After exporting, the text files may be opened in a text editor. For Windows, the best text editor to use is Wordpad.

How do I import the taxonomic data into a spreadsheet?

  • In MS Excel for Windows, you can import a text file by selecting "Data/Import Data...".
  • In MS Excel for Mac, import a text file by selecting "Data/Get External Data/Import Text File..."
  • From the file dialog box, select the file you wish to import.
  • This brings up the Text Import Wizard. Follow steps 1 through 3:
    • Step 1 - Make sure "Delimited" is selected.
    • Step 2 - Make sure "Tab" is selected as the delimiter.
    • Step 3 - Click Finish.

<Back to Top>

Import Data

How do I import data into VAMPS so that I can view the data using the VAMPS tools?

Data can be imported into VAMPS in three different ways.

  • Trimmed sequences can be uploaded to the VAMPS database directly using the
    tools on this page. If you use this method you will be asked for a
    project and dataset name as well as domain and which region of the S16 molecule the sequences come from. This metadata will be used
    to select the reference database when you want the taxonomy assigned using the GAST system. You will also be asked to supply an environmental
    source for your data which will be used to filter your dataset(s) later. It is also possible to add a new dataset to one of your
    projects that is already in the VAMPS database, but you are not allowed to add to an already existing dataset. The sequence file must be
    fasta format (plain text, zipped or gzipped).

  • The second type of data that can be uploaded is raw sequences that you want to have
    trimmed using specific primers and keys (short barcode sequences which separate datasets) which you supply
    (see this page to start uploading raw sequences). A quality file can also be included to help elucidate which sequences are to be discarded
    because of low quality. See below for file formats for uploading raw sequences. The sequences
    and quality files must be fasta format (plain text, zipped or gzipped) and the primers and
    runkeys file are tab delimited text files.

  • The remaining method to upload data is to upload a file of taxonomy data that is in one of two
    RDP taxonomy formats: Classifier Text format and Fasta Style format. See here for
    and here to start uploading.

What is the file format for uploading trimmed sequences?

Trimmed sequences can be imported using the NCBI FASTA format. Each read starts with a ‘>’ .

Single-Dataset: The unique Read_ID must between the ‘>’ and the first ‘|’ (a ‘pipe’ symbol) or the first space (' ').
    Counts: [optional] The sequence will be expanded if there is an integer (or '|frequency=<count>') at the end of the defline.
    Otherwise the sequence will be read as unique.
example deflines:
    ">read_id|10" and ">read_id|frequency=10" will be expanded to 10 sequences
    ">read_id|taxonomy or other data" will be not be expanded

Multi-Dataset: The dataset name must be between the ‘>’ and the first ‘|’ (a ‘pipe’ symbol) or the first space (' ').
    Then the unique Read_ID must be next.
    Counts: [optional] As with the single dataset version above the sequence will be expanded if there is an integer (or '|frequency=<count>') at the end of the defline.
example deflines:
    ">datasetName|read_id|24" or ">datasetName|read_id|frequency=24" will be expanded to 24 sequences
    ">datasetName|read_id|taxonomy or other data" will be not be expanded

The Read_ID and dataset name (if present) cannot contain any special characters other than underscore ‘_’. Each Read_ID must be a unique value (per dataset).
If the sequences represent unique sequences the the sequence count should go after the Read_ID separated by a pipe (or space).
If there is any other information on the definition line, it must be after the Read_ID and separated by another ‘|’ (or space).
The whole definition line is separated from the sequence data by a return or linefeed.

What is the file format for uploading untrimmed sequences?

Untrimmed sequences can be imported using the same fasta file format as the trimmed sequences above.
Untrimmed sequences must be uploaded in parallel with a keys file, primers file and optional quality file.

What is the file format for simple metadata upload?

When uploading this file you must be the owner of the project(s) in the file.
The projects must exist already in vamps.
Datasets will be ignored if they don't already exist.
The header line is required.


What is the file format for uploading a matrix file?

A matrix file is a plain text file of columns of sequence counts. The first (header) line is the dataset names
separated by tab characters which identify the columns of numbers.
The very first item in this first row is a tab character or column header name such as 'TAXA'.
After the first row each subsequent row starts with a taxonomy string followed by the sequences counts for each dataset.
The counts are separated from each other by tabs and are separated from the taxonomy string also by a tab.
The counts must be integers and in the same number in each row as there are datasets.
The taxonomy string is taxa names separated by semi-colons. There must be between one and eight (inclusive)
names corresponding to Domain(or superkingdom), Phylum, Class, Order, Family, Genus, Species and Strain.
Not all of the names have to be present (at a minimum: the domain) and place holders can be used such as
'Unknown' or 'incertae sedis'. Spaces are ok in the taxonomy string.
The first item (Domain or Superkingdom) must be present and must be one of:
'Archaea', 'Bacteria', 'Eukarya', 'Fungi', 'Organelle', 'Unknown'

Archaea;Crenarchaeota;MGI	0	0	0	4	2	0	0	0	0	0
Archaea;Crenarchaeota;Marine_Group_I	0	0	0	0	0	0	6	0	0	0
Archaea;Crenarchaeota;Thermoprotei	20	0	35	11	382	0	0	0	0	0
Archaea;Crenarchaeota	0	0	0	0	2	0	0	0	0	0
Archaea;Euryarchaeota;Halobacteria;Halobacteriales	0	0	0	0	0	0	0	0	0	1
Archaea;Euryarchaeota;MGII	112	0	43	1645	2062	0	0	0	0	0
Archaea;Euryarchaeota;MGIII	0	0	0	0	1	0	0	0	0	0
Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales	284	49	189	407	531	20	0	0	0	0

What is the file format for uploading quality information?

The quality file also follows a format similar to the FASTA format.
Each read starts with a definition line as described above using the same Read_ID
as in the sequence fasta file to match the sequence to the quality data.

What is the file format for importing primers?

The primers can be imported in a plain text csv file (comma separated values).
The format is one line per primer, and fields are separated by a comma. All fields must be present.
The fields have to be in the following order:
primer_name, primer_direction, original sequence

Header: A header line is NOT allowed
Name: Alphanumeric, dash ("-") and underscore ("_") only. The primer name must be unique in the file.
Primer Direction: F = forward; R = reverse.
Sequence: See Primer Notations below.



Primer Notations

The primers are specified using the following standards:

R = A or G
Y = C or T
W = A or T
S = C or G
K = G or T
M = A or C

B = C or G or T (not A)
D = A or G or T (not C)
H = A or C or T (not G)
V = A or C or G (not T)

N = A or C or G or T (any base, same as '.')
* = zero or more of the preceding base
. = any base
[AG] = either one of the enclosed bases
? = zero or one of the preceding base
+ = one or more of the preceding base

What is the QIIME metadata mapping file format?

The QIIME file formats are best described here.

What is the metadata file for?

The metadata file is a csv (comma separated values) file. It is required and provides a way to include runkeys
(barcodes) so that the trimmed sequences can be de-multiplexed into separate datasets.
The metadata file contains other required data such as dataset names and sequence direction.
In the table below is an explanation of the different columns. The double quotes around each
item are allowed but not required. No commas allowed other than to separate fields.

Format (header line IS required):

TCATC,dataset_one,F,"my Title","My Description","dataset description",120
ACTCG,dataset_two,F,"my Title","My Description","dataset description",120

Here is a simple template file (headers only) which can be opened in MS Excel or OpenOffice:
Template CSV file
(rigth-click and save as vamps_csv_template.csv)

Column Explanations:
runkey (barcode) No letters besides A,G,C,T - Length: Minimum 3nt; Maximum 12nt
dataset NO COMMAS - Name of the dataset: ONLY Alphanumeric and underscore '_' (no spaces). Cannot start with a number.
sequence_direction NO COMMAS - Choose one: F, R or B for Forward, Reverse or Both
project_title NO COMMAS - Free form brief title of the project (10 words or less).
project_description NO COMMAS - All on one line, this is more in-depth than the title. Free form description of the project which can be a few sentences long.
dataset_description NO COMMAS - brief description of the dataset.
environmental_source_id A single id number selected from the list below.

Environmental Sample Source IDs:
    ID      Sample Source
    10	    air
    20	    extreme_habitat
    30	    host_associated
    40	    human_associated
    41	    human-skin
    42	    human-oral
    43	    human-gut
    44	    human-vaginal
    45	    human-amniotic_fluid
    46	    human-urine
    47	    human-blood
    50	    microbial_mat/biofilm
    60	    miscellaneous_natural_or_artificial_environment
    70	    plant_associated
    80	    sediment
    90	    soil/sand
    100	    unknown
    110	    wastewater/sludge
    120	    water-freshwater
    130	    water-marine
    140	    indoor

<Back to Top>

Sample Submission Process

How do I fill out the submission form?

The submission form process starts on the project_submit page. You select the primary investigator for the new project you are creating,
then using the guidelines below fill out the project name, title and description. as well as selecting the domain and region for your experiment.
Also enter the funding code for the experiment and how many sample will be submitted.

When you press the 'Validate' button it will check your entries for format and ask you to fix any errors before you are able to download it.

  • Primary Investigator (do not leave empty)
    • Select an investigator from our database who will be the Primary Scientist associated with this project.
      If that person is not on this list they should request an account at

  • Project Name (do not leave empty)
    • Format:

    • The project names in VAMPS have three parts. The first part is usually the Primary Investigator's
      initials, but can be some other identifier. It is only 2-3 characters (alphabet with no digits)
    • The second part is 3-4 characters (also alphabet with no digits) and is usually some abreviation
      or acronym identifing your project.
    • The third part is determined by the type of analysis requested. It is in the form of Bv6 or Av4v5.
      where the 'B' and 'A' stand for Bacteria and Archaea respectively. Eukarya may also occur if you choose the ITS1 region.
      Currently for HiSEQ and MiSeq the dna_regions are v6 and v4v5 (or ITS1). The three parts are joined by an underscore
      to create the VAMPS project name. The submission form will report if the project name is already chosen
      and we reserve the right to change a submitted name.

  • Project Title (do not leave empty)
    • This is a short statement about your project (one line - 60 characters).

  • Project Description (~1000 characters do not leave empty)
    • This is similar to an abstract about your project.

  • Domain and Region (do not leave empty)
    • Here is where you tell us what domain and DNA region should be analysed for this project.

  • Funding Code (do not leave empty)
    • This is the numeric code indicating the funding source for this project.

  • Number of Samples to be Submitted (do not leave empty)
    • Here you enter the number of sample tubes you will be submitting.
      When you download the form an empty row will be created for each row indicated.

Download and Fill in sample (dataset) information:
When it validates you should download the file and open it on you local machine.
The downloaded file is a Comma Separated Values (csv) file and it should be saved as a csv text file with no formatting,
We recommend OpenOffice or Excel to edit the file.
Once open you need to fill in the fields for your samples (datasets).
When filling out the form be cautious not to change any of the headers as this will cause the validation to fail.

  • Tube Number (do not leave empty)
    • This is filled in already and is just a sequential number starting from one.

  • Dataset Name (do not leave empty)
    • Less than 30 characters (alphanumeric and '_' only and must start with a letter)
    • This is how datasets will be named in VAMPS

  • Dataset Description (do not leave empty)
    • A short (1-2 sentances) description of the sample.

  • Tube Label (do not leave empty)
    • What is written on the sample tube for identification.

  • Sample Concentration (do not leave empty)
    • Numeric only (ng/ul)

  • Quantitation Method (do not leave empty)
  • can ONLY be one of:

    • nanodrop
    • picogreen
    • qubit
    • other

  • Platform (do not leave empty)
  • can ONLY be one of:

    • 454
    • ion-torrent
    • hiseq
    • miseq
    • nextseq

  • Adaptor
  • For illumina: A01 through H24 (okay to leave empty)

    For v4_hap: A25 through H36 (okay to leave empty)

  • Environmental Source id (do not leave empty)
    • Enter the ID associated with the environmental source from the table below:
    • IDSample Source
      20Extreme Habitat
      30Host Associated
      40Human Associated
      41Human - Skin
      42Human - Oral
      43Human - Gut
      44Human - Vaginal
      45Human - Amniotic Fluid
      46Human - Urine
      47Human - Blood
      50Microbial Mat or Biofilm
      60Miscellaneous or Artificial
      70Plant Associated
      90Soil or Sand
      110Wastewater or Sludge
      120Water - Freshwater
      130Water - Marine

<Back to Top>


Example: Census of Deep Life: Construction and sequencing of metagenomic libraries

We use Picogreen (ThermoFisher Scientific, Grand Island NY) to quantitate genomic DNA samples. 5-50 ng DNA is sheared using a Covaris S220 (Covaris, Woburn MA) and libraries are constructed with the Ovation Ultralow Library protocol (Nugen, San Carlos CA) and amplified for 10-18 cycles, depending on the amount of starting material. The amplified product is visualized on a Bioanalyzer DNA 1000 chip (Agilent Technologies, Santa Clara CA) or Caliper HiSens assay (Perkin Elmer, Waltham MA). Libraries are pooled at equimolar concentrations based on these results and size selected using a PippinPrep 1.5% cassette (Sage Science, Beverly MA) to an average insert size of 170bp (280 including adapters). The pool is quantified by qPCR (Kapa Biosystems, Wilmington, MA), then sequenced on the Illumina HiSeq1000 in a paired-end sequencing run (2 x 101 or 2 x 108) using dedicated read indexing. The sample datasets are demultiplexed with CASAVA 1.8.2.

  • Note: To identify which Nugen index was used for a particular library and the exact read length, look at the definition line and sequence in the fastq file.
  • In this example, the index is AAGGGA. The read length is 113 nt.
    @D4ZHLFP1:73:C5PABACXX:7:1101:1376:2313 1:N:0:AAGGGA

<Back to Top>

Reference Databases

How are the reference databases created?

We create Ref16S, a reference database of aligned full-length sequences based on all available sequences
in SILVA exported using the ARB software. New updates to both SILVA and RDP are incorporated as they become available.

  • We flag as low-quality and delete all sequences with a sequence quality score <= 50, an alignment
    score <= 50 or a pintail (chimera) score <= 40.
  • We flag as redundant and delete all exact copies of the full-length sequence.
  • We classify all bacterial and archaeal sequences directly with the Ribosomal Database Project Classifier
    (RDP). We used only RDP classifications with a bootstrap value of >=80%. If the bootstrap value was <80%,
    the taxonomic assignment was moved to a higher classification level until an 80% or better bootstrap value was
    achieved. For example, if the genus assignment had a bootstrap value of 70%, but the family had a value of 85%,
    that sequence would be assigned only as far as family and not to genus. RDP Classifier does not classify
    sequences below the genus level.
  • We incorporate other taxonomy sources, such as Entrez Genome accession numbers or researcher knowledge of
    specific entries, as they become available. These "other sources" are used preferentially over the RDP for
    bacteria and archaea. RDP does not classify eukaryotes. For eukaryota taxonomies, we use the EMBL taxonomy
    from the SILVA database where we do not have other sources.

We create hypervariable region specific databases (RefV6, RefV3, RefV9, etc.).

  • For each hypervariable region we calculate the Ref16S alignment coordinates.
  • We then excise from the Ref16s aligned sequences the section corresponding to the hypervariable region.
  • The gaps are removed from the aligned sequences to create a set of unaligned sequences.
  • Any hypervariable sequences that contain an 'N' are deleted.
  • Any hypervariable sequences shorter than 50 nt are deleted.
  • Any full-length sequences that were not sequenced all the way through the specific hypervariable region are deleted.
  • Two reference IDs are assigned to each reference hypervariable region. A ref16s_id (previously alt_local_gi)
    links the hypervariable sequence with its source full-length sequence, and a second, e.g., refv6_id
    (previously known as local_gi) is used to identify all entries having the exact same sequence of the
    hypervariable region. The taxonomy is carried directly from the full-length source.
  • Unique reference hypervariable regions sequences are exported to a blastable database for use in the
    assignment of taxonomy to pyrosequencing reads through GAST.

<Back to Top>

Tag Generation

Conserved sequences that flank the hypervariable V6-V4 region of rRNAs serve as primer sites to generate PCR
amplicons. Each PCR reaction produces products that can be informatically identified using a unique "key" incorporated between
the 454 Life Sciences primer A or B and the 5' flanking rRNA primer. The use of a 5-bp key allows for the synthesis of as many as 81
oligonucleotides that differ by at least two sites. Our multiplexing strategy allows the concurrent collection of 10,000-50,000
tags from each of 8-40 samples in a single nine- hour sequencing run without use of partitioning gaskets that reduce the number
of sequencing wells on the PicoTiterPlateTM. Amplicons can be pooled before the emPCR step and each
pool is run on a large region of the plate.

454 Amplicon PCR (Christina Holmes/Ekaterina Andreishcheva) for four reactions:

  • 96 ul water
  • 13.4 ul 10X Platinum buffer
  • 9.4 ul 50 mM MgSO4
  • 2.6 ul 10 mM Pure Peak dNTPs
  • 4 ul 10uM Fusion Primer A
  • 4 ul 10uM Fusion Primer B
  • 2.6 ul 2.5 U/ul Platinum HiFi Pol
  • [2 ul template (~5-25 ng)*]
  • 33 ul total volume/reaction

*If template stock is dilute or otherwise resistant to amplification, more template can be added in place of water.

The 5 reactions are the three replicates of the environmental template, positive control, and negative control.
Template (plasmid pool for positive control; water for negative control) is added as final step.


  • 94C for 2 min
  • 30 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
  • 72C for 2 min
  • Hold at 4C
  • Visualize and quantitate the amplicons in the BioAnalyzer or Caliper.
  • Purify the amplicons using Agencourt AMPure XP beads as described in 454 Sequencing Technical Bulletin 2011-007, and resuspend in 100 uL of Buffer EB.
  • Store purified amplicons at -20C.


  • DNA 1000 Kit, Agilent, 5067-1504, 25 chips. $372
  • Agencourt AMPure XP, 60 mL, A63881. $1050
  • PurePeak DNA Polymerization Mix 10 mM 1 ml, Pierce/ThermoFisher, NU606001, $115
  • Platinum HiFi Taq polymerase plus buffer and MgSO4, Invitrogen.

In 2013 we switched to using Illumina platforms for 16S sequencing. Similar to the strategy for 454, we use fusion primers composed
of the Illumina adaptors, multiplexing identifiers, and domain-specific primers.
Thermocycling and reaction mixtures are different from 454 sequencing for Illumina amplicon PCR.

For Archaeal V6:

  • 94C for 2 min
  • 30 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
  • 72C for 2 min
  • Hold at 4C

For Bacterial and Archaeal V4V5:

  • 94C for 2 min
  • 30 cycles of (94C for 30sec, 57C for 45sec, 72C for 1min)
  • 72C for 2 min
  • Hold at 4C

Bacterial V6 uses first the domain-specific primers for 25 cycles, then the products are cleaned and used in a second 5-cycle PCR with fusion primers.
We experimented with different types of fusion primers and with 2 vs. 1 step PCR for different amplicon targets.
With 1 step Bv6, we found that particular abundant taxa were not amplified in their expected proportion (e.g., Arcobacter Bv6).
Conversely, 2 step Bv4v5 introduced other biases. We recognize that there is no sure way to avoid biases introduced by primer design or PCR
conditions even using mock-communities and well-characterized samples.

For Bacterial V6:

  • 94C for 2 min
  • 25 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
  • 72C for 2 min
  • Hold at 4C
  • Visualize and quantitate the amplicons in the BioAnalyzer or Caliper.
  • Purify the amplicons using Qiagen MinElute columns.
  • Use product in second PCR:
    • 94C for 2 min
    • 5 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
    • 72C for 2 min
    • Hold at 4C

After we produce amplicons, we can clean and/or size-select for the target products using Agencourt AMPure XP beads.
Then we quantitate products with an Invitrogen Picogreen assay, pool at desired concentrations (e.g. equimolar), and quantitate the final pool with qPCR.

<Back to Top>

What are the primers used?

Please see the information on primers here.

<Back to Top>

Community Visualization Demo

See a short demonstration video about using the Community Visualization tool. The movie is in Quicktime format. Download

<Back to Top>



The project name refers to the overall study or research project to which the data belong.
The project ties multiple samples and sequencing runs together.


The dataset name refers to a set of sequences within the project that are from one sampling
location or individual at a particular date and time. The dataset combines sequences sampled or amplified together.
Sequence and taxonomic data are uploaded on a dataset by dataset basis. Multiple datasets may be combined together
or compared separately when using the Community Visualization tools.


The FASTA file definition line follows NCBI FASTA format.
If you upload a FASTA file, it will be run through RDP to calculate a sequence by sequence taxonomy file.
The file will be filtered for valid file format and data.
If valid, the file will be uploaded into a temporary table of VAMPS data that will be available immediately for viewing.
FASTA Definition Line: Each read starts with a ‘>’ and the read ID is between the ‘>’ and the first ‘|’
(a ‘pipe’ symbol), it cannot contain any special characters other than dash ‘-’ or underscore ‘_’ and must be less than 32 characters.
If there is any other information on the definition line, it must be after the first ‘|’. The whole definition line is separated from the sequence data by a return or linefeed.

RDP Taxonomy Files

The RDP taxonomy file complies with the RDP format as defined by the
Ribosomal Database Project (RDP).
The file will be filtered for valid file format and data.
If valid, the file will be uploaded into a temporary table of VAMPS data that will be available immediately for viewing.
Please note: There should be no blank lines in the header.

<Back to Top>

Computing Resources


Vamps has been tested mainly on the following browsers:
  Mac OS X: Safari and Firefox
  Linux: Firefox
  Windows: Firefox


Javascript and cookies need to be enabled in you browser.

Screen resolution of at least 800x600 dpi is required and larger is recommended.
Minimum of 256 MB RAM, and 1.0 GB RAM recommended.
Java is needed and should be enabled in your browser.

<Back to Top>