|
Visualization and Analysis Tools
|
|
|
|
VAMPS
Frequently Asked Questions
Diversity and Rarefaction
What is included in the Diversity and Rarefaction data package?
The rarefaction files are packaged by project. The rarefaction package includes the following files:
- The .ace file contains ACE values.
- The .chao contains Chao values.
- The .rarefaction file contains resampling of OTUs for creation of rarefaction curves.
To create a rarefaction curve, import one .rarefaction file into Excel, using the Excel default import parameters.
Create a new X-Y chart using the numsequences column as the X-value. The unique 0.03, 0.06, or 0.10 columns can be used as Y-values.
What is included in the OTU Cluster data package?
- The .fa file is a fasta file of the unique sequences used to create the clusters.
- The .names is a text file specifying the additional tags that have the exact same sequence.
- The .list file specifies the OTU membership - one line for each OTU width.
The first item in this file is the OTU width (unique or 0.01-0.10 corresponding to 0-10%), the second is the number of OTUs, and following this is the list of OTUs. Commas separate tags within the same OTUs and tabs separate one OTU from the next.
-
The .otu file specifies the OTU sizes - one line for each OTU width.
The first item in this file is the OTU width (unique or 0.01-0.10 corresponding to 0-10%), the second is the number of OTUs, and following is the list of OTU sizes separated by tabs.
NOTE: The .list file and the .otu file may exceed Excel limitations on cell size and number of columns. Other software may be required to utilize these data.
Sequence Processing
How is the tag sequence trimming done and low-quality reads removed?
For each read from the GS-FLX Sequencer we trim primer bases from the beginning and the end of each read, and remove sequences likely to be of low-quality based on our assessment of pyrosequencing error rates (Huse et al, 2007).
-
A sequencing request for is submitted by each researcher outlining the 5-nt run key and amplification primers used for each dataset in a GS-FLX run. These data are imported into the database.
-
The GS-FLX sff data files are converted to raw tag sequences.
-
The first five nucleotides of each tag are removed and compared with the list of expected run keys. If the first five nucleotides of a tag are not in the expected set, the tag is deleted for low-quality.
-
If one or more Ns are located in the tag sequence, the read is flagged as low-quality and deleted.
-
The 5-nt run key is used to look up the amplification primers for the researcher submission form. The start of the tag sequence is compared with the list of amplification start primers (i.e., 5' end of a forward read). If an exact match to an expected proximal primer is not found, the tag is deleted for low-quality, otherwise the run key and primer are removed from the tag sequence.
-
If an exact match to the distal primer is not located, BLASTN (with non-default parameters -q -1, -W 7, -S 1, -E 2, -F 'F', and -G 1), and the EMBOSS program fuzznuc (with non-default parameter -mismatch=3) are used for a fuzzy comparison of expected primers with the end of the sequence. If a match is found (either exact or fuzzy) the distal primer is removed from the sequence.
-
If the length of the tag sequence is less than 50 nt after removal of the run key and amplification primers, the tag is flagged as low-quality and deleted.
How is taxonomy assigned to the tags through the GAST process?
In Sogin et al. (2006), we proposed a tag mapping methodology, GAST (Global Alignment for Sequence Taxonomy) to assign a taxonomic classification to environmental V6 tags.
The steps to assigning taxonomy via GAST are as follows:
-
We create a reference database of rRNA genes, RefSSU, based on the SILVA database (Pruesse et al. 2007), and taxonomy assigned to each reference sequence assigned primarily with RDP Classifier (Wang et al. 2007).
-
We create reference sets, RefVx, of hypervariable region tags (e.g., RefV6, RefV3, RefV9) by excising the hypervariable region from RefSSU using the SILVA alignment.
-
We BLAST each pyrosequencing tag against the RefVx database to generate a set of 100 best local matches to the reference database.
-
Because the top BLAST hit may not have the highest overall similarity to the tag sequence, particularly because edge-effects in the short region being compared can be pronounced, we align the tag sequence to the reference tags corresponding to the top 100 BLAST hits. We use MUSCLE (with non-default parameters - diags diags1 and -maxiters 2 to reduce processing time).
-
We calculate the global distance from the tag to each of the aligned reference sequences as the number of insertions, deletions and mismatches divided by the length of the tag. We considered the reference sequence or sequences with the minimum global distance to be the top GAST match(es). The top BLAST hit was generally the best global match; however, for 5% to 25% of tags the best global match is to a reference sequence with a lower BLAST score.
-
We identify all of the reference long sequences in Ref16S that contain the exact hypervariable sequence of the top GAST match or matches. We compile the taxonomic classifications (with RDP bootstrap values >= 80) of all these 16S sequences
-
If two-thirds or more of the full-length sequences share the same assigned genus, the tag is assigned to that genus. If there is no such agreement, we proceed up the tree one level to family. If there is a two-thirds or better consensus at the family level, we assign this taxonomy to the tag, and if not, we continue up the tree, until we achieve a two-thirds majority. Tags that do not match any reference tag by BLAST were not given a taxonomic assignment. Comparison of taxonomic assignments of hypervariable tags via GAST with the taxonomic assignments of known source full-length sequences through RDP show a 98+% correlation.
Click the chart for a full-size view.
Export Taxonomic Counts
How are the datasets normalized?
Normalized to Largest Dataset - For each particular taxonomic assignment for each dataset, the number of reads in that taxonomic assignment are multiplied by the ratio of the total number of reads in the largest dataset to the total number of reads in the current dataset.
-
Normalized Sample Count = (Actual Sample Count) * (Total of Largest Dataset) / (Total of Current Dataset)
Normalized by Percent within Datasets - The frequency of each taxonomic assignment is reported as a percent (number of reads assigned to a taxonomy over total number of reads in the dataset).
-
Normalized Sample Count = (Actual Sample Count) / (Total of Current Dataset)
How do I download the taxonomic data export files?
The export files are text files, compressed with gzip. When you run the query by clicking "Get Taxonomy Data", three files will be generated. Please allow time for all three files to finish. When ready, select the desired output file and click "Download".
-
Mac OS X / Safari. Either left- or right-clicking will uncompress and download the file to your desktop and open it. Right-clicking allows you to save the compressed file as well.
-
Mac OS X / Firefox. Either left- or right-clicking will uncompress and download the file to your desktop. Right-clicking allows you to save the compressed file as well.
-
Windows / Internet Explorer. Either left- or right-clicking allows you to save the compressed file. Do not select "Open". You can then extract the file with WinZip.
-
Windows / Firefox. Either left- or right-clicking allows you to save the compressed file. Left-clicking and selecting "Open" allows you to extract the file with WinZip and save the uncompressed file.
How do I import the taxonomic data into a spreadsheet?
-
In MS Excel for Windows, you can import a text file by selecting "Data/Import Data...".
-
In MS Excel for Mac, import a text file by selecting "Data/Get External Data/Import Text File..."
-
From the file dialog box, select the file you wish to import.
-
This brings up the Text Import Wizard. Follow steps 1 through 3:
-
Step 1 - Make sure "Delimited" is selected.
-
Step 2 - Make sure "Tab" is selected.
-
Step 3 - If the TaxByLgi or TaxByTag files are being imported, select "Text" for the Column Format. Otherwise, leave the format as "General". Click Finish.
Reference Databases
How are the reference databases created?
We create Ref16S, a reference database of aligned full-length sequences based on all available sequences in SILVA exported using the ARB software. New updates to both SILVA and RDP are incorporated as they become available.
-
We flag as low-quality and delete all sequences with a sequence quality score <= 50, an alignment score <= 50 or a pintail (chimera) score <= 40.
-
We flag as redundant and delete all exact copies of the full-length sequence.
-
We classify all bacterial and archaeal sequences directly with the Ribosomal Database Project Classifier (RDP). We used only RDP classifications with a bootstrap value of >=80%. If the bootstrap value was <80%, the taxonomic assignment was moved to a higher classification level until an 80% or better bootstrap value was achieved. For example, if the genus assignment had a bootstrap value of 70%, but the family had a value of 85%, that sequence would be assigned only as far as family and not to genus. RDP Classifier does not classify sequences below the genus level.
-
We incorporate other taxonomy sources, such as Entrez Genome accession numbers or researcher knowledge of specific entries, as they become available. These "other sources" are used preferentially over the RDP for bacteria and archaea. RDP does not classify eukaryotes. For eukaryota taxonomies, we use the EMBL taxonomy from the SILVA database where we do not have other sources.
We create hypervariable region specific databases (RefV6, RefV3, RefV9, etc.).
-
For each hypervariable region we calculate the Ref16S alignment coordinates.
-
We then excise from the Ref16s aligned sequences the section corresponding to the hypervariable region.
-
The gaps are removed from the aligned sequences to create a set of unaligned sequences.
-
Any hypervariable sequences that contain an 'N' are deleted.
-
Any hypervariable sequences shorter than 50 nt are deleted.
-
Any full-length sequences that were not sequenced all the way through the specific hypervariable region are deleted.
-
Two reference IDs are assigned to each reference hypervariable region. A ref16s_id (previously alt_local_gi) links the hypervariable sequence with its source full-length sequence, and a second, e.g., refv6_id (previously known as local_gi) is used to identify all entries having the exact same sequence of the hypervariable region. The taxonomy is carried directly from the full-length source.
-
Unique reference hypervariable regions sequences are exported to a blastable database for use in the assignment of taxonomy to pyrosequencing reads through GAST.
Community Visualization Demo
See a short demonstration video about using the Community Visualization tool.
The movie is in Quicktime format.
Download
|
| | | |