Frequently Asked Questions
Clustering and Diversity
How are SLP clusters created?
A combination of ESPRIT, SLP and mothur computes taxonomic independent clusters (Operational Taxonomic Units - OTUs) using the total collection of available V6 sequences in VAMPS. The sequences were binned into separate datasets for the Archaeal or Eukaryal domains, and into Bacterial phylum- or Proteobacterial class-level datasets. For each bin, the unique.seqs function in mothur, eliminated duplicate sequences but retained information about observed frequencies for each unique read. The kmerdist module of ESPRIT (with default values) identified all sequence pairs within each bin that are predicted to be at least 90% similar. The needledist module in ESPRIT generated a sparse matrix of pairwise distances by performing a Needleman-Wunsch alignment on the sequence pairs and calculating pairwise distances using quickdist. The algorithm SLP uses the pairwise distances to perform a modified single-linkage preclustering at 2% to reduce noise in the sequence data. Initially SLP orders sequences according to their rank abundance and then steps through the ordered sequences assigning them to clusters. The most abundant sequence defines the first cluster. Each subsequent sequence is tested against the growing list of clusters using the single-linkage algorithm. If the sequence has a pairwise distance less than 0.02 (equivalent to a single difference in the V6 region) to any of the sequences already in the cluster, the new sequence will be added to the cluster and not tested against subsequent clusters. If the sequence is not within a distance of 0.02 from any read in any of the existing clusters, it will establish a new cluster. Once all sequences have been assigned to clusters, sequences in the low abundance clusters (< 10 tags) are tested against the larger clusters and added to those clusters if possible. For each precluster, SLP uses the sequence with the highest frequency and the count of all tags in the precluster for average linkage clustering by mothur. Taxonomy for each cluster relies upon on a two-thirds majority of the taxonomy for each cluster member; CATCHALL estimates the estimate richness.
Citation:
Huse, S.M., D. Mark Welch, H.G Morrison, and M.L. Sogin. (2010)
Ironing out the wrinkles in the rare biosphere. Environmental Microbiology early view.
What is included in the OTU Cluster data package?
-
The .fa file is a fasta file of the unique sequences used to create the clusters.
-
The .names is a text file specifying the additional tags that have the exact same sequence.
-
The .list file specifies the OTU membership - one line for each OTU width.
The first item in this file is the OTU width (unique or 0.01-0.10 corresponding to 0-10%), the second is the number of OTUs, and following this is the list of OTUs. Commas separate tags within the same OTUs and tabs separate one OTU from the next.
-
The .otu file specifies the OTU sizes - one line for each OTU width.
The first item in this file is the OTU width (unique or 0.01-0.10 corresponding to 0-10%), the second is the number of OTUs, and following is the list of OTU sizes separated by tabs.
-
The .tax file shows the taxonomy corresponding to the reads in the .list file for the 0.03, 0.06, and 0.10 widths. Clusters are tab-delimited and contain a list of comma-separated taxa and their counts. The lists are in order of decreasing abundance, so the most common taxon is first. The format will be
"0.03 Taxon1:GAST_distance:Count,Taxon2:GAST_distance:Count,...".
NOTE: The .list file, the .tax file and the .otu file may exceed Excel limitations on cell size and number of columns. Other software may be required to utilize these data.
OTU Creation and Upload
OTU File formats:
-
The Matrix file (for OTU uploads)
The header line is REQUIRED.
The required header line consists of the word: 'Cluster_ID', then the dataset names.
Taxonomy is optional and if it is present it is the last tab field in each row and there must be the word 'Taxonomy' as the
last word in the header. If the word taxonomy is found in the header then all taxonomy needs
to be filled in. If the taxonomy is unknown then 'unknown' should be in the row.
All of the fields in the file are separated by tabs including the header line.
The dataset names should be alpanumeric only (underscore allowed also) and contain no spaces.
Below is a sample matrix file:
Cluster_ID DS1 DS2 DS3 DS4 DS5 DS6 DS7 DS8 Taxonomy
Cluster101 0 0 1 1 0 2 0 1 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae
Cluster102 5 0 0 14 0 33 41 12 Bacteria;Nitrospirae;Nitrospira;Nitrospirales;Nitrospiraceae;Leptospirillum
Cluster103 2 0 0 8 0 9 2 0 Bacteria;Proteobacteria;Alphaproteobacteria;Rickettsiales;Rickettsiaceae;Rickettsia;endosymbiont
Cluster105 1 11 0 0 0 0 2 0 Bacteria;Firmicutes;Clostridia;Clostridiales;Unassigned;Sulfobacillus
Cluster107 0 0 0 1 0 12 18 0 Bacteria;Firmicutes;Clostridia;Clostridiales
Cluster108 72 74 108 46 1 39 164 41 Bacteria;Actinobacteria;Actinobacteria;Acidimicrobiales;Acidimicrobiaceae
-
The Biom file (for OTU uploads)
The Biom file is a sparse matrix file derived from the output of QIIME and is described
Here. It should have taxonomy in the metadata field for each row.
Below is a sample biom file:
{
"rows": [
{"id": "0", "metadata": {"taxonomy": ["Bacteria", "Firmicutes", "Bacilli", "Bacillales", "Staphylococcaceae", "Staphylococcus"]}},
{"id": "1", "metadata": {"taxonomy": ["Bacteria", "Firmicutes", "Clostridia", "Clostridiales"]}},
{"id": "2", "metadata": {"taxonomy": ["Bacteria", "Bacteroidetes"]}},
{"id": "3", "metadata": {"taxonomy": ["Bacteria", "Bacteroidetes"]}},
{"id": "4", "metadata": {"taxonomy": ["Bacteria", "Firmicutes", "Clostridia", "Clostridiales", "Lachnospiraceae"]}}
],
"format": "Biological Observation Matrix 0.9.1-dev",
"data": [[0, 0, 16.0], [1, 1, 1.0], [2, 0, 3.0], [3, 2, 0.9], [4, 2, 9.0]],
"columns": [
{"id": "PC.636", "metadata": null},
{"id": "PC.635", "metadata": null},
{"id": "PC.356", "metadata": null}
],
"generated_by": "QIIME 1.5.0, svn revision 2956",
"matrix_type": "sparse",
"shape": [5, 3],
"date": "2012-05-06T15:59:11.792652",
"type": "OTU table",
"id": null,
"matrix_element_type": "int"
}
<Back to Top>
Sequence Processing
How is the tag sequence trimming done and low-quality reads removed?
For each read from the GS-FLX Sequencer we trim primer bases from the beginning and the end of each read, and remove sequences likely to be of low-quality based on our assessment of pyrosequencing error rates (Huse et al, 2007).
-
A sequencing request is submitted by each researcher outlining the 5-nt run key and amplification primers used for each dataset in a GS-FLX run. These data are imported into the database.
-
The GS-FLX sff data files are converted to raw tag sequences.
-
The first five nucleotides of each tag are removed and compared with the list of expected run keys. If the first five nucleotides of a tag are not in the expected set, the tag is deleted for low-quality.
-
If one or more Ns are located in the tag sequence, the read is flagged as low-quality and deleted.
-
The 5-nt run key is used to look up the amplification primers on the researcher submission form. The start of the tag sequence is compared with the list of amplification start primers (i.e., 5' end of a forward read). If an exact match to an expected proximal primer is not found, the tag is deleted for low-quality, otherwise the run key and primer are removed from the tag sequence.
-
If an exact match to the distal primer is not located, BLASTN (with non-default parameters -q -1, -W 7, -S 1, -E 2, -F 'F', and -G 1), and the EMBOSS program fuzznuc (with non-default parameter -mismatch=3) are used for a fuzzy comparison of expected primers with the end of the sequence. If a match is found (either exact or fuzzy) the distal primer is removed from the sequence.
-
If the length of the tag sequence is less than 50 nt after removal of the run key and amplification primers, the tag is flagged as low-quality and deleted.
<Back to Top>
How is taxonomy assigned to the tags through the GAST process?
In Sogin et al. (2006), we proposed a tag mapping methodology, GAST (Global Alignment for Sequence Taxonomy) to assign a taxonomic classification to environmental V6 tags.
The steps to assigning taxonomy via GAST are as follows:
-
We create a reference database of rRNA genes, RefSSU, based on the SILVA database (Pruesse et al. 2007), and taxonomy assigned to each reference sequence assigned primarily with RDP Classifier (Wang et al. 2007).
-
We create reference sets, RefVx, of hypervariable region tags (e.g., RefV6, RefV3, RefV9) by excising the hypervariable region from RefSSU using the SILVA alignment.
-
We BLAST each pyrosequencing tag against the RefVx database to generate a set of 100 best local matches to the reference database.
-
Because the top BLAST hit may not have the highest overall similarity to the tag sequence, particularly because edge-effects in the short region being compared can be pronounced, we align the tag sequence to the reference tags corresponding to the top 100 BLAST hits. We use MUSCLE (with non-default parameters - diags diags1 and -maxiters 2 to reduce processing time).
-
We calculate the global distance from the tag to each of the aligned reference sequences as the number of insertions, deletions and mismatches divided by the length of the tag. We considered the reference sequence or sequences with the minimum global distance to be the top GAST match(es). The top BLAST hit was generally the best global match; however, for 5% to 25% of tags the best global match is to a reference sequence with a lower BLAST score.
-
We identify all of the reference long sequences in Ref16S that contain the exact hypervariable sequence of the top GAST match or matches. We compile the taxonomic classifications (with RDP bootstrap values >= 80) of all these 16S sequences
-
If two-thirds or more of the full-length sequences share the same assigned genus, the tag is assigned to that genus. If there is no such agreement, we proceed up the tree one level to family. If there is a two-thirds or better consensus at the family level, we assign this taxonomy to the tag, and if not, we continue up the tree, until we achieve a two-thirds majority. Tags that do not match any reference tag by BLAST were not given a taxonomic assignment. Comparison of taxonomic assignments of hypervariable tags via GAST with the taxonomic assignments of known source full-length sequences through RDP show a 98+% correlation.
Click the chart for a full-size view.
<Back to Top>
Export Taxonomic Counts
How are the datasets normalized?
Normalized to Largest Dataset - For each particular taxonomic assignment for each dataset, the number of reads in that taxonomic assignment are multiplied by the ratio of the total number of reads in the largest dataset to the total number of reads in the current dataset.
-
Normalized Sample Count = (Actual Sample Count) * (Total of Largest Dataset) / (Total of Current Dataset)
Normalized by Percent within Datasets - The frequency of each taxonomic assignment is reported as a percent (number of reads assigned to a taxonomy over total number of reads in the dataset).
-
Normalized Sample Count = (Actual Sample Count) / (Total of Current Dataset)
How do I download the taxonomic data export files?
The export files are text files, and are not compressed. When you run the query by clicking "Get Taxonomy Data", three files will be generated. Please allow time for all three files to finish. When ready, select the desired output file and download.
-
Mac OS X / Firefox. Right-click on the file name link and choose "Save Link As...".
-
Mac OS X / Opera. Right-click on the file name link and choose "Save Linked Content As...".
-
Mac OS X / Safari. Right-click on the file name link and choose "Save Linked File As...".
-
Windows / Firefox. Right-click on the file name link and choose "Save Link As...".
-
Windows / Internet Explorer. Right-click on the file name link and choose "Save Target As...".
-
Windows / Opera. Right-click on the file name link and choose "Save Linked Content As...".
After exporting, the text files may be opened in a text editor. For Windows, the best text editor to use is Wordpad.
How do I import the taxonomic data into a spreadsheet?
-
In MS Excel for Windows, you can import a text file by selecting "Data/Import Data...".
-
In MS Excel for Mac, import a text file by selecting "Data/Get External Data/Import Text File..."
-
From the file dialog box, select the file you wish to import.
-
This brings up the Text Import Wizard. Follow steps 1 through 3:
-
Step 1 - Make sure "Delimited" is selected.
-
Step 2 - Make sure "Tab" is selected as the delimiter.
-
Step 3 - Click Finish.
<Back to Top>
Import Data
How do I import data into VAMPS so that I can view the data using the VAMPS tools?
Data can be imported into VAMPS in three different ways.
- Trimmed sequences can be uploaded to the VAMPS database directly using the
tools on this page. If you use this method you will be asked for a project and dataset
name as well as domain and which region of the S16 molecule the sequences come from. This metadata will be used to select the reference
database when you want the taxonomy assigned using the GAST system. You will also be asked to supply an environmental source for your
data which will be used to filter your dataset(s) later. It is also possible to add a new dataset to one of your projects that is already
in the VAMPS database, but you are not allowed to add to an already existing dataset. The sequence file must be fasta format
(plain text, zipped or gzipped).
- The second type of data that can be uploaded is raw sequences that you want to have trimmed using specific primers and keys
(short barcode sequences which separate datasets) which you supply (see this page to start uploading
raw sequences).
A quality file can also be included to help elucidate which sequences are to be discarded
because of low quality. See below for file formats for uploading raw sequences. The sequences and quality files must be fasta format
(plain text, zipped or gzipped) and the primers and runkeys file are tab delimited text files.
- The remaining method to upload data is to upload a file of taxonomy data that is in one of two
RDP taxonomy formats: Classifier Text format and Fasta Style format. See here for details and
here to start uploading.
What is the file format for uploading trimmed sequences?
Trimmed sequences can be imported using the NCBI FASTA format.
Each read starts with a ‘>’ and the Read_ID is between the ‘>’ and the first ‘|’ (a ‘pipe’ symbol)
or the first white space ('tab' or 'space') and cannot contain any special characters other than underscore ‘_’.
Each Read_ID must be a unique value.
If there is any other information on the definition line, it must be after the first ‘|’ (or space).
The whole definition line is separated from the sequence data by a return or linefeed.
What is the file format for uploading untrimmed sequences?
Untrimmed sequences can be imported using the same fasta file format as the trimmed sequences above.
Untrimmed sequences must be uploaded in parallel with a keys file, primers file and optional quality file.
What is the file format for uploading quality information?
The quality file also follows a format similar to the FASTA format.
Each read starts with a definition line as described above using the same Read_ID as in the sequence fasta file to match the
sequence to the quality data.
What is the file format for importing primers?
The primers can be imported in a plain text csv file (comma separated values).
The format is one line per primer, and fields are separated by a comma. All fields must be present.
The fields have to be in the following order:
name, primer_direction, original sequence
Header: The header line is allowed but not required - when present it must be: name,direction,sequence
Name: Alphanumeric, dash ("-") and underscore ("_") only. The primer name must be unique in the file.
Primer Direction: F = forward; R = reverse.
Sequence: See Primer Notations below.
Example (header line is not required):
name,direction,sequence
967F,F,CNACGCGAAGAACCTTANC
967F-UC1,F,CAACGCGAAAA+CCTTACC
967F-UC2,F,CAACGCGCAGAACCTTACC
1046R-AQ1,R,AGGTG.?TGCATGG*CCGTCG
1046R-AQ2,R,AGGTG.?TGCATGG*TCGTCG
Primer Notations
The primers are specified using the following standards:
| R = A or G | | Y = C or T | | W = A or T | | S = C or G | | K = G or T | M = A or C
| | B = C or G or T (not A) | | D = A or G or T (not C) | | H = A or C or T (not G) | V = A or C or G (not T)
| | N = A or C or G or T (any base, same as '.') | | * = zero or more of the preceding base | | . = any base | | [AG] = either one of the enclosed bases | | ? = zero or one of the preceding base | | + = one or more of the preceding base |
What is the QIIME metadata mapping file format?
The QIIME file formats are best described here.
What is the metadata file for? (Also run_keys)
The metadata file is a csv (comma separated values) file. It is required and provides a way to include runkeys (barcodes) so that the trimmed sequences
can be de-multiplexed into separate datasets.
The metadata file contains other required data such as project and dataset names.
In the table below is an explanation of the different columns. The double quotes around each item are okay but not required.
Format (header line is required):
runkey,project,dataset,sequence_direction,project_title,project_description,dataset_description,environmental_source_id
TCATC,new_test2,two,F,"my Title","My Description","dataset description",120
ACTCG,new_test2,two,F,"my Title","My Description","dataset description",120
Here is a simple template file (headers only) which can be opened in MS Excel or OpenOffice:
Template CSV file
(rigth-click and save as vamps_csv_template.csv)
Column Explanations:
| runkey (barcode) |
No letters besides A,G,C,T - Length: Minimum 3nt; Maximum 12nt |
| project |
Name of the project: ONLY Alphanumeric and underscore '_' (no spaces). Cannot start with a number. |
| dataset |
Name of the dataset: ONLY Alphanumeric and underscore '_' (no spaces). Cannot start with a number. |
| sequence_direction |
NO COMMAS - Choose one: F, R or B for Forward, Reverse or Both |
| project_title |
NO COMMAS - Free form brief title of the project (10 words or less).
|
| project_description |
NO COMMAS - All on one line, this is more in-depth than the title.
Free form description of the project which can be a few sentences long.
|
| dataset_description |
NO COMMAS - brief description of the dataset. |
| environmental_source_id |
A single id number selected from the list below. |
Environmental Sample Source IDs:
ID Sample Source
10 air
20 extreme_habitat
30 host_associated
40 human_associated
41 human-skin
42 human-oral
43 human-gut
44 human-vaginal
45 human-amniotic_fluid
46 human-urine
47 human-blood
50 microbial_mat/biofilm
60 miscellaneous_natural_or_artificial_environment
70 plant_associated
80 sediment
90 soil/sand
100 unknown
110 wastewater/sludge
120 water-freshwater
130 water-marine
140 indoor
<Back to Top>
<Back to Top>
Reference Databases
How are the reference databases created?
We create Ref16S, a reference database of aligned full-length sequences based on all available sequences in SILVA exported using the ARB software. New updates to both SILVA and RDP are incorporated as they become available.
-
We flag as low-quality and delete all sequences with a sequence quality score <= 50, an alignment score <= 50 or a pintail (chimera) score <= 40.
-
We flag as redundant and delete all exact copies of the full-length sequence.
-
We classify all bacterial and archaeal sequences directly with the Ribosomal Database Project Classifier (RDP). We used only RDP classifications with a bootstrap value of >=80%. If the bootstrap value was <80%, the taxonomic assignment was moved to a higher classification level until an 80% or better bootstrap value was achieved. For example, if the genus assignment had a bootstrap value of 70%, but the family had a value of 85%, that sequence would be assigned only as far as family and not to genus. RDP Classifier does not classify sequences below the genus level.
-
We incorporate other taxonomy sources, such as Entrez Genome accession numbers or researcher knowledge of specific entries, as they become available. These "other sources" are used preferentially over the RDP for bacteria and archaea. RDP does not classify eukaryotes. For eukaryota taxonomies, we use the EMBL taxonomy from the SILVA database where we do not have other sources.
We create hypervariable region specific databases (RefV6, RefV3, RefV9, etc.).
-
For each hypervariable region we calculate the Ref16S alignment coordinates.
-
We then excise from the Ref16s aligned sequences the section corresponding to the hypervariable region.
-
The gaps are removed from the aligned sequences to create a set of unaligned sequences.
-
Any hypervariable sequences that contain an 'N' are deleted.
-
Any hypervariable sequences shorter than 50 nt are deleted.
-
Any full-length sequences that were not sequenced all the way through the specific hypervariable region are deleted.
-
Two reference IDs are assigned to each reference hypervariable region. A ref16s_id (previously alt_local_gi) links the hypervariable sequence with its source full-length sequence, and a second, e.g., refv6_id (previously known as local_gi) is used to identify all entries having the exact same sequence of the hypervariable region. The taxonomy is carried directly from the full-length source.
-
Unique reference hypervariable regions sequences are exported to a blastable database for use in the assignment of taxonomy to pyrosequencing reads through GAST.
<Back to Top>
Tag Generation
Conserved sequences that flank the hypervariable V6 region of rRNAs serve as primer sites to generate PCR amplicons. Each PCR reaction produces products that can be informatically identified using a unique "key" incorporated between the 454 Life Sciences primer A and the 5' flanking rRNA primer. The use of a 5-bp key allows for the synthesis of as many as 81 oligonucleotides that differ by at least two sites. Our multiplexing strategy allows the concurrent collection of 10,000-50,000 tags from each of 8-40 samples in a single four- hour sequencing run without use of partitioning gaskets that reduce the number of sequencing wells on the PicoTiterPlateTM. Amplicons can be pooled before the emPCR step and each pool is run on a large region of the plate.
454 Amplicon PCR (Christina Holmes/Ekaterina Andreishcheva) for five reactions:
- 130 ul water
- 16.7 ul 10X Platinum buffer
- 6.7 ul 50 mM MgSO4
- 3.4 ul 10 mM Pure Peak dNTPs
- 3.4 ul 10uM Fusion Primer A pool
- 3.4 ul 10 uM Fusion Primer B
- 3.4 ul 2.5 U/ul Platinum HiFi Pol
- [2 ul template (~5-7 ng)*]
- 33 ul total volume/reaction
*If template stock is dilute or otherwise resistant to amplification, more template can be added in place of water.
The 5 reactions are the three replicates of the environmental template, positive control, and negative control. Template (plasmid pool for positive control; water for negative control) is added as final step.
Program:
- 94C for 2 min
- 30 cycles of (94C for 30sec, 50C for 20sec, 72C for 1min)
- 72C for 2 min
- Hold at 4C
- Visualize and quantitate the amplicons in the BioAnalyzer.
- Purify the amplicons using QIagen Min Elute columns according to the instructions except elute using 30 ul of Buffer EB.
- Store purified amplicons at -20C.
Supplies:
- DNA 1000 Kit, Agilent, 5067-1504, 25 chips. $372
- MinElute PCR Purification Kit, QIagen, 28004, 50 preps, $92
- PurePeak DNA Polymerization Mix 10 mM 1 ml, Pierce/ThermoFisher, NU606001, $115
<Back to Top>
What are the primers used?
To capture the full diversity of bacterial rRNA sequences, we use a cocktail of five primers at the 5' end and four primers at the 3' end of the V6 region. The purified amplicon libraries are annealed to oligonucleotides that are complementary to the Life Science A primer, which are tethered onto micron-size beads. The annealing conditions favor one fragment per bead. The beads are emulsified in a PCR mixture-in-oil and PCR amplification occurs in micro-reactor generating ~ten million copies of a unique DNA template. After breaking the emulsion, the DNA strands are denatured and beads carrying single-stranded DNA clones are deposited into wells of the PicoTiterPlateTM for pyrosequencing on a Roche Genome Sequencer. A typical sequencing run will generate more than 400,000 tags on a GS-FLX.
5' Primers
gcctccctcgcgccatcagNNNNNCNACGCGAAGAACCTTANC
gcctccctcgcgccatcagNNNNNCAACGCGAAAAACCTTACC
gcctccctcgcgccatcagNNNNNCAACGCGCAGAACCTTACC
gcctccctcgcgccatcagNNNNNATACGCGARGAACCTTACC
gcctccctcgcgccatcagNNNNNCTAACCGANGAACCTYACC
3' Primers
gccttgccagcccgctcagCGACAGCCATGCANCACCT
gccttgccagcccgctcagCGACAACCATGCANCACCT
gccttgccagcccgctcagCGACGGCCATGCANCACCT
gccttgccagcccgctcagCGACGACCATGCANCACCT
We have published our V6-tag sequencing technology in the following:
Sogin, M. L., H. G. Morrison, J. A. Huber, D. Mark Welch, S. M. Huse, P. R.
Neal, J. M. Arrieta, and G. J. Herndl. 2006. Microbial diversity in the deep sea and
the underexplored "rare biosphere". Proc Natl Acad Sci U S A 103:12115-20.
Huber, J. A., D. B. Mark Welch, H. G. Morrison, S. M. Huse, P. R. Neal, D. A.
Butterfield, and M. L. Sogin. 2007. Microbial population structures in the deep
marine biosphere. Science 318:97-100.
Huse, S. M., J. A. Huber, H. G. Morrison, M. L. Sogin, and D. Mark Welch.
2007. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol
8:R143.
ArchaealV6 amplification (same cycling conditions).
5' primer
gcctccctcgcgccatcagNNNNNAATTGGANTCAACGCCGG
3' primers
gccttgccagcccgctcagCGRCGGCCATGCACCWC (major)
gccttgccagcccgctcagCGRCRGCCATGYACCWC (minor)
EukaryalV9 amplification:
Details and primers are unpublished. Contact Linda Amaral Zettler for information.
<Back to Top>
Community Visualization Demo
See a short demonstration video about using the Community Visualization tool. The movie is in Quicktime format. Download
<Back to Top>
Definitions
-
Project
-
The project name refers to the overall study or research project to which the data belong. The project ties multiple samples and sequencing runs together.
-
Dataset
-
The dataset name refers to a set of sequences within the project that are from one sampling location or individual at a particular date and time. The dataset combines sequences sampled or amplified together. Sequence and taxonomic data are uploaded on a dataset by dataset basis. Multiple datasets may be combined together or compared separately when using the Community Visualization tools.
-
FASTA Files
-
The FASTA file definition line follows NCBI FASTA format.
If you upload a FASTA file, it will be run through RDP to calculate a sequence by sequence taxonomy file.
The file will be filtered for valid file format and data.
If valid, the file will be uploaded into a temporary table of VAMPS data that will be available immediately for viewing.
FASTA Definition Line:
Each read starts with a ‘>’ and the read ID is between the ‘>’ and the first ‘|’ (a ‘pipe’ symbol),
it cannot contain any special characters other than dash ‘-’ or underscore ‘_’
and must be less than 32 characters.
If there is any other information on the definition line, it must be after the first ‘|’.
The whole definition line is separated from the sequence data by a return or linefeed.
-
RDP Taxonomy Files
-
The RDP taxonomy file complies with the RDP format as defined by the
Ribosomal Database Project (RDP).
The file will be filtered for valid file format and data.
If valid, the file will be uploaded into a temporary table of VAMPS data that will be available immediately for viewing.
Please note: There should be no blank lines in the header.
<Back to Top>
Computing Resources
-
Browsers
-
Vamps has been tested mainly on the following browsers:
Mac OS X: Safari and Firefox
Linux: Firefox
Windows: Firefox
-
Software
-
Javascript and cookies need to be enabled in you browser.
Screen resolution of at least 800x600 dpi is required and larger is recommended.
Minimum of 256 MB RAM, and 1.0 GB RAM recommended.
Java is needed and should be enabled in your browser.
<Back to Top>
|