Using DNA Barcodes to Identify and Classify Living Things:
Bioinformatics

I. Use BLAST to Find DNA Sequences in Databases (Electronic PCR)

  1. Perform a BLAST search as follows:
    • Do an Internet search for "ncbi blast."
    • Click the link for the result BLAST: Basic Local Alignment Search Tool. This will take you to the Internet site of the National Center for Biotechnology Information (NCBI).
    • Under the heading "Web BLAST," click "Nucleotide BLAST."
    • Enter the primer set you used into the “Enter Query Sequence” search window. These are the query sequences.
    • The following primers were used in this experiment:

      Plant rbcL gene

      • rbcLa f 5’- ATGTCACCACAAACAGAGACTAAAGC-3’ (forward primer)
      • rbcLa  rev 5’- GTAAAATCAAGTCCACCRCG-3’ (reverse primer)


      Plant matK gene

      • matk-3F 5’- CGTACAGTACTTTTGTGTTTACGAG-3’ (forward primer)
      • matk-1R 5’- ACCCAGTCCATCTGGAAATCTTGGTTC-3’ (reverse primer)


      Plant ITS region

      • nrITS2-S2F 5’- ATGCGATACTTGGTGTGAAT-3’ (forward primer)
      • nrITS2-S3R 5’-GACGCTTCTCCAGACTACAAT-3’ (reverse primer)


      Plant tufA gene

      • tufA_F 5’- TGAAACAGAAMAWCGTCATTATGC-3’ (forward primer)
      • tufA_R  5’- CCTTCNCGAATMGCRAAWCGC-3’ (reverse primer)


      Vertebrate (non-fish) COI gene

      • VF1_t1 5'-TCTCAACCAACCACAAAGACATTGG-3' (forward primer)
      • VR1d_t1 5'-TAGACTTCTGGGTGGCCRAARAAYCA-3' (reverse primer)


      Vertebrate (fish) COI gene

      • VF2_t1 5'-CAACCAACCACAAAGACATTGGCAC-3' (forward primer)
      • FishR2_t15'-ACTTCAGGGTGACCGAAGAATCAGAA-3' (reverse primer )


      Invertebrate COI gene

      • LCO1490_F  5’-GGTCAACAAATCATAAAGATATTGG-3’ (forward primer)
      • HC02198_R  5’-TAAACTTCAGGGTGACCAAAAAATCA-3’ (reverse primer)


      Fungi ITS region

      • ITS1 F  5’-TCCGTAGGTGAACCTGCGG-3’ (forward primer)
      • ITS4 R  5’-TCCTCCGCTTATTGATATGC-3’ (reverse primer)


      Fungi (lichen-specific) ITS region

      • ITS1F_(Gad)  5’-CTTGGTCATTTAGAGGAAGTA-3’ (forward primer)
      • ITS4 R  5’-TCCTCCGCTTATTGATATGC-3’ (reverse primer)
    • Omit any non-nucleotide characters from the window because they will not be recognized by the BLAST algorithm.
    • Under "Choose Search Set," select "Nucleotide collection (nr/nt)" from the pull-down menu.
    • Under "Program Selection," optimize for "Somewhat similar sequences (blastn)."
    • Click "BLAST". This sends your query sequences to a server at the National Center for Biotechnology Information in Bethesda, Maryland. There, the BLAST algorithm will attempt to match the primer sequences to the DNA sequences stored in its database. A temporary page showing the status of your search will be displayed until your results are available. This may take only a few seconds or more than a minute if many other searches are queued at the server.
  2. The results of the BLAST search are displayed in three ways as you scroll down the page:
    • First, a "Graphic Summary" illustrates how significant matches, or "hits," align with the query sequence. Why are some alignments longer than others?
    • This is followed by "Descriptions of sequences producing significant alignments," a table with links to database reports.
      • The accession number is a unique identifier given to a sequence when it is submitted to a database, such as GenBank®. The accession link leads to a detailed report on the sequence.
      • Note the scores in the "e" column on the right. The Expectation or E value is the number of alignments with the query sequence that would be expected to occur by chance in the database. The lower the E value, the higher the probability that the hit is related to the query. For example, an E value of 1 means that a search with your sequence would be expected to turn up one match by chance.
      • What is the E value of your most significant hit, and what does it mean? What does it mean if there are multiple hits with similar E values?
      • What do the descriptions of significant hits have in common?
    • Next is an "Alignments" section, which provides a detailed view of each primer sequence ("Query") aligned to the nucleotide sequence of the search hit ("Subject"). Notice that hits have matches to one or both of the primers. For example:
        Forward Primer Reverse Primer
      Plant   nucleotides 1-26 nucleotides 27-46
      Vertebrate (non-fish) nucleotide 1-25 nucleotides 26-53
      Fish   nucleotides 1-25 nucleotides 26-51
      Fungi   nucleotide 1-19 nucleotides 20-39
      Invertebrate   nucleotides 3-25 nucleotides 26-51
  3. Predict the length of the product that the primer set would amplify in a PCR reaction (in vitro).
    • In the "Alignments" section, select a hit that matches both primer sequences.
    • Which nucleotide positions do the primers match in the subject sequence?
    • The lowest and highest nucleotide positions in the subject sequence indicate the borders of the amplified sequence. Subtracting one from the other gives the difference between the coordinates.
    • However, the PCR product includes both ends, so add 1 nucleotide to the result that you obtained in Step 3.c. to determine the exact length of the fragment amplified by the two primers.
    • What value do you get if you calculate the fragment size for other species that have matches to the forward and reverse primer? Do you get the same number?
  4. Determine the type of DNA sequence amplified by the primer set:
    • Click the accession link (beginning with "ref") to open the data sheet for the hit used in Question 3 above. Accession Numbers will be linked next to “Sequence ID”.
    • The data sheet has three parts:
      •  The top section contains basic information about the sequence, including its basepair (bp) length, database accession number, source, and references to papers in which the sequence is published.
      • The bottom section lists the nucleotide sequence.
      • The middle section contains annotations of gene and regulatory "FEATURES," with their beginning and ending nucleotide positions ("xx..xx"). These features may include genes, coding sequences (cds), regulatory regions, ribosomal RNA (rRNA), and transfer RNA (tRNA).
    • Identify the feature(s) located between the nucleotide positions identified by the primers, as determined in 3.b. above.

II. Determine Sequence Relationships Using the Blue Line

The following directions explain how to use the Blue Line of DNA Subway to analyze novel DNA sequences generated by a DNA sequencing experiment. If you did not sequence your own DNA sample, you can follow these directions to use DNA sequences produced for other students. You can find supplementary instructions by clicking on the "manual" link on the DNA Subway homepage.

DNA Subway is an intuitive interface for analyzing DNA barcodes. Generally, you progress in a stepwise fashion through the button "stops" on each "branch line." An "R" indicates that analysis is available. A blinking "R" indicates an analysis is in process. A "V" means that results are ready to view.

You can analyze relationships between DNA sequences by comparing them to a set of sequences you have compiled yourself, or by comparing your sequences to others that have been published in databases such as GenBank (National Center for Biotechnology Information). Generating a phylogenetic tree from DNA sequences derived from related species can also allow you to draw inferences about how these species may be related. By sequencing variable sections of DNA (barcode regions) you can also use the Blue Line to help you identify an unknown species, or publish a DNA barcode for a species you have identified, but which is not represented in published databases like GenBank (www.ncbi.nlm.nih.gov/genbank).

  1. Create a DNA Subway Project and Upload DNA Sequences
    Note: Only registered users submitting novel, high-quality sequences will be able to submit sequence to GenBank.
    • Log into DNA Subway. If you do not have an account, you will need to register first to save and share your work.
    • Select "Determine Sequence Relationships" (Blue Line) to begin a project.
    • Under “Select a project type” > “Barcoding”, create a project by selecting rbcL (plants), COI (animals), 16S (bacteria), or ITS (fungi). If you are analyzing a barcode region that is not listed, select “DNA" under “Select Project Type” > “Phylogenetics”.
    • "Select Sequence Source" provides several ways to obtain sequences for barcode analysis. Select the most appropriate way to upload your data from the following four options:
      • Upload sequence(s) in ab1 (files ending with .ab1) or FASTA format. Click "Browse" to navigate to a folder on your desktop or drive containing your sequence(s). Select a sequence by clicking on its file name. Select more than one sequence by holding down the ctrl key while clicking file names. Once you have selected the sequences you want, click "Open".
      • Enter a sequence in FASTA format. Below is an example of this format. The ">" symbol demarcates the sequence name. The sequence is started on the next line.
        >sequence name
        atcgccccttaatattgcctt…
      • Import a sequence/trace from the DNALC. If your DNA sample was sequenced by GENEWIZ, your sequence data will be automatically uploaded to this database. Search for your tracking number and click on the linked tracking number. Select one or more files from the list. Click to “Add selected files”.
      • Select a sample sequence. If you do not have a file, you may select any of the available sample sequences.
    • Provide a title in the "Name Your Project" section.
    • Write a short description of your project in the "Description" section (optional).
    • Click "Continue" to load the project into DNA Subway.
  2. View and Build Sequences
    There are many plants, animals, and fungi that do not have a documented barcode sequence. For instance, there are an estimated 350,000 species of angiosperms (flowering plants), but as of July 2018 there were only about 270,000 rbcL angiosperm sequences in GenBank. For other species, diversity in the barcode sequences are not well characterized. This means that there are opportunities to submit novel sequences and contribute to the global barcoding effort. Only samples that have high quality sequence for both the forward and reverse reads are good enough to ensure a low error rate and can be published to GenBank, so the sequence quality must be checked. Sequences for which there is only one high quality read are not be considered high enough quality to publish. These sequences and those with no high quality sequence are can still be analyzed even though the results are not publishing quality.
    • On the "Assemble Sequences" branch line, Click "Sequence Viewer" to display the sequences you have input in the project creation section. If you did not upload trace files, you can scroll to see the sequence. If you uploaded trace files, click on the file names to view the trace files.
      • The DNA sequencing software measures the fluorescence emitted in each of four channels – A,T,C,G – and records these as a trace, or electropherogram. In a good sequencing reaction, the nucleotide at a given position will be fluorescently labeled far in excess of background (random) labeling of the other three nucleotides, producing a "peak" at that position in the trace. Thus, peaks in the electropherogram correlate to nucleotide positions in the DNA sequence.
      • A software program called Phred analyzes the sequence file and "calls" a nucleotide (A, T, C, G) for each peak. If two or more nucleotides have relatively strong signals at the same position, the software calls an "N" for an undetermined nucleotide.
      • Phred also examines the peaks around each call and assigns a quality score for each nucleotide. The quality scores corresponds to a logarithmic error probability that the nucleotide call is wrong, or, conversely, to the accuracy of the call.
        Phred Score
        Error Accuracy
        10
        1 in 10 90%
        20
        1 in 100 99%
        30
        1 in 1,000 99.9%
        40
        1 in 10,000 99.99%
        50
        1 in 100,000 99.999%
      • The electropherogram viewer represents each Phred score as a blue bar. The horizontal line equals a Phred score of 20, which is generally the cut-off for high-quality sequence. Thus any bar at or above the line is considered a high-quality read. What is the error rate and accuracy associated with a Phred score of 20?
      • Every sequence "read" begins with nucleotides (A,T,C,G) interspersed with Ns. In "clean" sequences, where experimental conditions were near optimal, the initial Ns will end within the first 25 nucleotides. The remaining sequence will have very few, if any, internal Ns. Then, at the end of the read, the sequence will abruptly change over to Ns.
      • Large numbers of Ns scattered throughout the sequence indicate poor quality sequence. Sequences with average Phred scores below 20 will be flagged with a "Low Quality Score Alert." You will need to be careful when drawing conclusions from analyses made with poor quality sequence. What do you notice about the electropherogram peaks and quality scores at nucleotide positions labeled "N"?
      • Note: The exclamation icon (!) indicates poor quality sequence.
    • Use the “X” and “Y” buttons to adjust the level of zoom. You can undo zooming by pressing the “Reset” button.
    • Examine the quality of the sequence(s). Any sequence for which the forward or reverse has the warning icon indicating a low quality score in not of good enough quality to publish and any determination of novelty will be tentative as sequencing errors could appear to be novel polymorphisms.
    • Click “Sequence Trimmer” to trim your sequences; this automatically remove Ns from the 5’ and 3’ ends of selected sequences. Click again to view the trimmed sequences. Why is it important to remove excess Ns from the ends of the sequences?
    • If you wish to view trimmed sequences, click on the file name.
  3. Pair and Build Consensus for Forward and Reverse Reads
    • Click “Pair Builder” to pair your forward and reverse reads. If you have two reads for a sample, pair the sequences by checking the box to the right of each read for the sample. By default, DNA Subway assumes that all reads are in the forward orientation, and displays an "F" to the right of the sequence. If any sequence is not in that orientation, click the "F" to reverse complement the sequence. The sequence will display an "R" to indicate the change. (Reverse complementing involves reversing the order of the reverse read and then changing the bases to their complementary bases. In this way, the two sequences should be identical, and should mostly overlap.)
    • Check the square boxes next to the reads, and a dialogue box will appear asking if you wish to designate the sequences as a pair. Alternatively, Click "Try auto pairing" to pair sequences which have identical sample names, but appended with of F or R based on sequencing direction.
    • Click "Save" to save your pair assignments.
    • Once you have created sequence pairs, click “Consensus Editor” to make a consensus sequence from both sequences in the selected pairs. To examine the consensus sequence click “Consensus Editor” again, and then click on the link to the pair you wish to examine. How does the consensus sequence optimize the amount of sequence information available for analysis? Why does this occur?
    • If there are any mismatched nucleotides between the first and second sequence, these will be highlighted yellow in the consensus editor window. Do differences tend to occur in certain areas of the sequence? Why?
      • A dash (–) is used to represent a gap in the data. In our consensus editor, the dash is used to “pad” the alignment between the forward and reverse sequences. A dash is a useful feature in an alignment because one of the possible mutations that could differentiate two related sequences is an insertion or deletion. In our case, misalignments between a forward and reverse read from the same sample are due to sequencing error. Since they are sequences from the same sample, they should be identical.
      • One recommendation on trimming at the beginning of the sequence is to trim up to the last position where one sequence has an "N" or a "–" within the first 50 or so bases. Starting from the right, you can also trim the sequence starting at the first "N" or "–" you find 75 or so base pairs from the end of the read. These recommendations are only rules of thumb. You will have to choose how strictly you wish to trim. If a trim on either end is more than 100bp, you may have to consider the effects of discarding large amounts of sequence. Trimming 200 bp in total represents 1/3 of our approximately 600 bp of sequence.
    • Large numbers of yellow mismatches – especially in long blocks – may indicate that you have incorrectly paired sequences from two different sources (organisms), or that you failed to reverse complement the reverse strand.
      • Return to Pair Builder to check your pairs and reverse complements.
      • Click the red "X" to redo a pairing, and toggle "F" and "R" settings, as needed.
    • A large number of mismatches in properly paired and reverse complemented sequences indicate that one or both sequences is of poor quality. Often, one of the sequencing reactions produces a high quality read that can be used on its own. To determine this:
      • Examine the distribution of Ns to see if they are mainly confined to one of the two sequences.
      • Examine the electropherograms to see if one of the two sequences is of good quality.
      • If one of the sequences seems of good quality, return to Pair Builder, and click the red X to undo the pairing.
    • Few or no internal mismatches indicate good quality sequence from forward and reverse reads. If you like, you can check the consensus sequence at yellow mismatches and override the judgment made by the software:
      • Click a highlighted mismatch to see the electropherograms and Phred scores for each read.
      • Click the desired nucleotide in the black rectangle to change the consensus sequence at that position. You should only change the consensus if you have a strong reason to believe the consensus is wrong.
      • Click the button to Save Change(s).
  4. BLAST Your Sequence
    A BLAST search can quickly identify any close matches to your sequence in sequence databases. In this way, you can identify an unknown sample to the genus or species level. It also provides a means to add samples for a phylogenetic analysis.
    • On the Add Sequences branch, click "BLASTN". Then, click the "BLAST" button next to the sequence you want to query against DNA databases.
    • The returned list has information about the 20 most significant alignments (hits):
      • Accession number, a unique identifier given to each sequence submitted to a database. Prefixes indicate the database name – including gb (GenBank), emb (European Molecular Biology Laboratory), and dbj (DNA Databank of Japan).
      • Organism and sequence description or gene name of the hit. Click the genus and species name for a link to an image of the organism, with additional links to detailed descriptions at Wikipedia and Encyclopedia of Life (EOL).
      • Several statistics allow comparison of hits across different searches. The number of mismatches over the length of the alignment gives a rough idea of how closely two sequences match. The bit score formula takes into account gaps in the sequence; the higher the score the better the alignment. The Expectation or E-value is the number of alignments with the query sequence that would be expected to occur by chance in the database. The lower the E-value, the higher the probability that the hit is related to the query. For example, an E-value of 0 means that a search with your sequence would be expected to turn up no matches by chance. Why do the most significant hits typically have E-values of 0? (This is not the case with BLAST searches with primers.) What does it mean when there are multiple BLAST hits with similar E-values?
      • Examine the last column in the report called “Mismatches.” For barcodes, this is an informative column, with the best hits being those with the lowest number of mismatches. Note that hits with low numbers of mismatches can sometimes be lower on the list, as the bit scores are used to arrange the hits in the table. High bit scores can occur when the alignment length is longer, even when there are more mismatches than for other hits.
      • If there are zero mismatches between your sequence and a BLAST result, it is unlikely that your sequence is unique. Instead, the identical sequences probably match because they are in the same taxonomic group as your sample. Check to see if the matching sequences are from species that seem reasonable for your sample. If your best matches include some mismatches, you may have identified a novel barcode. The more mismatches you find, the more likely that your sequence is unique, especially in regions of the sequence with high quality scores. However, sequencing errors could explain the difference, so it will be important to reexamine the trace files at any sites with mismatches to ensure that the consensus at those locations is of high quality.
    • Add BLAST sequence data to your phylogenetic analysis by checking the box(es) above any accession number(s), then clicking on "Add BLAST hits to project" at the bottom of the BLAST results window.
  5. Add Sequences to Your Analysis
    • Click“Upload Data” to add additional sequence data to your analysis without starting a new project. Use “Upload Sequence(s)” to upload AB1 trace files or FASTA formatted sequences locally stores on your computer; Use “Enter Sequences(s)” to paste or type sequences in FASTA format.
    • If you would like to import sequences from non-local sources you can use “Import Sequence” to search a sequence database using a sequence identifier. For GenBank sequences you can search by Accession number. Search BOLD by species name, or search the DNALC sequence database by tracking number for sequences you processed with GENEWIZ through the DNALC system.
    • If your sequence is high quality and had no hits with zero mismatches, you may use NCBI BLAST to confirm that the sequence is novel. Click on the BLASTN button and then double-click on the sequence (the actual nucleotides) that you identified as possibly novel to select them. Right-click (PC) or command-click (Mac) and then select copy to move the sequence to your clipboard.
      • In a web browser go to http://blast.ncbi.nlm.nih.gov. From this page click on "Nucleotide BLAST."
      • Paste your sequence into the "Enter Query Sequence" window under "Enter accession number(s), gi(s), or FASTA sequence(s)."
      • Under "Program Selection" select "Highly similar sequences (megablast)"; next click "BLAST."
      • On the results page you will get a list of results very similar to what was returned by DNA Subway.
      • Scrolling down the page, you will find alignments of your sequence (Query) to the sequences from the closest matches in GenBank (Sbjct).
      • Analyze the results of the BLAST search, which are displayed in three ways as you scroll down the page:
        • First, a graphical overview illustrates how significant matches (hits) align with the query sequence. Matches of differing lengths are indicated by color-coded bars. For barcoding results, it is likely that most matches will be red, indicating high scores, and cover most of the width of the table, showing matches that span the length of your query sequence.
        • This is followed by a table with "Descriptions of sequences producing significant alignments” much like the table for BLAST results in DNA Subway.
          Next is an "Alignments" section, which provides a detailed view of each primer sequence ("Query") aligned to the nucleotide sequence of the search hit (S"bjct," "subject").
        • From the table, identify any matches that are 100% identical or any matches with high identity that appear to represent species or sequences you have not identified previously in DNA Subway. Select these sequences by clicking on the box to the left of each hit. After selecting sequences, click Download, ensure FASTA (complete sequence) is selected, and then click Continue.
        • Open the resulting FASTA file (named seqdump). Double-click the sequences to select them all, then right-click (PC) or command-click (Mac) and select copy to move the sequence to your clipboard. Add these sequences to your project using the Upload Data function, as in step 1.
      • Click on "Sequence Viewer" back on DNA Subway, and view the trace file for the forward read of your query sequence. Locate the position on your table where the query sequence differed from the GenBank match. Determine if the nucleotides you identified as different were of high quality (e.g. not sequencing errors). Because of sequence trimming, you may have to search for the polymorphic site, as the numbers from the BLAST alignment and in the trace file may not correspond.
    • You may also choose to search for your sequence at the International Barcode of Life (IBOL) database, BOLD (Barcode of Life Online Database); their records are not all in GenBank.
      • Click on the "BLAST" button and then double-click on the nucleotides for the sequence you are analyzing. Right-click (PC) or command-click (Mac) and then select copy to move the sequence to your clipboard.
      • In a web browser go to http://boldsystems.org. Click on the menu in the top right-hand corner of the webpage. Select “Identification”. This will bring you to the “Identification Engine” page.
      • Select the tab that corresponds to the appropriate kingdom for the sample (animal, plant, or fungal).
      • Under the Animal Identification [COI] tab, select “Species Level Barcode Records.” On the Fungal Identification [ITS] tab, select “ITS Sequences.” On the Plant Identification [rcbL & matK] tab, select “Plant Sequences.”
      • Paste the sequence into the search box labeled “Enter sequences in fasta format”; next click “Submit.”
      • Again, a results table is produced. The column labeled “similarity” indicates how similar your sequence was to the records in the BOLD, with a 100% match indicating they were exact matches. Some records in BOLD are not public, or are not accompanied by species-level identifications. Scrolling down the list of matches you will see a pairwise alignment of your sequence (Query) to the matched sequences (Subj). Once again, identify any new hits that may be identical to your sequence. For published hits, you can download the sequence by clicking the link to the right of “Published,” then clicking “FASTA” and saving the file. This FASTA file can be uploaded, as described above, at step 1.
    • Back in DNA Subway, click “Reference Data” (optional) to include additional sequences. Depending on the project type you have created, you will have access to additional sequence data that may be of interest. For example, if you are doing a DNA barcoding project using the rbcL gene, samples of rbcL sequence from major plant groups (Angiosperms, Gymnosperms, etc.) will be provided. Choose any data set to add it to your analysis; you will be able to include or exclude individual sequences within the set in the next step.
  6. Analyze Sequences: Select and Align
    Unknown samples can potentially be identified to the species level by a BLAST search. In this case, a phylogenetic analysis adds depth to your understanding by showing how your sequence fits into a broader taxonomy of living things. If your BLAST search fails to identify your sequence, phylogenetic analysis can usually identify it to at least the family level.
    • Click “Select Data” to display all the sequences you have brought into your analysis, including “user data,” BLAST hits, or reference data. Check off sequences you wish to include in an alignment. In general, to determine the relationship of your sequence to species with known barcodes, it is best to concentrate on similar sequences. For instance, you should align sequences from samples that you believe are the same species and any close matches from database searches. You may also use the “Select all” feature to include all sequences; to deselect all sequences, click “Select all” twice. You may run new alignments or download different sequences at any time after selecting a new set of sequences.
      • To download selected sequences to a FASTA file click the “Download” button and save the resulting file.
      • Once you have selected the sequences you wish to align, you must click “Save Selections” in the blue dialog box that appears when you make any selections.
    • Click “MUSCLE” to generate the multiple sequence alignment. This software will align all sequences that were included in the “Select Data” step. Click “MUSCLE” again to open the created multiple alignment. MUSCLE is a software tool that takes several DNA sequences and repositions them (adding gaps where necessary) to generate a multiple sequence alignment (the alignment of three or more DNA sequences). Assuming the DNA sequences share a common origin, alignments of DNA sequence can reveal mutations between different sequences, including insertions, deletions, or single nucleotide polymorphisms (SNPs).
      • Click the “Trim Alignment” button to trim the alignment to a region where all the selected DNA sequences overlap. Without this trimming step, those missing regions of sequences would be interpreted by the phylogenetic tree building algorithm as true deletions in the sequence, rather than missing data. In this step, we are trimming again to account for the BLASTN matches we introduced, which may have different sequence lengths from the data we generated. Why is it important to remove sequence gaps and unaligned ends?
      • Scroll through your alignments to see similarities between sequences. "Sequence Conservation" displays a histogram across the displayed sequences. At positions where most nucleotides are the same, the histogram approaches 100%; dips in the histogram are more variable regions. "Sequence Variation" displays the nucleotides that occur at that position relative to the consensus sequence; the colors of the bars (Green = A, Red = T, Black =G, Blue =C) display what alternative nucleotide(s) appear at that position. The "Consensus" bar is light gray in positions where all sequences shown contain the same nucleotide; colors appear to indicate any nucleotide position that is not at 100% consensus with other aligned sequences. Missing sequence is indicated in each row by a dark grey block on either end of the sequence. Light gray spaces indicate agreement with the consensus sequence.
      • Note that the 5’ (leftmost) and 3’ (rightmost) ends of the sequences are usually misaligned, due to gaps (-) or undetermined nucleotides (Ns). What causes these problems?
      • Note any sequence that introduces large, internal gaps (-----) in the alignment. This is either poor quality or unrelated sequence that should be excluded from the analysis. To remove such unrelated sequences, return to Select Data, uncheck that sequence, and save your change. Then click "MUSCLE" to recalculate.
  7. Analyze Sequences: Create a Phylogenetic Tree
    • A phylogenetic tree is a graphical representation of relationships between taxonomic groups. In this experiment, a gene tree is determined by analyzing the similarities and differences in DNA sequence.What assumptions are made when one infers evolutionary relationships from sequence differences?
    • Click "PHYLIP ML" to generate a phylogenetic tree using the maximum likelihood method. A tree will open in a new window.
    • For "Select Outgroup" function, select the species that is the least closely related species in your selection of species. Note: determining the outgroup might require background research. The outgroup will be different depending on the sequences being compared.
    • Look at your tree.
      • Trees consist of branch tips that are labeled with the name of the sequence and/or organism you analyzed. Two branches are connected to each other by a node. A node represents the point at which descendants from an inferred common ancestor diverged into different lineages.
      • The length of each branch is a measure of the evolutionary distance from the ancestral sequence at the node. Species or sequences with short branches from a node are closely related; those with longer branches are more distantly related.
      • A group formed by a common ancestor and its descendants is called a clade. Related clades, in turn, are connected by nodes to make larger, less-closely related clades.
      • Generally, the clades will follow established phylogenetic relationships ascending from genus > family > order > class > phylum. However, gene and phylogenetic trees do disagree on some placements, and much research is focused on "reconciling" these differences. Why do gene and phylogenetic trees sometimes disagree?
    • Find and evaluate your sequence’s position in the tree.
      • If your sequence is closely related to any of the reference or uploaded sequences, it will share a single node with those species.
      • If your sequence is identical to another sequence, the two will diverge directly from the node without branches.
      • If your sequence is distantly related to all of the species in your tree, your sequence will sit on a branch by itself – with the other sequences grouping together as a clade.
      • To identify the smallest clade that includes your sequence, click the node that is directly connected to your sequence. The sequences that are highlighted are the closest relatives of your sequence in the tree.
      • Look at the scientific names of sequences within the most closely associated clade. If all members share the same genus name, you have identified your sequence as belonging to that genus. If different genus names are represented, check and see if they belong to the same family or order.
    • Return to the menu, and click "PHYLIP NJ" to generate a phylogenetic tree using the neighbor joining method. How does it compare to the maximum likelihood tree? What does this tell you?
      • To find the most likely tree and determine the reliability of the branches in this tree, NJ in DNA Subway uses bootstrapping, or resampling of the sequence data. Bootstrapping is a computational technique for assessing the accuracy of a statistical estimate. In bootstrapping, the columns in the sequence alignment are randomly resampled over and over to make many new alignments – 100 for NJ in DNA Subway – and these alignments are used to construct NJ trees. The final tree represents the “most likely” tree and shows the confidence of relationships with bootstrap levels. Each bootstrap value is the number of times that particular relationship appears in the 100 resampled trees. The values do not represent the distance between sequences. Instead, a higher value indicates that a branch of the tree is well supported, while low values indicate that the relationships are less certain. In general bootstrap values above 70 might be considered as plausible given the data, and above 95 can be considered “correct.”
    • If neither tree places your sequence within an identifiable clade – or if that clade is only at order level – you will need to add more sequences that may increase the resolution of your analysis. Return to Step 5, and add more reference sequences or obtain sequences within the order or family clade that contained your sequence. Then repeat Steps 6-7 to select, align, and generate trees from your refined data set.
  8. Exporting Sequences to GenBank

    If you do not identify any identical hits through searches in DNA Subway, GenBank, and BOLD and you have determined that your sequence is of high quality, you may have a novel sequence.

    Once you have identified a potentially novel sequence there are additional steps that you can take, including publishing your sequence to GenBank through DNA Subway. It is not required that a sequence be novel to publish it to GenBank. However, discretion should be used, and sequences that are already present in GenBank multiple times for a particular species or without vetted metadata (definitive species identification, collection information, etc.) should not be published.

    Note: Only high quality consensus sequences that have been generated by a submitter, and have not been previously submitted can be exported to GenBank.

    • Click “Export to GenBank” in the project window.
    • Click “New submission.” (If you are working with an animal sample, you need to specify if it is from a vertebrate, invertebrate, or echinoderm) then Click “Proceed.”
    • If you have already collected information of your samples in the DNALC Barcoding Sample Database, write the sample’s code number. Its information will be retrieved automatically. If not, you can enter the sample information manually in the next step; click “Continue.”
    • Verify and fill in the information required in the “Specimen info” window; click “Continue”.
    • Add photos of the sample if you have any available.
    • Verify your submission information, make any appropriate changes if necessary, and finally click “Submit.” You will receive a notification that your sequence has been submitted to NCBI and a specialist there will check it. If your submission passes NCBI’s verification procedure, you will receive a notification that your sequence has been published in GenBank.

Visit the CyVerse DNA Subway Guide for a walk-through of the DNA Subway Blue Line.

Answers