DNA Learning Center Barcoding 101

Barcoding Protocol Download Protocol PDF

Using DNA Barcodes to Identify and Classify Living Things:

Bioinformatics

I. Use BLAST to Find DNA Sequences in Databases (Electronic PCR)

Perform a BLAST search as follows:

Do an Internet search for "ncbi blast."
Click the link for the result BLAST: Basic Local Alignment Search Tool. This will take you to the Internet site of the National Center for Biotechnology Information (NCBI).
Under the heading "Web BLAST," click "Nucleotide BLAST."
Enter the primer set you used into the “Enter Query Sequence” search window. These are the query sequences.

The following primers were used in this experiment:

Plant rbcL gene

rbcLa f 5’- ATGTCACCACAAACAGAGACTAAAGC-3’ (forward primer)
rbcLa rev 5’- GTAAAATCAAGTCCACCRCG-3’ (reverse primer)

Plant matK gene

matk-3F 5’- CGTACAGTACTTTTGTGTTTACGAG-3’ (forward primer)
matk-1R 5’- ACCCAGTCCATCTGGAAATCTTGGTTC-3’ (reverse primer)

Plant ITS region

nrITS2-S2F 5’- ATGCGATACTTGGTGTGAAT-3’ (forward primer)
nrITS2-S3R 5’-GACGCTTCTCCAGACTACAAT-3’ (reverse primer)

Plant tufA gene

tufA_F 5’- TGAAACAGAAMAWCGTCATTATGC-3’ (forward primer)
tufA_R 5’- CCTTCNCGAATMGCRAAWCGC-3’ (reverse primer)

Vertebrate (non-fish) COI gene

VF1_t1 5'-TCTCAACCAACCACAAAGACATTGG-3' (forward primer)
VR1d_t1 5'-TAGACTTCTGGGTGGCCRAARAAYCA-3' (reverse primer)

Vertebrate (fish) COI gene

VF2_t1 5'-CAACCAACCACAAAGACATTGGCAC-3' (forward primer)
FishR2_t15'-ACTTCAGGGTGACCGAAGAATCAGAA-3' (reverse primer )

Invertebrate COI gene

LCO1490_F 5’-GGTCAACAAATCATAAAGATATTGG-3’ (forward primer)
HC02198_R 5’-TAAACTTCAGGGTGACCAAAAAATCA-3’ (reverse primer)

Fungi ITS region

ITS1 F 5’-TCCGTAGGTGAACCTGCGG-3’ (forward primer)
ITS4 R 5’-TCCTCCGCTTATTGATATGC-3’ (reverse primer)

Fungi (lichen-specific) ITS region

ITS1F_(Gad) 5’-CTTGGTCATTTAGAGGAAGTA-3’ (forward primer)
ITS4 R 5’-TCCTCCGCTTATTGATATGC-3’ (reverse primer)

Omit any non-nucleotide characters from the window because they will not be recognized by the BLAST algorithm.
Under "Choose Search Set," select "Nucleotide collection (nr/nt)" from the pull-down menu.
Under "Program Selection," optimize for "Somewhat similar sequences (blastn)."
Click "BLAST". This sends your query sequences to a server at the National Center for Biotechnology Information in Bethesda, Maryland. There, the BLAST algorithm will attempt to match the primer sequences to the DNA sequences stored in its database. A temporary page showing the status of your search will be displayed until your results are available. This may take only a few seconds or more than a minute if many other searches are queued at the server.

The results of the BLAST search are displayed in three ways as you scroll down the page:

First, a "Graphic Summary" illustrates how significant matches, or "hits," align with the query sequence. Why are some alignments longer than others?
This is followed by "Descriptions of sequences producing significant alignments," a table with links to database reports.

The accession number is a unique identifier given to a sequence when it is submitted to a database, such as GenBank®. The accession link leads to a detailed report on the sequence.
Note the scores in the "e" column on the right. The Expectation or E value is the number of alignments with the query sequence that would be expected to occur by chance in the database. The lower the E value, the higher the probability that the hit is related to the query. For example, an E value of 1 means that a search with your sequence would be expected to turn up one match by chance.
What is the E value of your most significant hit, and what does it mean? What does it mean if there are multiple hits with similar E values?
What do the descriptions of significant hits have in common?

Next is an "Alignments" section, which provides a detailed view of each primer sequence ("Query") aligned to the nucleotide sequence of the search hit ("Subject"). Notice that hits have matches to one or both of the primers. For example:

	Forward Primer	Reverse Primer
Plant	nucleotides 1-26	nucleotides 27-46
Vertebrate (non-fish)	nucleotide 1-25	nucleotides 26-53
Fish	nucleotides 1-25	nucleotides 26-51
Fungi	nucleotide 1-19	nucleotides 20-39
Invertebrate	nucleotides 3-25	nucleotides 26-51

Predict the length of the product that the primer set would amplify in a PCR reaction (in vitro).

In the "Alignments" section, select a hit that matches both primer sequences.
Which nucleotide positions do the primers match in the subject sequence?
The lowest and highest nucleotide positions in the subject sequence indicate the borders of the amplified sequence. Subtracting one from the other gives the difference between the coordinates.
However, the PCR product includes both ends, so add 1 nucleotide to the result that you obtained in Step 3.c. to determine the exact length of the fragment amplified by the two primers.
What value do you get if you calculate the fragment size for other species that have matches to the forward and reverse primer? Do you get the same number?

Determine the type of DNA sequence amplified by the primer set:

Click the accession link (beginning with "ref") to open the data sheet for the hit used in Question 3 above. Accession Numbers will be linked next to “Sequence ID”.
The data sheet has three parts:

The top section contains basic information about the sequence, including its basepair (bp) length, database accession number, source, and references to papers in which the sequence is published.
The bottom section lists the nucleotide sequence.
The middle section contains annotations of gene and regulatory "FEATURES," with their beginning and ending nucleotide positions ("xx..xx"). These features may include genes, coding sequences (cds), regulatory regions, ribosomal RNA (rRNA), and transfer RNA (tRNA).

Identify the feature(s) located between the nucleotide positions identified by the primers, as determined in 3.b. above.

II. Determine Sequence Relationships Using the Blue Line

The following directions explain how to use the Blue Line of DNA Subway 2.0 to analyze novel DNA sequences generated by a DNA sequencing experiment. If you did not sequence your own DNA sample, you can follow these directions to use DNA sequences produced for other students. You can find supplementary instructions by clicking “Help” link on the DNA Subway 2.0 navigation bar.

DNA Subway 2.0 is an intuitive interface for analyzing DNA barcodes. Generally, you progress by scrolling through the sections of each “Stop.”

You can analyze relationships between DNA sequences by comparing them to a set of sequences you have compiled yourself, or by comparing your sequences to others that have been published in databases such as GenBank (National Center for Biotechnology Information). Generating a phylogenetic tree from DNA sequences derived from related species can also allow you to draw inferences about how these species may be related. By sequencing variable sections of DNA (barcode regions) you can also use the Blue Line to help you identify an unknown species, or publish a DNA barcode for a species you have identified, but which is not represented in published databases like GenBank (www.ncbi.nlm.nih.gov/genbank).

Create a DNA Subway 2.0 Project and Upload DNA Sequences
Note: Only registered users submitting novel, high-quality sequences will be able to submit sequence to GenBank.

Log into DNA Subway 2.0. If you do not have an account, you will need to register first to save and share your work.
Select "Create Project → Blue Line" to begin a project.
Under "Sequencing Type," select "Sanger."
Provide a title in the "Project Title" section.
Write a short description of your project in the "Description (optional)" section.
Click "Create Project" to load the project into DNA Subway 2.0.
"Import Sequences" provides several ways to obtain sequences for barcode analysis. Select the most appropriate way to upload your data from the four options displayed, or choose one from "More Options":

Select a sample sequence. If you do not have a file, you may select any of the available sample sequences.
Import from "DNALC Trace Files." If your DNA sample was processed with Azenta Life Sciences/GENEWIZ, your sequence data will be automatically uploaded to this database. Search for your tracking number and click on the linked number to see sequence files. Select one or more files from the list. Click “Add to Project.”
Upload sequence(s) in ab1 (files ending with .ab1) or FASTA format. Click “Upload from Device” to navigate to a folder on your desktop or drive containing your sequence(s). Select a sequence by clicking on its file name. Select more than one sequence by holding down the ctrl key while clicking file names. Once you have selected the sequences you want, click "Open".
Select a previously saved sequence from your “Sequence Collection.”
Under “More Options,” choose "Paste Sequence" and enter a sequence in FASTA format. Below is an example of this format. The ">" symbol demarcates the sequence name. The sequence is started on the next line.
>sequence name
atcgccccttaatattgcctt…
If you would like to import sequences from non-local sources, open “More Options” to search a sequence database using a sequence identifier. For GenBank® sequences you can search by Accession number. Search BOLD by Process ID.

View and Build Sequences
There are many plants, animals, and fungi that do not have a documented barcode sequence. For instance, there are an estimated 350,000 species of angiosperms (flowering plants), but as of July 2018 there were only about 270,000 rbcL angiosperm sequences in GenBank. For other species, diversity in the barcode sequences are not well characterized. This means that there are opportunities to submit novel sequences and contribute to the global barcoding effort. Only samples that have high quality sequence for both the forward and reverse reads are good enough to ensure a low error rate and can be published to GenBank, so the sequence quality must be checked. Sequences for which there is only one high quality read are not be considered high enough quality to publish. These sequences and those with no high quality sequence are can still be analyzed even though the results are not publishing quality.

View the "Select" stop to display the sequences you have input. You can click “View Sequence” in the kebab menu () to see the sequence. If you uploaded trace files, click on the ”ABI View” to view the trace files.

The DNA sequencing software measures the fluorescence emitted in each of four channels – A,T,C,G – and records these as a trace, or electropherogram. In a good sequencing reaction, the nucleotide at a given position will be fluorescently labeled far in excess of background (random) labeling of the other three nucleotides, producing a "peak" at that position in the trace. Thus, peaks in the electropherogram correlate to nucleotide positions in the DNA sequence.
A software program called Phred analyzes the sequence file and "calls" a nucleotide (A, T, C, G) for each peak. If two or more nucleotides have relatively strong signals at the same position, the software calls an "N" for an undetermined nucleotide.

Phred also examines the peaks around each call and assigns a quality score for each nucleotide. The quality scores corresponds to a logarithmic error probability that the nucleotide call is wrong, or, conversely, to the accuracy of the call.

Phred Score	Error	Accuracy
10	1 in 10	90%
20	1 in 100	99%
30	1 in 1,000	99.9%
40	1 in 10,000	99.99%
50	1 in 100,000	99.999%

The electropherogram viewer represents each Phred score as a blue bar. The horizontal line equals a Phred score of 20 by default, which is generally the cut-off for high-quality sequence. Thus any bar at or above the line is considered a high-quality read. What is the error rate and accuracy associated with a Phred score of 20?
Every sequence "read" begins with nucleotides (A,T,C,G) interspersed with Ns. In "clean" sequences, where experimental conditions were near optimal, the initial Ns will end within the first 25 nucleotides. The remaining sequence will have very few, if any, internal Ns. Then, at the end of the read, the sequence will abruptly change over to Ns.
Large numbers of Ns scattered throughout the sequence indicate poor quality sequence. Sequences with average Phred scores below 20 will be flagged with a "Low Quality Score Alert." You will need to be careful when drawing conclusions from analyses made with poor quality sequence. What do you notice about the electropherogram peaks and quality scores at nucleotide positions labeled "N"?
Note: The exclamation icon (!) indicates poor quality sequence.

Use the “X” and “Y” buttons to adjust the level of zoom. You can undo zooming by pressing the “Reset” button.
Examine the quality of the sequence(s). Any sequence for which the forward or reverse has the warning icon indicating a low quality score in not of good enough quality to publish and any determination of novelty will be tentative as sequencing errors could appear to be novel polymorphisms.
Click “Auto Trim All” in the “Trim Sequences” step to trim your sequences; this automatically remove Ns from the 5’ and 3’ ends of selected sequences. You can alternatively manually trim each sequence. Why is it important to remove excess Ns from the ends of the sequences?
If you wish to view trimmed sequences, select one from the drop-down menu.

Pair and Build Consensus for Forward and Reverse Reads

Go to the “Pair Sequences” step to pair your forward and reverse reads. If you have two reads for a sample, pair the sequences by clicking the F or R symbol to the left of each read for the sample.
If you click a symbol next to two reads, a dialogue box opens asking if you wish to designate the sequences as a pair. To manually pair, click "Confirm" to save the pair assignment.
Alternatively, Click "Auto Pair All" to pair sequences that have identical sample names, but appended with an F or R based on sequencing direction.
Once you have created sequence pairs, go to the “Edit Consensus” step to edit consensus sequences for the selected pairs. Select the pair you wish to examine. How does the consensus sequence optimize the amount of sequence information available for analysis? Why does this occur?
If there are any mismatched nucleotides between the first and second sequence, these will be highlighted yellow in the consensus editor window. Do differences tend to occur in certain areas of the sequence? Why?

A dash (–) is used to represent a gap in the data. In our consensus editor, the dash is used to “pad” the alignment between the forward and reverse sequences. A dash is a useful feature in an alignment because one of the possible mutations that could differentiate two related sequences is an insertion or deletion. In our case, misalignments between a forward and reverse read from the same sample are due to sequencing error. Since they are sequences from the same sample, they should be identical.
One recommendation on trimming at the beginning of the sequence is to trim up to the last position where one sequence has an "N" or a "–" within the first 50 or so bases. Starting from the right, you can also trim the sequence starting at the first "N" or "–" you find 75 or so base pairs from the end of the read. These recommendations are only rules of thumb. You will have to choose how strictly you wish to trim. If a trim on either end is more than 100bp, you may have to consider the effects of discarding large amounts of sequence. Trimming 200 bp in total represents 1/3 of our approximately 600 bp of sequence.

Large numbers of yellow mismatches – especially in long blocks – may indicate that you have incorrectly paired sequences from two different sources (organisms), or that you failed to reverse complement the reverse strand.

Return to the Pair Sequences step to check your pairs and reverse complements.
Click "Undo Pair" in the kebab menu next to a pair to redo a pairing, and click "F" and "R" appropriately, as needed.

A large number of mismatches in properly paired and reverse complemented sequences indicate that one or both sequences is of poor quality. Often, one of the sequencing reactions produces a high quality read that can be used on its own. To determine this:

Examine the distribution of Ns to see if they are mainly confined to one of the two sequences.
Examine the electropherograms to see if one of the two sequences is of good quality.
If one of the sequences seems of good quality, return to Pair Sequences, and click "Undo Pair” in the options menu to undo the pairing.

Few or no internal mismatches indicate good quality sequence from forward and reverse reads.

BLAST Your Sequence
A BLAST search can quickly identify any close matches to your sequence in sequence databases. In this way, you can identify an unknown sample to the genus or species level. It also provides a means to add samples for a phylogenetic analysis.

In the “Analyze” stop, go to "BLASTn". Select the sequences you want to query against DNA databases, then click the “BLAST” button.
The returned list has information up to the 50 most significant alignments (hits), including:

Accession number, a unique identifier given to each sequence submitted to a database. Prefixes indicate the database name – including gb (GenBank), emb (European Molecular Biology Laboratory), and dbj (DNA Databank of Japan).
Organism and sequence description or gene name of the hit. Click the genus and species name for a link to an image of the organism, with additional links to detailed descriptions at Wikipedia and Encyclopedia of Life (EOL).
Several statistics allow comparison of hits across different searches. The number of mismatches over the length of the alignment gives a rough idea of how closely two sequences match. The bit score formula takes into account gaps in the sequence; the higher the score the better the alignment. The Expectation or E-value is the number of alignments with the query sequence that would be expected to occur by chance in the database. The lower the E-value, the higher the probability that the hit is related to the query. For example, an E-value of 0 means that a search with your sequence would be expected to turn up no matches by chance. Why do the most significant hits typically have E-values of 0? (This is not the case with BLAST searches with primers.) What does it mean when there are multiple BLAST hits with similar E-values?
Examine the last column in the report called “Mismatches.” For barcodes, this is an informative column, with the best hits being those with the lowest number of mismatches. Note that hits with low numbers of mismatches can sometimes be lower on the list, as the bit scores are used to arrange the hits in the table. High bit scores can occur when the alignment length is longer, even when there are more mismatches than for other hits.
If there are zero mismatches between your sequence and a BLAST result, it is unlikely that your sequence is unique. Instead, the identical sequences probably match because they are in the same taxonomic group as your sample. Check to see if the matching sequences are from species that seem reasonable for your sample. If your best matches include some mismatches, you may have identified a novel barcode. The more mismatches you find, the more likely that your sequence is unique, especially in regions of the sequence with high quality scores. However, sequencing errors could explain the difference, so it will be important to reexamine the trace files at any sites with mismatches to ensure that the consensus at those locations is of high quality.

Add BLAST sequence data to your phylogenetic analysis by checking the box(es) next to any accession number(s), then clicking on "Add to Project" at the bottom of the BLAST results window.

Add Sequences to Your Analysis

Return to the “Select” stop to add additional sequence data to your analysis without starting a new project.
If your sequence is high quality and had no hits with zero mismatches in the BLASTn step that searches a local database, you may use NCBI BLAST (that searches a larger database) to confirm that the sequence is novel. Click on the “View Consensus” button in the “Edit Consensus” step after selecting a pair and then “Copy to Clipboard” the sequence (the actual nucleotides) that you identified as possibly novel to select them.

In a web browser go to http://blast.ncbi.nlm.nih.gov. From this page click on "Nucleotide BLAST."
Paste your sequence into the "Enter Query Sequence" window under "Enter accession number(s), gi(s), or FASTA sequence(s)."
Under "Program Selection" select "Highly similar sequences (megablast)"; next click "BLAST."
On the results page you will get a list of results very similar to what was returned by DNA Subway 2.0.
Scrolling down the page, you will find alignments of your sequence (Query) to the sequences from the closest matches in GenBank (Sbjct).
Analyze the results of the BLAST search, which are displayed in three ways as you scroll down the page:

First, a graphical overview illustrates how significant matches (hits) align with the query sequence. Matches of differing lengths are indicated by color-coded bars. For barcoding results, it is likely that most matches will be red, indicating high scores, and cover most of the width of the table, showing matches that span the length of your query sequence.
This is followed by a table with "Descriptions of sequences producing significant alignments” much like the table for BLAST results in DNA Subway.
Next is an "Alignments" section, which provides a detailed view of each primer sequence ("Query") aligned to the nucleotide sequence of the search hit (S"bjct," "subject").
From the table, identify any matches that are 100% identical or any matches with high identity that appear to represent species or sequences you have not identified previously in DNA Subway 2.0. Select these sequences by clicking on the box to the left of each hit. After selecting sequences, click Download, ensure FASTA (complete sequence) is selected, and then click Continue.
Open the resulting FASTA file (named seqdump). Double-click the sequences to select them all, then right-click (PC) or command-click (Mac) and select copy to move the sequence to your clipboard. Add these sequences to your project using the Upload Data function, as in step 1.

Go back to the "Select" stop on DNA Subway 2.0, and view the trace file for the forward read of your query sequence. Locate the position on your table where the query sequence differed from the GenBank match. Determine if the nucleotides you identified as different were of high quality (e.g. not sequencing errors). Because of sequence trimming, you may have to search for the polymorphic site, as the numbers from the BLAST alignment and in the trace file may not correspond.

You may also choose to search for your sequence at the International Barcode of Life (IBOL) database, BOLD (Barcode of Life Online Database); their records are not all in GenBank.

Click on the "View Consensus” button in the “Edit Consensus” step after selecting a pair and then “Copy to Clipboard”.
In a web browser go to http://boldsystems.org. Click on the menu in the top right-hand corner of the webpage. Select “Identification”. This will bring you to the “Identification Engine” page.
Select the tab that corresponds to the appropriate kingdom for the sample (animal, plant, or fungal).
Under the Animal Identification [COI] tab, select “Species Level Barcode Records.” On the Fungal Identification [ITS] tab, select “ITS Sequences.” On the Plant Identification [rcbL & matK] tab, select “Plant Sequences.”
Paste the sequence into the search box labeled “Enter sequences in fasta format”; next click “Submit.”
Again, a results table is produced. The column labeled “similarity” indicates how similar your sequence was to the records in the BOLD, with a 100% match indicating they were exact matches. Some records in BOLD are not public, or are not accompanied by species-level identifications. Scrolling down the list of matches you will see a pairwise alignment of your sequence (Query) to the matched sequences (Subj). Once again, identify any new hits that may be identical to your sequence. For published hits, you can download the sequence by clicking the link to the right of “Published,” then clicking “FASTA” and saving the file. This FASTA file can be uploaded, as described above, at step 1.

Back in DNA Subway 2.0, click “Add Reference Data” (optional) within the MUSCLE step to include additional sequences. You will have access to additional sequence data that may be of interest. For example, if you are doing a DNA barcoding project using the rbcL gene, you can choose samples of rbcL sequence from major plant groups (Angiosperms, Gymnosperms, etc.). Choose any data set to add it to your analysis; you will be able to include or exclude individual sequences within the set in the next step.

Analyze Sequences: Select and Align
Unknown samples can potentially be identified to the species level by a BLAST search. In this case, a phylogenetic analysis adds depth to your understanding by showing how your sequence fits into a broader taxonomy of living things. If your BLAST search fails to identify your sequence, phylogenetic analysis can usually identify it to at least the family level.

In the MUSCLE step, select from a list of the sequences you have brought into your analysis, including “user data,” BLAST hits, or reference data sequences you wish to include in an alignment. In general, to determine the relationship of your sequence to species with known barcodes, it is best to concentrate on similar sequences. For instance, you should align sequences from samples that you believe are the same species and any close matches from database searches. You may run new alignments at any time by selecting a new set of sequences.
Click “MUSCLE” to generate the multiple sequence alignment. This software will align all sequences that were selected. You can then view the created multiple alignment. MUSCLE is a software tool that takes several DNA sequences and repositions them (adding gaps where necessary) to generate a multiple sequence alignment (the alignment of three or more DNA sequences). Assuming the DNA sequences share a common origin, alignments of DNA sequence can reveal mutations between different sequences, including insertions, deletions, or single nucleotide polymorphisms (SNPs).

Click the “Auto Trim” button to trim the alignment to a region where all the selected DNA sequences overlap. Without this trimming step, those missing regions of sequences would be interpreted by the phylogenetic tree building algorithm as true deletions in the sequence, rather than missing data. In this step, we are trimming again to account for the BLASTN matches we introduced, which may have different sequence lengths from the data we generated. Why is it important to remove sequence gaps and unaligned ends?
Scroll through your alignments to see similarities between sequences. "Sequence Conservation" displays a histogram across the displayed sequences. At positions where most nucleotides are the same, the histogram approaches 100%; dips in the histogram are more variable regions. "Sequence Variation" displays the nucleotides that occur at that position relative to the consensus sequence; the colors of the bars (Green = A, Red = T, Black =G, Blue =C) display what alternative nucleotide(s) appear at that position. The "Consensus" bar is light gray in positions where all sequences shown contain the same nucleotide; colors appear to indicate any nucleotide position that is not at 100% consensus with other aligned sequences. Missing sequence is indicated in each row by a dark grey block on either end of the sequence. Light gray spaces indicate agreement with the consensus sequence.
Note that the 5’ (leftmost) and 3’ (rightmost) ends of the sequences are usually misaligned, due to gaps (-) or undetermined nucleotides (Ns). What causes these problems?
Note any sequence that introduces large, internal gaps (-----) in the alignment. This is either poor quality or unrelated sequence that should be excluded from the analysis. To remove such unrelated sequences, return to Select Data, uncheck that sequence, and save your change. Then click "MUSCLE" to recalculate.

Analyze Sequences: Create a Phylogenetic Tree

A phylogenetic tree is a graphical representation of relationships between taxonomic groups. In this experiment, a gene tree is determined by analyzing the similarities and differences in DNA sequence. What assumptions are made when one infers evolutionary relationships from sequence differences?
Under “PHYLIP ML” select as the “outgroup” the species that is the least closely related species in your selection of species. Note: determining the outgroup might require background research. The outgroup will be different depending on the sequences being compared. Click “Submit” to generate a phylogenetic tree using the maximum likelihood method. A tree will display below.
Look at your tree.

Trees consist of branch tips that are labeled with the name of the sequence and/or organism you analyzed. Two branches are connected to each other by a node. A node represents the point at which descendants from an inferred common ancestor diverged into different lineages.
The length of each branch is a measure of the evolutionary distance from the ancestral sequence at the node. Species or sequences with short branches from a node are closely related; those with longer branches are more distantly related.
A group formed by a common ancestor and its descendants is called a clade. Related clades, in turn, are connected by nodes to make larger, less-closely related clades.
Generally, the clades will follow established phylogenetic relationships ascending from genus > family > order > class > phylum. However, gene and phylogenetic trees do disagree on some placements, and much research is focused on "reconciling" these differences. Why do gene and phylogenetic trees sometimes disagree?

Find and evaluate your sequence’s position in the tree.

If your sequence is closely related to any of the reference or uploaded sequences, it will share a single node with those species.
If your sequence is identical to another sequence, the two will diverge directly from the node without branches.
If your sequence is distantly related to all of the species in your tree, your sequence will sit on a branch by itself – with the other sequences grouping together as a clade.
To identify the smallest clade that includes your sequence, click the node that is directly connected to your sequence. The sequences that are highlighted are the closest relatives of your sequence in the tree.
Look at the scientific names of sequences within the most closely associated clade. If all members share the same genus name, you have identified your sequence as belonging to that genus. If different genus names are represented, check and see if they belong to the same family or order.

Select an outgroup and click "PHYLIP NJ" to generate a phylogenetic tree using the neighbor joining method. How does it compare to the maximum likelihood tree? What does this tell you?

To find the most likely tree and determine the reliability of the branches in this tree, NJ in DNA Subway uses bootstrapping, or resampling of the sequence data. Bootstrapping is a computational technique for assessing the accuracy of a statistical estimate. In bootstrapping, the columns in the sequence alignment are randomly resampled over and over to make many new alignments – 100 for NJ in DNA Subway – and these alignments are used to construct NJ trees. The final tree represents the “most likely” tree and shows the confidence of relationships with bootstrap levels. Each bootstrap value is the number of times that particular relationship appears in the 100 resampled trees. The values do not represent the distance between sequences. Instead, a higher value indicates that a branch of the tree is well supported, while low values indicate that the relationships are less certain. In general bootstrap values above 70 might be considered as plausible given the data, and above 95 can be considered “correct.”

If neither tree places your sequence within an identifiable clade – or if that clade is only at order level – you will need to add more sequences that may increase the resolution of your analysis. Return to Step 5, and add more reference sequences or obtain sequences within the order or family clade that contained your sequence. Then repeat Steps 6-7 to select, align, and generate trees from your refined data set.

Exporting Sequences to GenBank

If you do not identify any identical hits through searches in DNA Subway 2.0, GenBank, and BOLD and you have determined that your sequence is of high quality, you may have a novel sequence.

Once you have identified a potentially novel sequence there are additional steps that you can take, including publishing your sequence to GenBank through DNA Subway. It is not required that a sequence be novel to publish it to GenBank. However, discretion should be used, and sequences that are already present in GenBank multiple times for a particular species or without vetted metadata (definitive species identification, collection information, etc.) should not be published.

Note: Only high quality consensus sequences that have been generated by a submitter, and have not been previously submitted can be exported to GenBank. Only users with enhanced permissions may export sequences to GenBank. To gain enhanced permissions, go the settings page for your DNA Subway 2.0 account, verify your email address, once that is done go back to the settings page and request enhanced permissions, then wait for DNALC to approve that request. Only after that is done, you can proceed.
- In the “Select” stop, choose to save the consensus sequence you wish to export to your sequence collection (“Save to Collection”.)
- Go to the “Sequences” tab of the “Dashboard”
- Click the ⓘ icon next to your saved consensus sequence.
- Click the “Export to GenBank” button.
- Verify and fill in the information required; click “Review Submission”.
- Verify your submission information, make any appropriate changes if necessary, and finally click “Submit.” You will receive a notification that your sequence has been submitted to NCBI and a specialist there will check it. If your submission passes NCBI’s verification procedure, you will receive a notification that your sequence has been published in GenBank.

Visit the DNA Subway 2.0 Help for a walk-through of the DNA Subway Blue Line.

Answers