Data from the Tasmanian Devil Genome Project

TABLE OF CONTENTS

Introduction
Example 1: Selecting SNPs for genotyping tumors
Example 2: Widely spaced synonymous coding SNPs
Example 3: Non-synonymous coding SNPs (SAPs) found only in the tumor
References

Introduction

Three websites provide access to the data produced by the Tasmanian devil genome project described in Miller et al. 2011:

Galaxy: usegalaxy.org: The Galaxy server provides tables of data associated with putative SNPs, both genome-wide and restricted to predicted protein-coding regions. Moreover, it supplies general-purpose tools for extracting desired subsets of those SNPs (e.g., SNP predictions with the lowest false-positive rate, or SNPs most likely to be evolving neutrally), as well as special-purpose tools to facilitate the design of particular genotyping assays.
BX Browser: main.genome-browser.bx.psu.edu: A variant of the UCSC Genome Browser that displays alignments of our reads to reference assemblies, as well as visualizing putative amino-acid variants in the context of other genome-wide data sets.
Coding SNPs: tasmaniandevil.psu.edu/coding_snps: This server allows the user to enter a gene name and receive nucleotide and/or amino-acid alignments of putative coding regions, showing differences among individuals in the species under study as well as differences from a related reference species.

The following examples illustrate these capabilities in more detail. We suggest opening the sites in another window to perform the steps as we go along.

Example 1: Selecting SNPs for genotyping tumors

Here we use Galaxy to select a set of putative SNPs and start designing some genotyping experiments.

On the top menu bar, click on "Shared Data", then on "Genome Diversity". Click on the name "Tasmanian devil SNPs" and scroll down to the bottom of the page that is displayed, looking for "Peek" and "Column Assignments". Under "Peek" you will see a few lines from the top of the file, including a line of column headers. The table has 17 columns, which are briefly described under "Column Assignments". Below we will use columns 11 (count of reads from Spirit's tumor containing the first allele), 12 (count containing the second allele), 13 (SNP quality value in the tumor sample) and 15 (distance to the closest adjacent SNP position). Use your web browser's Back button to return to the page of Genome Diversity data sets, then select the checkbox for Tasmanian devil SNPs and click "Go" (telling Galaxy to "Import to histories"). Galaxy will then ask you to select from a list of Destination Histories; select "current history", then click on "Import library datasets".
Select "Analyze Data" from the top menu bar. The right panel should show the new data set. First, we will use the general-purpose Galaxy tools to specify a subset of the SNPs, as follows. In the left panel, click on "Filter and Sort", then on "Filter" (the first option). Enter the condition
```
    c11>0 and c12>0 and c13>50 and c15>200
```
in the box provided and click "Execute". We have thus asked Galaxy to select positions that appear to be variant in the tumor sample, that have particularly high SNP quality, and that are well separated from other putative SNPs. (Separation is important when genotyping to help reduce linkage.) When Galaxy finishes running the command, the new history item in the right panel will turn green. Clicking on the item's name shows that about 130,000 SNPs meet our conditions. (Clicking on the "eye" icon will let you browse them.)
Next, we will use the special-purpose Galaxy tools to start designing genotyping assays. Here we assume the assay consists of using PCR to amplify regions containing the SNP position, then using a restriction enzyme that cuts at one variant but not the other. Start by clicking on "Genome Diversity" in the left panel, then on "Specify a set of restriction enzymes". Our filtered SNPs are already selected as the input data set by default, because that is the most recent history item having the required data format. Select some enzymes (for this example, we selected the first ten), then press "Execute" at the bottom of the panel.
When the command finishes, click on the name of the resulting history item (right panel). In our case, it reported that 4,461 SNPs were chosen. Click on the eye icon to view them, then highlight the scaffold name and SNP position from the first two columns of, say, the second line
```
    scf7180002526194   18022   A   G   8   10   88   0   28   111   ...
```
with your mouse and copy that text to the system clipboard. Open the BX Browser in another window or tab, and click on "Genomes" in the blue bar across the top. Choose "Tasmanian Devil" from the "genome" pull-down selector, then clear the "position or search term" box and paste the scaffold name and position that you copied from Galaxy into it. Edit the position to give some surrounding context, e.g. replacing the "18022" with "18002 18042", then click "submit".
By default the reads from Spirit, Spirit's tumor, and Cedric are displayed, but the browser view is very customizable. Scroll to the bottom of the page, and from the selection box for the "Restr Enzymes" track, choose "pack"; then click "refresh". This adds a track of restriction enzyme recognition sites, which shows the AluI site that caused this SNP to be selected by our Galaxy query, as well as other sites in the region.

From the Galaxy result we know the alleles are A and G, and also the allele counts obtained from Cedric, Spirit, and the tumor; for example 8 A's and 11 G's were observed in the tumor. Here we see the reads that those numbers came from, aligned to the Tasmanian devil consensus sequence given along the top of the panel. The reads are shown as gray bars (connected by thin lines if paired), and mismatches from the consensus are indicated by a nucleotide within the bar. The darker bars signify higher quality reads. Note that the counts given in the Galaxy library tables include only the reads that met the required criteria for quality, etc. and were actually used for calling the variants, whereas additional reads may be shown here (e.g. 14 G's in the tumor).
Now return to "Genome Diversity" in the left panel of your Galaxy window, and select "Extract primers for selected SNPs". Again the input data set is already correct, so just click "Execute". When that command finishes, click on its name in the history and then on the eye icon; the middle panel shows the results. Each line describing a SNP is followed by a line listing the differentially-cutting restriction enzymes (always including at least one of those specified in step 3). Somewhere between the indicated primers is a single "n", showing the SNP position (the possible nucleotides are shown in the first line). Our program for selecting restriction enzymes guarantees that for each of the listed enzymes, the only recognition site between (and including) the primer sites is at the variant position.
Note that for technical reasons, the Genome Diversity tool to "Select a specified number of SNPs" does not work on this one data set ("Tasmanian devil SNPs"), though it does work on "Tasmanian devil protein-coding SNPs" and data sets from other species (this is because the tool requires the reads to be mapped onto a reference sequence). Its role is to find the requested number of SNPs (or a few more) that are as widely separated as possible relative to the reference genome.

Example 2: Widely spaced synonymous coding SNPs

When studying population structure, we want to examine SNPs that are evolving neutrally and are well separated (to reduce linkage). We can use Galaxy to help select good candidates, as follows.

In a fashion similar to step 1 of Example 1, examine and import the data set called "Tasmanian devil protein-coding SNPs" from the Genome Diversity shared data library. This table has far fewer lines than the one from Example 1, since only about 1% of the genome codes for a protein sequence. On the other hand, it has five additional columns, numbered 5-9, giving the two amino acids (identical for a synonymous substitution), plus the chromosome, position, and amino acid in a related, assembled "reference" species, which in the case of the Tasmanian devil is the laboratory opossum Monodelphis domestica. Import this data set to your current history, as before.
Similar to step 2 of Example 1, select Analyze Data > Filter and Sort > Filter. This time enter the condition
```
    c5==c6
```
to select the synonymous SNPs.
Now go to the Genome Diversity section in the left panel and open the tool "Select a specified number of SNPs". Type "100" in the "Number of SNPs" box and click "Execute"; this will find 100 of the synonymous SNPs (or slightly more) that are as widely spaced as possible (according to the opossum genome).

Example 3: Non-synonymous coding SNPs (SAPs) found only in the tumor

For studying genotype-phenotype relationships such as those involved in devil facial tumor disease, we want to discover SNPs that affect function, and non-synonymous substitutions in coding regions are a promising place to look.

If you have not already done so, examine and import the data set called "Tasmanian devil protein-coding SNPs" from the Genome Diversity shared data library into your current Galaxy history (see step 1 of Example 2).
Similar to step 2 of Example 1, select Analyze Data > Filter and Sort > Filter. This time the default input data set might not be "Tasmanian devil protein-coding SNPs" if you have generated some other history items since then (e.g. by following Example 2), so be sure to adjust that if necessary. Use the condition
```
    c5!=c6 and c10==0 and c13==0 and c16!=0 and c18>=50 and c20>=100
```
to find some* of the non-synonymous variants observed only in the tumor, having quality at least 50 there, and not too close to another SNP. (*Quiz: why doesn't this find all of them?)
When the command finishes, examine the output as in step 4 of Example 1. The first line begins something like this:
```
    scf7180002576325   80734   A   T   Q   L   chr1   15494467   Q   ...
```
This time, highlight the chromosome and position in the orthologous opossum reference sequence (columns 7-8) as shown, and copy that text to the clipboard. Go to the BX Browser and click on "Genomes" as before, but choose "Opossum" for the genome. Paste in the position fields, edit them to give a range (e.g. "chr1 15494457 15494477"), and click "submit".

Here the SAPs are represented by green bars; in this region only the one from our query is visible. The Q allele matches the opossum reference and is conserved across all of the mammals shown in the conservation track's alignment, while the L allele is unlike any of the listed species. But interestingly, the columns from our Galaxy output showing the allele counts observed for Cedric, Spirit, and the tumor indicate that it was the conserved Q allele, not the L, that was found only in the tumor.
Click on the red bar in the Ensembl Gene Predictions track to get more information about the gene, including the gene symbol: "CWF19L1".
Then open the Coding SNPs server in another window or tab. Choose "Tasmanian devil" from the "Species" pull-down selector, type or paste the gene symbol from step 4 into the "Gene" box, and click "Get SNP info".

...

The results show alignments for each coding exon (i.e., excluding UTRs) in the requested gene where the opossum and devil sequences align sufficiently well (with at least 75% identity and no insertions or deletions) so that the putative protein sequence for the Tasmanian devil can be determined from the annotated opossum sequence. By default protein alignments are shown, but DNA alignments can be requested instead (or in addition) via checkboxes on the submission form. If the gene has multiple transcripts, additional exons are listed in a separate table for each transcript (but identical exons are not repeated). Any SNPs in these coding regions are also listed; these may be synonymous or non-synonymous (SAPs).
Here we see that the SAP from our Galaxy query falls in exon 14 of transcript ENSMODT00000013487, and that both amino acids Q and L were found in the Tasmanian devil samples. No other coding SNPs are reported in the well-aligned exons of this gene.

References

Miller W, Hayes VM, Ratan A, Petersen DC, Wittekindt NE, Miller J, Walenz B, Knight J, Qi J, Zhao F, Wang Q, Bedoya-Reina OC, Katiyar N, Tomsho LP, Kasson LR, Hardie R-A, Woodbridge P, Tindall EA, Bertelsen MF, Dixon D, Pycroft S, Helgen KM, Lesk AM, Pringle T, Patterson N, Zhang Y, Kreiss A, Woods GM, Jones M, Schuster SC (2011) Genetic diversity and population structure of the endangered marsupial Sarcophilus harrisii (Tasmanian devil). Submitted.

April 2011