Tasmanian Devil Genome Project: Sequencing

We were interested in finding the locations in the Tasmanian devil genome where Cedric differed from Spirit. We started with long 454 reads that were used to build a reference sequence for the devil genome using the CABOG assembler [1]. Illumina sequencing delivers huge amounts of short reads from an individuals genome at a much lower cost per base when compared with 454. It was used to sequence Cedric to an average depth of 16.7X and Spirit to an average depth of 32.2X. We also sequenced a tumor taken from Spirit to 19.7 fold coverage. The sequenced reads were of length 76/80/82 bp with short insert lengths of about 300 bp.

We aligned these short reads to the CABOG assembly, using the short-reads aligner BWA [2] Version 0.5.8a, allowing up to four differences in the alignment. The reads were soft-trimmed towards the 3’ end, to ensure that the low quality bases were not used in mapping. The SNPs were then called using SAMtools [3] Version 0.1.12a. Regions on the reference sequence that are covered too few times or by too many reads should not be trusted in SNP calls. Regions of high-coverage generally signal errors in the assembly or could be a signature of a structural variant that needs to be handled separately. For this reason Cedric SNPs were called in regions with coverage between 4 and 53, whereas for Spirit SNPs were limited to regions with coverage of 4 to 72. We also threw away SNPs with a SNP quality lower than 30. This process enabled us to call 558,270 SNPs for Cedric and 864,664 SNPs for Spirit, compared to the assembled reference. We analyzed those postions to see which were true variants in the Illumina data, rather than being due to different biases in the sequencing platforms (or errors in the CABOG assembly); there were 914,827 locations where at least two distinct nucleotides were identified in Cedric and Spirit. A similar process was employed for the reads from the tumor.

References

Miller JR, et al. (2008) Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24:2818–2824.

Li H and Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-1760.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-2079.