Difference between revisions of "Genome"
m (→Contents) |
|||
Line 29: | Line 29: | ||
**{{WP|Genome Reference Consortium}} | **{{WP|Genome Reference Consortium}} | ||
**{{WP|1000 Genomes Project}} | **{{WP|1000 Genomes Project}} | ||
+ | **$1000 Genome: [http://www.lifetechnologies.com/ Life technologies'] Ion Torrent and [http://www.illumina.com/ Illumina's] HiSeq | ||
* Genome annotation | * Genome annotation | ||
* Genome browsers working with genome-scale information | * Genome browsers working with genome-scale information |
Revision as of 16:32, 9 February 2012
Genome
Genome sequencing brought the first complete description of the cell's components to light. It is a topic of ever increasing prominence with the advent of technologies that can sequence entire eukaryotic genomes in less than a week at a cost of less than a thousand dollars. Besides assembly and maintenance of such large amounts of data, data interpretation via automated annotation algorithms, and data access through tools such as genome browsers are active topics.
Introductory reading
Bakke et al. (2009) Evaluation of three automated genome annotations for Halorhabdus utahensis. PLoS ONE 4:e6291. (pmid: 19617911) |
[ PubMed ] [ DOI ] Genome annotations are accumulating rapidly and depend heavily on automated annotation systems. Many genome centers offer annotation systems but no one has compared their output in a systematic way to determine accuracy and inherent errors. Errors in the annotations are routinely deposited in databases such as NCBI and used to validate subsequent annotation errors. We submitted the genome sequence of halophilic archaeon Halorhabdus utahensis to be analyzed by three genome annotation services. We have examined the output from each service in a variety of ways in order to compare the methodology and effectiveness of the annotations, as well as to explore the genes, pathways, and physiology of the previously unannotated genome. The annotation services differ considerably in gene calls, features, and ease of use. We had to manually identify the origin of replication and the species-specific consensus ribosome-binding site. Additionally, we conducted laboratory experiments to test H. utahensis growth and enzyme activity. Current annotation practices need to improve in order to more accurately reflect a genome's biological potential. We make specific recommendations that could improve the quality of microbial annotation projects. |
Petty (2010) Genome annotation: man versus machine. Nat Rev Microbiol 8:762. (pmid: 20948549) |
Malone & Oliver (2011) Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol 9:34. (pmid: 21627854) |
[ PubMed ] [ DOI ] Microarrays first made the analysis of the transcriptome possible, and have produced much important information. Today, however, researchers are increasingly turning to direct high-throughput sequencing -- RNA-Seq -- which has considerable advantages for examining transcriptome fine structure -- for example in the detection of allele-specific expression and splice junctions. In this article, we discuss the relative merits of the two techniques, the inherent biases in each, and whether all of the vast body of array work needs to be revisited using the newer technology. We conclude that microarrays remain useful and accurate tools for measuring expression levels, and RNA-Seq complements and extends microarray measurements. |
Contents
- Genome sequencing and assembly
- Human - current: GRCh37 (2009). With the next iteration, genome coordinates will change (again)
- Reference genome
- Genome Reference Consortium
- 1000 Genomes Project
- $1000 Genome: Life technologies' Ion Torrent and Illumina's HiSeq
- Genome annotation
- Genome browsers working with genome-scale information
- Programmatic access to genome sequences
Exercises
Pevsner (2009) Analysis of genomic DNA with the UCSC genome browser. Methods Mol Biol 537:277-301. (pmid: 19378150) |
[ PubMed ] [ DOI ] Genomic DNA is being sequenced and annotated at a rapid rate, with terabases of DNA currently deposited in GenBank and other repositories. Genome browsers provide an essential collection of resources to visualize and analyze chromosomal DNA. The University of California, Santa Cruz (UCSC) Genome Browser provides annotations from the level of single nucleotides to whole chromosomes for four dozen metazoan and other species. The Genome Browser may be used to address a wide range of problems in bioinformatics (e.g., sequence analysis), comparative genomics, and evolution. |
Further reading and resources
Harbison et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature 431:99-104. (pmid: 15343339) |
[ PubMed ] [ DOI ] DNA-binding transcriptional regulators interpret the genome's regulatory code by binding to specific sequences to induce or repress gene expression. Comparative genomics has recently been used to identify potential cis-regulatory sequences within the yeast genome on the basis of phylogenetic conservation, but this information alone does not reveal if or when transcriptional regulators occupy these binding sites. We have constructed an initial map of yeast's transcriptional regulatory code by identifying the sequence elements that are bound by regulators under various conditions and that are conserved among Saccharomyces species. The organization of regulatory elements in promoters and the environment-dependent use of these elements by regulators are discussed. We find that environment-specific use of regulatory elements predicts mechanistic models for the function of a large population of yeast's transcriptional regulators. |
Karolchik et al. (2007) Comparative genomic analysis using the UCSC genome browser. Methods Mol Biol 395:17-34. (pmid: 17993665) |
[ PubMed ] [ DOI ] Comparative analysis of DNA sequence from multiple species can provide insights into the function and evolutionary processes that shape genomes. The University of California Santa Cruz (UCSC) Genome Bioinformatics group has developed several tools and methodologies in its study of comparative genomics, many of which have been incorporated into the UCSC Genome Browser (http://genome.ucsc.edu), an easy-to-use online tool for browsing genomic data and aligned annotation "tracks" in a single window. The comparative genomics annotations in the browser include pairwise alignments, which aid in the identification of orthologous regions between species, and conservation tracks that show measures of evolutionary conservation among sets of multiply aligned species, highlighting regions of the genome that may be functionally important. A related tool, the UCSC Table Browser, provides a simple interface for querying, analyzing, and downloading the data underlying the Genome Browser annotation tracks. Here, we describe a procedure for examining a genomic region of interest in the Genome Browser, analyzing characteristics of the region, filtering the data, and downloading data sets for further study. |
Pop & Salzberg (2008) Bioinformatics challenges of new sequencing technology. Trends Genet 24:142-9. (pmid: 18262676) |
[ PubMed ] [ DOI ] New DNA sequencing technologies can sequence up to one billion bases in a single day at low cost, putting large-scale sequencing within the reach of many scientists. Many researchers are forging ahead with projects to sequence a range of species using the new technologies. However, these new technologies produce read lengths as short as 35-40 nucleotides, posing challenges for genome assembly and annotation. Here we review the challenges and describe some of the bioinformatics systems that are being proposed to solve them. We specifically address issues arising from using these technologies in assembly projects, both de novo and for resequencing purposes, as well as efforts to improve genome annotation in the fragmented assemblies produced by short read lengths. |
Yang et al. (2010) Annotation confidence score for genome annotation: a genome comparison approach. Bioinformatics 26:22-9. (pmid: 19855104) |
[ PubMed ] [ DOI ] MOTIVATION: The massively parallel sequencing technology can be used by small research labs to generate genome sequences of their research interest. However, annotation of genomes still relies on the manual process, which becomes a serious bottleneck to the high-throughput genome projects. Recently, automatic annotation methods are increasingly more accurate, but there are several issues. One important challenge in using automatic annotation methods is to distinguish annotation quality of ORFs or genes. The availability of such annotation quality of genes can reduce the human labor cost dramatically since manual inspection can focus only on genes with low-annotation quality scores. RESULTS: In this article, we propose a novel annotation quality or confidence scoring scheme, called Annotation Confidence Score (ACS), using a genome comparison approach. The scoring scheme is computed by combining sequence and textual annotation similarity using a modified version of a logistic curve. The most important feature of the proposed scoring scheme is to generate a score that reflects the excellence in annotation quality of genes by automatically adjusting the number of genomes used to compute the score and their phylogenetic distance. Extensive experiments with bacterial genomes showed that the proposed scoring scheme generated scores for annotation quality according to the quality of annotation regardless of the number of reference genomes and their phylogenetic distance. AVAILABILITY: http://microbial.informatics.indiana.edu/acs |
Picardi & Pesole (2010) Computational methods for ab initio and comparative gene finding. Methods Mol Biol 609:269-84. (pmid: 20221925) |
[ PubMed ] [ DOI ] High-throughput DNA sequencing is increasing the amount of public complete genomes even though a precise gene catalogue for each organism is not yet available. In this context, computational gene finders play a key role in producing a first and cost-effective annotation. Nowadays a compilation of gene prediction tools has been made available to the scientific community and, despite the high number, they can be divided into two main categories: (1) ab initio and (2) evidence based. In the following, we will provide an overview of main methodologies to predict correct exon-intron structures of eukaryotic genes falling in such categories. We will take into account also new strategies that commonly refine ab initio predictions employing comparative genomics or other evidence such as expression data. Finally, we will briefly introduce metrics to in house evaluation of gene predictions in terms of sensitivity and specificity at nucleotide, exon, and gene levels as well. |
Kislyuk et al. (2010) A computational genomics pipeline for prokaryotic sequencing projects. Bioinformatics 26:1819-26. (pmid: 20519285) |
[ PubMed ] [ DOI ] MOTIVATION: New sequencing technologies have accelerated research on prokaryotic genomes and have made genome sequencing operations outside major genome sequencing centers routine. However, no off-the-shelf solution exists for the combined assembly, gene prediction, genome annotation and data presentation necessary to interpret sequencing data. The resulting requirement to invest significant resources into custom informatics support for genome sequencing projects remains a major impediment to the accessibility of high-throughput sequence data. RESULTS: We present a self-contained, automated high-throughput open source genome sequencing and computational genomics pipeline suitable for prokaryotic sequencing projects. The pipeline has been used at the Georgia Institute of Technology and the Centers for Disease Control and Prevention for the analysis of Neisseria meningitidis and Bordetella bronchiseptica genomes. The pipeline is capable of enhanced or manually assisted reference-based assembly using multiple assemblers and modes; gene predictor combining; and functional annotation of genes and gene products. Because every component of the pipeline is executed on a local machine with no need to access resources over the Internet, the pipeline is suitable for projects of a sensitive nature. Annotation of virulence-related features makes the pipeline particularly useful for projects working with pathogenic prokaryotes. AVAILABILITY AND IMPLEMENTATION: The pipeline is licensed under the open-source GNU General Public License and available at the Georgia Tech Neisseria Base (http://nbase.biology.gatech.edu/). The pipeline is implemented with a combination of Perl, Bourne Shell and MySQL and is compatible with Linux and other Unix systems. |
Montgomery et al. (2010) Annotating the regulatory genome. Methods Mol Biol 674:313-49. (pmid: 20827601) |
[ PubMed ] [ DOI ] Determining the timing and molecular repertoire responsible for gene expression is fundamental to understanding a gene's function. Heritable differences in this character are increasingly regarded as explanatory for complex and common traits. For many known trait-predisposing genes, studies have sought to elucidate the associated logic behind gene regulation. However, there exist many challenges in deciphering these mechanisms. Among them, it is recognized that we have limited understanding of regulatory complexity, the current models of gene regulation have low specificity and any gene's regulatory logic is dependent on biological context. Addressing these limitations and defining the regulatory genome is an ongoing challenge for molecular biology. We discuss current efforts to define and annotate the regulatory genome by focusing on curation and text-mining activities. We further highlight the type of information and curation process for describing regulatory elements within the ORegAnno database ( www.oreganno.org ) and how the general standards for such information are changing. |
Nagarajan & Pop (2010) Sequencing and genome assembly using next-generation technologies. Methods Mol Biol 673:1-17. (pmid: 20835789) |
[ PubMed ] [ DOI ] Several sequencing technologies have been introduced in recent years that dramatically outperform the traditional Sanger technology in terms of throughput and cost. The data generated by these technologies are characterized by generally shorter read lengths (as low as 35 bp) and different error characteristics than Sanger data. Existing software tools for assembly and analysis of sequencing data are, therefore, ill-suited to handle the new types of data generated. This paper surveys the recent software packages aimed specifically at analyzing new generation sequencing data. |
Han et al. (2011) SnapShot: High-throughput sequencing applications. Cell 146:1044, 1044.e1-2. (pmid: 21925324) |
Kenny & Bustamante (2011) SnapShot: Human biomedical genomics. Cell 147:248-248.e1. (pmid: 21962520) |
Cancer Genome Atlas Research Network (2011) Integrated genomic analyses of ovarian carcinoma. Nature 474:609-15. (pmid: 21720365) |
[ PubMed ] [ DOI ] A catalogue of molecular aberrations that cause ovarian cancer is critical for developing and deploying therapies that will improve patients' lives. The Cancer Genome Atlas project has analysed messenger RNA expression, microRNA expression, promoter methylation and DNA copy number in 489 high-grade serous ovarian adenocarcinomas and the DNA sequences of exons from coding genes in 316 of these tumours. Here we report that high-grade serous ovarian cancer is characterized by TP53 mutations in almost all tumours (96%); low prevalence but statistically recurrent somatic mutations in nine further genes including NF1, BRCA1, BRCA2, RB1 and CDK12; 113 significant focal DNA copy number aberrations; and promoter methylation events involving 168 genes. Analyses delineated four ovarian cancer transcriptional subtypes, three microRNA subtypes, four promoter methylation subtypes and a transcriptional signature associated with survival duration, and shed new light on the impact that tumours with BRCA1/2 (BRCA1 or BRCA2) and CCNE1 aberrations have on survival. Pathway analyses suggested that homologous recombination is defective in about half of the tumours analysed, and that NOTCH and FOXM1 signalling are involved in serous ovarian cancer pathophysiology. |
Tran et al. (2012) Cancer genomics: technology, discovery, and translation. J Clin Oncol 30:647-60. (pmid: 22271477) |
[ PubMed ] [ DOI ] In recent years, the increasing awareness that somatic mutations and other genetic aberrations drive human malignancies has led us within reach of personalized cancer medicine (PCM). The implementation of PCM is based on the following premises: genetic aberrations exist in human malignancies; a subset of these aberrations drive oncogenesis and tumor biology; these aberrations are actionable (defined as having the potential to affect management recommendations based on diagnostic, prognostic, and/or predictive implications); and there are highly specific anticancer agents available that effectively modulate these targets. This article highlights the technology underlying cancer genomics and examines the early results of genome sequencing and the challenges met in the discovery of new genetic aberrations. Finally, drawing from experiences gained in a feasibility study of somatic mutation genotyping and targeted exome sequencing led by Princess Margaret Hospital-University Health Network and the Ontario Institute for Cancer Research, the processes, challenges, and issues involved in the translation of cancer genomics to the clinic are discussed. |