GENOMICS AND BIOINFORMATICS
Introduction
The term "GENOMICS" was coined in 1986 by Thomas Roder, to describe the scientific discipline of mapping, sequencing and analyzing genomes. H. Winkler in 1920 had coined the term genome to implicate the complete set of chromosomal and extra chromosomal genes of an organism, a cell, an organelle or a virus.
The field of genomics relies upon bioinformatics, which is the management and analysis of biological information stored in databases. During the mid-1980s to late 1980s, researchers started to use computers as central sequence repository, from where the data could be accessed remotely. Later in the early 1990s, genomics was transformed from an academic undertaking to a significant commercial endeavor, a course followed by bioinformatics a few years later.
In retrospect, the genomics really began with the conception of the Human Genome (HGP) in the mid-1980s. In the United States, the Human Genome Project officially started on October 1, 1990, as a 15-year program to map and sequence the complete set of human chromosomes, as well as those of several model organisms. The goal of sequencing an estimated three billion base pairs of the human genome was ambitious, considering that few laboratories in 1990 had sequenced just 100, 000 nucleotides. By 1993, the Human Genome Project had become an established international effort. The strategy of this international project was to make a series of maps of each human chromosome at increasingly finer resolutions. In this approach, chromosomes were divided into smaller fragments that could be cloned and then the fragments were arranged to correspond to their locations on a chromosome. After mapping, each of these ordered fragments would be sequenced.
Progress in stages
J. Craig Venter, a researcher at the National Institutes of Health, and his colleagues~ in early 1990s devised a new way to find genes. Rather than taking the Human Genome Project strategy of sequencing chromosomal DNA-one base at a time, his group isolated messenger RNA molecules, copied these RNA molecules into DNA, and then sequenced a part of these DNA molecules to create expressed sequence tags, or "ESTs." These ESTs could be used as handles to isolate the entire gene. Venter's method, therefore, focused on the "active" portion of the genome, which was producing messenger RNA for protein synthesis. The EST approach has generated enormous sized databases of nucleotide sequences, and facilitated the construction of a preliminary transcript map of the human genome. The development of the EST technique is considered to have demonstrated the feasibility of high-throughput gene discovery (screening of all possible gene candidates from the EST library), as well as provided a key impetus for the growth of the genomics industry. After the success of these projects, Craig Venter moved again to sequence entire genomes.
Evolving approaches
He devised the "whole-genome shotgun strategy," which involves randomly breaking DNA into segments of various sizes and cloning the fragments into vectors. Since the fragments are randomly cleaved from the genome, they tend to overlap, and a genome assembly program is used to fit contiguous pieces by matching overlapping ends. This method was validated by sequencing the entire genomes of a few selected microorganisms.
This is how, the first set of whole genome sequences of the smallest genome Mycoplasma genitalium and Haemophilus influenzae Rd were released. To analyze the data, several computer programs had to be adapted to fast computers, Later several new programs were also written to accomplish the task of sequence assembly. Craig Venter established an organization called "The Institute of Genomics Research (TIGR)" located in Maryland U.S.A., and soon whole genome sequences were determined for many other bacteria including those that live in exotic environments such as hot temperature or deep sea vents. Several bacteria of medical importance were also sequenced. During this time, several groups from Europe also initiated whole genome sequencing of bacteria such as Mycobacterium tuberculosis and Bacillus subtilis at Pasteur Institute. Generally, in Europe large consortiums (group of organizations in various countries) were formed to complement each other's strengths.
The exciting commercial era of genomics began with the establishment of Celera Genomics that was dedicated to sequencing the human and the mouse genomes, compared to microbial genomes, the human genome is large ~ 3 x 109 bps and also contains lots of repeated sequences. These repeated sequences present difficulties in sequence assembly because doubts arise with regard to their true order of arrangement in the genome. The parts containing the genes were somewhat easier to assemble. The problem of sequence assembly of repeats and of unique sequences by the computer is akin to this example. Suppose you were blindfolded and asked to pick two balls of different size from a lot of mostly identical balls, you would make several attempts but end up with failures most of the time. Further, you may not be able to distinguish one ball from another. However, if you were given the same assignment of drawing two different balls from a lot balls of all different sizes and shapes, then there is a good possibility of you picking up two different balls at much fewer attempts, perhaps the very first attempt itself may be enough.
But the unraveling of the human genome sequence gave us a surprise. Initial EST sequencing had led to an estimate of over 100,000 genes being present in the human genome. The genome sequence however, revealed that there are only about 30,000 genes. This number is only twice that of the fruitfly Drosophila melanogaster, a simple organism compared to the immensely complex human being. Possessing only twice the number of genes of fruitfly challenges us to search for other explanations that underlie the complexity of the humans. It turns out that humans can achieve this through combinations of the genes. You can understand this by an example.
Suppose, a mechanic has a set of 20 or 30 tools, each one dedicated to carry out a specific task. Then the mechanic can accomplish 20 or 30 tasks. However, if the same set of tools had flexible parts, then the same mechanic can generate several 'new combinations’ of these tools to carry out hundreds of tasks. Naturally through combinations, this mechanic will have more business, earn more money and will be most sought after compared to using the dedicated 'one job specific' tools.
Below, the sections on genomics explain the different branches of this exciting new area. With the development of automated sequencing, it has been possible to sequence genomes of many organisms. According to the latest list displayed at the NCBI (National Centre for Biotechnology Information) site, there are 1409 complete genome sequences of bacteria and archaea, 40 complete genome sequences of eukaryotes and 2537 complete genome sequences of viruses. The sequencing projects have shown several interesting and unexpected findings. The term genomics itself has undergone expansion in last few years and in the present context also includes genome function. Genomics can be broadly divided into structural genomics and functional genomics.
Structural Genomics
Structural genomics primarily involves high-throughput DNA sequencing followed by assembly, organization and management of DNA sequences. It represents an initial phase of genome analysis, which involves the construction of high-resolution genetic, physical or transcript maps of the organism. The ultimate physical map of an organism is its complete DNA sequence. Although, in the last few years with the completion of several genome-sequencing projects, the term structural genomics has also undergone transition. Several structural genomics initiatives now encompass systematic and high-throughput determination of three-dimensional structures of all proteins. The information and reagents provided by structural genomics are used to design global (genome-wide) experiments to identify functions of proteins.
Functional genomics
Functional genomics represents a new phase of genome analysis and deals with the reconstruction of the genome to determine the biological function of genes and the interactions between genes. The fundamental strategy in a functional genomics approach is to expand the scope of biological investigation from studying single genes or proteins to studying all genes or proteins at once in a systematic manner. Functional genomics is therefore characterized by high-throughput or large scale experimental methodologies combined with statistical and computational analysis of results.