History of Bioinformatics
Bioinformatics has emerged as a scientific discipline that encompasses the application of computing science and technology to analyze and manage biological data. All this began when it was demonstrated by Ingram that there is homology between sickle cell haemoglobin and normal haemoglobin. This led to comparison of other proteins with similar biological function. As more and more proteins were sequenced, it became necessary to have databases which enabled a quick comparison using computational softwares. With the advent of rapid nucleic acid sequencing techniques, a large number of sequences started accumulating which again required computing facilities.
In 1962, Zuckerkandl and Pauling proposed a new approach of studying evolutionary relations using sequence variability. This initiated a new field called 'molecular evolution'. The approach was based on the observation that functionally related or homologous protein sequences were similar. Subsequently, sequence comparisons, analysis of functional relatedness and inference of evolutionary relationships became possible. Margaret Dayhoff observed that protein sequences undergo variation during evolution according to certain patterns. She noted that :
• amino acids were not replaced at random but were altered with specific preferences. For example, amino acids with similar physico-chemical characteristics were preferred, one for another.
• some amino acids such as tryptophan, was generally not replaced by any other amino acid.
• based on several homologous sequences, a point accepted mutation (PAM) matrix could be developed.
This laid the first foundation for subsequent work on sequence comparisons using quantitative approaches.
The National Biomedical Research Foundation (NBRF) compiled the first comprehensive collection of macromolecular sequences in the "Atlas of Protein Sequence and Structure' published from 1965-1978 under the editorship of Margaret O. Dayhoff. Dayhoff and her research group pioneered the development of computer methods for the comparison of protein sequences, for the detection of distantly related sequences and duplications within sequences, and for the inference of evolutionary histories from alignments of protein sequences.
In 1980, the data library was established at the European Molecular Biology Laboratory (EMBL) to collect, organize, and distribute nucleotide sequence, data and related information. Now its successor is the European Bioinformatics Institute (EBI) located at Hinxton, U.K. The National Centre for Biotechnology Information also started in USA as a primary information databank and provider at about the same time. Later, the DNA Data Bank of Japan was initiated. The Protein Information Resource (PIR) was established in 1984 by the National Biomedical Research Foundation (NBRF) as a resource to assist researchers in the identification and interpretation of protein sequence information. Today, all these databanks are in close collaboration with each other and they exchange data on a regular basis.
As the sequence data began to accumulate rapidly, new powerful sequence analysis softwares were needed. In parallel, firm mathematical basis was also required to develop algorithms. Scientists from the field of mathematics, biology, and computer science entered the emerging field of bioinformatics.
The databanks through their wide network of distribution of information are very important sources for all researchers who take interest in asking fundamental questions in biology. Thus, a major primary aim of bioinformatics is to spread scientifically investigated knowledge for the benefit of the research community. Other aims include the development of softwares for data analysis.
The word "bioinformatics" is a combination from biology and informatics. As it became clear that biological polymers, such as nucleic acid molecules and proteins, can be transformed into sequences of digital symbols informatics approaches can be used for analysis. Moreover, only limited set of letters is required to represent the nucleotide and amino acid monomers. It is the digital nature of this data that differentiates genetic data from many other types of biological data, and has allowed bioinformatics to flourish. Another key point is that the use of sequence data relies upon an underlying reductionist approach: sequence implies structure which in turn implies function. In the subsequent sections we will see the details of these activities.