Biological Databases- Types and Importance

·         One of the hallmarks of modern genomic research is the generation of enormous amounts of raw sequence data.

·         As the volume of genomic data grows, sophisticated computational methodologies are required to manage the data deluge.

·         Thus, the very first challenge in the genomics era is to store and handle the staggering volume of information through the establishment and use of computer databases.

·         A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system.

·         A simple database might be a single file containing many records, each of which includes the same set of information.

·         The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information.

Based on their contents, biological databases can be roughly divided into two categories:

1. Primary databases

·         Primary databases are also called as archieval database. 

·         They are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure.

·         Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature.

·         Once given a database accession number, the data in primary databases are never changed: they form part of the scientific record.

Examples

·         ENA, GenBank and DDBJ (nucleotide sequence)

·         Array Express Archive and GEO (functional genomics data)

·         Protein Data Bank (PDB; coordinates of three-dimensional macromolecular structures)

2. Secondary databases

·         Secondary databases comprise data derived from the results of analysing primary data.

·         Secondary databases often draw upon information from numerous sources, including other databases (primary and secondary), controlled vocabularies and the scientific literature.

·         They are highly curated, often using a complex combination of computational algorithms and manual analysis and interpretation to derive new knowledge from the public record of science.

Examples

·         InterPro (protein families, motifs and domains)

·         UniProt Knowledgebase (sequence and functional information on proteins) 

·         Ensembl (variation, function, regulation and more layered onto whole genome sequences)

3. However, many data resources have both primary and secondary characteristics. For example, UniProt accepts primary sequences derived from peptide sequencing experiments. However, UniProt also infers peptide sequences from genomic information, and it provides a wealth of additional information, some derived from automated annotation (TrEMBL), and even more from careful manual analysis (SwissProt).

4. There are also specialized databases are those that cater to a particular research interest. For example, Flybase, HIV sequence database, and Ribosomal Database Project are databases that specialize in a particular organism or a particular type of data.

Importance of Databases

·         Databases act as a store house of information.

·         Databases are used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria.

·         It allows knowledge discovery, which refers to the identification of connections between pieces of information that were not known when the information was first entered. This facilitates the discovery of new biological insights from raw data.

·         Secondary databases have become the molecular biologist’s reference library over the past decade or so, providing a wealth of information on just about any gene or gene product that has been investigated by the research community.

·         It helps to solve cases where many users want to access the same entries of data.

·         Allows the indexing of data.

·         It helps to remove redundancy of data.