Finding Sequences

In this section, you will learn how to obtain nucleic acid or protein sequence information, in a format called FASTA, that is easy to use as input into bioinformatics tools.

What is the nucleotide sequence of this gene?

Remember that you are looking at information about the gene for the red-sensitive opsin in human vision, and it is located near the bottom tip of the X chromosome. On the Entrez Gene page for OPN1LW opsin 1 scroll farther down (way down!) to NCBI Reference Sequences (RefSeq). In the first subsection, mRNA and Protein(s), all of the following are available:

·         the mRNA Sequence (sequence of nucleotide bases in the messenger RNA), here listed as NM_020061.3 (M for mRNA);

·         the protein sequence (sequence of this gene's protein product, the red opsin), here listed as NP_064445.1 (P for protein);

·         the source sequences (entire sequences of the all of the overlapping genome fragments in which this sequence was found, from GenBank).

Note that the two links to mRNA sequence and protein sequence are given as NM_020061.3→NP_064445.1, the arrow implying that the sequence of the NM entry is translated (by protein synthesis) to give the sequence of the NP entry.

Click the entry number for the mRNA sequence: NM_020061.3

This is a typical GenBank nucleotide file, and a lot of it is hard to read, but a few things are clear. First note, under references, citations to the publication of this sequence in the scientific literature. To see an abstract of the article in which this gene was described, click the PubMed link (a number) below the first reference and read it.

Scroll to the bottom of this long page. The last thing, labeled ORIGIN, is the sequence of this messenger RNA. You are seeing the actual list of As, Ts, Gs, and Cs that make up the message for synthesis of this opsin. But wait! You know that RNA contains no T. In most nucleotide databases, U from RNA is represented as T, to make for easy comparison of DNA and RNA sequences. This sequence information is not in the form that is most useful for searching in databases, say, searching for related genes. Let's display this entry in a form more useful for searching.

At the top of the page, beside the Display button, pull down the menu that says GenBank (the default display format for each entry), and select FASTA (note that several other display options are available). Now you see one descriptive or "comment" line that begins with ">", followed by the nucleotide sequence. This little bit of text is just what you need to search nucleotide databases for similar sequences.

Keep it for future use, as follows. Click and drag on the web page to select everything from the ">" through the last nucleotides (CCAA). Be careful not to select anything else. From your browser's Edit menu, select Copy to make a copy of this information on your clipboard, for pasting elsewhere. Now start a simple word processor (use TextEdit on Mac, Notepad on Windows—to avoid inadvertent changes in crucial formatting of sequence files), make a new document, and paste. The FASTA comment and sequence should appear. If necessary, select all of the text and change the font to Courier or Monaco -- these "typewriter" fonts make it easy to align letters into columns, because all letter are the same width. Save this file, choosing text or plain text as the file type. Call it mrnared.txt (for mRNA sequence of red opsin). Save it to a convenient location for this and other files you'll be making for later seaches.

Click your browser's Back button until you return to the Entrez Gene page for this gene.

What is the amino-acid sequence of this gene?

Under NCBI Reference Sequences (RefSeq), click the entry number NP_064445.1 for the protein sequence.

Things look a lot like before, but this is a protein entry (the classical view is that gene products are proteins, but many are not), containing the amino-acid sequence in one-letter abbreviations. Just as with the mRNA entry, turn this into a FASTA display, and copy it into a new word-processor document. Save it in text format as protred.txt (for protein sequence of red opsin). Return to Entrez Gene.

What does the neighborhood of this gene look like?

(Get ready for a surprise. Hint: OPN1LW is a human gene, and humans are eucaryotes. When people began to sequence eucaryotic genes, what big surprise was in store for them?)

Now take a look at the chromosome region that contains the red opsin gene. Scroll back to near the top of the Entrez Gene page for OPN1LW, to the section called Genomic context. The diagram shows you that the red opsin gene lies on the X chromosome, within a segment of base pairs (bp) stretching from position 152,929,151 to position 153,114,725 (a distance of 185,574 bp). [Don't worry if these numbers are not exactly the ones you see; these resources are constantly being updated.] The location of OPN1LW, shown as a red arrow, is about 3/4 of the way down this segment.

Now look at the diagram in the preceding section, Genomic regions, transcripts, and products. This diagram gives a closer look at the OPN1LW segment, representing only positions 153,062,939 to 153,077,701 (14,762 bp). The lower line shows coding regions as red blocks, noncoding regions as red lines. Here is the surprise: You knew, but you might have forgotten, that eucaryotic genes are often interrupted by non-coding regions called intervening sequences or introns. The coding regions are called exons. From this diagram, you can see that the OPN1LW gene consists of 6 exons and 5 introns, and that the introns are far larger than the exons. Of the 14,762 bp in the "gene", only 1095 bp code for protein, which means that less than 8% of the base pairs contain the code. When this gene is expressed in cells in the human retina, an RNA copy of the entire gene is synthesized. Then the intron regions are cut out, and the exon regions joined together to produce the mature mRNA (a process called splicing). which will be translated by ribosomes as they make the red opsin protein. In this case, 92% of the initial RNA transcript is tossed out, leaving the pure protein code. Seems wasteful, but our understanding of how all this works, while impressive, is still pretty fragmentary.