Data is retrieved from snapshots from the RCSB PDB repository.
Each
file is imported into a single table raw_pdb
in our PostgreSQL database using a Perl script. The script
uses transactions of 10,000 inserts per chunk.
The
structure of raw_pdb
is this:
Column |
Type |
Modifiers |
code |
character varying(20) |
not null |
line_num |
integer |
not null |
line_cont |
character varying(80) |
not null |
The import script looks like this:
#!/usr/bin/perl -w
use FindBin
'$Bin';
use DBI;
$dbName =
'bioinf';
$dbLogin =
'ezop';
$dbPass =
'XYZ';
$conn = DBI->
connect(
"DBI:Pg:database=$dbName;host=localhost",
"$dbLogin",
"$dbPass", {
'RaiseError' =>
1,
'AutoCommit' =>
0});
die
"./pdb_lines_unos.pl <table> <file>"
if
not
defined($ARGV[
0]);
$recordCount =
0;
$table = $ARGV[
0];
$fName = $ARGV[
1];
open F,
"zcat $fName|";
while (<F>) {
chomp;
$linija = $_;
$recordCount +=
1;
$sql =
"insert into $table (code, line_num, line_cont) values (?, ?, ?)";
$conn->
do($sql,
undef, $fName, $recordCount, $linija);
$conn->commit()
if ($recordCount%10000 ==
0);
}
close F;
$conn->commit();
1;
After lines are imported, they are parsed using functions we will define below.
From raw_pdb
data, we generate the tables ts
, ps
, proteins
, sources
, sources_organela
, and ss_bond
by parsing the corresponding records.
The ps
table has three important columns: chain
, length
, and sequence
. Protein sequence
is generated using C-alpha atoms for each chain and for each residue ordered by
residue sequence, taking only the first insertion and the first alternate
location. chain
is taken from
the TS.chain
column,
and length
is simply
the precalculated length of the sequence
string. Since this article is meant to analyze only single
chains and intrachain connections, multiple-chain proteins are skipped in our
analysis here.
Within SSBOND
records, all disulfide bonds are stored in
the pdb_ssbond
table, which
inherits from the pdb_ssbond_extended
table. pdb_ssbond
looks like this:
Column |
Type |
Nullable |
Default |
Description |
id |
integer |
not null |
nextval(‘pdb_ssbond_id_seq’::regclass) |
|
code |
character(4) |
|
|
four-letter code |
ser_num |
integer |
|
|
serial number of ssbond |
residue1 |
character(3) |
|
|
first residue in bond |
chain_id1 |
character(1) |
|
|
first chain in bond |
res_seq1 |
integer |
|
|
sequential number of first residue |
i_code1 |
character(1) |
|
|
insertion code of first residue |
residue2 |
character(3) |
|
|
second residue in bond |
chain_id2 |
character(1) |
|
|
second chain in bond |
res_seq2 |
integer |
|
|
sequential number of second residue |
i_code2 |
character(1) |
|
|
insertion code of second residue |
sym1 |
character(6) |
|
|
first symmetry operator |
sym2 |
character(6) |
|
|
second symmetry operator |
dist |
numeric(5,2) |
|
|
distance between atoms |
is_reactive |
boolean |
not null |
false |
mark for reactivity (to be set) |
is_consecutive |
boolean |
not null |
true |
mark for consecutive bonds (to be set) |
rep7 |
boolean |
not null |
false |
mark for set-7 structures (to be set) |
rep40 |
boolean |
not null |
false |
mark for set-40 structures (to be set) |
rep80 |
boolean |
not null |
false |
mark for set-80 structures (to be set) |
is_from_pdb |
boolean |
|
true |
is taken from PDB as source (to be set) |
I also added these indexes:
"pdb_ssbond_pkey" PRIMARY KEY, btree (id)
"ndxcode1" btree (code, chain_id1, res_seq1)
"ndxcode2" btree (code, chain_id2, res_seq2)
It
is assumed that distribution of disulfide bonds prior to the cutoff is Gaussian
(without testing with, e.g., KS-test), so standard deviations were calculated
on each distance between cysteines in same protein to discover the range of
permitted bond lengths and compare them to the cutoff. The cutoff was same as
the calculated mean plus-minus three standard deviations. We have extended the
range to introduce more possible disulfide bonds which weren’t enlisted in PDB
file in SSBOND
rows but
which we have inserted ourselves by calculating distances between ATOM
records. The chosen range for ssbonds
are between 1.6175344456264 and
2.48801947642267 Å, which is the mean (2.05) plus-minus four standard deviations:
select
count(
1)
as cnt
,
stddev(dist)
as std_dev
,
avg(dist)
as avg_val
, -
stddev(dist) +
avg(dist)
as left1
,
stddev(dist) +
avg(dist)
as right1
,
-2 *
stddev(dist) +
avg(dist)
as left2
,
2 *
stddev(dist) +
avg(dist)
as right2
,
-3 *
stddev(dist) +
avg(dist)
as left3
,
3 *
stddev(dist) +
avg(dist)
as right3
,
-4 *
stddev(dist) +
avg(dist)
as left4
,
4 *
stddev(dist) +
avg(dist)
as right4
,
min(dist)
,
max(dist)
from pdb_ssbond_tmp
where dist >
0
and dist <
20;
The TS
table contains the coordinates of all atoms, but only
cysteines will be used, with their sulfur named " SG "
. So another staging table
with " SG "
sulfur atoms
only is created for speeding up the process by reducing the number of records
to search. When selecting sulfurs only, the number of combinations is much less
than in the case of all atoms—194,574 of the former compared with 122,761,100
of the latter. Within this table joined to itself, distances are calculated
using the Euclidean distance, and results are imported into the pdb_ssbond
table but only where the distance is
between the defined lengths calculated earlier. The reason for doing this
speedup is to lessen the amount of time of running the whole process again—for
checking purposes—keeping it within one day.
To clean disulfide bonds, we use the following algorithm:
· Delete when both sides of connection point to same amino acid
· Delete bonds whose length is not between 1.6175344456264 and 2.48801947642267
· Remove insertions
· Remove bonds caused by alternate atom locations, but leaving first location
The
code for this (taking the pdb_ssbond
table as the
first argument) is:
CREATE
OR
REPLACE
FUNCTION pdb_ssbond_clean2(
clean_icodes
boolean,
clean_altloc
boolean,
mark_reactive
boolean,
mark_consecutive
boolean)
RETURNS
void
AS $$
declare _res
integer;
BEGIN
delete
from pdb_ssbond b
where
exists (
select a.id
from pdb_ssbond a
where a.code = b.code
and a.id > b.id
and (
( a.chain_id1 = b.chain_id1
and a.res_seq1 = b.res_seq1
and a.chain_id2 = b.chain_id2
and a.res_seq2 = b.res_seq2)
or
( a.chain_id1 = b.chain_id2
and a.res_seq1 = b.res_seq2
and a.chain_id2 = b.chain_id1
and a.res_seq2 = b.res_seq1)
)
) ;
delete
from pdb_ssbond
where chain_id1 = chain_id2
and res_seq1 = res_seq2
and i_code1 = i_code2;
update pdb_ssbond
set is_consecutive =
true
, is_reactive =
false;
delete
from pdb_ssbond
where
not dist
between
1.6175344456264
and
2.48801947642267;
if clean_icodes then
delete
from pdb_ssbond
where i_code1
not
in (
'',
' ',
'A')
or i_code2
not
in (
'',
' ',
'A') ;
end
if;
if clean_altloc then
delete
from pdb_ssbond a
where
exists (
select
1
from pdb_ssbond b
where b.code = a.code
and b.chain_id1 = a.chain_id1
and b.res_seq1 = a.res_seq1
and b.i_code1 = a.i_code1
and b.ser_num < a.ser_num
) ;
delete
from pdb_ssbond a
where
exists (
select
1
from pdb_ssbond b
where b.code = a.code
and b.chain_id2 = a.chain_id2
and b.res_seq2 = a.res_seq2
and b.i_code2 = a.i_code2
and b.ser_num < a.ser_num
) ;
end
if;
if mark_reactive then
update pdb_ssbond
set is_reactive =
false ;
update pdb_ssbond
set is_reactive =
true
where
exists (
select
1
from pdb_ssbond b
where b.code = pdb_ssbond.code
and b.chain_id1 = pdb_ssbond.chain_id1
and b.res_seq1 = pdb_ssbond.res_seq1
and b.i_code1 = pdb_ssbond.i_code1
and b.ser_num < pdb_ssbond.ser_num
) ;
update pdb_ssbond
set is_reactive =
true
where
exists (
select
1
from pdb_ssbond b
where b.code = pdb_ssbond.code
and b.chain_id2 = pdb_ssbond.chain_id2
and b.res_seq2 = pdb_ssbond.res_seq2
and b.i_code2 = pdb_ssbond.i_code2
and b.ser_num < pdb_ssbond.ser_num
) ;
update pdb_ssbond
set is_reactive =
true
where (code, chain_id1, res_seq1, i_code1)
in (
select code, chain_id1
as c, res_seq1
as r, i_code1
as i
from pdb_ssbond
intersect
select code, chain_id2
as c, res_seq2
as r, i_code2
as i
from pdb_ssbond
) ;
update pdb_ssbond
set is_reactive =
true
where (code, chain_id2, res_seq2, i_code2)
in (
select code, chain_id1
as c, res_seq1
as r, i_code1
as i
from pdb_ssbond
intersect
select code, chain_id2
as c, res_seq2
as r, i_code2
as i
from pdb_ssbond
);
end
if;
if mark_consecutive then
update pdb_ssbond
set is_consecutive =
false;
update pdb_ssbond
set is_consecutive =
true
where
not
exists (
select
1
from pdb_ssbond a
where
a.code = pdb_ssbond.code
and (
(a.chain_id1 = pdb_ssbond.chain_id1
and a.chain_id2 = pdb_ssbond.chain_id2)
or
(a.chain_id1 = pdb_ssbond.chain_id2
and a.chain_id2 = pdb_ssbond.chain_id1)
)
and (
(a.res_seq1
between pdb_ssbond.res_seq1
and pdb_ssbond.res_seq2)
or (a.res_seq2
between pdb_ssbond.res_seq1
and pdb_ssbond.res_seq2)
or (pdb_ssbond.res_seq1
between a.res_seq1
and a.res_seq2)
or (pdb_ssbond.res_seq2
between a.res_seq1
and a.res_seq2)
)
and
not (
a.code = pdb_ssbond.code
and a.chain_id1 = pdb_ssbond.chain_id1
and a.chain_id2 = pdb_ssbond.chain_id2
and a.res_seq1 = pdb_ssbond.res_seq1
and a.res_seq2 = pdb_ssbond.res_seq2
)
);
end
if;
END;
$$ LANGUAGE plpgsql;
With
this, the non-redundant set of proteins is imported to the nr
table which is joined to the ps
and proteins
tables, and
sets are marked (set7
, set40
, and set80
). At the end, according to protein quantity only one set will be
analyzed. Mismatched chains between PDB and NR are removed from analysis.
Proteins
without disulfide bonds are excluded from research, together with proteins that
don’t belong to any set. Data is processed with DSSP, and these files which had
problems with resolution or too many atoms to be processed are also excluded.
Only proteins with single chains are used as result for analysis because
interchain connections were not kept, although they are easily calculated from
the ssbond
table by
counting the number of connections that have different chains.
For the remaining proteins, an update is done for the total number of bonds and the number of overlapping bonds, and this is done for each of the sets.
The
source organism is extracted from SOURCE
records. In cases where it is unknown, synthetic, designed,
artificial, or hybrid, it is discarded from research. Low-resolution proteins
are also excluded only when their side chain is not visible.
SOURCE
records are stored in
the sources
table, which
contains taxonomy rows. In some cases, the taxonomy is missing or incorrect. In
these cases, the manual correction of experts is needed.
From
the source and taxonomy downloaded from NCBI, the class is assigned to each
primary structure. In case a class is unable to be assigned, the protein is
removed from the analysis list. Knowing that biological databases are being
used, an extra check of all source records and NCBI taxonomy classes is
recommended to be performed by a biologist, otherwise there might be problems
with classifications between different domains. To match source cellular
locations with taxonomy IDs, data from the source table is extracted into the
table sources_organela
where all
data is written as code, tag, and value. Its format is shown below:
select *
from sources_organela
where code =
'1rav';
code |
mol_id |
tag |
val |
1rav |
0 |
MOL_ID |
1 |
1rav |
7 |
CELLULAR_LOCATION |
CYTOPLASM (WHITE) |
The
taxonomy archive we want to use is a ZIP file containing seven dump files.
Among these files, two of the most important are names.dmp
and merged.dmp
. Both files are CSV tab-pipe delimited files as detailed in
the documentation:
·
The file merged.dmp
contains a list of previous taxonomy IDs, and the current
taxonomy IDs into which each one was merged.
·
names.dmp
contains these important columns in this order:
o
tax_id
: The taxonomy ID
o
name_txt
: The name of the species, and if applicable, the unique name
(where species can be found with multiple names, this helps disambiguate)
·
division.dmp
contains the names of top-level domains which we will use as
our classes.
·
nodes.dmp
is the table which contains a hierarchical structure of
organisms using taxonomy IDs.
o
It contains a parent taxonomy ID, a
foreign key which can be found in names.dmp
.
o It also contains a division ID which is important for us to correctly store the relevant top domain data.
With
all this data and manual corrections (setting correct domains of life) we were
able to create the structure of the taxonomy_path
table. A sampling of data looks like this:
select *
from taxonomy_path
order
by
length(
path)
limit
20
offset
2000;
tax_id |
path |
is_viral |
is_eukaryote |
is_archaea |
is_other |
is_prokaryote |
142182 |
cellular organisms;Bacteria;Gemmatimonadetes |
f |
f |
f |
f |
t |
136087 |
cellular organisms;Eukaryota;Malawimonadidae |
f |
t |
f |
f |
f |
649454 |
Viruses;unclassified phages;Cyanophage G1168 |
t |
f |
f |
f |
f |
321302 |
Viruses;unclassified viruses;Tellina virus 1 |
t |
f |
f |
f |
f |
649453 |
Viruses;unclassified phages;Cyanophage G1158 |
t |
f |
f |
f |
f |
536461 |
Viruses;unclassified phages;Cyanophage S-SM1 |
t |
f |
f |
f |
f |
536462 |
Viruses;unclassified phages;Cyanophage S-SM2 |
t |
f |
f |
f |
f |
77041 |
Viruses;unclassified viruses;Stealth virus 4 |
t |
f |
f |
f |
f |
77042 |
Viruses;unclassified viruses;Stealth virus 5 |
t |
f |
f |
f |
f |
641835 |
Viruses;unclassified phages;Vibrio phage 512 |
t |
f |
f |
f |
f |
1074427 |
Viruses;unclassified viruses;Mouse Rosavirus |
t |
f |
f |
f |
f |
1074428 |
Viruses;unclassified viruses;Mouse Mosavirus |
t |
f |
f |
f |
f |
480920 |
other sequences;plasmids;IncP-1 plasmid 6-S1 |
f |
f |
f |
t |
f |
2441 |
other sequences;plasmids;Plasmid ColBM-Cl139 |
f |
f |
f |
t |
f |
168317 |
other sequences;plasmids;IncQ plasmid pIE723 |
f |
f |
f |
t |
f |
536472 |
Viruses;unclassified phages;Cyanophage Syn10 |
t |
f |
f |
f |
f |
536474 |
Viruses;unclassified phages;Cyanophage Syn30 |
t |
f |
f |
f |
f |
2407 |
other sequences;transposons;Transposon Tn501 |
f |
f |
f |
t |
f |
227307 |
Viruses;ssDNA viruses;Circoviridae;Gyrovirus |
t |
f |
f |
f |
f |
687247 |
Viruses;unclassified phages;Cyanophage ZQS-7 |
t |
f |
f |
f |
f |
Before any analysis, to avoid biases, sequences have to be checked for their level of identity. Although the NR set contains sequences which are already compared between each other, an extra check is always recommended.
For
each disulfide bond’s prior statistical analysis, data is marked if it is
reactive or overlapping. After marking overlaps, that information automatically
reveals how many consecutive and non-consecutive bonds are inside each protein,
and that data is stored in the proteins
table from which all protein complexes are excluded in final
result.
Each disulfide bond is marked also for its association to sets, by checking both bond sides to see if they belong to the same NR set. Where that is not the case, the disulfide bond is skipped for that set analysis.
To analyze the quantity of bonds by their variance, each length has to be put in a specific class. In this case, only five classes are chosen as written in the function below.
CREATE
OR
REPLACE
FUNCTION len2class(
len
integer)
RETURNS
integer
AS
$
BODY$
BEGIN
return
case
when
len <=
100
then
1
when
len >
100
and
len <=
200
then
2
when
len >
200
and
len <=
300
then
3
when
len >
300
and
len <=
400
then
4
else
5
end;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
Most of the protein sizes are less than 400 amino acids, so length classification is done by splitting lengths into five ranges, each taking 100 amino acids, except the last one which takes the rest. Below you can see how to use this function to extract data for analysis:
SELECT p.code,
p.title,
p.ss_bonds,
p.ssbonds_overlap,
p.intra_count,
p.sci_name_src,
p.len,
p.tax_path,
p.pfam_families,
p.src_class,
(
SELECT s.id
FROM src_classes s
WHERE s.src_class::
text = p.src_class::
text)
AS src_class_id,
p.len_class7,
(
SELECT s.val
FROM sources_organela s
WHERE s.code = p.code::bpchar
AND s.tag::
text =
'EXPRESSION_SYSTEM_CELLULAR_LOCATION'::
text)
AS expression_system_cellular_location,
(
SELECT s.val
FROM sources_organela s
WHERE s.code = p.code::bpchar
AND s.tag::
text =
'CELLULAR_LOCATION'::
text)
AS cellular_location,
ps.sequence,
ps.uniprot_code,
ps.accession_code,
ps.cc,
ps.seq_uniprot,
ps.chain_id
FROM proteins p
JOIN nr n
ON n.code::
text = p.code::
text
AND n.rep7 =
true
JOIN ps ps
ON ps.code::
text = n.code::
text
AND ps.chain_id = n.chain_id
AND ps.het =
false
WHERE p.is_excluded =
false
AND p.chain_cnt =
1
AND p.is_set7 =
true
AND p.reactive_cnt7 =
0
ORDER
BY p.code;
In this work, we showed how to process data, from fetching to analyzing. When working with scientific data, sometimes normalization is needed, and sometimes not. When working with small quantities of data which will not be reused for another analysis, then it is enough to leave it denormalized where processing is fast enough.
The reason why this was done all in one bioinformatics database is that PostgreSQL is able to integrate many languages. This includes R, which can do statistical analysis directly in-database—the subject for a future article on bioinformatics tools.