Filtering and Processing Data

Data is retrieved from snapshots from the RCSB PDB repository.

Each file is imported into a single table raw_pdb in our PostgreSQL database using a Perl script. The script uses transactions of 10,000 inserts per chunk.

The structure of raw_pdb is this:

Column

Type

Modifiers

code

character varying(20)

not null

line_num

integer

not null

line_cont

character varying(80)

not null

The import script looks like this:

#!/usr/bin/perl -w
 
use FindBin '$Bin';
use DBI;
 
$dbName = 'bioinf'; 
$dbLogin = 'ezop'; 
$dbPass = 'XYZ';
 
$conn = DBI->connect("DBI:Pg:database=$dbName;host=localhost", "$dbLogin", "$dbPass", {'RaiseError' => 1, 'AutoCommit' => 0});
 
die "./pdb_lines_unos.pl <table> <file>" if not defined($ARGV[0]);
 
$recordCount = 0;
$table = $ARGV[0];
$fName = $ARGV[1];
open F, "zcat $fName|";
while (<F>) {
    chomp;
    $linija = $_;
    $recordCount += 1;
    $sql = "insert into $table (code, line_num, line_cont) values (?, ?, ?)";
 
    $conn->do($sql, undef, $fName, $recordCount, $linija);
    $conn->commit() if ($recordCount%10000 == 0);
}
close F;
$conn->commit();
 
1;

After lines are imported, they are parsed using functions we will define below.

From raw_pdb data, we generate the tables tspsproteinssourcessources_organela, and ss_bond by parsing the corresponding records.

The ps table has three important columns: chainlength, and sequence. Protein sequence is generated using C-alpha atoms for each chain and for each residue ordered by residue sequence, taking only the first insertion and the first alternate location. chainis taken from the TS.chain column, and length is simply the precalculated length of the sequence string. Since this article is meant to analyze only single chains and intrachain connections, multiple-chain proteins are skipped in our analysis here.

Within SSBOND records, all disulfide bonds are stored in the pdb_ssbond table, which inherits from the pdb_ssbond_extendedtable. pdb_ssbond looks like this:

Column

Type

Nullable

Default

Description

id

integer

not null

nextval(‘pdb_ssbond_id_seq’::regclass)

 

code

character(4)

 

 

four-letter code

ser_num

integer

 

 

serial number of ssbond

residue1

character(3)

 

 

first residue in bond

chain_id1

character(1)

 

 

first chain in bond

res_seq1

integer

 

 

sequential number of first residue

i_code1

character(1)

 

 

insertion code of first residue

residue2

character(3)

 

 

second residue in bond

chain_id2

character(1)

 

 

second chain in bond

res_seq2

integer

 

 

sequential number of second residue

i_code2

character(1)

 

 

insertion code of second residue

sym1

character(6)

 

 

first symmetry operator

sym2

character(6)

 

 

second symmetry operator

dist

numeric(5,2)

 

 

distance between atoms

is_reactive

boolean

not null

false

mark for reactivity (to be set)

is_consecutive

boolean

not null

true

mark for consecutive bonds (to be set)

rep7

boolean

not null

false

mark for set-7 structures (to be set)

rep40

boolean

not null

false

mark for set-40 structures (to be set)

rep80

boolean

not null

false

mark for set-80 structures (to be set)

is_from_pdb

boolean

 

true

is taken from PDB as source (to be set)

I also added these indexes:

"pdb_ssbond_pkey" PRIMARY KEY, btree (id)
"ndxcode1" btree (code, chain_id1, res_seq1)
"ndxcode2" btree (code, chain_id2, res_seq2)

It is assumed that distribution of disulfide bonds prior to the cutoff is Gaussian (without testing with, e.g., KS-test), so standard deviations were calculated on each distance between cysteines in same protein to discover the range of permitted bond lengths and compare them to the cutoff. The cutoff was same as the calculated mean plus-minus three standard deviations. We have extended the range to introduce more possible disulfide bonds which weren’t enlisted in PDB file in SSBOND rows but which we have inserted ourselves by calculating distances between ATOM records. The chosen range for ssbonds are between 1.6175344456264 and 2.48801947642267 , which is the mean (2.05) plus-minus four standard deviations:

select                     count(1) as     cnt
    ,      stddev(dist)             as std_dev  
    ,                     avg(dist) as avg_val
    ,     -stddev(dist) + avg(dist) as   left1
    ,      stddev(dist) + avg(dist) as  right1
    , -2 * stddev(dist) + avg(dist) as   left2
    ,  2 * stddev(dist) + avg(dist) as  right2
    , -3 * stddev(dist) + avg(dist) as   left3
    ,  3 * stddev(dist) + avg(dist) as  right3
    , -4 * stddev(dist) + avg(dist) as   left4
    ,  4 * stddev(dist) + avg(dist) as  right4
    ,         min(dist)
    ,         max(dist)
from pdb_ssbond_tmp
where dist > 0
    and dist < 20;

The TS table contains the coordinates of all atoms, but only cysteines will be used, with their sulfur named " SG ". So another staging table with " SG " sulfur atoms only is created for speeding up the process by reducing the number of records to search. When selecting sulfurs only, the number of combinations is much less than in the case of all atoms—194,574 of the former compared with 122,761,100 of the latter. Within this table joined to itself, distances are calculated using the Euclidean distance, and results are imported into the pdb_ssbond table but only where the distance is between the defined lengths calculated earlier. The reason for doing this speedup is to lessen the amount of time of running the whole process again—for checking purposes—keeping it within one day.

To clean disulfide bonds, we use the following algorithm:

·         Delete when both sides of connection point to same amino acid

·         Delete bonds whose length is not between 1.6175344456264 and 2.48801947642267

·         Remove insertions

·         Remove bonds caused by alternate atom locations, but leaving first location

The code for this (taking the pdb_ssbond table as the first argument) is:

 CREATE OR REPLACE FUNCTION pdb_ssbond_clean2( 
    clean_icodes boolean,
    clean_altloc boolean,
    mark_reactive boolean,
    mark_consecutive boolean)
RETURNS void AS $$
declare _res integer;
BEGIN  
    delete from pdb_ssbond b 
    where exists (
        select a.id
        from pdb_ssbond a
        where a.code = b.code
        and a.id > b.id
        and (
            (   a.chain_id1 = b.chain_id1 and a.res_seq1 = b.res_seq1
            and a.chain_id2 = b.chain_id2 and a.res_seq2 = b.res_seq2)
            or
            (   a.chain_id1 = b.chain_id2 and a.res_seq1 = b.res_seq2
            and a.chain_id2 = b.chain_id1 and a.res_seq2 = b.res_seq1)
        )
    ) ; 
 
    delete
    from pdb_ssbond
    where chain_id1 = chain_id2
    and res_seq1 = res_seq2
    and i_code1 = i_code2;
        
 
         
    update pdb_ssbond 
    set is_consecutive = true
    , is_reactive = false;
        
    delete from pdb_ssbond 
    where not dist between 1.6175344456264 and 2.48801947642267;
     
          
    
    if clean_icodes then  
        delete from pdb_ssbond 
        where  i_code1 not in (''' ', 'A') 
        or     i_code2 not in (''' ', 'A') ;
    end if;
    
    if clean_altloc then  
        delete from pdb_ssbond a
        where exists (
            select 1
            from pdb_ssbond  b
            where   b.code = a.code
                and b.chain_id1 = a.chain_id1
                and b.res_seq1 = a.res_seq1
                and b.i_code1 = a.i_code1
                and b.ser_num < a.ser_num
        ) ;
        
        delete from pdb_ssbond a
        where exists (
            select 1
            from pdb_ssbond  b
            where   b.code = a.code
                and b.chain_id2 = a.chain_id2
                and b.res_seq2 = a.res_seq2
                and b.i_code2 = a.i_code2
                and b.ser_num < a.ser_num
        ) ;
    end if;
 
    if mark_reactive then  
            update pdb_ssbond 
            set is_reactive = false ; 
       
            update pdb_ssbond 
            set is_reactive = true
            where exists (
                select 1
                from pdb_ssbond  b
                where   b.code = pdb_ssbond.code
                    and b.chain_id1 = pdb_ssbond.chain_id1
                    and b.res_seq1 = pdb_ssbond.res_seq1
                    and b.i_code1 = pdb_ssbond.i_code1
                    and b.ser_num < pdb_ssbond.ser_num
            ) ;  
         
       
            update pdb_ssbond 
            set is_reactive = true         
            where exists (
                select 1
                from pdb_ssbond  b
                where   b.code = pdb_ssbond.code
                    and b.chain_id2 = pdb_ssbond.chain_id2
                    and b.res_seq2 = pdb_ssbond.res_seq2
                    and b.i_code2 = pdb_ssbond.i_code2
                    and b.ser_num < pdb_ssbond.ser_num
            ) ; 
       
            update pdb_ssbond 
            set is_reactive = true         
            where (code, chain_id1, res_seq1, i_code1)
            in (
                select code, chain_id1 as c, res_seq1 as r, i_code1 as i
                from pdb_ssbond
                intersect
                select code, chain_id2 as c, res_seq2 as r, i_code2 as i
                from pdb_ssbond
            ) ; 
       
 
            update pdb_ssbond 
            set is_reactive = true         
            where (code, chain_id2, res_seq2, i_code2)
            in (
                select code, chain_id1 as c, res_seq1 as r, i_code1 as i
                from pdb_ssbond
                intersect
                select code, chain_id2 as c, res_seq2 as r, i_code2 as i
                from pdb_ssbond
            );
    end if;
    
    if mark_consecutive then
           update pdb_ssbond
           set is_consecutive = false;
 
            update pdb_ssbond
            set is_consecutive = true
            where not exists (
                select 1
                from pdb_ssbond a
                where
                    a.code = pdb_ssbond.code
                    and (
                        (a.chain_id1 = pdb_ssbond.chain_id1 and a.chain_id2 = pdb_ssbond.chain_id2)
                        or
                        (a.chain_id1 = pdb_ssbond.chain_id2 and a.chain_id2 = pdb_ssbond.chain_id1)
                    )
                    and (
                           (a.res_seq1 between pdb_ssbond.res_seq1 and pdb_ssbond.res_seq2)
                        or (a.res_seq2 between pdb_ssbond.res_seq1 and pdb_ssbond.res_seq2)
                        or (pdb_ssbond.res_seq1 between a.res_seq1 and a.res_seq2)
                        or (pdb_ssbond.res_seq2 between a.res_seq1 and a.res_seq2)
                    )
                    and not (
                            a.code = pdb_ssbond.code
                        and a.chain_id1 = pdb_ssbond.chain_id1
                        and a.chain_id2 = pdb_ssbond.chain_id2 
                        and a.res_seq1 = pdb_ssbond.res_seq1
                        and a.res_seq2 = pdb_ssbond.res_seq2
                    )
            );
    end if; 
END;
$$ LANGUAGE plpgsql;

With this, the non-redundant set of proteins is imported to the nrtable which is joined to the ps and proteins tables, and sets are marked (set7set40, and set80). At the end, according to protein quantity only one set will be analyzed. Mismatched chains between PDB and NR are removed from analysis.

Proteins without disulfide bonds are excluded from research, together with proteins that don’t belong to any set. Data is processed with DSSP, and these files which had problems with resolution or too many atoms to be processed are also excluded. Only proteins with single chains are used as result for analysis because interchain connections were not kept, although they are easily calculated from the ssbond table by counting the number of connections that have different chains.

For the remaining proteins, an update is done for the total number of bonds and the number of overlapping bonds, and this is done for each of the sets.

The source organism is extracted from SOURCE records. In cases where it is unknown, synthetic, designed, artificial, or hybrid, it is discarded from research. Low-resolution proteins are also excluded only when their side chain is not visible.

SOURCE records are stored in the sources table, which contains taxonomy rows. In some cases, the taxonomy is missing or incorrect. In these cases, the manual correction of experts is needed.

From the source and taxonomy downloaded from NCBI, the class is assigned to each primary structure. In case a class is unable to be assigned, the protein is removed from the analysis list. Knowing that biological databases are being used, an extra check of all source records and NCBI taxonomy classes is recommended to be performed by a biologist, otherwise there might be problems with classifications between different domains. To match source cellular locations with taxonomy IDs, data from the source table is extracted into the table sources_organela where all data is written as code, tag, and value. Its format is shown below:

select * from sources_organela where code = '1rav';

code

mol_id

tag

val

1rav

0

MOL_ID

1

1rav

7

CELLULAR_LOCATION

CYTOPLASM (WHITE)

The taxonomy archive we want to use is a ZIP file containing seven dump files. Among these files, two of the most important are names.dmp and merged.dmp. Both files are CSV tab-pipe delimited files as detailed in the documentation:

·         The file merged.dmp contains a list of previous taxonomy IDs, and the current taxonomy IDs into which each one was merged.

·         names.dmp contains these important columns in this order:

o    tax_id: The taxonomy ID

o    name_txt: The name of the species, and if applicable, the unique name (where species can be found with multiple names, this helps disambiguate)

·         division.dmp contains the names of top-level domains which we will use as our classes.

·         nodes.dmp is the table which contains a hierarchical structure of organisms using taxonomy IDs.

o    It contains a parent taxonomy ID, a foreign key which can be found in names.dmp.

o    It also contains a division ID which is important for us to correctly store the relevant top domain data.

With all this data and manual corrections (setting correct domains of life) we were able to create the structure of the taxonomy_pathtable. A sampling of data looks like this:

select * from taxonomy_path order by length(pathlimit 20 offset 2000;

tax_id

path

is_viral

is_eukaryote

is_archaea

is_other

is_prokaryote

142182

cellular organisms;Bacteria;Gemmatimonadetes

f

f

f

f

t

136087

cellular organisms;Eukaryota;Malawimonadidae

f

t

f

f

f

649454

Viruses;unclassified phages;Cyanophage G1168

t

f

f

f

f

321302

Viruses;unclassified viruses;Tellina virus 1

t

f

f

f

f

649453

Viruses;unclassified phages;Cyanophage G1158

t

f

f

f

f

536461

Viruses;unclassified phages;Cyanophage S-SM1

t

f

f

f

f

536462

Viruses;unclassified phages;Cyanophage S-SM2

t

f

f

f

f

77041

Viruses;unclassified viruses;Stealth virus 4

t

f

f

f

f

77042

Viruses;unclassified viruses;Stealth virus 5

t

f

f

f

f

641835

Viruses;unclassified phages;Vibrio phage 512

t

f

f

f

f

1074427

Viruses;unclassified viruses;Mouse Rosavirus

t

f

f

f

f

1074428

Viruses;unclassified viruses;Mouse Mosavirus

t

f

f

f

f

480920

other sequences;plasmids;IncP-1 plasmid 6-S1

f

f

f

t

f

2441

other sequences;plasmids;Plasmid ColBM-Cl139

f

f

f

t

f

168317

other sequences;plasmids;IncQ plasmid pIE723

f

f

f

t

f

536472

Viruses;unclassified phages;Cyanophage Syn10

t

f

f

f

f

536474

Viruses;unclassified phages;Cyanophage Syn30

t

f

f

f

f

2407

other sequences;transposons;Transposon Tn501

f

f

f

t

f

227307

Viruses;ssDNA viruses;Circoviridae;Gyrovirus

t

f

f

f

f

687247

Viruses;unclassified phages;Cyanophage ZQS-7

t

f

f

f

f

Before any analysis, to avoid biases, sequences have to be checked for their level of identity. Although the NR set contains sequences which are already compared between each other, an extra check is always recommended.

For each disulfide bond’s prior statistical analysis, data is marked if it is reactive or overlapping. After marking overlaps, that information automatically reveals how many consecutive and non-consecutive bonds are inside each protein, and that data is stored in the proteins table from which all protein complexes are excluded in final result.

Each disulfide bond is marked also for its association to sets, by checking both bond sides to see if they belong to the same NR set. Where that is not the case, the disulfide bond is skipped for that set analysis.

To analyze the quantity of bonds by their variance, each length has to be put in a specific class. In this case, only five classes are chosen as written in the function below.

CREATE OR REPLACE FUNCTION len2class(len integer)
    RETURNS integer AS
$BODY$
BEGIN
return 
    case 
        when len <= 100                then 1
        when len100 and len <= 200 then 2
        when len200 and len <= 300 then 3
        when len300 and len <= 400 then 4
                                       else 5
    end;
END;
$BODY$
    LANGUAGE plpgsql VOLATILE
    COST 100;

Most of the protein sizes are less than 400 amino acids, so length classification is done by splitting lengths into five ranges, each taking 100 amino acids, except the last one which takes the rest. Below you can see how to use this function to extract data for analysis:

 SELECT p.code,
    p.title,
    p.ss_bonds,
    p.ssbonds_overlap,
    p.intra_count,
    p.sci_name_src,
    p.len,
    p.tax_path,
    p.pfam_families,
    p.src_class,
    ( SELECT s.id
           FROM src_classes s
          WHERE s.src_class::text = p.src_class::text) AS src_class_id,
    p.len_class7,
    ( SELECT s.val
           FROM sources_organela s
          WHERE s.code = p.code::bpchar AND s.tag::text = 'EXPRESSION_SYSTEM_CELLULAR_LOCATION'::text) AS expression_system_cellular_location,
    ( SELECT s.val
           FROM sources_organela s
          WHERE s.code = p.code::bpchar AND s.tag::text = 'CELLULAR_LOCATION'::text) AS cellular_location,
    ps.sequence,
    ps.uniprot_code,
    ps.accession_code,
    ps.cc,
    ps.seq_uniprot,
    ps.chain_id
   FROM proteins p
     JOIN nr n ON n.code::text = p.code::text AND n.rep7 = true
     JOIN ps ps ON ps.code::text = n.code::text AND ps.chain_id = n.chain_id AND ps.het = false
  WHERE p.is_excluded = false AND p.chain_cnt = 1 AND p.is_set7 = true AND p.reactive_cnt7 = 0
  ORDER BY p.code;

 

PostgreSQL as a Processing Intermediary

In this work, we showed how to process data, from fetching to analyzing. When working with scientific data, sometimes normalization is needed, and sometimes not. When working with small quantities of data which will not be reused for another analysis, then it is enough to leave it denormalized where processing is fast enough.

The reason why this was done all in one bioinformatics database is that PostgreSQL is able to integrate many languages. This includes R, which can do statistical analysis directly in-database—the subject for a future article on bioinformatics tools.