NRL_3D PROTEIN SEQUENCE--STRUCTURE DATABASE

                      distributed by the

             Protein Information Resource (PIR)*
              Supported by NIH grant LM-05206
              And by NSF grant DIR-91--07540

                in collaboration with the

                Naval Research Laboratory (NRL)
     Partially supported by the Office of Naval Research
    and U.S. Army Medical Research and Development Command


                    Document  NRL_3D-0992


This database may be copied and redistributed freely, without prior 
consent, provided that the PIR and the NRL are acknowledged as the 
source.  Vendors who redistribute the database are requested to 
identify it prominently.  They should also indicate if the database 
has been reformatted, modified, or enhanced.  We would appreciate 
receiving typical advertising copy for each release and an annual 
statement of (1) from whom and in what form you obtain the database 
and (2) to how many end-users you estimate it is distributed.


We have made every effort to present the data accurately and to 
ensure the proper functioning of the programs. We cannot be 
responsible for the consequences to users of any errors in the 
data or programs.


              Protein Information Resource (PIR)
            National Biomedical Research Foundation
             Georgetown University Medical Center
                   3900 Reservoir Road, N.W.
                  Washington, D.C. 20007  USA


             *PIR is a registered mark of NBRF.


NRL_3D is a sequence--structure database derived from the 3-dimensional 
structures of proteins deposited in the Protein Data Bank (PDB) [1] as of 
July 1992. NRL_3D was conceived, developed, and tested by K. Namboodiri, 
N. Pattabiraman, A. Lowrey, and B. Gaber [2,4] at the Naval Research 
Laboratory, Washington, DC. It was incorporated into the Protein 
Information Resource (PIR) [3] by D. George, W. C. Barker, and J. S. 
Garavelli at the National Biomedical Research Foundation, Washington, 
DC, and M. Kusunoki of the Institute for Protein Research, Osaka 
University, Osaka, Japan. When this database is used in published 
research please cite references 2, 3, and 4.

The PDB contains atomic coordinates for the 3-dimensional structure of 
biomolecules obtained using X-ray, electron or neutron diffraction, 
nuclear magnetic resonance or molecular modeling methods. One common 
research task is selecting sets of coordinate data according to sequence 
properties. However, the primary sequence information in the PDB is not 
presented in a format compatible with sequence manipulation programs of 
the PIR or other sequence analysis software packages. The NRL_3D 
database provides a link between this software and the PDB and allows 
the sequence portion of the PDB to be searched and analyzed.

Sequence, reference and annotation data from the coordinate sets in the
PDB are extracted and reformatted in NBRF-format (see the PIR Document
Database File Structure and Format Specification for details). These 
data are set up as a separate database fully accessible to all PIR 
programs. The program that performs the extraction and reformatting, 
PDB2PIR, can select coordinate sets on criteria that (1) they correspond 
to well-defined protein sequences, (2) they are determined with 
resolutions below a specified limit, (3) they were not determined by 
molecular mechanics or other theoretical techniques or (4) they contain 
data added or revised after a specified date. When a selected PDB entry
contains more than one polypeptide chain, each chain is represented as 
a separate entry in NRL_3D. In addition, if more than 3.8 Angstroms 
separates the alpha-carbons of adjacent amino acids, the sequence is 
divided into separate fragments and each fragment with more than three 
recognizable amino acids is also represented in a separate entry. Large 
separations between adjacent amino acids typically occurs when an 
intervening sequence fragment has not been resolved in the electron 
density map. The existence of large separations is taken to indicate 
that the sequence is not complete or that the chains are not covalently 
connected.

The entry identification codes in NRL_3D are derived from the PDB codes 
by an automatic procedure. The first four characters in the code 
correspond to the PDB code. That four character code is followed first 
by an optional letter distinguishing one of several polypeptide chains 
indicated in the same PDB entry. When the chain identifier of a PDB 
entry is a number rather than a letter, the letter corresponding to the 
number is used whenever possible. At the end of the code is an optional 
number distinguishing one the several fragments detected in the same
 polypeptide chain. The fragment numbers are assigned in order from the 
amino terminal end, but fragments that do not have more than three 
recognizable amino acids are subsequently eliminated.

The titles for NRL_3D entries are constructed by combining the COMPND 
and SOURCE records of PDB. The PDB resolution and the R-values are 
included when they can be recognized. The list of contributing authors 
is included as a reference with the citation 'Coordinates deposited in 
the Protein Data Bank.' Any additional references appearing in the PDB 
entry are included as comments. In the author index file (.AUX) only 
the contributing authors' names are indexed. Consequently the AUTHOR 
command of PSQ and Atlas will retrieve on these authors only. The PDB 
HELIX, SHEET, TURN, SSBOND and SITE records as well as some special 
ATOM and HETATM records are carried into appropriate PIR features.

In the NBRF-PIR Protein Sequence Database sequences are numbered 
sequentially beginning with 1 at the amino end to the carboxyl end of 
the sequence. This numbering system is used by all the PIR software 
to specify subsequences or specific amino acids within the sequence. 
In the PDB, however, the numbering does not necessarily begin with 1, 
may contain negative values, may contain non-numerically coded 
insertions (i.e., 23A, 23B, etc.), and may not always follow 
consecutively. The numbering schemes in PDB entries are chosen by the 
depositors of the coordinate sets to highlight the correspondence 
between residues in homologous structures. To alleviate to some 
extent problems associated with these differing numbering systems, a 
transformation table has been constructed for each entry in the 
NRL_3D database. 

The MATCH and SCAN commands of the Protein Sequence Query (PSQ) 
Program (and the ATLAS program, also the PIR Web site has similar 
searching capabilities) interface the sequence data with the 
coordinate data. 

The PDB DATAPRTP files are not distributed by the PIR and the PIR 
does not distribute molecular graphics or modeling programs. These 
must be obtained from other sources. 

NOTE: Since 1988, VAX versions of the PDB have used the file-naming 
convention PDB'code'.ENT. Many computer facilities using the data 
had previously adopted a convention with file names of the form 
'code'.PDB. We would appreciate your comments on which convention 
you would prefer. We are sorry if this causes you any inconvenience.

References
----------

1. Abola, E.E., Bernstein, F.C., Bryant, S.H., Koetzle, T.F., and
   Weng, J. (1987) in Crystallographic Databases - Information
   Content, Software Systems, Scientific Applications, eds.,
   Allen, F.H., Bergerhoff, G., and Sievers, R., Data Commission
   of the Int'l Union of Crystallography, Bonn, Cambridge,
   Chester, pp. 107-132.

2. Namboodiri, K., Pattabiraman, N., Lowrey, A., and Gaber, B.P.
   (1988) Automated Protein Structure Data Bank Similarity
   Searches and Their Use in Molecular Modeling with MIDAS, J.
   Mol. Graphics 6, 211-212.

3. George, D.G., Barker, W.C., and Hunt, L.T. (1986) The Protein
   Identification Resource (PIR), Nucl. Acids Res. 14, 11-15.

4. Pattabiraman, N., Namboodiri, K., Lowrey, A., and Gaber, B.P.
   (1990) NRL_3D: a sequence-structure database derived from the
   protein data bank (PDB) and searchable within the PIR
   environment, Protein Sequences & Data Analysis 3, 387-405.