NRL_3D PROTEIN SEQUENCE--STRUCTURE DATABASE distributed by the Protein Information Resource (PIR)* Supported by NIH grant LM-05206 And by NSF grant DIR-91--07540 in collaboration with the Naval Research Laboratory (NRL) Partially supported by the Office of Naval Research and U.S. Army Medical Research and Development Command Document NRL_3D-0992 This database may be copied and redistributed freely, without prior consent, provided that the PIR and the NRL are acknowledged as the source. Vendors who redistribute the database are requested to identify it prominently. They should also indicate if the database has been reformatted, modified, or enhanced. We would appreciate receiving typical advertising copy for each release and an annual statement of (1) from whom and in what form you obtain the database and (2) to how many end-users you estimate it is distributed. We have made every effort to present the data accurately and to ensure the proper functioning of the programs. We cannot be responsible for the consequences to users of any errors in the data or programs. Protein Information Resource (PIR) National Biomedical Research Foundation Georgetown University Medical Center 3900 Reservoir Road, N.W. Washington, D.C. 20007 USA *PIR is a registered mark of NBRF. NRL_3D is a sequence--structure database derived from the 3-dimensional structures of proteins deposited in the Protein Data Bank (PDB) [1] as of July 1992. NRL_3D was conceived, developed, and tested by K. Namboodiri, N. Pattabiraman, A. Lowrey, and B. Gaber [2,4] at the Naval Research Laboratory, Washington, DC. It was incorporated into the Protein Information Resource (PIR) [3] by D. George, W. C. Barker, and J. S. Garavelli at the National Biomedical Research Foundation, Washington, DC, and M. Kusunoki of the Institute for Protein Research, Osaka University, Osaka, Japan. When this database is used in published research please cite references 2, 3, and 4. The PDB contains atomic coordinates for the 3-dimensional structure of biomolecules obtained using X-ray, electron or neutron diffraction, nuclear magnetic resonance or molecular modeling methods. One common research task is selecting sets of coordinate data according to sequence properties. However, the primary sequence information in the PDB is not presented in a format compatible with sequence manipulation programs of the PIR or other sequence analysis software packages. The NRL_3D database provides a link between this software and the PDB and allows the sequence portion of the PDB to be searched and analyzed. Sequence, reference and annotation data from the coordinate sets in the PDB are extracted and reformatted in NBRF-format (see the PIR Document Database File Structure and Format Specification for details). These data are set up as a separate database fully accessible to all PIR programs. The program that performs the extraction and reformatting, PDB2PIR, can select coordinate sets on criteria that (1) they correspond to well-defined protein sequences, (2) they are determined with resolutions below a specified limit, (3) they were not determined by molecular mechanics or other theoretical techniques or (4) they contain data added or revised after a specified date. When a selected PDB entry contains more than one polypeptide chain, each chain is represented as a separate entry in NRL_3D. In addition, if more than 3.8 Angstroms separates the alpha-carbons of adjacent amino acids, the sequence is divided into separate fragments and each fragment with more than three recognizable amino acids is also represented in a separate entry. Large separations between adjacent amino acids typically occurs when an intervening sequence fragment has not been resolved in the electron density map. The existence of large separations is taken to indicate that the sequence is not complete or that the chains are not covalently connected. The entry identification codes in NRL_3D are derived from the PDB codes by an automatic procedure. The first four characters in the code correspond to the PDB code. That four character code is followed first by an optional letter distinguishing one of several polypeptide chains indicated in the same PDB entry. When the chain identifier of a PDB entry is a number rather than a letter, the letter corresponding to the number is used whenever possible. At the end of the code is an optional number distinguishing one the several fragments detected in the same polypeptide chain. The fragment numbers are assigned in order from the amino terminal end, but fragments that do not have more than three recognizable amino acids are subsequently eliminated. The titles for NRL_3D entries are constructed by combining the COMPND and SOURCE records of PDB. The PDB resolution and the R-values are included when they can be recognized. The list of contributing authors is included as a reference with the citation 'Coordinates deposited in the Protein Data Bank.' Any additional references appearing in the PDB entry are included as comments. In the author index file (.AUX) only the contributing authors' names are indexed. Consequently the AUTHOR command of PSQ and Atlas will retrieve on these authors only. The PDB HELIX, SHEET, TURN, SSBOND and SITE records as well as some special ATOM and HETATM records are carried into appropriate PIR features. In the NBRF-PIR Protein Sequence Database sequences are numbered sequentially beginning with 1 at the amino end to the carboxyl end of the sequence. This numbering system is used by all the PIR software to specify subsequences or specific amino acids within the sequence. In the PDB, however, the numbering does not necessarily begin with 1, may contain negative values, may contain non-numerically coded insertions (i.e., 23A, 23B, etc.), and may not always follow consecutively. The numbering schemes in PDB entries are chosen by the depositors of the coordinate sets to highlight the correspondence between residues in homologous structures. To alleviate to some extent problems associated with these differing numbering systems, a transformation table has been constructed for each entry in the NRL_3D database. The MATCH and SCAN commands of the Protein Sequence Query (PSQ) Program (and the ATLAS program, also the PIR Web site has similar searching capabilities) interface the sequence data with the coordinate data. The PDB DATAPRTP files are not distributed by the PIR and the PIR does not distribute molecular graphics or modeling programs. These must be obtained from other sources. NOTE: Since 1988, VAX versions of the PDB have used the file-naming convention PDB'code'.ENT. Many computer facilities using the data had previously adopted a convention with file names of the form 'code'.PDB. We would appreciate your comments on which convention you would prefer. We are sorry if this causes you any inconvenience. References ---------- 1. Abola, E.E., Bernstein, F.C., Bryant, S.H., Koetzle, T.F., and Weng, J. (1987) in Crystallographic Databases - Information Content, Software Systems, Scientific Applications, eds., Allen, F.H., Bergerhoff, G., and Sievers, R., Data Commission of the Int'l Union of Crystallography, Bonn, Cambridge, Chester, pp. 107-132. 2. Namboodiri, K., Pattabiraman, N., Lowrey, A., and Gaber, B.P. (1988) Automated Protein Structure Data Bank Similarity Searches and Their Use in Molecular Modeling with MIDAS, J. Mol. Graphics 6, 211-212. 3. George, D.G., Barker, W.C., and Hunt, L.T. (1986) The Protein Identification Resource (PIR), Nucl. Acids Res. 14, 11-15. 4. Pattabiraman, N., Namboodiri, K., Lowrey, A., and Gaber, B.P. (1990) NRL_3D: a sequence-structure database derived from the protein data bank (PDB) and searchable within the PIR environment, Protein Sequences & Data Analysis 3, 387-405.