Document PSD-CODATA-0703 PIR Installation Document For the CODATA Format Release of PIR-International Protein Sequence Database Release 68.00, March 31, 2001 219241 entries 76174552 residues and NRL_3D Sequence-Structure Database PATCHX non-redundant supplement to PIR-International PSD The collaborating centers of PIR-International: Protein Information Resource (PIR)* National Biomedical Research Foundation 3900 Reservoir Road, NW, Washington, DC 20007, USA Japan International Protein Munich Information Center for Information Database (JIPID) Protein Sequences (MIPS) Amakubo 1-16-1 GSF-Forschungszentrum f. Umwelt und Gesundheit Tsukuba 305-0005, Japan am Max-Planck-Instut f. Biochemie Am Klopferspitz 18, D-82152 Martinsried, FRG This database may be redistributed, provided that this notice be given to each user and that the words "Derived from" shall precede this notice if the database has been altered by the redistributor. We have made every effort to ensure proper functioning of the programs and cannot be held responsible for the consequences to users of any problems encountered during their operation. *PIR is a registered mark of NBRF PIR is partially supported by National Library of Medicine grant LM05798 1.0 CODATA Format ================= This document describes the quarterly release of the PIR-International Protein Sequence Database in CODATA format formerly distributed on magnetic media for non-VAX/VMS systems in fixed-length 80-byte records. 2.0 In this Release =================== Release 68.00 of the Protein Sequence Database contains 219,241 entries and 76,174,552 residues. The Release is separated into four datasets. Section 1, Fully Classified Entries, contains 20,498 entries and 8,042,606 residues. Section 2, Verified and Classified Entries, contains 198,276 entries and 68,038,310 residues. Section 3, Unverified Entries, contains 62 entries and 27,267 residues. Section 4, Unencoded or Untranslated Entries, contains 405 entries and 66,369 residues. A total of 31,598 superfamilies are represented in sections 1 and 2. The NRL_3D Sequence-Structure Database contains 23,291 entries and 4,527,721 residues corresponding to the March 2000 Release of the Protein Data Bank. 3.0 Features in this Release ============================ Starting with Release 64.00 of the Protein Sequence Database, PIR-International is including status information in protein titles, function and complex records. These new status identifiers are as follows. [validated] = in a title or function block means that one of the references in the entry contains some experimental evidence for the protein's function. [similarity] = in a title or function block means that the name and/or function has been assigned by end to end sequence similarity with other entries that have that same name or function. [imported] = in a title means that the name was imported with the sequence from GenBank, EMBL DDBJ, or other source and has not been verified by PIR. Complete coverage of the entire database will not be obtained for several releases. The absence of a status identifier at this time should NOT be taken as an indication that the information in the title or function blocks is not correct or has not been evaluated by PIR staff. !!!!!!!!!!!!!!NOTE CHANGES IN FILES AVAILABLE !!!!!!!!!!!!!!!!! IT IS PIR'S INTENTION TO STOP PROVIDING INDEX FILES FOR THE NBRF-PIR FORMAT AND THE CODATA FORMAT AFTER RELEASE 66.00!! THE *.REF, *.SEQ, *.NAM, AND *.DAT FLAT FILES WILL CONTINUE. IF YOU WISH US TO CONTINUE TO PROVIDE ANY OF THE INDEX FILES PLEASE LET US KNOW. CONTACT PIRMAIL@NBRF.GEORGETOWN.EDU . !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 3.1 Ongoing Special Projects ============================ Entries extracted from GenBank CDS regions and imported can be tracked by the GenBank PID and NID cross references located in accession blocks of entries in the Protein Sequence Database. GenBank entries whose references do not appear in the PIR-International database are also candidates for inclusion and many of these candidates will be merged into larger sequence reports. In 1995, the PIR4 dataset was introduced for sequences that most researchers would not normally wish to include in searches for molecular, systematic or evolutionary studies. Since this data is in the published literature but may not be suitable for all users, a separate dataset is maintained. 3.2 Database Statistics ======================= Statistical information about the database is available in the PRINDX.LIS and SUPFSTAT.LIS files. In addition to a database short directory ordered by Superfamily number, the PRINDX file contains statistics about taxonomic frequency and longest sequence lengths in various classifications. The SUPFSTAT file contains statistics about Superfamily classification completeness and largest Superfamily groups. PADD and PREV files reflect additions/revisions to both the PIR1 and PIR2 sections of the database. 4.0 Technical Developers Information ==================================== The Technical Developers Bulletin is a document describing current and future database formats. The Bulletins are available from the PIR WWW Server at URL http://pir.georgetown.edu/pirwww/otherinfo/doc/techbulletin.html This electronic bulletin provides detailed specifications of the database format and serves as an "early warning system" for software developers and others who are concerned about changes in the format and standards for the PIR- International databases. If you are interested in the technical aspects of these database changes and would like to be placed on the mailing list for the Technical Bulletin, send a brief electronic mail note to PIRMAIL@NBRF.Georgetown.Edu. 5.0 File Organization ===================== CODATA.TXT contains a copy of this document PROTEIN.TXT contains a document describing the CODATA Sequence Exchange format PRINDX.LIS contains a listing of the Superfamily number and title of each entry in Section 1 (PIR1) and Section 2 (PIR2) that is classified as well as statistics on taxonomic frequency and sequence lengths PIR1.DAT contains the PIR-International Protein Sequence Database (Section 1. Fully Classified Entries) PADD.LIS contains a list of entries added in this update to Section 1 (PIR1) and Section 2 (PIR2) PREV.LIS contains a list of entries revised or deleted in this update of Section 1 (PIR1) and Section 2 (PIR2) PIR2.DAT contains the PIR-International Protein Sequence Database (Section 2. Verified and Classified Entries) PIR3.DAT contains the PIR-International Protein Sequence Database (Section 3. Unverified Entries) PIR4.DAT contains the PIR-International Protein Sequence Database (Section 4. Unencoded or Untranslated Entries) SUPFAMNUM.LIS contains a listing of the Superfamily names represented in Section 1 (PIR1) and Section 2 (PIR2) organized numerically by Superfamily number SUPFSTAT.LIS contains statistics about Superfamily classification represented in Section 1 (PIR1) and Section 2(PIR2) TAXONOMY.LIS contains a hierarchically ordered list of species names found in the data sets JOURNALS.LIS contains an alphabetical listing of journal abbreviations as found in the PIR SGC.LIS contains a listing of eight Special Genetic Code tables depicting different codon usage for different NRL3D.DOC contains a document describing the NRL3D database structure and origin NRL3D.DAT contains the NRL3D database corresponding to the indicated release of the Protein Data Bank NRL3D.NUM contains a table correlating the numbering in PDB and NRL_3D for each entry in NRL_3D 6.0 File Descriptions/Formats ============================= PIR1.DAT represents Section 1 (Fully Classified Entries) of the Protein Sequence Database. PIR2.DAT represents Section 2 (Verified and Classified Entries) of the Protein Sequence Database. PIR3.DAT represents Section 3 (Unverified Entries) of the Protein Sequence Database. PIR4.DAT represents Section 4 (Unencoded or Untranslated Entries) of the Protein Sequence Database. Entries in PIR3 contain basic information that has not been reviewed; only the sequence and reference has been checked and verified. PIR2 and PIR3 are ordered by species according to the hierarchy represented in TAXONOMY.LIS. The PIR-International Protein Sequence Database sets PIR1, PIR2, PIR3 and PIR4 are distributed in the NBRF implementation of the CODATA Sequence Data Exchange Format. 6.1 Superfamily List Files ========================== The PRINDX.LIS file is a database listing (PIR1 and PIR2) of classified sequences ordered by Superfamily number. This file also contains frequencies of a given taxonomy group and the entry with the longest sequence in each group. SUPFAMNUM.LIS contains a listing of the Superfamily names found for each Superfamily number represented in Section 1 (PIR1) and Section 2 (PIR2). 6.2 PIR1 Update Information =========================== Files PADD.LIS and PREV.LIS contain entries that have changed since the last release. PADD contains a short directory of all entries that are new in this release (PIR1 and PIR2). PREV contains a short directory of all entries that had sequence revisions, text revisions, code changes and were deleted. 6.3 Data Files ============== The structure of each data file (.LIS) and the NRL3D.NUM file is such that any line greater than 78 characters is wrapped to the next record. In such a case, the first record is terminated with '##'; any record beginning with '##' is a continuation of the record immediately preceeding the current. All blanks or spaces are preserved. To reassemble the file to larger record lengths, append multiple records together without including the '##' pairs. 6.3.1 Taxonomic Data Files ========================== The TAXONOMY.LIS and SGC.LIS files are compiled and maintained by Andrzej Elzanowski of MIPS at the Max-Planck-Institut Fuer Biochemie in Martinsried, Germany. These files represent data as it appears in Section 1 (PIR1) and Section 2 (PIR2) of the Protein Sequence Database. The PIR2 and PIR3 data sets are ordered according to the heirarchy of the TAXONOMY.LIS file. 6.3.2 Journal List ================== The JOURNALS.LIS file is compiled and maintained by the National Library of Medicine (NLM). Journal names followed by an asterisk are no longer in use but continue to exist in the database. 6.3.3 NRL3D.NUM File ==================== THE NRL3D.NUM file contains residue numbering information allowing sequence reconstruction since the numbering system used in entries of the Protein Data Bank may not correspond to that used in the PIR-International database. Entry tile lines longer than 78 are wrapped as described.