Document PSD-CODATA-0701 PIR Installation Document For the CODATA Format Release of PIR-International Protein Sequence Database Release 63.00, December 30, 1999 168,808 entries 58,629,455 residues and: NRL_3D Sequence-Structure Database PATCHX non-redundant supplement to PIR-International PSD The collaborating centers of PIR-International: Protein Information Resource (PIR)* National Biomedical Research Foundation 3900 Reservoir Road, NW, Washington, DC 20007, USA Japan International Protein Munich Information Center for Information Database (JIPID) Protein Sequences (MIPS) Amakubo 1-16-1 GSF-Forschungszentrum f. Umwelt und Gesundheit Tsukuba 305-0005, Japan am Max-Planck-Instut f. Biochemie Am Klopferspitz 18, D-82152 Martinsried, FRG This database may be redistributed without prior consent, provided that this notice be given to each user and that the words "Derived from" shall precede this notice if the database has been altered by the redistributor. We have made every effort to ensure proper functioning of the programs and cannot be held responsible for the consequences to users of any problems encountered during their operation. *PIR is a registered mark of NBRF PIR is partially supported by National Library of Medicine grant LM05798 The following are the staff members from PIR, MIPS, and JIPID that contributed to Release 63.00 of the Protein Sequence Database: Protein Information Resource (PIR) ================================== Winona C. Barker Ph.D. Director, PIR John S. Garavelli Ph.D. Associate Director, PIR Cathy H. Wu Ph.D. Director of Bioinformatics Bruce C. Orcutt Ph.D. Senior Computer Scientist Lai-Su L. Yeh Ph.D. Senior Scientist Geetha Y. Srinivasarao Ph.D. Computer Scientist Peter B. McGarvey Ph.D. Senior Scientist Hongzhan Huang Ph.D. Senior Scientist Chunlin Xiao Ph.D. Bioinformatics Programmer Joseph F. Janda B.S. Technical Services Coordinator Robert S. Ledley D.D.S. Principal Investigator Munich Information Center for Protein Sequences (MIPS) ====================================================== Werner Mewes Ph.D. Director Friedhelm Pfeiffer Ph.D. Head of Annotation Group Gisela Fobo Ph.D. *Senior Annotator Corinna Keilmann Ph.D. *Annotator Irmtraut Dunger Eng. Annotator Ute Kaemper Ph.D. *Annotator Goar Astvatsatourian Ph.D. Annotator Petr Jordan Eng. System Manager Japanese International Protein Information Database (JIPID) =========================================================== Akira Tsugita Ph.D. Data Bank Chairman, Annotator, Editor Jinya Otsuka Ph.D. Data Bank Vice Chairman, Editor Takashi Kunisawa Ph.D. System Manager, Computer Support, Editor Hiromi Suzuki Ph.D. Research Associate, Computer Support Kenji Miyazaki Ph.D. Editor, Data Entry Tatsuhiko Yagi Ph.D. *Annotator Hiroko Toda Ph.D. *Research Associate, Annotator Masaharu Kamo Ph.D. Research Associate, Annotator, Brain Yuzo Nozu Ph.D. *Laboratory Chief, Annotator Fumio Arisaka Ph.D. *Associate Professor, Annotator Ruqun Shen M.S. Data Entry Lin Xu - Data Entry Takao Kawakami Ph.D. Researcher, Annotator, Plant Kazuo Satake Ph.D. *Enzyme Database Miyuki Tsukahara - Secretary * Part time personnel 1.0 CODATA Format ================= This document describes the quarterly release of the PIR-International Protein Sequence Database and the NRL_3D Sequence-Structure Database in CODATA format formerly distributed on magnetic media for non-VAX/VMS systems in fixed-length 80-byte records. 2.0 In this Release =================== Release 63.00 of the Protein Sequence Database contains 168,808 entries and 58,629,455 residues. The Release is separated into four datasets. Section 1, Fully Classified Entries, contains 20,032 entries and 7,820,966 residues. Section 2, Verified and Classified Entries, contains 147,632 entries and 50,341,994 residues. Section 3, Unverified Entries, contains 779 entries and 409,111 residues. Section 4, Unencoded or Untranslated Entries, contains 365 entries and 57,384 residues. A total of 10,307 superfamilies are represented in sections 1 and 2. The NRL_3D Sequence-Structure Database contains 14,791 entries and 2,636,724 residues corresponding to the June 1998 Release of the Protein Data Bank. 3.0 Features in this Release ============================ 3.1 Ongoing Special Projects ============================ Entries extracted from GenBank CDS regions and imported can be tracked by the GenBank PID and NID cross references located in accession blocks of entries in the Protein Sequence Database. GenBank entries whose references do not appear in the PIR-International database are also candidates for inclusion and many of these candidates will be merged into larger sequence reports. In 1995, the PIR4 dataset was introduced for sequences that most researchers would not normally wish to include in searches for molecular, systematic or evolutionary studies. Since this data is in the published literature but may not be suitable for all users, a separate dataset is maintained. 3.2 Database Statistics ======================= Statistical information about the database is available in the PRINDX.LIS and SUPFSTAT.LIS files. In addition to a database short directory ordered by Superfamily number, the PRINDX file contains statistics about taxonomic frequency and longest sequence lengths in various classifications. The SUPFSTAT file contains statistics about Superfamily classification completeness and largest Superfamily groups. PADD and PREV files reflect additions/revisions to both the PIR1 and PIR2 sections of the database. 4.0 Technical Developers Information ==================================== The Technical Developers Bulletin is a document describing current and future database formats. The Bulletins are available from the PIR WWW Server at URL http://pir.georgetown.edu/pirwww/otherinfo/doc/techbulletin.html This electronic bulletin provides detailed specifications of the database format and serves as an "early warning system" for software developers and others who are concerned about changes in the format and standards for the PIR- International databases. If you are interested in the technical aspects of these database changes and would like to be placed on the mailing list for the Technical Bulletin, send a brief electronic mail note to PIRMAIL@NBRF.Georgetown.Edu. 5.0 File Organization ===================== CODATA.TXT contains a copy of this document PROTEIN.TXT contains a document describing the CODATA Sequence Exchange format PRINDX.LIS contains a listing of the Superfamily number and title of each entry in Section 1 (PIR1) and Section 2 (PIR2) that is classified as well as statistics on taxonomic frequency and sequence lengths PIR1.DAT contains the PIR-International Protein Sequence Database (Section 1. Fully Classified Entries) PADD.LIS contains a list of entries added in this update to Section 1 (PIR1) and Section 2 (PIR2) PREV.LIS contains a list of entries revised or deleted in this update of Section 1 (PIR1) and Section 2 (PIR2) PIR2.DAT contains the PIR-International Protein Sequence Database (Section 2. Verified and Classified Entries) PIR3.DAT contains the PIR-International Protein Sequence Database (Section 3. Unverified Entries) PIR4.DAT contains the PIR-International Protein Sequence Database (Section 4. Unencoded or Untranslated Entries) SUPFAMNUM.LIS contains a listing of the Superfamily names represented in Section 1 (PIR1) and Section 2 (PIR2) organized numerically by Superfamily number SUPFSTAT.LIS contains statistics about Superfamily classification represented in Section 1 (PIR1) and Section 2(PIR2) TAXONOMY.LIS contains a hierarchically ordered list of species names found in the data sets JOURNALS.LIS contains an alphabetical listing of journal abbreviations as found in the PIR SGC.LIS contains a listing of eight Special Genetic Code tables depicting different codon usage for different NRL3D.DOC contains a document describing the NRL3D database structure and origin NRL3D.DAT contains the NRL3D database corresponding to the indicated release of the Protein Data Bank NRL3D.NUM contains a table correlating the numbering in PDB and NRL_3D for each entry in NRL_3D 6.0 File Descriptions/Formats ============================= PIR1.DAT represents Section 1 (Fully Classified Entries) of the Protein Sequence Database. PIR2.DAT represents Section 2 (Verified and Classified Entries) of the Protein Sequence Database. PIR3.DAT represents Section 3 (Unverified Entries) of the Protein Sequence Database. PIR4.DAT represents Section 4 (Unencoded or Untranslated Entries) of the Protein Sequence Database. Entries in PIR3 contain basic information that has not been reviewed; only the sequence and reference has been checked and verified. PIR2 and PIR3 are ordered by species according to the hierarchy represented in TAXONOMY.LIS. The PIR-International Protein Sequence Database sets PIR1, PIR2, PIR3 and PIR4 are distributed in the NBRF implementation of the CODATA Sequence Data Exchange Format. 6.1 Superfamily List Files ========================== The PRINDX.LIS file is a database listing (PIR1 and PIR2) of classified sequences ordered by Superfamily number. This file also contains frequencies of a given taxonomy group and the entry with the longest sequence in each group. SUPFAMNUM.LIS contains a listing of the Superfamily names found for each Superfamily number represented in Section 1 (PIR1) and Section 2 (PIR2). 6.2 PIR1 Update Information =========================== Files PADD.LIS and PREV.LIS contain entries that have changed since the last release. PADD contains a short directory of all entries that are new in this release (PIR1 and PIR2). PREV contains a short directory of all entries that had sequence revisions, text revisions, code changes and were deleted. 6.3 Data Files ============== The structure of each data file (.LIS) and the NRL3D.NUM file is such that any line greater than 78 characters is wrapped to the next record. In such a case, the first record is terminated with '##'; any record beginning with '##' is a continuation of the record immediately preceeding the current. All blanks or spaces are preserved. To reassemble the file to larger record lengths, append multiple records together without including the '##' pairs. 6.3.1 Taxonomic Data Files ========================== The TAXONOMY.LIS and SGC.LIS files are compiled and maintained by Andrzej Elzanowski of MIPS at the Max-Planck-Institut Fuer Biochemie in Martinsried, Germany. These files represent data as it appears in Section 1 (PIR1) and Section 2 (PIR2) of the Protein Sequence Database. The PIR2 and PIR3 data sets are ordered according to the heirarchy of the TAXONOMY.LIS file. 6.3.2 Journal List ================== The JOURNALS.LIS file is compiled and maintained by the National Library of Medicine (NLM). Journal names followed by an asterisk are no longer in use but continue to exist in the database. 6.3.3 NRL3D.NUM File ==================== THE NRL3D.NUM file contains residue numbering information allowing sequence reconstruction since the numbering system used in entries of the Protein Data Bank may not correspond to that used in the PIR-International database. Entry tile lines longer than 78 are wrapped as described.