Document NBRF-INSTALL-703 PIR Installation Document for NBRF-PIR Format Release of the PIR-International Protein Sequence Database Release 69.00, June 30, 2001 232,624 entries, 80,607,033 residues and NRL_3D Sequence-Structure Database PATCHX non-redundant supplement to PIR-International PSD The collaborating centers of PIR-International: Protein Information Resource (PIR)* National Biomedical Research Foundation 3900 Reservoir Road, NW, Washington, DC 20007, USA Japan International Protein Munich Information Center for Information Database (JIPID) Protein Sequences (MIPS) Amakubo 1-16-1 GSF-Forschungszentrum f. Umwelt und Gesundheit Tsukuba 305-0005, Japan am Max-Planck-Instut f. Biochemie Am Klopferspitz 18, D-82152 Martinsried, FRG This database may be redistributed without prior consent, provided that this notice be given to each user and that the words "Derived from" shall precede this notice if the database has been altered by the redistributor. We have made every effort to ensure proper functioning of the programs and cannot be held responsible for the consequences to users of any problems encountered during their operation. Copyright 2000, PIR-International. *PIR is a registered mark of NBRF PIR is partially supported by National Library of Medicine grant LM05798 1.0 NBRF Format =============== This document describes the quarterly release of the PIR-International Protein Sequence Database and the NRL_3D Sequence-Structure Database in NBRF-PIR format formerly distributed on magnetic media for VAX/VMS systems. 2.0 In this Release =================== Release 69.00 of the Protein Sequence Database contains 232,624 entries and 80,607,033 residues. The Release is separated into four datasets. Section 1, Fully Classified Entries, contains 20,501 entries and 8,043,553 residues. Section 2, Verified and Classified Entries, contains 211,695 entries and 72,497,046 residues. Section 3, Unverified Entries, contains 23 entries and 69 residues. Section 4, Unencoded or Untranslated Entries, contains 405 entries and 66,365 residues. A total of 32,933 superfamilies are represented in sections 1 and 2. The NRL_3D Sequence-Structure Database contains 23,291 entries and 4,527,721 residues corresponding to the March 2000 Release of the Protein Data Bank. 3.0 Features in this Release ============================ Starting with Release 64.00 of the Protein Sequence Database, PIR-International is including status information in protein titles, function and complex records. These new status identifiers are as follows. [validated] = in a title or function block means that one of the references in the entry contains some experimental evidence for the protein's function. [similarity] = in a title or function block means that the name and/or function has been assigned by end to end sequence similarity with other entries that have that same name or function. [imported] = in a title means that the name was imported with the sequence from GenBank, EMBL DDBJ, or other source and has not been verified by PIR. Complete coverage of the entire database will not be obtained for several releases. The absence of a status identifier at this time should NOT be taken as an indication that the information in the title or function blocks is not correct or has not been evaluated by PIR staff. 3.1 Ongoing Special Projects ============================ Entries extracted from GenBank CDS regions and imported can be tracked by the GenBank PID, PIDN (protein_id), and NID cross references located in accession blocks of entries in the Protein Sequence Database. GenBank entries whose references do not appear in the PIR-International database are also candidates for inclusion and many of these candidates will be merged into larger sequence reports. In 1995, the PIR4 dataset was introduced for sequences that most researchers would not normally wish to include in searches for molecular, systematic or evolutionary studies. Since this data is in the published literature but may not be suitable for all users, a separate dataset is maintained. 3.2 Database Statistics ======================= Statistical information about the database is available in the PRINDX.LIS and SUPFSTAT.LIS files. In addition to a database short directory ordered by Superfamily number, the PRINDX file contains statistics about taxonomic frequency and longest sequence lengths in various classifications. The SUPFSTAT file contains statistics about Superfamily classification completeness and largest Superfamily groups. PADD and PREV files reflect additions/revisions to both the PIR1 and PIR2 sections of the database. 4.0 Technical Developers Information ==================================== The Technical Developers Bulletin is a document describing current and future database formats. The Bulletins are available from the PIR WWW Server at URL http://pir.georgetown.edu/pirwww/otherinfo/doc/techbulletin.html This electronic bulletin provides detailed specifications of the database format and serves as an "early warning system" for software developers and others who are concerned about changes in the format and standards for the PIR-International databases. If you are interested in the technical aspects of these database changes and would like to be placed on the mailing list for the Technical Bulletin, send a brief electronic mail note to PIRMAIL@NBRF.Georgetown.Edu. 5.0 Protein Sequence Database Files =================================== 5.1 Database Index and Documentation Files ========================================== 0NBRF.TXT this document PRINDX.LIS database listing of Section 1 and 2 (PIR1 and PIR2) and statistics on taxonomic frequency and sequence lengths 0PROTEIN.TXT document describing database file format 0NRL_3D.TXT document describing NRL 3D database 5.2.1 Primary Protein Sequence Database files ============================================= .SEQ primary file containing the title and sequence for each entry (ASCII). .REF primary file containing the title and annotation information for each entry (ASCII). .INX index file for the SEQ and REF files (Binary). It allows PSQ to use the VAX-11 RMS RFA record access mode for random access into these files. If either the .SEQ or .REF file is altered in any way, the information in the .INX file becomes invalid and the database system programs will not operate. !!!!!!!!!!!!!!NOTE CHANGES IN FILES AVAILABLE !!!!!!!!!!!!!!!!! IT IS PIR'S INTENTION TO STOP PROVIDING INDEX FILES FOR THE NBRF-PIR FORMAT AND THE CODATA FORMAT AFTER RELEASE 66.00!! THE *.REF, *.SEQ, *.NAM, AND *.DAT FLAT FILES WILL CONTINUE. IF YOU WISH US TO CONTINUE TO PROVIDE ANY OF THE INDEX FILES PLEASE LET US KNOW. CONTACT PIRMAIL@NBRF.GEORGETOWN.EDU . !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 5.3 Database files ================== The Protein Sequence Database is partitioned into four sections, named PIR1, PIR2, PIR3, PIR4 and is contained in the following files. PIR1 (Section 1. Fully Classified Entries) PIR1.NAM PIR1.REF PIR1.SEQ PIR2 (Section 2. Verified and Classified Entries) The PIR2 data set continues to be ordered by taxonomic classification as depicted in the file TAXONOMY.LIS for those entries that are not classified by Superfamily number. PIR2.NAM PIR2.REF PIR2.SEQ PIR3 (Section 3. Unverified Entries) Entries in this dataset have not been reviewed; only the sequences and references have been checked and verified. PIR3.NAM PIR3.REF PIR3.SEQ PIR4 (Section 4. Unencoded or Untranslated Entries) Entries in this dataset fall into one of the following categories: conceptual translations of artifactual nucleotide sequences; conceptual translations of nucleotide sequences that are not transcribed or translated or are abortively translated pseudogenes; protein sequences or conceptual translations of nucleotide sequences that are extensively genetically engineered; polypeptide sequences that are not genetically encoded and not produced on ribosomes. PIR4.NAM PIR4.REF PIR4.SEQ The data set NRL_3D is a protein sequence--structure database derived from the high resolution X-ray structures of proteins deposited in the Protein Data Bank (PDB). PSQ is compatible with these files. Please see the document NRL_3D.TXT for more information. The database consists of the following files. NRL_3D.AUX NRL_3D.CAX NRL_3D.CDX NRL_3D.FTX NRL_3D.INX NRL_3D.JRX NRL_3D.NAM NRL_3D.NUM NRL_3D.REF NRL_3D.RNX NRL_3D.SEQ NRL_3D.SPX NRL_3D.TSC NRL_3D.TTX NRL_3D.WOX The data set PATCHX (produced by MIPS) is a non redundant database of protein sequences not yet in the PIR-International. The PATCHX.NAM file contains a description of the database and method of construction. The database consists of the following files. PATCHX.INX PATCHX.NAM PATCHX.REF PATCHX.SEQ PATCHX.TSC PATCHX.TTX 5.4 Data files ============== SUPFAMNUM.LIS Superfamily name listing ordered by Superfamily number SUPFSTAT.LIS Superfamily classification statistics file TAXONOMY.LIS file containing a hierarchically ordered list of species names found in the PIR-International database JOURNALS.LIS file containing an alphabetical listing of all Journal abbreviations as found in the PIR-International databases SGC.LIS file containing a listing of Special Genetic Code usage tables (SGC1-SGC8) 5.5 Restriction enzyme files (in the PIR ftp directory: /pir/old_files) ============================ LONG.ENZ file containing all currently known enzymes SHORT.ENZ file containing one enzyme for each known enzyme specificity AVAIL.ENZ file containing all commercially available enzymes MERGED.ENZ file merged from SHORT.ENZ and AVAIL.ENZ NBRF.ENZ old NBRF restriction enzyme list Dr. Friedhelm Pfeiffer of the Max Planck Institute for Biochemistry, Martinsried, Germany, has compiled a set of four restriction enzyme lists combining the data of Dr. Richard Roberts (Nucl. Acids Res. 13, 165, 1985) and Dr. Kessler from Boehringer, Mannheim, Germany (Gene 1986). These lists (LONG.ENZ, SHORT.ENZ, AVAIL.ENZ, and MERGED.ENZ) have been donated to PIR-International and are provided. MERGED.ENZ is accessed by PSQ when the PIR system is initialized using PIR.COM; the other lists can be accessed by using the SET/ENZYME command to set the restriction enzyme list to an alternate list. Respond to the "Enzyme list:" prompt with LONGENZ, SHORTENZ, AVAILENZ, MERGEDENZ, or NBRFENZ. 6.0 Update Information ====================== PADD.LIS list of new sequence additions to PIR1 and PIR2 (Section 1. and Section 2.) PREV.LIS list of entries revised in PIR1 and PIR2 (Section 1. and Section 2.) These files can be used as code files to generate a current list in PSQ using the PSQ>GET command.