Document NBRF-INSTALL-701 PIR Installation Document for NBRF-PIR Format Release of the PIR-International Protein Sequence Database Release 63.00, December 30, 1999 168,808 entries 58,629,455 residues and: NRL_3D Sequence-Structure Database PATCHX non-redundant supplement to PIR-International PSD The collaborating centers of PIR-International: Protein Information Resource (PIR)* National Biomedical Research Foundation 3900 Reservoir Road, NW, Washington, DC 20007, USA Japan International Protein Munich Information Center for Information Database (JIPID) Protein Sequences (MIPS) Amakubo 1-16-1 GSF-Forschungszentrum f. Umwelt und Gesundheit Tsukuba 305-0005, Japan am Max-Planck-Instut f. Biochemie Am Klopferspitz 18, D-82152 Martinsried, FRG This database may be redistributed without prior consent, provided that this notice be given to each user and that the words "Derived from" shall precede this notice if the database has been altered by the redistributor. We have made every effort to ensure proper functioning of the programs and cannot be held responsible for the consequences to users of any problems encountered during their operation. *PIR is a registered mark of NBRF PIR is partially supported by National Library of Medicine grant LM05798 The following are the staff members from PIR, MIPS, and JIPID that contributed to Release 63.00 of the Protein Sequence Database: Protein Information Resource (PIR) ================================== Winona C. Barker Ph.D. Director, PIR John S. Garavelli Ph.D. Associate Director, PIR Cathy H. Wu Ph.D. Director of Bioinformatics Bruce C. Orcutt Ph.D. Senior Computer Scientist Lai-Su L. Yeh Ph.D. Senior Scientist Geetha Y. Srinivasarao Ph.D. Computer Scientist Peter B. McGarvey Ph.D. Senior Scientist Hongzhan Huang Ph.D. Senior Scientist Chunlin Xiao Ph.D. Bioinformatics Programmer Joseph F. Janda B.S. Technical Services Coordinator Robert S. Ledley D.D.S. Principal Investigator Munich Information Center for Protein Sequences (MIPS) ====================================================== Werner Mewes Ph.D. Director Friedhelm Pfeiffer Ph.D. Head of Annotation Group Gisela Fobo Ph.D. *Senior Annotator Corinna Keilmann Ph.D. *Annotator Irmtraut Dunger Eng. Annotator Ute Kaemper Ph.D. *Annotator Goar Astvatsatourian Ph.D. Annotator Petr Jordan Eng. System Manager Japanese International Protein Information Database (JIPID) =========================================================== Akira Tsugita Ph.D. Data Bank Chairman, Annotator, Editor Jinya Otsuka Ph.D. Data Bank Vice Chairman, Editor Takashi Kunisawa Ph.D. System Manager, Computer Support, Editor Hiromi Suzuki Ph.D. Research Associate, Computer Support Kenji Miyazaki Ph.D. Editor, Data Entry Tatsuhiko Yagi Ph.D. *Annotator Hiroko Toda Ph.D. *Research Associate, Annotator Masaharu Kamo Ph.D. Research Associate, Annotator, Brain Yuzo Nozu Ph.D. *Laboratory Chief, Annotator Fumio Arisaka Ph.D. *Associate Professor, Annotator Ruqun Shen M.S. Data Entry Lin Xu - Data Entry Takao Kawakami Ph.D. Researcher, Annotator, Plant Kazuo Satake Ph.D. *Enzyme Database Miyuki Tsukahara - Secretary * Part time personnel 1.0 NBRF Format =============== This document describes the quarterly release of the PIR-International Protein Sequence Database and the NRL_3D Sequence-Structure Database in NBRF-PIR format formerly distributed on magnetic media for VAX/VMS systems. 2.0 In this Release =================== Release 63.00 of the Protein Sequence Database contains 168,808 entries and 58,629,455 residues. The Release is separated into four datasets. Section 1, Fully Classified Entries, contains 20,034 entries and 7,820,966 residues. Section 2, Verified and Classified Entries, contains 147,632 entries and 50,341,994 residues. Section 3, Unverified Entries, contains 779 entries and 409,111 residues. Section 4, Unencoded or Untranslated Entries, contains 365 entries and 57,384 residues. A total of 10,307 superfamilies are represented in sections 1 and 2. The NRL_3D Sequence-Structure Database contains 14,791 entries and 2,636,724 residues corresponding to the June 1998 Release of the Protein Data Bank. 3.0 Features in this Release ============================ 3.1 Ongoing Special Projects ============================ Entries extracted from GenBank CDS regions and imported can be tracked by the GenBank PID, PIDN (protein_id), and NID cross references located in accession blocks of entries in the Protein Sequence Database. GenBank entries whose references do not appear in the PIR-International database are also candidates for inclusion and many of these candidates will be merged into larger sequence reports. In 1995, the PIR4 dataset was introduced for sequences that most researchers would not normally wish to include in searches for molecular, systematic or evolutionary studies. Since this data is in the published literature but may not be suitable for all users, a separate dataset is maintained. 3.2 Database Statistics ======================= Statistical information about the database is available in the PRINDX.LIS and SUPFSTAT.LIS files. In addition to a database short directory ordered by Superfamily number, the PRINDX file contains statistics about taxonomic frequency and longest sequence lengths in various classifications. The SUPFSTAT file contains statistics about Superfamily classification completeness and largest Superfamily groups. PADD and PREV files reflect additions/revisions to both the PIR1 and PIR2 sections of the database. 4.0 Technical Developers Information ==================================== The Technical Developers Bulletin is a document describing current and future database formats. The Bulletins are available from the PIR WWW Server at URL http://pir.georgetown.edu/pirwww/otherinfo/doc/techbulletin.html This electronic bulletin provides detailed specifications of the database format and serves as an "early warning system" for software developers and others who are concerned about changes in the format and standards for the PIR-International databases. If you are interested in the technical aspects of these database changes and would like to be placed on the mailing list for the Technical Bulletin, send a brief electronic mail note to PIRMAIL@NBRF.Georgetown.Edu. 5.0 Protein Sequence Database Files =================================== 5.1 Database Index and Documentation Files ========================================== 0NBRF.TXT this document PRINDX.LIS database listing of Section 1 and 2 (PIR1 and PIR2) and statistics on taxonomic frequency and sequence lengths 0PROTEIN.TXT document describing database file format 0NRL_3D.TXT document describing NRL 3D database 5.2.1 Primary Protein Sequence Database files ============================================= .SEQ primary file containing the title and sequence for each entry (ASCII). .REF primary file containing the title and annotation information for each entry (ASCII). .INX index file for the SEQ and REF files (Binary). It allows PSQ to use the VAX-11 RMS RFA record access mode for random access into these files. If either the .SEQ or .REF file is altered in any way, the information in the .INX file becomes invalid and the database system programs will not operate. 5.2.2 Auxiliary/Optional files ============================== .NAM a text file containing the database citation .CAX file to support the SELECT and REPORT commands .CDX file to support the REPORT, SELECT, SUPERFAMILY, and TAXONOMY commands .CHN file listing new and old versions of codes that have been changed between releases; this file may not be complete .TSC file to support the SCAN command 5.2.3 Index files ================= .ACX file to support the ACCESSION command .AUX file to support the AUTHOR command .CRX * index file for cross-reference numbers .FTX file to support the FEATURE command .GNX * index file for gene names .JRX * index file for journals appearing in the database .RNX * index file for Reference numbers appearing in the database .SFX file to support the SUPERFAMILY/NAME command .SNX * index file for superfamily numbers in the database .SPX file to support the SPECIES command .TTX * index file for titles appearing in the database .WOX file to support the KEYWORD command The format of these index files is ASCII with the keyword listed and ISN numbers following. All files are supported by PSQ except those marked with a single asterisk. Also, all ISN numbers in the files can be converted to CODES using the program CONVINDX described in section 6.4.3. To see CODES of files marked with a single asterisk, CONVINDX must be executed. 5.3 Database files ================== The Protein Sequence Database is partitioned into four sections, named PIR1, PIR2, PIR3, PIR4 and is contained in the following files. PIR1 (Section 1. Fully Classified Entries) PIR1.ACX PIR1.AUX PIR1.CAX PIR1.CDX PIR1.CHN PIR1.CRX PIR1.FTX PIR1.GNX PIR1.INX PIR1.JRX PIR1.NAM PIR1.REF PIR1.RNX PIR1.SEQ PIR1.SFX PIR1.SNX PIR1.SPX PIR1.TSC PIR1.TTX PIR1.WOX PIR2 (Section 2. Verified and Classified Entries) The PIR2 data set continues to be ordered by taxonomic classification as depicted in the file TAXONOMY.LIS for those entries that are not classified by Superfamily number. PIR2.ACX PIR2.AUX PIR2.CAX PIR2.CDX PIR2.CHN PIR2.CRX PIR2.FTX PIR2.GNX PIR2.INX PIR2.JRX PIR2.NAM PIR2.REF PIR2.RNX PIR2.SEQ PIR2.SFX PIR2.SNX PIR2.SPX PIR2.TSC PIR2.TTX PIR2.WOX PIR3 (Section 3. Unverified Entries) Entries in this dataset have not been reviewed; only the sequences and references have been checked and verified. PIR3.ACX PIR3.AUX PIR3.CAX PIR3.CDX PIR3.CRX PIR3.INX PIR3.JRX PIR3.NAM PIR3.REF PIR3.RNX PIR3.SEQ PIR3.SPX PIR3.TSC PIR3.TTX PIR3.WOX PIR4 (Section 4. Unencoded or Untranslated Entries) Entries in this dataset fall into one of the following categories: conceptual translations of artifactual nucleotide sequences; conceptual translations of nucleotide sequences that are not transcribed or translated or are abortively translated pseudogenes; protein sequences or conceptual translations of nucleotide sequences that are extensively genetically engineered; polypeptide sequences that are not genetically encoded and not produced on ribosomes. PIR4.ACX PIR4.AUX PIR4.CAX PIR4.CDX PIR4.CRX PIR4.FTX PIR4.GNX PIR4.INX PIR4.JRX PIR4.NAM PIR4.REF PIR4.RNX PIR4.SEQ PIR4.SFX PIR4.SPX PIR4.TSC PIR4.TTX PIR4.WOX The data set NRL_3D is a protein sequence--structure database derived from the high resolution X-ray structures of proteins deposited in the Protein Data Bank (PDB). PSQ is compatible with these files. Please see the document NRL_3D.TXT for more information. The database consists of the following files. NRL_3D.AUX NRL_3D.CAX NRL_3D.CDX NRL_3D.FTX NRL_3D.INX NRL_3D.JRX NRL_3D.NAM NRL_3D.NUM NRL_3D.REF NRL_3D.RNX NRL_3D.SEQ NRL_3D.SPX NRL_3D.TSC NRL_3D.TTX NRL_3D.WOX The data set PATCHX (produced by MIPS) is a non redundant database of protein sequences not yet in the PIR-International. The PATCHX.NAM file contains a description of the database and method of construction. The database consists of the following files. PATCHX.INX PATCHX.NAM PATCHX.REF PATCHX.SEQ PATCHX.TSC PATCHX.TTX 5.4 Data files ============== SUPFAMNUM.LIS Superfamily name listing ordered by Superfamily number SUPFSTAT.LIS Superfamily classification statistics file TAXONOMY.LIS file containing a hierarchically ordered list of species names found in the PIR-International database JOURNALS.LIS file containing an alphabetical listing of all Journal abbreviations as found in the PIR-International databases SGC.LIS file containing a listing of Special Genetic Code usage tables (SGC1-SGC8) 5.5 Restriction enzyme files (in the PIR ftp directory: /pir/old_files) ============================ LONG.ENZ file containing all currently known enzymes SHORT.ENZ file containing one enzyme for each known enzyme specificity AVAIL.ENZ file containing all commercially available enzymes MERGED.ENZ file merged from SHORT.ENZ and AVAIL.ENZ NBRF.ENZ old NBRF restriction enzyme list Dr. Friedhelm Pfeiffer of the Max Planck Institute for Biochemistry, Martinsried, Germany, has compiled a set of four restriction enzyme lists combining the data of Dr. Richard Roberts (Nucl. Acids Res. 13, 165, 1985) and Dr. Kessler from Boehringer, Mannheim, Germany (Gene 1986). These lists (LONG.ENZ, SHORT.ENZ, AVAIL.ENZ, and MERGED.ENZ) have been donated to PIR-International and are provided. MERGED.ENZ is accessed by PSQ when the PIR system is initialized using PIR.COM; the other lists can be accessed by using the SET/ENZYME command to set the restriction enzyme list to an alternate list. Respond to the "Enzyme list:" prompt with LONGENZ, SHORTENZ, AVAILENZ, MERGEDENZ, or NBRFENZ. 6.0 Update Information ====================== PADD.LIS list of new sequence additions to PIR1 and PIR2 (Section 1. and Section 2.) PREV.LIS list of entries revised in PIR1 and PIR2 (Section 1. and Section 2.) These files can be used as code files to generate a current list in PSQ using the PSQ>GET command. 7.0 PSQ Program Files (in the PIR ftp directory: /program) ===================== The Protein Sequence Query system (PSQ) is composed of the following files. PSQ.EXE PSQ executable image PSQ.HLB run-time PSQ help library 7.1 PSQ Source and Documentation Files ====================================== The source code and documentation for the PSQ program are contained in the following files. PSQ.TLB a VAX/VMS text library that contains FORTRAN source code modules for the PSQ program PSQH.TLB a VAX/VMS text library that contains FORTRAN INCLUDE modules for the PSQ program PSQ.OLB a VAX/VMS object module library that contains all the compiled code for the PSQ program PSQM.DOC the PSQ User's Guide PSQM.RNO RUNOFF source file for PSQM.DOC PSQH.DOC listing of the PSQ help messages 7.2 Database Definitions File ============================= PIR.COM a VAX/VMS DCL procedure for defining the logical and symbolic names necessary for the operation of the PSQ program. PIR.COM must be executed prior to using the PSQ retrieval system. A means of assuring proper use of the PSQ program is to put the line @PIR in the file LOGIN.COM found in each user's primary directory. See section 8.0 for more information on PIR.COM. 7.3 Utility Procedures ====================== PSQMAKE.COM VAX/VMS DCL procedure to recompile the PSQ program COPYLIB.FOR FORTRAN program used to extract source code modules from a VAX/VMS Text library COPYLIB.EXE Executable image for COPYLIB COPYLIB copies all modules from a TEXT or HELP library to separate files with the file name equal to the module name. All extracted files have the same file type, which is specified by the user. PSQMAKE gives the user various compilation options such as EXTRACTING all source modules from the Text library, COMPILING all source .FOR files and LINKING all object .OBJ files. See section 9.0 for protocol. 7.4 Database Creating Programs ============================== 7.4.1 Documentation =================== CREATEDB.TXT description of software system for sequence databases CODATA.TXT document describing CODATA sequence exchange format 7.4.2 Primary database file programs ==================================== CREATEDBS program to create primary NBRF-PIR database files from PIR, CODATA, GENBANK, EMBL formatted input EXCHANGE program to convert NBRF-PIR database to CODATA format CREATEINX program to create primary database index file 7.4.4 Optional database file programs ===================================== INDEXER program to create all optional/auxiliary .TMP ASCII index files for NBRF-PIR, GenBank and EMBL formatted databases. SORTTMP program to sort .TMP index file created by INDEXER to a file format compatible with PSQ/NAQ SORTTMP.COM VAX/VMS DCL procedure to run SORTTMP on all .TMP files created by INDEXER CREATETSC program to create tripeptide catalogue file CONVINDX program to convert ISN numbers in the index files to CODES CREATETTL program to create optional .TTL title file TTL.COM VAX/VMS DCL procedure to create .TTL/.INX files TSC.COM VAX/VMS DCL procedure to create the .TSC tripeptide scan file for use with the PSQ program The .TTL file is not supplied with any dataset; it is not used by the current version of the PSQ program. The CREATETTL program is included to allow those sites that are dependant upon user developed software that utilizes this file to regenerate it. A new .INX file must be created to allow indexed access to the .TTL file. TTL.COM is a DCL procedure that executes the CREATETTL and CREATEINX programs to create the .TTL title file and recreate the .INX index file for each of the four data sets. TSC.COM is a DCL procedure that executes the CREATETSC program to create the .TSC tripeptide catalogue files. These index files are required for the PSQ SCAN command. All programs are written in VAX-11 Fortran. The source code is contained in the .FOR files, the executable in the .EXE files. 7.5 Nucleic Acid Sequence retrieval software ============================================ NAQ.DOC NAQ user manual NAQ.RNO RUNOFF source file for NAQ.DOC NAQ.EXE NAQ executable image NAQ.HLB NAQ program Help Library NAQ.OLB NAQ program Object Library NAQ.TLB NAQ program Text Library (source code) 8.0 Installing the Database System in a VAX/VMS Environment =========================================================== 1. SET DEFAULT to the directory that will hold the database system. 2. FTP the files to the directory. 3. Uncompress the compressed files 9.0 Running the PSQ System in a VAX/VMS Environment =================================================== 1. EDIT the command procedure PIR.COM. You must change the device and directory specification that PIR.COM assigns to logical name PIRSYSTEM so that it specifies the directory at your installation that holds the database system; that is, your default set in step 1 of the Installation procedure. 2. Type @PIR to initialize the system. 3. Type PSQ to run PSQ on Section 1 of the data set; PSQ PIR2 to run PSQ on Section 2; PSQ PIR3 to run PSQ on Section 3; and PSQ PIR4 to run on Section 4. 10.0 Recompiling the PSQ Program ================================ If you are operating under a version of VAX/VMS prior to version 5.0, it will be necessary to recompile the PSQ program. DCL procedure PSQMAKE.COM is provided for this purpose; please use the following procedure. 1. Create a subdirectory of the current directory containing the files copied from the PIR ftp site. An example is $ CREATE/DIR [.PSQSOURCE]. 2. SET DEFAULT to this subdirectory and copy into it the files PSQ.TLB, PSQH.TLB, and COPYLIB.EXE. Note that COPYLIB.FOR should not be included in this subdirectory. 3. Execute the procedure PSQMAKE.COM by typing $ @[-]PSQMAKE. This procedure extracts all FORTRAN source modules from the Text libraries, compiles the .FOR FORTRAN program files to produce object files, and links the .OBJ object files to produce the .EXE executable image. After recompilation the new .EXE executable image should be moved to the directory specified in the file PIR.COM. The user may want to purge older file versions. 11.0 Software Modification Notes ================================ The FORTRAN source code is contained in the LIBRARY file PSQ.TLB and PSQH.TLB; the corresponding compiled code is found in object library PSQ.OLB. The PSQ software is designed in modular form. For convenience in modifying this program, each subroutine is stored as a separate file and these files are stored in a subdirectory containing no other files. The source code files have been collected in the PSQ.TLB library and are distributed in this form. The utility COPYLIB is supplied to allow programmers to easily restore PSQ to its native environment. Set your default to an empty subdirectory and execute COPYLIB. You would generally specify the file type as FOR. After modification of the source code, you need only recompile the modified modules. These compiled modules can be replaced in the object library and the library relinked. For example, after modification of module TYPE $ FORTRAN/CONTINUATIONS=90 TYPE.FOR $ LIBRARY/REPLACE/LOG/OBJECT PSQ.OLB TYPE.OBJ $ LINK/NOTRACE/MAP/CROSS_REF PSQ/INCLUDE=PSQ/LIBRARY A new object library can be created by compiling all the source code files (the /CONTINUATIONS=90 qualifier of the FORTRAN command must be used) and using the /CREATE qualifier of the LIBRARY command, rather than /REPLACE, i.e., $ LIBRARY/CREATE/LOG/OBJECT PSQ.OLB *.OBJ Similarly, the help messages for the PSQ program are stored in the HELP library file PSQ.HLB. They may be copied to separate files of file type HLP using COPYLIB. A new help library may be created by the command $ LIBRARY/CREATE/LOG/HELP PSQ.HLB *.HLP