Document NBRF-INSTALL-701
                        PIR Installation Document
                  for NBRF-PIR Format Release of the

             PIR-International Protein Sequence Database
                    Release 63.00,  December 30, 1999
                 168,808 entries    58,629,455 residues

                                 and:
                   NRL_3D Sequence-Structure Database
        PATCHX non-redundant supplement to PIR-International PSD


              The collaborating centers of PIR-International:

                  Protein Information Resource (PIR)*
                National Biomedical Research Foundation
                        3900 Reservoir Road, NW,
                      Washington, DC  20007, USA


  Japan International Protein           Munich Information Center for
  Information Database (JIPID)             Protein Sequences (MIPS)
        Amakubo 1-16-1          GSF-Forschungszentrum f. Umwelt und Gesundheit
   Tsukuba 305-0005, Japan            am Max-Planck-Instut f. Biochemie
                                 Am Klopferspitz 18, D-82152 Martinsried, FRG


This database may be redistributed without prior consent, provided that
this notice be given to each user and that the words "Derived from" shall 
precede this notice if the database has been altered by the redistributor.

We have made every effort to ensure proper functioning of the programs 
and cannot be held responsible for the consequences to users of any 
problems encountered during their operation.


                *PIR is a registered mark of NBRF


PIR is partially supported by National Library of Medicine grant LM05798


The following are the staff members from PIR, MIPS, and JIPID that
contributed to Release 63.00 of the Protein Sequence Database:


Protein Information Resource (PIR)
==================================
Winona C. Barker        Ph.D.  Director, PIR
John S. Garavelli       Ph.D.  Associate Director, PIR
Cathy H. Wu             Ph.D.  Director of Bioinformatics
Bruce C. Orcutt         Ph.D.  Senior Computer Scientist
Lai-Su L. Yeh           Ph.D.  Senior Scientist
Geetha Y. Srinivasarao  Ph.D.  Computer Scientist
Peter B. McGarvey       Ph.D.  Senior Scientist
Hongzhan Huang          Ph.D.  Senior Scientist
Chunlin Xiao            Ph.D.  Bioinformatics Programmer
Joseph F. Janda         B.S.   Technical Services Coordinator     
Robert S. Ledley       D.D.S.  Principal Investigator


Munich Information Center for Protein Sequences (MIPS)
======================================================
Werner Mewes            Ph.D.  Director
Friedhelm Pfeiffer      Ph.D.  Head of Annotation Group
Gisela Fobo             Ph.D. *Senior Annotator
Corinna Keilmann        Ph.D. *Annotator
Irmtraut Dunger          Eng.  Annotator
Ute Kaemper             Ph.D. *Annotator
Goar Astvatsatourian    Ph.D.  Annotator
Petr Jordan              Eng.  System Manager


Japanese International Protein Information Database (JIPID)
===========================================================
Akira Tsugita           Ph.D.  Data Bank Chairman, Annotator, Editor
Jinya Otsuka            Ph.D.  Data Bank Vice Chairman, Editor
Takashi Kunisawa        Ph.D.  System Manager, Computer Support, Editor
Hiromi Suzuki           Ph.D.  Research Associate, Computer Support
Kenji Miyazaki          Ph.D.  Editor, Data Entry
Tatsuhiko Yagi          Ph.D. *Annotator
Hiroko Toda             Ph.D. *Research Associate, Annotator
Masaharu Kamo           Ph.D.  Research Associate, Annotator, Brain
Yuzo Nozu               Ph.D. *Laboratory Chief, Annotator
Fumio Arisaka           Ph.D. *Associate Professor, Annotator
Ruqun Shen               M.S.  Data Entry
Lin Xu                   -     Data Entry
Takao Kawakami          Ph.D.  Researcher, Annotator, Plant
Kazuo Satake            Ph.D. *Enzyme Database
Miyuki Tsukahara         -     Secretary
  
* Part time personnel


1.0 NBRF Format
===============
This document describes the quarterly release of the PIR-International 
Protein Sequence Database and the NRL_3D Sequence-Structure Database in 
NBRF-PIR format formerly distributed on magnetic media for VAX/VMS 
systems.

2.0 In this Release
===================
Release 63.00 of the Protein Sequence Database contains 168,808 entries
and 58,629,455 residues. The Release is separated into four datasets.
Section 1, Fully Classified Entries, contains 20,034 entries and
7,820,966 residues. Section 2, Verified and Classified Entries, contains 
147,632 entries and 50,341,994 residues. Section 3, Unverified Entries, 
contains 779 entries and 409,111 residues. Section 4, Unencoded or 
Untranslated Entries, contains 365 entries and 57,384 residues. A total 
of 10,307 superfamilies are represented in sections 1 and 2. The NRL_3D 
Sequence-Structure Database contains 14,791 entries and 2,636,724 
residues corresponding to the June 1998 Release of the Protein Data Bank.

3.0 Features in this Release
============================

3.1 Ongoing Special Projects
============================

Entries extracted from GenBank CDS regions and imported can be tracked 
by the GenBank PID, PIDN (protein_id), and NID cross references located
in accession blocks of entries in the Protein Sequence Database. GenBank 
entries whose references do not appear in the PIR-International database 
are also candidates for inclusion and many of these candidates will be merged 
into larger sequence reports. In 1995, the PIR4 dataset was introduced 
for sequences that most researchers would not normally wish to include 
in searches for molecular, systematic or evolutionary studies. Since this 
data is in the published literature but may not be suitable for all 
users, a separate dataset is maintained.

3.2 Database Statistics
=======================
Statistical information about the database is available in the 
PRINDX.LIS and SUPFSTAT.LIS files. In addition to a database short 
directory ordered by Superfamily number, the PRINDX file contains 
statistics about taxonomic frequency and longest sequence lengths in 
various classifications. The SUPFSTAT file contains statistics about 
Superfamily classification completeness and largest Superfamily groups. 
PADD and PREV files reflect additions/revisions to both the PIR1 and 
PIR2 sections of the database.

4.0 Technical Developers Information
====================================
The Technical Developers Bulletin is a document describing current and
future database formats. The Bulletins are available from the PIR WWW 
Server at URL 
http://pir.georgetown.edu/pirwww/otherinfo/doc/techbulletin.html
This electronic bulletin provides detailed specifications of the 
database format and serves as an "early warning system" for software 
developers and others who are concerned about changes in the format and 
standards for the PIR-International databases.

If you are interested in the technical aspects of these database changes
and would like to be placed on the mailing list for the Technical 
Bulletin, send a brief electronic mail note to 
PIRMAIL@NBRF.Georgetown.Edu.

5.0 Protein Sequence Database Files
===================================

5.1 Database Index and Documentation Files
==========================================
0NBRF.TXT     this document
PRINDX.LIS     database listing of Section 1 and 2 (PIR1 and PIR2) and
                 statistics on taxonomic frequency and sequence lengths
0PROTEIN.TXT    document describing database file format
0NRL_3D.TXT     document describing NRL 3D database

5.2.1 Primary Protein Sequence Database files
=============================================
 .SEQ    primary file containing the title and sequence for each entry
            (ASCII).
 .REF    primary file containing the title and annotation information
           for each entry (ASCII).
 .INX    index file for the SEQ and REF files (Binary).  It allows PSQ                       
           to use the VAX-11 RMS RFA record access mode for random 
           access into these files.  If either the .SEQ or .REF file is 
           altered in any way, the information in the .INX file becomes
           invalid and the database system programs will not operate.

5.2.2 Auxiliary/Optional files
==============================
 .NAM    a text file containing the database citation
 .CAX    file to support the SELECT and REPORT commands
 .CDX    file to support the REPORT, SELECT, SUPERFAMILY, and TAXONOMY 
           commands
 .CHN    file listing new and old versions of codes that have been
           changed between releases; this file may not be complete
 .TSC    file to support the SCAN command

5.2.3 Index files
=================
 .ACX    file to support the ACCESSION command
 .AUX    file to support the AUTHOR command
 .CRX *  index file for cross-reference numbers
 .FTX    file to support the FEATURE command
 .GNX *  index file for gene names
 .JRX *  index file for journals appearing in the database
 .RNX *  index file for Reference numbers appearing in the database
 .SFX    file to support the SUPERFAMILY/NAME command
 .SNX *  index file for superfamily numbers in the database
 .SPX    file to support the SPECIES command
 .TTX *  index file for titles appearing in the database
 .WOX    file to support the KEYWORD command

The format of these index files is ASCII with the keyword listed and 
ISN numbers following. All files are supported by PSQ except those marked 
with a single asterisk. Also, all ISN numbers in the files can be 
converted to CODES using the program CONVINDX described in section 6.4.3. 
To see CODES of files marked with a single asterisk, 
CONVINDX must be executed.

5.3 Database files
==================
The Protein Sequence Database is partitioned into four sections, 
named PIR1, PIR2, PIR3, PIR4 and is contained in the following files.

PIR1 (Section 1. Fully Classified Entries)
  PIR1.ACX   PIR1.AUX   PIR1.CAX   PIR1.CDX   PIR1.CHN   PIR1.CRX
  PIR1.FTX   PIR1.GNX   PIR1.INX   PIR1.JRX   PIR1.NAM   PIR1.REF
  PIR1.RNX   PIR1.SEQ   PIR1.SFX   PIR1.SNX   PIR1.SPX   PIR1.TSC
  PIR1.TTX   PIR1.WOX

PIR2 (Section 2. Verified and Classified Entries) The PIR2 data set 
continues to be ordered by taxonomic classification as depicted in 
the file TAXONOMY.LIS for those entries that are not classified by 
Superfamily number.
  PIR2.ACX   PIR2.AUX   PIR2.CAX   PIR2.CDX   PIR2.CHN   PIR2.CRX
  PIR2.FTX   PIR2.GNX   PIR2.INX   PIR2.JRX   PIR2.NAM   PIR2.REF
  PIR2.RNX   PIR2.SEQ   PIR2.SFX   PIR2.SNX   PIR2.SPX   PIR2.TSC
  PIR2.TTX   PIR2.WOX

PIR3 (Section 3. Unverified Entries) Entries in this dataset have not 
been reviewed; only the sequences and references have been checked 
and verified.
  PIR3.ACX   PIR3.AUX   PIR3.CAX   PIR3.CDX   PIR3.CRX   PIR3.INX
  PIR3.JRX   PIR3.NAM   PIR3.REF   PIR3.RNX   PIR3.SEQ   PIR3.SPX
  PIR3.TSC   PIR3.TTX   PIR3.WOX

PIR4 (Section 4. Unencoded or Untranslated Entries) Entries in this 
dataset fall into one of the following categories: conceptual 
translations of artifactual nucleotide sequences; conceptual 
translations of nucleotide sequences that are not transcribed or 
translated or are abortively translated pseudogenes; protein 
sequences or conceptual translations of nucleotide sequences that are 
extensively genetically engineered; polypeptide sequences that are not 
genetically encoded and not produced on ribosomes.
  PIR4.ACX   PIR4.AUX   PIR4.CAX   PIR4.CDX   PIR4.CRX   PIR4.FTX
  PIR4.GNX   PIR4.INX   PIR4.JRX   PIR4.NAM   PIR4.REF   PIR4.RNX
  PIR4.SEQ   PIR4.SFX   PIR4.SPX   PIR4.TSC   PIR4.TTX   PIR4.WOX

The data set NRL_3D is a protein sequence--structure database derived 
from the high resolution X-ray structures of proteins deposited in the 
Protein Data Bank (PDB). PSQ is compatible with these files. Please see 
the document NRL_3D.TXT for more information. The database consists of 
the following files.
  NRL_3D.AUX   NRL_3D.CAX   NRL_3D.CDX   NRL_3D.FTX   NRL_3D.INX
  NRL_3D.JRX   NRL_3D.NAM   NRL_3D.NUM   NRL_3D.REF   NRL_3D.RNX
  NRL_3D.SEQ   NRL_3D.SPX   NRL_3D.TSC   NRL_3D.TTX   NRL_3D.WOX

The data set PATCHX (produced by MIPS) is a non redundant database of 
protein sequences not yet in the PIR-International. The PATCHX.NAM file 
contains a description of the database and method of construction. The 
database consists of the following files.
  PATCHX.INX   PATCHX.NAM   PATCHX.REF   PATCHX.SEQ   PATCHX.TSC
  PATCHX.TTX

5.4 Data files
==============
SUPFAMNUM.LIS  Superfamily name listing ordered by Superfamily number
SUPFSTAT.LIS   Superfamily classification statistics file
TAXONOMY.LIS   file containing a hierarchically ordered list of species
                 names found in the PIR-International database
JOURNALS.LIS   file containing an alphabetical listing of all Journal
                 abbreviations as found in the PIR-International 
                 databases 
SGC.LIS        file containing a listing of Special Genetic Code usage
                 tables (SGC1-SGC8)

5.5 Restriction enzyme files (in the PIR ftp directory: /pir/old_files)
============================
LONG.ENZ       file containing all currently known enzymes
SHORT.ENZ      file containing one enzyme for each known enzyme 
                 specificity 
AVAIL.ENZ      file containing all commercially available enzymes
MERGED.ENZ     file merged from SHORT.ENZ and AVAIL.ENZ
NBRF.ENZ       old NBRF restriction enzyme list

Dr. Friedhelm Pfeiffer of the Max Planck Institute for Biochemistry,
Martinsried, Germany, has compiled a set of four restriction enzyme 
lists combining the data of Dr. Richard Roberts (Nucl. Acids Res. 13, 
165, 1985) and Dr. Kessler from Boehringer, Mannheim, Germany (Gene 
1986). These lists (LONG.ENZ, SHORT.ENZ, AVAIL.ENZ, and MERGED.ENZ) have 
been donated to PIR-International and are provided. MERGED.ENZ is 
accessed by PSQ when the PIR system is initialized using PIR.COM; the 
other lists can be accessed by using the SET/ENZYME command to set the 
restriction enzyme list to an alternate list. Respond to the "Enzyme 
list:" prompt with LONGENZ, SHORTENZ, AVAILENZ, MERGEDENZ, or NBRFENZ.

6.0 Update Information
======================
PADD.LIS   list of new sequence additions to PIR1 and PIR2
             (Section 1. and Section 2.)
PREV.LIS   list of entries revised in PIR1 and PIR2
             (Section 1. and Section 2.)
These files can be used as code files to generate a current list in PSQ 
using the PSQ>GET command.

7.0 PSQ Program Files (in the PIR ftp directory: /program)
=====================
The Protein Sequence Query system (PSQ) is composed of the following 
files.
PSQ.EXE    PSQ executable image
PSQ.HLB    run-time PSQ help library

7.1 PSQ Source and Documentation Files
======================================
The source code and documentation for the PSQ program are contained in
the following files.
PSQ.TLB    a VAX/VMS text library that contains FORTRAN source code
             modules for the PSQ program
PSQH.TLB   a VAX/VMS text library that contains FORTRAN INCLUDE
             modules for the PSQ program
PSQ.OLB    a VAX/VMS object module library that contains all the
             compiled code for the PSQ program
PSQM.DOC   the PSQ User's Guide
PSQM.RNO   RUNOFF source file for PSQM.DOC
PSQH.DOC   listing of the PSQ help messages

7.2 Database Definitions File
=============================
PIR.COM    a VAX/VMS DCL procedure for defining the logical and symbolic
             names necessary for the operation of the PSQ program.
PIR.COM must be executed prior to using the PSQ retrieval system. A means
of assuring proper use of the PSQ program is to put the line @PIR in the
file LOGIN.COM found in each user's primary directory. See section 8.0 
for more information on PIR.COM.

7.3 Utility Procedures
======================
PSQMAKE.COM  VAX/VMS DCL procedure to recompile the PSQ program
COPYLIB.FOR  FORTRAN program used to extract source code modules from a
               VAX/VMS Text library
COPYLIB.EXE  Executable image for COPYLIB
COPYLIB copies all modules from a TEXT or HELP library to separate files
with the file name equal to the module name. All extracted files have the
 same file type, which is specified by the user. PSQMAKE gives the user 
various compilation options such as EXTRACTING all source modules from 
the Text library, COMPILING all source .FOR files and LINKING all object 
.OBJ files. See section 9.0 for protocol.

7.4 Database Creating Programs
==============================

7.4.1 Documentation
===================
CREATEDB.TXT   description of software system for sequence databases
CODATA.TXT     document describing CODATA sequence exchange format

7.4.2 Primary database file programs
====================================
CREATEDBS      program to create primary NBRF-PIR database files
                 from PIR, CODATA, GENBANK, EMBL formatted input
EXCHANGE       program to convert NBRF-PIR database to CODATA format
CREATEINX      program to create primary database index file

7.4.4 Optional database file programs
=====================================
INDEXER        program to create all optional/auxiliary .TMP ASCII index
                 files for NBRF-PIR, GenBank and EMBL formatted 
                 databases.
SORTTMP        program to sort .TMP index file created by INDEXER to
                 a file format compatible with PSQ/NAQ
SORTTMP.COM    VAX/VMS DCL procedure to run SORTTMP on all .TMP files
                 created by INDEXER
CREATETSC      program to create tripeptide catalogue file
CONVINDX       program to convert ISN numbers in the index files to 
                 CODES
CREATETTL      program to create optional .TTL title file
TTL.COM        VAX/VMS DCL procedure to create .TTL/.INX files
TSC.COM        VAX/VMS DCL procedure to create the .TSC tripeptide
                 scan file for use with the PSQ program
The .TTL file is not supplied with any dataset; it is not used by the
current version of the PSQ program. The CREATETTL program is included to
allow those sites that are dependant upon user developed software that
utilizes this file to regenerate it. A new .INX file must be created to
allow indexed access to the .TTL file. TTL.COM is a DCL procedure that
executes the CREATETTL and CREATEINX programs to create the .TTL title
file and recreate the .INX index file for each of the four data sets.
TSC.COM is a DCL procedure that executes the CREATETSC program to create
the .TSC tripeptide catalogue files. These index files are required for
the PSQ SCAN command.

All programs are written in VAX-11 Fortran. The source code is contained
in the .FOR files, the executable in the .EXE files.

7.5 Nucleic Acid Sequence retrieval software
============================================
NAQ.DOC        NAQ user manual
NAQ.RNO        RUNOFF source file for NAQ.DOC
NAQ.EXE        NAQ executable image
NAQ.HLB        NAQ program Help Library
NAQ.OLB        NAQ program Object Library
NAQ.TLB        NAQ program Text Library (source code)

8.0 Installing the Database System in a VAX/VMS Environment
===========================================================
1. SET DEFAULT to the directory that will hold the database system.
2. FTP the files to the directory.
3. Uncompress the compressed files

9.0 Running the PSQ System in a VAX/VMS Environment
===================================================
1. EDIT the command procedure PIR.COM. You must change the
   device and directory specification that PIR.COM assigns to
   logical name PIRSYSTEM so that it specifies the directory at
   your installation that holds the database system; that is,
   your default set in step 1 of the Installation procedure.
2. Type @PIR to initialize the system.
3. Type PSQ to run PSQ on Section 1 of the data set;
   PSQ PIR2 to run PSQ on Section 2; PSQ PIR3 to run PSQ
   on Section 3; and PSQ PIR4 to run on Section 4.

10.0 Recompiling the PSQ Program
================================                                                               
If you are operating under a version of VAX/VMS prior to version 5.0, it
will be necessary to recompile the PSQ program. DCL procedure PSQMAKE.COM 
is provided for this purpose; please use the following procedure.
1. Create a subdirectory of the current directory containing the files
   copied from the PIR ftp site.  An example is 
   $ CREATE/DIR [.PSQSOURCE].
2. SET DEFAULT to this subdirectory and copy into it the files PSQ.TLB,
   PSQH.TLB, and COPYLIB.EXE.  Note that COPYLIB.FOR should not be
   included in this subdirectory.
3. Execute the procedure PSQMAKE.COM by typing $ @[-]PSQMAKE.  This 
procedure extracts all FORTRAN source modules from the Text libraries,
compiles the .FOR FORTRAN program files to produce object files, and 
links the .OBJ object files to produce the .EXE executable image.

After recompilation the new .EXE executable image should be moved to the
directory specified in the file PIR.COM. The user may want to purge older
file versions.

11.0 Software Modification Notes
================================
The FORTRAN source code is contained in the LIBRARY file PSQ.TLB and
PSQH.TLB; the corresponding compiled code is found in object library
PSQ.OLB. The PSQ software is designed in modular form. For convenience
in modifying this program, each subroutine is stored as a separate file 
and these files are stored in a subdirectory containing no other files. 
The source code files have been collected in the PSQ.TLB library and are
distributed in this form. The utility COPYLIB is supplied to allow
programmers to easily restore PSQ to its native environment. Set your
default to an empty subdirectory and execute COPYLIB. You would generally
specify the file type as FOR.

After modification of the source code, you need only recompile the
modified modules. These compiled modules can be replaced in the object
library and the library relinked. For example, after modification of
module TYPE
      $ FORTRAN/CONTINUATIONS=90      TYPE.FOR
      $ LIBRARY/REPLACE/LOG/OBJECT    PSQ.OLB   TYPE.OBJ
      $ LINK/NOTRACE/MAP/CROSS_REF    PSQ/INCLUDE=PSQ/LIBRARY
A new object library can be created by compiling all the source code 
files (the /CONTINUATIONS=90 qualifier of the FORTRAN command must be 
used) and using the /CREATE qualifier of the LIBRARY command, rather 
than /REPLACE, i.e.,
      $ LIBRARY/CREATE/LOG/OBJECT    PSQ.OLB   *.OBJ
Similarly, the help messages for the PSQ program are stored in the HELP
library file PSQ.HLB. They may be copied to separate files of file type
HLP using COPYLIB. A new help library may be created by the command
      $ LIBRARY/CREATE/LOG/HELP    PSQ.HLB   *.HLP