Document NBRF-INSTALL-703
                        PIR Installation Document
                  for NBRF-PIR Format Release of the

             PIR-International Protein Sequence Database
                    Release 69.00,   June 30, 2001
                 232,624 entries,   80,607,033 residues
                                 and
                   NRL_3D Sequence-Structure Database
        PATCHX non-redundant supplement to PIR-International PSD


              The collaborating centers of PIR-International:

                  Protein Information Resource (PIR)*
                National Biomedical Research Foundation
                        3900 Reservoir Road, NW,
                      Washington, DC  20007, USA


  Japan International Protein           Munich Information Center for
  Information Database (JIPID)             Protein Sequences (MIPS)
        Amakubo 1-16-1          GSF-Forschungszentrum f. Umwelt und Gesundheit
   Tsukuba 305-0005, Japan            am Max-Planck-Instut f. Biochemie
                                 Am Klopferspitz 18, D-82152 Martinsried, FRG


This database may be redistributed without prior consent, provided that
this notice be given to each user and that the words "Derived from" shall 
precede this notice if the database has been altered by the redistributor.

We have made every effort to ensure proper functioning of the programs 
and cannot be held responsible for the consequences to users of any 
problems encountered during their operation.

                Copyright 2000, PIR-International.

                *PIR is a registered mark of NBRF


PIR is partially supported by National Library of Medicine grant LM05798


1.0 NBRF Format
===============
This document describes the quarterly release of the PIR-International 
Protein Sequence Database and the NRL_3D Sequence-Structure Database in 
NBRF-PIR format formerly distributed on magnetic media for VAX/VMS 
systems.

2.0 In this Release
===================
Release 69.00 of the Protein Sequence Database contains 232,624 entries
and 80,607,033 residues. The Release is separated into four datasets.
Section 1, Fully Classified Entries, contains 20,501 entries and
8,043,553 residues. Section 2, Verified and Classified Entries, contains
211,695 entries and 72,497,046 residues. Section 3, Unverified Entries,
contains 23 entries and 69 residues. Section 4, Unencoded or
Untranslated Entries, contains 405 entries and 66,365 residues. A total
of 32,933 superfamilies are represented in sections 1 and 2. The NRL_3D 
Sequence-Structure Database contains 23,291 entries and 4,527,721
residues corresponding to the March 2000 Release of the Protein Data Bank.

3.0 Features in this Release
============================
Starting with Release 64.00 of the Protein Sequence Database, PIR-International
is including status information in protein titles, function and complex records.
These new status identifiers are as follows.

[validated] = in a title or function block means that one of the references
in the entry contains some experimental evidence for the protein's function.

[similarity] = in a title or function block means that the name and/or 
function has been assigned by end to end sequence similarity with other 
entries that have that same name or function.

[imported] = in a title means that the name was imported with the sequence from
GenBank, EMBL DDBJ, or other source and has not been verified by PIR.

Complete coverage of the entire database will not be obtained for several
releases.  The absence of a status identifier at this time should NOT be taken
as an indication that the information in the title or function blocks is not
correct or has not been evaluated by PIR staff.   

3.1 Ongoing Special Projects
============================

Entries extracted from GenBank CDS regions and imported can be tracked 
by the GenBank PID, PIDN (protein_id), and NID cross references located
in accession blocks of entries in the Protein Sequence Database. GenBank 
entries whose references do not appear in the PIR-International database 
are also candidates for inclusion and many of these candidates will be merged 
into larger sequence reports. In 1995, the PIR4 dataset was introduced 
for sequences that most researchers would not normally wish to include 
in searches for molecular, systematic or evolutionary studies. Since this 
data is in the published literature but may not be suitable for all 
users, a separate dataset is maintained.

3.2 Database Statistics
=======================
Statistical information about the database is available in the 
PRINDX.LIS and SUPFSTAT.LIS files. In addition to a database short 
directory ordered by Superfamily number, the PRINDX file contains 
statistics about taxonomic frequency and longest sequence lengths in 
various classifications. The SUPFSTAT file contains statistics about 
Superfamily classification completeness and largest Superfamily groups. 
PADD and PREV files reflect additions/revisions to both the PIR1 and 
PIR2 sections of the database.

4.0 Technical Developers Information
====================================
The Technical Developers Bulletin is a document describing current and
future database formats. The Bulletins are available from the PIR WWW 
Server at URL 
http://pir.georgetown.edu/pirwww/otherinfo/doc/techbulletin.html
This electronic bulletin provides detailed specifications of the 
database format and serves as an "early warning system" for software 
developers and others who are concerned about changes in the format and 
standards for the PIR-International databases.

If you are interested in the technical aspects of these database changes
and would like to be placed on the mailing list for the Technical 
Bulletin, send a brief electronic mail note to 
PIRMAIL@NBRF.Georgetown.Edu.

5.0 Protein Sequence Database Files
===================================

5.1 Database Index and Documentation Files
==========================================
0NBRF.TXT      this document
PRINDX.LIS     database listing of Section 1 and 2 (PIR1 and PIR2) and
                statistics on taxonomic frequency and sequence lengths
0PROTEIN.TXT   document describing database file format
0NRL_3D.TXT    document describing NRL 3D database

5.2.1 Primary Protein Sequence Database files
=============================================
 .SEQ    primary file containing the title and sequence for each entry
            (ASCII).
 .REF    primary file containing the title and annotation information
           for each entry (ASCII).
 .INX    index file for the SEQ and REF files (Binary).  It allows PSQ                       
           to use the VAX-11 RMS RFA record access mode for random 
           access into these files.  If either the .SEQ or .REF file is 
           altered in any way, the information in the .INX file becomes
           invalid and the database system programs will not operate.


!!!!!!!!!!!!!!NOTE CHANGES IN FILES AVAILABLE !!!!!!!!!!!!!!!!!

IT IS PIR'S INTENTION TO STOP PROVIDING INDEX FILES FOR THE
NBRF-PIR FORMAT AND THE CODATA FORMAT AFTER RELEASE 66.00!!

THE *.REF, *.SEQ, *.NAM, AND *.DAT FLAT FILES WILL CONTINUE.

IF YOU WISH US TO CONTINUE TO PROVIDE ANY OF THE INDEX FILES
PLEASE LET US KNOW. CONTACT PIRMAIL@NBRF.GEORGETOWN.EDU .

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


5.3 Database files
==================
The Protein Sequence Database is partitioned into four sections, 
named PIR1, PIR2, PIR3, PIR4 and is contained in the following files.

PIR1 (Section 1. Fully Classified Entries)
  PIR1.NAM   PIR1.REF  PIR1.SEQ

PIR2 (Section 2. Verified and Classified Entries) The PIR2 data set 
continues to be ordered by taxonomic classification as depicted in 
the file TAXONOMY.LIS for those entries that are not classified by 
Superfamily number.
  PIR2.NAM   PIR2.REF  PIR2.SEQ

PIR3 (Section 3. Unverified Entries) Entries in this dataset have not 
been reviewed; only the sequences and references have been checked 
and verified.
  PIR3.NAM   PIR3.REF  PIR3.SEQ

PIR4 (Section 4. Unencoded or Untranslated Entries) Entries in this 
dataset fall into one of the following categories: conceptual 
translations of artifactual nucleotide sequences; conceptual 
translations of nucleotide sequences that are not transcribed or 
translated or are abortively translated pseudogenes; protein 
sequences or conceptual translations of nucleotide sequences that are 
extensively genetically engineered; polypeptide sequences that are not 
genetically encoded and not produced on ribosomes.
  PIR4.NAM   PIR4.REF  PIR4.SEQ

The data set NRL_3D is a protein sequence--structure database derived 
from the high resolution X-ray structures of proteins deposited in the 
Protein Data Bank (PDB). PSQ is compatible with these files. Please see 
the document NRL_3D.TXT for more information. The database consists of 
the following files.
  NRL_3D.AUX   NRL_3D.CAX   NRL_3D.CDX   NRL_3D.FTX   NRL_3D.INX
  NRL_3D.JRX   NRL_3D.NAM   NRL_3D.NUM   NRL_3D.REF   NRL_3D.RNX
  NRL_3D.SEQ   NRL_3D.SPX   NRL_3D.TSC   NRL_3D.TTX   NRL_3D.WOX

The data set PATCHX (produced by MIPS) is a non redundant database of 
protein sequences not yet in the PIR-International. The PATCHX.NAM file 
contains a description of the database and method of construction. The 
database consists of the following files.
  PATCHX.INX   PATCHX.NAM   PATCHX.REF   PATCHX.SEQ   PATCHX.TSC
  PATCHX.TTX

5.4 Data files
==============
SUPFAMNUM.LIS  Superfamily name listing ordered by Superfamily number
SUPFSTAT.LIS   Superfamily classification statistics file
TAXONOMY.LIS   file containing a hierarchically ordered list of species
                 names found in the PIR-International database
JOURNALS.LIS   file containing an alphabetical listing of all Journal
                 abbreviations as found in the PIR-International 
                 databases 
SGC.LIS        file containing a listing of Special Genetic Code usage
                 tables (SGC1-SGC8)

5.5 Restriction enzyme files (in the PIR ftp directory: /pir/old_files)
============================
LONG.ENZ       file containing all currently known enzymes
SHORT.ENZ      file containing one enzyme for each known enzyme 
                 specificity 
AVAIL.ENZ      file containing all commercially available enzymes
MERGED.ENZ     file merged from SHORT.ENZ and AVAIL.ENZ
NBRF.ENZ       old NBRF restriction enzyme list

Dr. Friedhelm Pfeiffer of the Max Planck Institute for Biochemistry,
Martinsried, Germany, has compiled a set of four restriction enzyme 
lists combining the data of Dr. Richard Roberts (Nucl. Acids Res. 13, 
165, 1985) and Dr. Kessler from Boehringer, Mannheim, Germany (Gene 
1986). These lists (LONG.ENZ, SHORT.ENZ, AVAIL.ENZ, and MERGED.ENZ) have 
been donated to PIR-International and are provided. MERGED.ENZ is 
accessed by PSQ when the PIR system is initialized using PIR.COM; the 
other lists can be accessed by using the SET/ENZYME command to set the 
restriction enzyme list to an alternate list. Respond to the "Enzyme 
list:" prompt with LONGENZ, SHORTENZ, AVAILENZ, MERGEDENZ, or NBRFENZ.

6.0 Update Information
======================
PADD.LIS   list of new sequence additions to PIR1 and PIR2
             (Section 1. and Section 2.)
PREV.LIS   list of entries revised in PIR1 and PIR2
             (Section 1. and Section 2.)
These files can be used as code files to generate a current list in PSQ 
using the PSQ>GET command.