Document NBRF-INSTALL-703
                        PIR Installation Document
                 For the NBRF-PIR Format Release of the
  
            P R O T E I N  S E Q U E N C E  D A T A B A S E
                           of PIR-International

                      Release 80.00, December 31, 2004
                    283416 sequences, 96216763 residues


                    Protein Information Resource (PIR)*
                  National Biomedical Research Foundation
                   3900 Reservoir Road, N.W., Box 571414
                      Washington, DC 20057-1414, USA
   
  
  Japan International Protein           Munich Information Center for
  Information Database (JIPID)             Protein Sequences (MIPS)
          Amakubo 1-16-1          GSF-Forschungszentrum f. Umwelt und Gesundheit
      Tsukuba 305-0005, Japan            am Max-Planck-Instut f. Biochemie
                                   Am Klopferspitz 18, D-82152 Martinsried, FRG
  
  
  This database may be redistributed without prior consent, provided that
  this notice be given to each user and that the words "Derived from" shall
  precede this notice if the database has been altered by the redistributor.
  
  We have made every effort to ensure proper functioning of the programs
  and cannot be held responsible for the consequences to users of any
  problems encountered during their operation.
  
                   Copyright 2000, PIR-International.
  
                   *PIR is a registered mark of NBRF
  
  
  PIR is partially supported by National Library of Medicine grant LM05798

                       *************************


                       * PIR-PSD Final Release *
                       =========================

   Release 80.00 is the final release for the PIR-International Protein
   Sequence Database (PIR-PSD). In 2002, PIR joined EBI and SIB to form
   the UniProt consortium. PIR-PSD sequences and annotations have been
   integrated into UniProt Knowledgebase. Bi-directional cross-references
   between UniProt (UniProt Knowledgebase and/or UniParc) and PIR-PSD are
   established to allow easy tracking of former PIR-PSD entries. PIR-PSD
   unique sequences, reference citations, and experimentally-verified data
   can now be found in the relevant UniProt records. This final version of
   the database will be accessible from the PIR web site and downloadable
   from the FTP site.

 
  1.0 NBRF Format
  ===============
  This document describes the quarterly release of the PIR-International
  Protein Sequence Database and the NRL_3D Sequence-Structure Database in
  NBRF-PIR format formerly distributed on magnetic media for VAX/VMS
  systems.
  
  2.0 In this Release
  ===================
  Release 80.00 of the Protein Sequence Database contains 283,416 entries
  and 96,216,763 residues. The Release is separated into four datasets.
  Sectione 1, Fully Classified Entries, contains 20,685 entries and
  8,103,841 residues. Section 2, Verified and Classified Entries, contains
  262,300 entries and 88,045,621 residues. Section 3, Unverified Entries,
  contains 24 entries and 74 residues. Section 4, Unencoded or
  Untranslated Entries, contains 407 tries and 67,227 residues. A total
  of 36,403 superfamilies includinding 5,700 fully curated ones are represented
  in sections 1 and 2.
 
  3.0 Features in this Release
  ============================
  Starting with Release 64.00 of the Protein Sequence Database, PIR-International
  is including status information in protein titles, function and complex records.
  These new status identifiers are as follows.
  
  [validated] = in a title or function block means that one of the references
  in the entry contains some experimental evidence for the protein's function.
  
  [similarity] = in a title or function block means that the name and/or
  function has been assigned by end to end sequence similarity with other
  entries that have that same name or function.
  
  [imported] = in a title means that the name was imported with the sequence from
  GenBank, EMBL DDBJ, or other source and has not been verified by PIR.
  
  Complete coverage of the entire database will not be obtained for several
  releases.  The absence of a status identifier at this time should NOT be taken
  as an indication that the information in the title or function blocks is not
  correct or has not been evaluated by PIR staff.
  
  3.1 Ongoing Special Projects
  ============================
  
  Entries extracted from GenBank CDS regions and imported can be tracked
  by the GenBank PID, PIDN (protein_id), and NID cross references located
  in accession blocks of entries in the Protein Sequence Database. GenBank
  entries whose references do not appear in the PIR-International database
  are also candidates for inclusion and many of these candidates will be merged
  into larger sequence reports. In 1995, the PIR4 dataset was introduced
  for sequences that most researchers would not normally wish to include
  in searches for molecular, systematic or evolutionary studies. Since this
  data is in the published literature but may not be suitable for all
  users, a separate dataset is maintained.
  
  3.2 Database Statistics
  =======================
  Statistical information about the database is available in the
  PRINDX.LIS and SUPFSTAT.LIS files. In addition to a database short
  directory ordered by Superfamily number, the PRINDX file contains
  statistics about taxonomic frequency and longest sequence lengths in
  various classifications. The SUPFSTAT file contains statistics about
  Superfamily classification completeness and largest Superfamily groups.
  PADD and PREV files reflect additions/revisions to both the PIR1 and
  PIR2 sections of the database.
  
  4.0 Technical Developers Information
  ====================================
  The Technical Developers Bulletin is a document describing current and
  future database formats. The Bulletins are available from the PIR WWW
  Server at URL
  http://pir.georgetown.edu/pirwww/otherinfo/doc/techbulletin.html
  This electronic bulletin provides detailed specifications of the
  database format and serves as an "early warning system" for software
  developers and others who are concerned about changes in the format and
  standards for the PIR-International databases.
  
  If you are interested in the technical aspects of these database changes
  and would like to be placed on the mailing list for the Technical
  Bulletin, send a brief electronic mail note to
  PIRMAIL@NBRF.Georgetown.Edu.
  
  5.0 Protein Sequence Database Files
  ===================================
  
  5.1 Database Index and Documentation Files
  ==========================================
  0NBRF.TXT      this document
  PRINDX.LIS     database listing of Section 1 and 2 (PIR1 and PIR2) and
                  statistics on taxonomic frequency and sequence lengths
  0PROTEIN.TXT   document describing database file format
  0NRL_3D.TXT    document describing NRL 3D database
  
  5.2 Primary Protein Sequence Database files
  =============================================
   .SEQ    primary file containing the title and sequence for each entry
              (ASCII).
   .REF    primary file containing the title and annotation information
             for each entry (ASCII).
   .INX    index file for the SEQ and REF files (Binary).  It allows PSQ
             to use the VAX-11 RMS RFA record access mode for random
             access into these files.  If either the .SEQ or .REF file is
             altered in any way, the information in the .INX file becomes
             invalid and the database system programs will not operate.
  
  
  !!!!!!!!!!!!!!NOTE CHANGES IN FILES AVAILABLE !!!!!!!!!!!!!!!!!
  
  IT IS PIR'S INTENTION TO STOP PROVIDING INDEX FILES FOR THE
  NBRF-PIR FORMAT AND THE CODATA FORMAT AFTER RELEASE 66.00!!
  
  THE *.REF, *.SEQ, *.NAM, AND *.DAT FLAT FILES WILL CONTINUE.
  
  IF YOU WISH US TO CONTINUE TO PROVIDE ANY OF THE INDEX FILES
  PLEASE LET US KNOW. CONTACT PIRMAIL@NBRF.GEORGETOWN.EDU .
  
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  
  
  5.3 Database files
  ==================
  The Protein Sequence Database is partitioned into four sections,
  named PIR1, PIR2, PIR3, PIR4 and is contained in the following files.
  
  PIR1 (Section 1. Fully Classified Entries)
    PIR1.NAM   PIR1.REF  PIR1.SEQ

  PIR2 (Section 2. Verified and Classified Entries) The PIR2 data set
  continues to be ordered by taxonomic classification as depicted in
  the file TAXONOMY.LIS for those entries that are not classified by
  Superfamily number.
    PIR2.NAM   PIR2.REF  PIR2.SEQ
  
  PIR3 (Section 3. Unverified Entries) Entries in this dataset have not
  been reviewed; only the sequences and references have been checked
  and verified.
    PIR3.NAM   PIR3.REF  PIR3.SEQ
  
  PIR4 (Section 4. Unencoded or Untranslated Entries) Entries in this
  dataset fall into one of the following categories: conceptual
  translations of artifactual nucleotide sequences; conceptual
  translations of nucleotide sequences that are not transcribed or
  translated or are abortively translated pseudogenes; protein
  sequences or conceptual translations of nucleotide sequences that are
  extensively genetically engineered; polypeptide sequences that are not
  genetically encoded and not produced on ribosomes.
    PIR4.NAM   PIR4.REF  PIR4.SEQ