Document PSD-CODATA-0703
                         PIR Installation Document
                     For the CODATA Format Release of
 
               PIR-International Protein Sequence Database
                    Release 68.00,  March 31, 2001
                 219241 entries    76174552 residues
                                 and
                   NRL_3D Sequence-Structure Database
        PATCHX non-redundant supplement to PIR-International PSD

              The collaborating centers of PIR-International:

                  Protein Information Resource (PIR)*
                National Biomedical Research Foundation
                       3900 Reservoir Road, NW,
                      Washington, DC  20007, USA


  Japan International Protein           Munich Information Center for
  Information Database (JIPID)             Protein Sequences (MIPS)
        Amakubo 1-16-1          GSF-Forschungszentrum f. Umwelt und Gesundheit
   Tsukuba 305-0005, Japan            am Max-Planck-Instut f. Biochemie
                                 Am Klopferspitz 18, D-82152 Martinsried, FRG


This database may be redistributed, provided that this notice be given to 
each user and that the words "Derived from" shall precede this notice if 
the database has been altered by the redistributor.

We have made every effort to ensure proper functioning of the programs 
and cannot be held responsible for the consequences to users of any 
problems encountered during their operation.


                *PIR is a registered mark of NBRF


PIR is partially supported by National Library of Medicine grant LM05798


1.0 CODATA Format
=================
This document describes the quarterly release of the PIR-International 
Protein Sequence Database in CODATA format formerly distributed on magnetic
media for non-VAX/VMS systems in fixed-length 80-byte records.

2.0 In this Release
===================
Release 68.00 of the Protein Sequence Database contains 219,241 entries
and 76,174,552 residues. The Release is separated into four datasets.
Section 1, Fully Classified Entries, contains 20,498 entries and
8,042,606 residues. Section 2, Verified and Classified Entries, contains 
198,276 entries and 68,038,310 residues. Section 3, Unverified Entries, 
contains 62 entries and 27,267 residues. Section 4, Unencoded or 
Untranslated Entries, contains 405 entries and 66,369 residues. A total 
of 31,598 superfamilies are represented in sections 1 and 2. The NRL_3D 
Sequence-Structure Database contains 23,291 entries and 4,527,721
residues corresponding to the March 2000 Release of the Protein Data Bank.

3.0 Features in this Release
============================
Starting with Release 64.00 of the Protein Sequence Database, PIR-International
is including status information in protein titles, function and complex records.
These new status identifiers are as follows.

[validated] = in a title or function block means that one of the references
in the entry contains some experimental evidence for the protein's function.

[similarity] = in a title or function block means that the name and/or 
function has been assigned by end to end sequence similarity with other 
entries that have that same name or function.

[imported] = in a title means that the name was imported with the sequence from
GenBank, EMBL DDBJ, or other source and has not been verified by PIR.

Complete coverage of the entire database will not be obtained for several
releases.  The absence of a status identifier at this time should NOT be taken
as an indication that the information in the title or function blocks is not
correct or has not been evaluated by PIR staff.   
                                               
!!!!!!!!!!!!!!NOTE CHANGES IN FILES AVAILABLE !!!!!!!!!!!!!!!!!

IT IS PIR'S INTENTION TO STOP PROVIDING INDEX FILES FOR THE
NBRF-PIR FORMAT AND THE CODATA FORMAT AFTER RELEASE 66.00!!

THE *.REF, *.SEQ, *.NAM, AND *.DAT FLAT FILES WILL CONTINUE.

IF YOU WISH US TO CONTINUE TO PROVIDE ANY OF THE INDEX FILES
PLEASE LET US KNOW. CONTACT PIRMAIL@NBRF.GEORGETOWN.EDU .

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                                                                   
3.1 Ongoing Special Projects
============================

Entries extracted from GenBank CDS regions and imported can be tracked 
by the GenBank PID and NID cross references located in accession blocks 
of entries in the Protein Sequence Database. GenBank entries whose 
references do not appear in the PIR-International database are also 
candidates for inclusion and many of these candidates will be merged 
into larger sequence reports. In 1995, the PIR4 dataset was introduced 
for sequences that most researchers would not normally wish to include 
in searches for molecular, systematic or evolutionary studies. Since this 
data is in the published literature but may not be suitable for all 
users, a separate dataset is maintained.

3.2 Database Statistics
=======================
Statistical information about the database is available in the 
PRINDX.LIS and SUPFSTAT.LIS files. In addition to a database short 
directory ordered by Superfamily number, the PRINDX file contains 
statistics about taxonomic frequency and longest sequence lengths in 
various classifications. The SUPFSTAT file contains statistics about 
Superfamily classification completeness and largest Superfamily groups. 
PADD and PREV files reflect additions/revisions to both the PIR1 and 
PIR2 sections of the database.

4.0 Technical Developers Information
====================================
The Technical Developers Bulletin is a document describing current and future 
database formats. The Bulletins are available from the PIR WWW Server at URL 
http://pir.georgetown.edu/pirwww/otherinfo/doc/techbulletin.html 
This electronic bulletin provides detailed specifications of the database format 
and serves as an "early warning system" for software developers and others who 
are concerned about changes in the format and standards for the PIR-
International databases.

If you are interested in the technical aspects of these database changes and 
would like to be placed on the mailing list for the Technical Bulletin, send a 
brief electronic mail note to PIRMAIL@NBRF.Georgetown.Edu. 

5.0 File Organization
=====================
CODATA.TXT  contains a copy of this document
PROTEIN.TXT contains a document describing the CODATA Sequence Exchange format 
PRINDX.LIS  contains a listing of the Superfamily number and title of each entry
            in Section 1 (PIR1) and Section 2 (PIR2) that is classified as well 
            as statistics on taxonomic frequency and sequence lengths
PIR1.DAT contains the PIR-International Protein Sequence Database
           (Section 1. Fully Classified Entries)
PADD.LIS contains a list of entries added in this update to
           Section 1 (PIR1) and Section 2 (PIR2)
PREV.LIS contains a list of entries revised or deleted in this update of 
           Section 1 (PIR1) and Section 2 (PIR2)
PIR2.DAT contains the PIR-International Protein Sequence Database
           (Section 2. Verified and Classified Entries) 
PIR3.DAT contains the PIR-International Protein Sequence Database
           (Section 3. Unverified Entries) 
PIR4.DAT contains the PIR-International Protein Sequence Database
           (Section 4. Unencoded or Untranslated Entries) 
SUPFAMNUM.LIS contains a listing of the Superfamily names represented in 
           Section 1 (PIR1) and Section 2 (PIR2) organized numerically by      
           Superfamily number 
SUPFSTAT.LIS contains statistics about Superfamily classification represented in 
           Section 1 (PIR1) and Section 2(PIR2)
TAXONOMY.LIS contains a hierarchically ordered list of species names found in 
           the data sets
JOURNALS.LIS contains an alphabetical listing of journal abbreviations as found 
           in the PIR 
SGC.LIS contains a listing of eight Special Genetic Code tables
           depicting different codon usage for different 
NRL3D.DOC contains a document describing the NRL3D database structure and origin
NRL3D.DAT contains the NRL3D database corresponding to the indicated release of 
           the Protein Data Bank
NRL3D.NUM contains a table correlating the numbering in PDB and NRL_3D for each
           entry in NRL_3D

6.0 File Descriptions/Formats
=============================
PIR1.DAT represents Section 1 (Fully Classified Entries) of the Protein Sequence 
Database. PIR2.DAT represents Section 2 (Verified and Classified Entries) of the 
Protein Sequence Database. PIR3.DAT represents Section 3 (Unverified Entries) of 
the Protein Sequence Database. PIR4.DAT represents Section 4 (Unencoded or 
Untranslated Entries) of the Protein Sequence Database. Entries in PIR3 contain 
basic information that has not been reviewed; only the sequence and reference 
has been checked and verified. PIR2 and PIR3 are ordered by species according to 
the hierarchy represented in TAXONOMY.LIS. The PIR-International Protein 
Sequence Database sets PIR1, PIR2, PIR3 and PIR4 are distributed in the NBRF 
implementation of the CODATA Sequence Data Exchange Format.

6.1 Superfamily List Files
==========================
The PRINDX.LIS file is a database listing (PIR1 and PIR2) of classified 
sequences ordered by Superfamily number. This file also contains frequencies of 
a given taxonomy group and the entry with the longest sequence in each group. 
SUPFAMNUM.LIS contains a listing of the Superfamily names found for each 
Superfamily number represented in Section 1 (PIR1) and Section 2 (PIR2).

6.2 PIR1 Update Information
===========================
Files PADD.LIS and PREV.LIS contain entries that have changed since the last 
release. PADD contains a short directory of all entries that are new in this 
release (PIR1 and PIR2). PREV contains a short directory of all entries that had 
sequence revisions, text revisions, code changes and were deleted.

6.3 Data Files
==============
The structure of each data file (.LIS) and the NRL3D.NUM file is such that any 
line greater than 78 characters is wrapped to the next record. In such a case, 
the first record is terminated with '##'; any record beginning with '##' is a 
continuation of the record immediately preceeding the current. All blanks or 
spaces are preserved. To reassemble the file to larger record lengths, append 
multiple records together without including the '##' pairs.

6.3.1 Taxonomic Data Files
==========================
The TAXONOMY.LIS and SGC.LIS files are compiled and maintained by Andrzej
Elzanowski of MIPS at the Max-Planck-Institut Fuer Biochemie in Martinsried, 
Germany. These files represent data as it appears in Section 1 (PIR1) and 
Section 2 (PIR2) of the Protein Sequence Database. The PIR2 and PIR3 data sets 
are ordered according to the heirarchy of the TAXONOMY.LIS file.

6.3.2 Journal List
==================
The JOURNALS.LIS file is compiled and maintained by the National Library of 
Medicine (NLM). Journal names followed by an asterisk are no longer in use but 
continue to exist in the database.

6.3.3 NRL3D.NUM File
==================== 
THE NRL3D.NUM file contains residue numbering information allowing sequence 
reconstruction since the numbering system used in entries of the Protein Data 
Bank may not correspond to that used in the PIR-International database. Entry 
tile lines longer than 78 are wrapped as described.