Document PSD-CODATA-0701
                         PIR Installation Document
                     For the CODATA Format Release of
 
             PIR-International Protein Sequence Database
                    Release 63.00,  December 30, 1999
                 168,808 entries    58,629,455 residues

                                 and:
                   NRL_3D Sequence-Structure Database
        PATCHX non-redundant supplement to PIR-International PSD


              The collaborating centers of PIR-International:

                  Protein Information Resource (PIR)*
                National Biomedical Research Foundation
                        3900 Reservoir Road, NW,
                      Washington, DC  20007, USA


  Japan International Protein           Munich Information Center for
  Information Database (JIPID)             Protein Sequences (MIPS)
        Amakubo 1-16-1          GSF-Forschungszentrum f. Umwelt und Gesundheit
   Tsukuba 305-0005, Japan            am Max-Planck-Instut f. Biochemie
                                 Am Klopferspitz 18, D-82152 Martinsried, FRG


This database may be redistributed without prior consent, provided that
this notice be given to each user and that the words "Derived from" shall 
precede this notice if the database has been altered by the redistributor.

We have made every effort to ensure proper functioning of the programs 
and cannot be held responsible for the consequences to users of any 
problems encountered during their operation.


                *PIR is a registered mark of NBRF


PIR is partially supported by National Library of Medicine grant LM05798


The following are the staff members from PIR, MIPS, and JIPID that
contributed to Release 63.00 of the Protein Sequence Database:


Protein Information Resource (PIR)
==================================
Winona C. Barker        Ph.D.  Director, PIR
John S. Garavelli       Ph.D.  Associate Director, PIR
Cathy H. Wu             Ph.D.  Director of Bioinformatics
Bruce C. Orcutt         Ph.D.  Senior Computer Scientist
Lai-Su L. Yeh           Ph.D.  Senior Scientist
Geetha Y. Srinivasarao  Ph.D.  Computer Scientist
Peter B. McGarvey       Ph.D.  Senior Scientist
Hongzhan Huang          Ph.D.  Senior Scientist
Chunlin Xiao            Ph.D.  Bioinformatics Programmer
Joseph F. Janda          B.S.  Technical Services Coordinator     
Robert S. Ledley       D.D.S.  Principal Investigator


Munich Information Center for Protein Sequences (MIPS)
======================================================
Werner Mewes            Ph.D.  Director
Friedhelm Pfeiffer      Ph.D.  Head of Annotation Group
Gisela Fobo             Ph.D. *Senior Annotator
Corinna Keilmann        Ph.D. *Annotator
Irmtraut Dunger          Eng.  Annotator
Ute Kaemper             Ph.D. *Annotator
Goar Astvatsatourian    Ph.D.  Annotator
Petr Jordan              Eng.  System Manager


Japanese International Protein Information Database (JIPID)
===========================================================
Akira Tsugita           Ph.D.  Data Bank Chairman, Annotator, Editor
Jinya Otsuka            Ph.D.  Data Bank Vice Chairman, Editor
Takashi Kunisawa        Ph.D.  System Manager, Computer Support, Editor
Hiromi Suzuki           Ph.D.  Research Associate, Computer Support
Kenji Miyazaki          Ph.D.  Editor, Data Entry
Tatsuhiko Yagi          Ph.D. *Annotator
Hiroko Toda             Ph.D. *Research Associate, Annotator
Masaharu Kamo           Ph.D.  Research Associate, Annotator, Brain
Yuzo Nozu               Ph.D. *Laboratory Chief, Annotator
Fumio Arisaka           Ph.D. *Associate Professor, Annotator
Ruqun Shen               M.S.  Data Entry
Lin Xu                   -     Data Entry
Takao Kawakami          Ph.D.  Researcher, Annotator, Plant
Kazuo Satake            Ph.D. *Enzyme Database
Miyuki Tsukahara         -     Secretary
  
* Part time personnel


1.0 CODATA Format
=================
This document describes the quarterly release of the PIR-International Protein 
Sequence Database and the NRL_3D Sequence-Structure Database in CODATA format 
formerly distributed on magnetic media for non-VAX/VMS systems in fixed-length 
80-byte records.

2.0 In this Release
===================
Release 63.00 of the Protein Sequence Database contains 168,808 entries
and 58,629,455 residues. The Release is separated into four datasets.
Section 1, Fully Classified Entries, contains 20,032 entries and
7,820,966 residues. Section 2, Verified and Classified Entries, contains
147,632 entries and 50,341,994 residues. Section 3, Unverified Entries,
contains 779 entries and 409,111 residues. Section 4, Unencoded or
Untranslated Entries, contains 365 entries and 57,384 residues. A total
of 10,307 superfamilies are represented in sections 1 and 2. The NRL_3D
Sequence-Structure Database contains 14,791 entries and 2,636,724
residues corresponding to the June 1998 Release of the Protein Data Bank.

3.0 Features in this Release
============================

3.1 Ongoing Special Projects
============================

Entries extracted from GenBank CDS regions and imported can be tracked 
by the GenBank PID and NID cross references located in accession blocks 
of entries in the Protein Sequence Database. GenBank entries whose 
references do not appear in the PIR-International database are also 
candidates for inclusion and many of these candidates will be merged 
into larger sequence reports. In 1995, the PIR4 dataset was introduced 
for sequences that most researchers would not normally wish to include 
in searches for molecular, systematic or evolutionary studies. Since this 
data is in the published literature but may not be suitable for all 
users, a separate dataset is maintained.

3.2 Database Statistics
=======================
Statistical information about the database is available in the 
PRINDX.LIS and SUPFSTAT.LIS files. In addition to a database short 
directory ordered by Superfamily number, the PRINDX file contains 
statistics about taxonomic frequency and longest sequence lengths in 
various classifications. The SUPFSTAT file contains statistics about 
Superfamily classification completeness and largest Superfamily groups. 
PADD and PREV files reflect additions/revisions to both the PIR1 and 
PIR2 sections of the database.

4.0 Technical Developers Information
====================================
The Technical Developers Bulletin is a document describing current and future 
database formats. The Bulletins are available from the PIR WWW Server at URL 
http://pir.georgetown.edu/pirwww/otherinfo/doc/techbulletin.html 
This electronic bulletin provides detailed specifications of the database format 
and serves as an "early warning system" for software developers and others who 
are concerned about changes in the format and standards for the PIR-
International databases.

If you are interested in the technical aspects of these database changes and 
would like to be placed on the mailing list for the Technical Bulletin, send a 
brief electronic mail note to PIRMAIL@NBRF.Georgetown.Edu. 

5.0 File Organization
=====================
CODATA.TXT  contains a copy of this document
PROTEIN.TXT contains a document describing the CODATA Sequence Exchange format 
PRINDX.LIS  contains a listing of the Superfamily number and title of each entry
            in Section 1 (PIR1) and Section 2 (PIR2) that is classified as well 
            as statistics on taxonomic frequency and sequence lengths
PIR1.DAT contains the PIR-International Protein Sequence Database
           (Section 1. Fully Classified Entries)
PADD.LIS contains a list of entries added in this update to
           Section 1 (PIR1) and Section 2 (PIR2)
PREV.LIS contains a list of entries revised or deleted in this update of 
           Section 1 (PIR1) and Section 2 (PIR2)
PIR2.DAT contains the PIR-International Protein Sequence Database
           (Section 2. Verified and Classified Entries) 
PIR3.DAT contains the PIR-International Protein Sequence Database
           (Section 3. Unverified Entries) 
PIR4.DAT contains the PIR-International Protein Sequence Database
           (Section 4. Unencoded or Untranslated Entries) 
SUPFAMNUM.LIS contains a listing of the Superfamily names represented in 
           Section 1 (PIR1) and Section 2 (PIR2) organized numerically by      
           Superfamily number 
SUPFSTAT.LIS contains statistics about Superfamily classification represented in 
           Section 1 (PIR1) and Section 2(PIR2)
TAXONOMY.LIS contains a hierarchically ordered list of species names found in 
           the data sets
JOURNALS.LIS contains an alphabetical listing of journal abbreviations as found 
           in the PIR 
SGC.LIS contains a listing of eight Special Genetic Code tables
           depicting different codon usage for different 
NRL3D.DOC contains a document describing the NRL3D database structure and origin
NRL3D.DAT contains the NRL3D database corresponding to the indicated release of 
           the Protein Data Bank
NRL3D.NUM contains a table correlating the numbering in PDB and NRL_3D for each
           entry in NRL_3D

6.0 File Descriptions/Formats
=============================
PIR1.DAT represents Section 1 (Fully Classified Entries) of the Protein Sequence 
Database. PIR2.DAT represents Section 2 (Verified and Classified Entries) of the 
Protein Sequence Database. PIR3.DAT represents Section 3 (Unverified Entries) of 
the Protein Sequence Database. PIR4.DAT represents Section 4 (Unencoded or 
Untranslated Entries) of the Protein Sequence Database. Entries in PIR3 contain 
basic information that has not been reviewed; only the sequence and reference 
has been checked and verified. PIR2 and PIR3 are ordered by species according to 
the hierarchy represented in TAXONOMY.LIS. The PIR-International Protein 
Sequence Database sets PIR1, PIR2, PIR3 and PIR4 are distributed in the NBRF 
implementation of the CODATA Sequence Data Exchange Format.

6.1 Superfamily List Files
==========================
The PRINDX.LIS file is a database listing (PIR1 and PIR2) of classified 
sequences ordered by Superfamily number. This file also contains frequencies of 
a given taxonomy group and the entry with the longest sequence in each group. 
SUPFAMNUM.LIS contains a listing of the Superfamily names found for each 
Superfamily number represented in Section 1 (PIR1) and Section 2 (PIR2).

6.2 PIR1 Update Information
===========================
Files PADD.LIS and PREV.LIS contain entries that have changed since the last 
release. PADD contains a short directory of all entries that are new in this 
release (PIR1 and PIR2). PREV contains a short directory of all entries that had 
sequence revisions, text revisions, code changes and were deleted.

6.3 Data Files
==============
The structure of each data file (.LIS) and the NRL3D.NUM file is such that any 
line greater than 78 characters is wrapped to the next record. In such a case, 
the first record is terminated with '##'; any record beginning with '##' is a 
continuation of the record immediately preceeding the current. All blanks or 
spaces are preserved. To reassemble the file to larger record lengths, append 
multiple records together without including the '##' pairs.

6.3.1 Taxonomic Data Files
==========================
The TAXONOMY.LIS and SGC.LIS files are compiled and maintained by Andrzej
Elzanowski of MIPS at the Max-Planck-Institut Fuer Biochemie in Martinsried, 
Germany. These files represent data as it appears in Section 1 (PIR1) and 
Section 2 (PIR2) of the Protein Sequence Database. The PIR2 and PIR3 data sets 
are ordered according to the heirarchy of the TAXONOMY.LIS file.

6.3.2 Journal List
==================
The JOURNALS.LIS file is compiled and maintained by the National Library of 
Medicine (NLM). Journal names followed by an asterisk are no longer in use but 
continue to exist in the database.

6.3.3 NRL3D.NUM File
==================== 
THE NRL3D.NUM file contains residue numbering information allowing sequence 
reconstruction since the numbering system used in entries of the Protein Data 
Bank may not correspond to that used in the PIR-International database. Entry 
tile lines longer than 78 are wrapped as described.