P R O T E I N  S E Q U E N C E  D A T A B A S E                    
                                                                                
                                                                                
                                                                                
                                                                                
                         PIR Document CXFSD-1293                                
                   CODATA Exchange Format Specification                         
                     Version 3.0,  December 31, 1993                            
                                                                                
                                                                                
                                                                                
                                                                                
                   Protein Information Resource (PIR)*                          
                 National Biomedical Research Foundation                        
                        3900 Reservoir Road, N.W.,                              
                        Washington, DC  20007, USA                              
                   E-mail: PIRMAIL@NBRF.Georgetown.Edu                          
                                                                                
                                                                                
                          In collaboration with:                                
International Protein Information |      Martinsried Institute for              
    Database in Japan (JIPID)     |       Protein Sequences (MIPS)              
   Science University of Tokyo    | Max Planck Institute for Biochemistry       
 2669 Yamazaki, Noda 278, Japan   | D-8033 Martinsried bei Muenchen, FRG        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
  This database may be redistributed without prior consent, provided            
  that this notice be given to each user and that the words "Derived            
  from" shall precede this notice if the database has been altered              
  by the redistributor.                                                         
                                                                                
                                                                                
                    *PIR is a registered mark of NBRF                           
      PIR is partially supported by the National Library of Medicine            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                Page 2          
                                                                                
                                                                                
________________________                                                        
1.0 Recent Modifications                                                        
                                                                                
In an effort to improve the coverage, accuracy, and completeness of the         
PIR-International Protein Sequence Database, a number of minor changes          
have been introduced into the CODATA formatted version of the Database          
beginning with Release 39.00. These changes were introduced to provide          
room for expansion, such as the inclusion of cross-references to other          
databases, and to make the overall presentation more uniform and more           
computer parsable.                                                              
                                                                                
For the most part, the changes involve modifications to record identifiers      
and subidentifiers and the replacement of comma and back-slash list             
separators with semicolons. In addition, a number of records have been          
expanded and new subfields have been added.                                     
                                                                                
We realize that even minor changes present difficulties to database users       
and have carefully limited such changes accordingly. However, it is not         
possible to make improvements without making changes. Below is an itemized      
list of affected records and corresponding format revisions.                    
                                                                                
                   Enhanced CODATA Record Descriptions                          
                                                                                
( 1) ENTRY                                                                      
     the #type field values are: "complete", "fragment", or "fragments"         
( 2) TITLE                                                                      
     the #EC-number field has been eliminated;                                  
     EC numbers are found within the title in parentheses                       
( 3) ALTERNATE_NAMES                                                            
     this identifier has changed from "ALTERNATE-NAME";                         
     alternate names are separated by semicolons rather than backslashes        
( 4) CONTAINS                                                                   
     this identifier changed from "INCLUDES";                                   
     contains names will be separated by semicolons rather than                 
       backslashes;                                                             
     #EC-number field has been eliminated (see 2 above)                         
                                                                                
( 5) ORGANISM                                                                   
     this identifer changed from "SOURCE";                                      
     in the future, unique identifiers will be assigned for every species       
       appearing in an organism record; these identifiers will follow           
       immediately after the record identifier;                                 
     the scientific organism name field is now explicitly identified by         
       "#formal_name";                                                          
     the hyphen in #common_name identifier is replaced with an underscore       
       and common_name is lowercase;                                            
     two new fields have been added:                                            
         #variety                                                               
         #note                                                                  
     the HOST record has been eliminated, this information in now found         
       in the #note field and will be moved to the Taxonomy file at a           
       later date                                                               
                                                                                
                                                                Page 3          
                                                                                
                                                                                
                                                                                
( 6) PLACEMENT                                                                  
     this field has been eliminated; placement information is given in          
       the PIR*.SNX index file; the PRINDX.LIS data file contains a             
       listing of all primary placement numbers in Sections PIR1 and PIR2       
( 7) DATE                                                                       
     the field identifiers "#sequence" and "#text" have been changed to         
       "#sequence_revision" and "#text_change", respectively                    
( 8) ACCESSIONS                                                                 
     this identifier has changed from "ACCESSION";                              
     accession numbers are separated by semicolons rather than backslashes      
( 9) REFERENCE                                                                  
     parenthetical comments following the record identifier have been           
       moved to the #contents field;                                            
     #reference-number  -  this field has been eliminated; reference            
         numbers now follow immediately after the REFERENCE record              
         identifier                                                             
     #authors  -  author names are separated by semicolons (rather than         
         commas); each name consists of a surname (one or more words)           
         separated from initials by a comma, periods now punctuate the          
         author's intitials and "Jr." and "Sr.", for example                    
             Maizel Jr., J.; Smith, E.L.                                        
     #description  -  this identifier may follow the identifier                 
         #submission; the field contains a brief description of the             
         submitted work                                                         
     #cross-references  -  this new field has been added for                    
         cross-references to medline with the tag "MUID:"                       
     #contents and #note  -  these fields have been added; initially            
         the #contents field will contain the information previously            
         found in the parenthetical comments appearing after the                
         REFERENCE identifier                                                   
     #accession  -  this field allows only one accession number                 
     ##molecule_type  -  this field becomes a subfield of #accession;           
         the hyphen is now an underscore                                        
     ##residues  -  this field becomes a subfield of #accession;                
         the `tag' that was delimited by < and > is moved to the                
         ##label field                                                          
     ##cross-references  -  this field becomes a subfield of #accession;        
         it contains cross-references to CAS, DDBJ, EMBL, GB, NCBIN, and        
         NCBIP (primary sequence data from the `backbone')                      
     three new subfields have been added to #accession :                        
         ##status                                                               
         ##genetics  -  contains a pointer to a GENETICS block when             
           there is more than one GENETICS block, it is the same as the         
           label on the GENETICS block (see below)                              
         ##note                                                                 
(10) GENETICS                                                                   
     multiple genetics records may occur when multiple strains or multiple      
       genes are represented in the entry; the first field may contain an       
       optional record label when more than one GENETICS record occurs          
     #gene  -  the identifier has been changed from #Name;                      
                                                                                
                                                                Page 4          
                                                                                
                                                                                
         cross-references to gene mapping databases now occur in this           
         field, i.e., GDB:AGT1; AGT1 is the gene name as it appears in the      
         GDB database                                                           
     #cross-references  -  this field is used to link the genetics              
         information to other databases particularly GDB; information           
         in this item refers to entry code or ID                                
     #map_position  -  the hyphen has been changed to an underscore;            
         #Segment-number has been eliminated; this information is now           
         considered to be part of the map position                              
     #genome  -  this new field decribes the genetic source                     
     #genetic_code  -  this identifier changed from #Special-code               
     #start_codon  -  there is no change to this field                          
     #introns  -  intron specifications now have the form: 31/2; 45/3           
     ##status  -  this subfield is new and gives the status of intron           
         assignment (it does not occur in Release 39.00)                        
     #note  -  this new field contains ancillary genetics information           
(11) COMPLEX                                                                    
     this record describes the type of structural complex associated with       
       forms of the molecule                                                    
(12) FUNCTION                                                                   
     this block contains information about the function of the protein          
     #description  -  free text describing the function (required)              
     #note  -  text denoting atypical activity (optional)                       
(13) CLASSIFICATION                                                             
     this identifier has changed from "SUPERFAMILY"; and                        
       "#Name" is replaced by "#superfamily"; it contains a list                
       (separated by semicolons instead of backslashes) of superfamily          
       names; residue specifications, explicitly depicting domains, no          
       longer occur in this field                                               
(14) KEYWORDS                                                                   
     keywords are separated by semicolons rather than backslashes               
(15) FEATURE                                                                    
     the first character of Feature descriptors is now lowercase                
       #Thioester-bonds have been incorporated into #cross-links records;       
       #Duplication features have been eliminated (the information has          
       been moved to #domain or #region features)                               
     two new fields have been added to FEATURE                                  
         #status                                                                
         #label  -  the label previously contained between `<' and `>'          
           is now in this field                                                 
Other new fields will be implemented in future releases.                        
                                                                                
                                                                                
________________                                                                
2.0 Introduction                                                                
                                                                                
                                                                                
For non-VAX/VMS systems, the PIR-International Protein Sequence Database        
is being distributed in a format consistent with the Sequence Data              
Exchange Format recommended by the CODATA Task Group on the Coordination        
of Protein Sequence Data Banks. The format is a context-independent, free       
format wherein data are not restricted to specific columns, fields, or          
                                                                                
                                                                Page 5          
                                                                                
                                                                                
records but are identified by a defined set of field descriptors                
(identifiers and subidentifiers). For maximum portability, the data are         
represented in the International ASCII character set restricted to the          
printable ASCII characters (ASCII characters 32 through 126, decimal            
representation). The data are represented in upper- and lower-case              
letters; however, no significance should be attached to the case of a           
letter; upper- and lower-case letters are to be treated equivalently.           
Records are fixed at 80 characters per record padded on the right by space      
characters.                                                                     
                                                                                
The data are organized into entries that contain all the information            
associated with a particular sequence, including the title, the biological      
source, references, associated text, and the sequence itself. Some entries      
contain information from several closely related sequences but only one         
sequence is explicitly represented in each entry. The database is               
contained in a single file; the first several lines of the file contain         
the database header, which identifies the database. The entries follow          
sequentially; they are separated from any other text in the file by             
beginning and ending with a record containing three backslash characters,       
'\', in the first three columns.                                                
                                                                                
Types of data within an entry are distinguished by dividing them into           
specific data items, e.g., title, reference, feature table, etc. Space          
characters are used as general separators; a variable number of spaces are      
used to separate data, which allows the data to be represented in an            
easily readable tabular form. Although the format may appear to be in a         
fixed field representation, software should be designed to read a free          
field format; we cannot guarantee that the indentation and spacing between      
specific data will not be changed in future releases.                           
                                                                                
Each data item is labeled with an Identifier. Identifiers are single words      
or several words connected by hyphens (they contain no internal space           
characters). A data item may extend over more than one line. The                
identifier starts at the first column of the first line of the                  
corresponding data item; continuation lines are distinguished by                
containing at least three space characters at the beginning of the line.        
                                                                                
Data within data items are divided into subitems. Each subitem consists of      
a subidentifier, which identifies the subitems and separates them,              
followed by the associated data. Subidentifiers are of the same form as         
identifiers but are immediately preceded by a number sign, '#', which           
designates them as subidentifiers; the number sign appears in the database      
only to introduce subidentifiers. Most of the fields within each data item      
are optional; therefore, they are defined specifically by their identifier      
rather than by relative position (the relative position of subitems may         
change in future releases).                                                     
                                                                                
It is often the case that many items of data are distinct but represent         
equivalent types of data. For example, many proteins are known by a             
variety of names. Although each name is distinct and should be treated as       
such, they are all alternate names. This may be handled in one of two           
                                                                                
                                                                Page 6          
                                                                                
                                                                                
ways. Each alternate name may be presented as a separate data item              
introduced by the identifier ALTERNATE_NAMES or several alternate names         
may be introduced by a single identifier and separated by a semicolon.          
The semicolon is used as a general data item separator that indicates           
that what follows is a separate data item but of the same type. The             
semicolon is used in the database only to indicate the concatenation of         
data items.                                                                     
                                                                                
An entry contains the following data items; see Appendix E for a entry          
template:                                                                       
                                                                                
                  Beginning-of-entry (Entry)                                    
                  Title                                                         
                  Alternate_names                                               
                  Contains                                                      
                  Organism                                                      
                  Date                                                          
                  Accessions                                                    
                  Reference                                                     
                  Comment                                                       
                  Genetics                                                      
                  Classification                                                
                  Keywords                                                      
                  Feature                                                       
                  Summary                                                       
                  Sequence                                                      
                  End-of-entry                                                  
                                                                                
All entries contain the data items Beginning-of-entry, Title, Organism,         
Date, Reference, Summary, Sequence, and End-of-entry; the other data items      
are optional. Each Entry begins with a Beginning-of-entry data item             
sequentially followed by a Title data item and ends with a Summary data         
item sequentially followed by a Sequence data item and an End-of-entry          
data item; these data items are present only once in each entry; all            
others may be repeated several times within the same entry. The remaining       
data items generally follow in the order indicated above; however, this         
order may not be strictly adhered to. Software should be designed to            
accommodate alterations in the order of data items. New data items and          
subitems may be introduced in the future. We recommend that software be         
designed to ignore or otherwise handle unknown data items and subitems.         
This will ensure that the software will continue to run when new data           
items and subitems are introduced and will give the programmer time to          
make smooth adjustments as new data are introduced.                             
                                                                                
                                                                                
                     The Beginning-of-Entry Data Item                           
                                                                                
                                                                                
The beginning-of-entry data item marks the beginning of each entry; it is       
denoted by the identifier 'ENTRY'. The entry identification code                
immediately follows the identifier; it is a four- to six-character word         
                                                                                
                                                                Page 7          
                                                                                
                                                                                
that is used to identify the entry in the database; it contains only            
alphanumeric characters with no internal spaces. One subidentifier,             
'#type', which denotes the type of molecule, is defined for the                 
beginning-of-entry data item. The '#type' subidentifier is followed by          
either 'complete', 'fragment' or 'fragments', which indicates whether the       
represented sequence is complete or fragmentary.                                
                                                                                
                                                                                
                           The Title Data Item                                  
                                                                                
                                                                                
The Title data item contains the entry title; it is denoted by the              
identifier 'TITLE'. The entry title gives a brief description of the            
molecule and its biological source. The name of the molecule appears on         
the left and the name of the biological source appears on the right; they       
are separated by ' - '. The Enzyme Commission (EC) Numbers as defined by        
the Nomenclature Committee of the International Union of Biochemistry are       
denoted by an 'EC ' number in parentheses. It represents a hierarchical         
classification scheme and consists of four positive integers separated by       
periods. With the exception of the first, the integers may be replaced by       
a dash, to indicate enzymes that are not fully classified.                      
                                                                                
                                                                                
                      The Alternate_Names Data Item                             
                                                                                
                                                                                
The optional Alternate_names data item contains an alternate title for          
molecules for which a standard nomenclature has not been established and        
that are designated by more than one name. It is denoted by the identifier      
'ALTERNATE_NAMES'. Multiple alternate names may be represented in separate      
alternate name data items or within the same data item separated by             
semicolon characters.                                                           
                                                                                
                                                                                
                          The Contains Data Item                                
                                                                                
                                                                                
The optional Contains data item also contains a title but is distinct from      
the alternate name and title data items in that the Contains data item          
refers to activities present within the molecule rather than those              
associated with the entire molecule. It is denoted by the identifier            
'CONTAINS'. Multiple contains data items may be represented in separate         
data items or within the same data item separated by a single semicolon         
character.                                                                      
                                                                                
                                                                                
                          The Organism Data Item                                
                                                                                
                                                                                
The Organism data item specifies the biological source of the sequence; it      
is denoted by the identifier 'ORGANISM'. The scientific name of the             
                                                                                
                                                                Page 8          
                                                                                
                                                                                
organism immediately follows the identifier and is denoted by the               
descriptor '#formal_name'. The common name is denoted by the subidentifier      
'#common_name'. Synonymous scientific names and common names are separated      
by commas but follow contiguously in the appropriate subitem. In most           
cases both the scientific and common names of the organism are given; this      
is not always possible, however.                                                
                                                                                
                                                                                
                            The Date Data Item                                  
                                                                                
                                                                                
The Date data item contains the revision dates of the entry. It is denoted      
by the identifier 'DATE'. The date on which the entry was added to the          
database immediately follows the identifier. Two subidentifiers are             
defined: '#sequence_revision', denoting the date the sequence was last          
revised and '#text_change', denoting the date that the text of the entry        
was last revised. Not all entries contain all three dates. If the               
'#sequence_revision' or '#text_change' dates are absent, the entry has not      
been revised since its addition to the database; if the added date is           
absent, the entry was added prior to 1979. The dates are specified as a         
two-digit integer representing the day followed by a dash, the first three      
letters of the month, a second dash, and the four-digit integer                 
representation of the year.                                                     
                                                                                
                                                                                
                         The Accessions Data Item                               
                                                                                
                                                                                
The Accessions data item is denoted by the identifier 'ACCESSIONS'.             
Accession numbers contain only alphanumeric characters. Multiple accession      
numbers may be represented in separate accession data items or within the       
same data item separated by a single semicolon character.                       
                                                                                
These numbers are associated with the sequence as reported in a                 
publication or as submitted to the databases; they constitute unique            
labels that provide an identity to the reported sequence. As these labels       
will remain permanently associated with this information, they may              
continue to serve as tags to locate the information in any future releases      
of the database.                                                                
                                                                                
                                                                                
                         The Reference Data Item                                
                                                                                
                                                                                
The Reference data item contains the literature citation; it is denoted by      
the identifier 'REFERENCE'. A unique six-character alphanumeric 'Reference      
number' may appear immediately after this identifier. The Reference data        
item begins the reference block. Contained in the reference block is the        
authors subitem; the journal, book, submission or citation subitem; an          
optional reference title subitem; an optional cross-references subitem; an      
optional contents subitem; an optional note subitem; and an optional            
                                                                                
                                                                Page 9          
                                                                                
                                                                                
accession number subitem which begins the accession block. Contained in         
the accession block is an optional status subsubitem, the moleculetype          
subsubitem, the residues subsubitem, an optional cross-references               
subsubitem, an optional genetics subsubitem, and an optional note subitem.      
The accession block and associated subsubitems may be repeated with the         
reference block and the reference block may be repeated with the entry.         
                                                                                
The authors subitem is denoted by the subidentifier '#authors'. The author      
names follow the subidentifier and are separated by semicolons. Each            
author name consists of a surname and initials separated by a comma. The        
surname may be multiple words and may contain extensions such as "Jr.",         
"Sr.", "II", "III" or "IV". The initials are separated by periods.              
                                                                                
The journal subitem is denoted by the subidentifier '#journal'. The             
subidentifier is sequentially followed by the journal abbreviation as           
depicted in the PIR-International journals abbreviations file named             
JOURNALS.LIS, a four-digit integer representation of the year enclosed in       
parentheses, an optional volume number, an optional issue number enclosed       
in square brackets, a colon, and the page numbers. The volume number and        
issue number contain alphanumeric characters, and the page numbers are one      
or two alphanumeric characters separated by a dash.                             
                                                                                
The book subitem is denoted by the subidentifier '#book', the submission        
subitem is denoted by the subidentifier '#submission' and the citation          
subitem is denoted by the subidentifier '#citation'. The book subitem           
contains citations to publications other than scientific journals, the          
submission subitem contains information of database and date of author          
submissions and the citation subitem contains other citations that cannot       
easily be represented in the format required for the journal subitem            
                                                                                
The reference title subitem is denoted by the subidentifier '#title' or         
'#description'. It contains the title of the manuscript or a brief              
description of submitted work for submissions.                                  
                                                                                
The contents subitem denoted by '#contents' may list the residue numbers        
reported in the paper, indicate what information is presented in the paper      
(complete sequence, X-ray crystallography, features, etc.), or give the         
biological source when the entry represents sequences from more than one        
organism. The note subitem denoted by '#note' describes any ancillary           
information associated with the citation.                                       
                                                                                
The accession number subitem is denoted by the subidentifier '#accession'.      
It contains a label that uniquely identifies the sequence that was              
reported in the reference. Single accession numbers are required as unique      
pointers to a sequence report. These labels have the same form as the           
entry identification code and are unique over all sections of the database      
except in very few exceptions where a portion of sequence overlap in two        
entries.                                                                        
                                                                                
The status subsubitem is denoted by '##status' and indicates a level of         
review of the sequence. A value for this field is "preliminary".                
                                                                                
                                                               Page 10          
                                                                                
                                                                                
The molecule_type subsubitem is denoted by the subidentifier                    
'##molecule_type'. It contains a symbol that identifies the type of             
molecule from which the sequence was experimentally determined. The valid       
symbols for this field are given in Appendix D.                                 
                                                                                
The residues subsubitem is denoted by the subidentifier '##residues'. It        
contains a residues specification and a special tag that provides a             
logical address to the sequence as reported in the publication. The tag         
subsubitem is denoted by the subidentifier '##label'. The residues              
specification is a special syntax that allows the sequence originally           
reported to be regenerated from that shown in the entry.                        
                                                                                
The residues specification consists of a set of segment specifications          
separated by semicolons and commas. The segment specification may be            
either a literal sequence element (a sequence of amino acids represent in       
the one-letter amino acid abbreviations and enclosed in single quotes) or       
a range of sequence positions. The range is either a single integer or a        
pair of integers separated by a dash ('-') that denotes the inclusive           
extent of the specified segment with respect to the sequence shown in the       
entry (the integers correspond to a sequential and contiguous sequence          
numbering system in which the first residue is assigned the number 1). The      
sequence is reconstructed by extracting the indicated sequence elements         
and appending these together with the literally specified sequence              
elements. The semicolon is used to denote physical breaks in the sequence.      
Segments separated by commas should be 'spliced' together; those separated      
by semicolons represent boundaries between adjacent fragments. This             
syntax, although simple and concise, is powerful enough to express all          
known sequence transforms (i.e., insertions, deletions, substitutions,          
duplications, tranpositions, etc.) capable of converting one protein            
sequence into another without introducing nonstandard amino acids.              
                                                                                
Following the residues specification is an entry-identification-code            
extension denoted by the subidentifier '##label'. This short 'tag' when         
combined with the entry identification code provides a unique logical           
address or path to the indicated sequence, i.e., CCYC6->AIT.                    
                                                                                
The cross-references subsubitem is denoted by '##cross-references'. It          
contains information about cross references to other databases such as          
Medline abstracts. The tag "MUID:" indicates a Medline cross reference.         
                                                                                
The genetics subsubitem is denoted by '##genetics'. This information is         
present when more than one genetics block is present and needs to be            
distinguished. The tag is unique and appears in the 'GENETICS' data item.       
                                                                                
The note subsubitem is denoted by '##note' and represents ancillary             
information about the sequence report.                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                               Page 11          
                                                                                
                                                                                
                          The Comment Data Item                                 
                                                                                
                                                                                
The optional Comment data item contains free text; it is denoted by the         
identifier 'COMMENT'.                                                           
                                                                                
                                                                                
                          The Genetics Data Item                                
                                                                                
                                                                                
The optional Genetics data item contains genetic information associated         
with the entry; it is denoted by the identifier 'GENETIC'. This data item       
begins the genetics block; the following seven subidentifiers have been         
defined for this data item (all are optional but at least one is present):      
'#gene', denoting the gene name; '#map_position', denoting the genetic map      
position; '#genome', denoting the genome type; '#genetic_code', denoting        
the genetic code used for protein translation and back-translation if           
other than the universal genetic code; '#start_codon', denoting the             
start-of-translation codon if other than AUG; '#introns', denoting the          
intron positions; and '#note' denoting and comments particular to the           
specified gene.                                                                 
                                                                                
A complete listing of the special genetic codes is included in the file         
SGC.LIS. For human sequences the gene name and map position are given           
according to the recommendations of the Human Gene Mapping Library (HGML)       
at Yale University. The '#introns' subidentifier is followed by a set of        
intron specifications separated by semicolons; intron specifications            
consist of two integers separated by a slash. The first integer specifies       
the position of the intron in the protein sequence; the second specifies        
the position in the codon immediately preceding the intron and may have         
the value 1, 2, or 3.                                                           
                                                                                
                                                                                
                       The Classification Data Item                             
                                                                                
                                                                                
The optional Classification data item contains the name of the                  
superfamily(s) to which the sequence has been assigned; it is denoted by        
the identifier 'CLASSIFICATION'. It has been recognized that many proteins      
are composed of functional domains of separate evolutionary origin. There       
are a number of these multidomain proteins in the database. If multiple         
superfamilies are assigned, each is specified as a "homology" except for        
the primary if there is one. Some domains may be specified in the FEATURE       
table described below. Each superfamily name is listed under the                
subidentifier '#superfamily' and is separated by a single semicolon.            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                               Page 12          
                                                                                
                                                                                
                          The Keywords Data Item                                
                                                                                
                                                                                
The optional Keywords data item contains a keyword or short key phrase; it      
is denoted by the identifier 'KEYWORDS'. Multiple keywords may be               
represented in separate keywords data items or within the same data item        
separated by a single semicolon.                                                
                                                                                
                                                                                
                          The Feature Data Item                                 
                                                                                
                                                                                
The optional Feature data item marks sites or regions of the sequence that      
are of biological interest; it is denoted by the identifier 'FEATURE'. The      
identifier is sequentially followed by a residue specification, a feature       
descriptor, and a feature title. The residue specification consists of a        
set of fragment specifications separated by commas. A fragment                  
specification consists of an integer or two integers separated by a dash.       
The feature descriptor is a subidentifier that indicates the type of            
feature. The currently used feature descriptors are given in Appendix A.        
When the '#protein', '#peptide', or '#domain' descriptors are used, three-      
or four- character code extensions follow the subidentifier '#label'.           
These code extensions when added to the entry identification code generate      
a unique code with which to identify the corresponding subsequence in the       
database.                                                                       
                                                                                
                                                                                
                          The Summary Data Item                                 
                                                                                
                                                                                
The Summary data item contains the molecular weight, the sequence length,       
and a sequence checksum; it is denoted by the identifier 'SUMMARY'. The         
molecular weight, sequence length, and sequence checksum are denoted by         
the subidentifiers '#length', '#molecular-weight', and '#checksum',             
respectively; all are represented as integers. Note that the molecular          
weight is not included on fragmentary sequences, #type fragment(s).             
                                                                                
The checksum is computed using the method introduced by Devereux et al.         
(Devereux, J., Marquess, P., Martin, W., Smith, M., Winsborough, W., 1986,      
Sequence Analysis Software Package of the University of Wisconsin Genetics      
Computer Group). It is a simple method that yields checksums with values        
between 0 and 9999. Although the checksums are not unique, they have been       
found to be sensitive enough to changes in the sequence to indicate that        
the sequence has been altered or incorrectly transmitted. The sum is            
compiled by sequentially summing over the entire sequence the product of a      
modified position number times the ASCII decimal representation of the          
character corresponding to the amino acid at that position. The position        
numbers are modified by setting the value to 1 every time it exceeds 57         
and thus restricting the values to 1 through 57. The remainder of this sum      
after division by 10,000 is then taken as the checksum. The following is a      
sample from a section of a FORTRAN 77 program that computes the checksum.       
                                                                                
                                                               Page 13          
                                                                                
                                                                                
The function ICHAR returns the decimal equivalent of the sequence               
character stored in the array SEQ. Note that when residues are given as         
lower-case letters they are represented as the ASCII equivalent of the          
corresponding upper-case letter.                                                
                                                                                
C  ---Compute checksum---                                                       
      ICHECK=0                                                                  
      ICOUNT=0                                                                  
      DO 20 L=1,LENSEQ                                                          
        ICOUNT=ICOUNT+1                                                         
        LCHAR=ICHAR(SEQ(L))           !Convert to decimal form                  
        IF (LCHAR.GE.97               !Optionally change case                   
           .and. LCHAR.LE.122) LCHAR=LCHAR-32                                   
        ICHECK = ICHECK + ICOUNT*LCHAR                                          
        IF (ICOUNT.EQ.57) ICOUNT=0                                              
20      CONTINUE                                                                
      LSUM = ICHECK/10000              !Integer division                        
      ICHECK = ICHECK - LSUM*10000      !Compute remainder                      
                                                                                
                                                                                
                          The Sequence Data Item                                
                                                                                
                                                                                
The Sequence data item contains the sequence in the one-letter amino acid       
abbreviations (Appendix B); it is denoted by the identifier 'SEQUENCE'.         
The sequences are numbered and displayed in an easily readable form; they       
also contain punctuation (see Appendix C). Inasmuch as the display of the       
sequence information may change in future releases, we strongly recommend       
that software developers do not rely on the explicit representation of the      
sequence data but write software that interprets only the one-letter amino      
acid abbreviations as sequence data (this may be accomplished by ignoring       
all nonletter characters following the 'SEQUENCE' identifier. The end of        
the sequence is defined by the end-of-entry marker (a line containing           
three slashes as the first three characters).                                   
                                                                                
                                                                                
                        The End-of-Entry Data Item                              
                                                                                
                                                                                
The end-of-entry data item marks the end of the entry. It is denoted by         
the identifier consisting of three consecutive slashes.                         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                               Page 14          
                                                                                
                                                                                
                                Appendix A                                      
                          Feature Descriptors of                                
             The PIR International Protein Sequence Database                    
                                                                                
                                                                                
                Descriptors Denoting Single Residue Sites                       
                -----------------------------------------                       
                                                                                
                          #active_site                                          
                          #binding_site                                         
                          #cleavage_site                                        
                          #inhibitory_site                                      
                          #modified_site                                        
                                                                                
These sites may be represented using the full feature residue                   
specification conventions; however, the specification should be                 
interpreted to specify a collection of single sites.                            
                                                                                
        Descriptors Denoting Residues Connected by Covalent Bonds               
        ---------------------------------------------------------               
                                                                                
                          #disulfide_bonds                                      
                          #cross_links                                          
                                                                                
The pairs of residues linked by hyphens should be intepreted as being           
connected by bonds. Single residues may be specified if the bond is             
between an amino acid in the sequence and another protein chain or              
molecule. The #cross-links descriptor specifically denotes bonds linking        
the sequence shown to an adjacent protein chain.                                
                                                                                
                  Descriptors Denoting Sequence Regions                         
                  -------------------------------------                         
                                                                                
                          #domain                                               
                          #peptide                                              
                          #protein                                              
                          #region                                               
                                                                                
Pairs of residues linked by hyphens denote a region extending from the          
first to the second position inclusive. The #domain descriptor indicates        
distinct functional regions that may be of separate evolutionary origin.        
The #peptide and #protein descriptors indicate regions corresponding to         
the mature protein or peptides derived from the sequence shown. The             
#region descriptor is used to indicate all other regions.                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                               Page 15          
                                                                                
                                                                                
                                Appendix B                                      
              One- and Three-letter Amino Acid Abbreviations                    
                                                                                
                    A   Ala   Alanine                                           
                    C   Cys   Cysteine                                          
                    D   Asp   Aspartic acid                                     
                    E   Glu   Glutamic acid                                     
                    F   Phe   Phenylalanine                                     
                    G   Gly   Glycine                                           
                    H   His   Histidine                                         
                    I   Ile   Isoleucine                                        
                    K   Lys   Lysine                                            
                    L   Leu   Leucine                                           
                    M   Met   Methionine                                        
                    N   Asn   Asparagine                                        
                    P   Pro   Proline                                           
                    Q   Gln   Glutamine                                         
                    R   Arg   Arginine                                          
                    S   Ser   Serine                                            
                    T   Thr   Threonine                                         
                    V   Val   Valine                                            
                    W   Trp   Tryptophan                                        
                    Y   Tyr   Tyrosine                                          
                    B   Asx   Asp or Asn,                                       
                              not distinguished                                 
                    Z   Glx   Glu or Gln,                                       
                              not distinguished                                 
                    X    X    Undetermined or atypical                          
                              amino acid                                        
                                                                                
        These abbreviations conform to those suggested by the                   
        IUPAC-IUB Commission on Biochemical Nomenclature, J. Biol.              
        Chem. 243, 3557-3559, 1968.                                             
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                               Page 16          
                                                                                
                                                                                
                                Appendix C                                      
                     Punctuation in Protein Sequences                           
                                                                                
    Two adjacent amino acids, with no punctuation between, indicates            
    that they are connected, as determined experimentally.                      
()  Encloses a region, the composition but not the complete sequence            
    of which has been determined experimentally, or encloses a single           
    residue that has been tentatively identified.                               
 =  Indicates )(, the juxtaposition of two regions of indeterminate             
    sequence, while preserving proper spacing between amino acids.              
 /  Indicates that the adjacent amino acids are from different                  
    peptides, not necessarily connected. When the amino end of a                
    protein has not been determined, / precedes the first residue.              
    When the carboxyl end has not been determined, / follows the last           
    residue. When )/, /(, or )/( are needed, only / is used.                    
 .  Outside of parentheses, indicates the ends of sequenced fragments.          
    The relative order of these fragments was not determined                    
    experimentally but is clear from homology or other indirect                 
    evidence.                                                                   
 .  Within parentheses, indicates that the amino acid to its left has           
    been placed with at least 90% confidence by homology with known             
    sequences.                                                                  
 ,  Indicates that the amino acid to its left could not be positioned           
    with confidence by homology.                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                               Page 17          
                                                                                
                                                                                
                                Appendix D                                      
                Molecule_type and Cross_references Symbols                      
                                                                                
                                                                                
                          Molecule_type Symbols                                 
                          ---------------------                                 
                                                                                
The following constitute the set of valid symbols used in the molecular         
type subitem of the reference. They indicate the type of molecule whose         
experimental sequence determination was reported.                               
                                                                                
         DNA                                                                    
         RNA                                                                    
         genomic RNA                                                            
         mRNA                                                                   
         nucleic acid                                                           
         protein                                                                
                                                                                
                                                                                
                         Cross_references Symbols                               
                         ------------------------                               
                                                                                
The following symbols are used in the '##cross_references subitem of the        
reference to specify the database.                                              
                                                                                
        GB   - GenBank(R) Genetic Sequence Data Bank                            
        EMBL - EMBL Data Library, Nucleotide Sequence Database                  
        DDBJ - DNA Data Bank of Japan                                           
        PDB  - Brookhaven National Laboratory's Protein Data Bank               
        CAS  - Chemical Abstracts Services registry numbers                     
        NCBIP - NCBI primary sequence database from protein (backbone)          
        NCBIN - NCBI primary sequence database from nucleic acid                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                               Page 18          
                                                                                
                                                                                
                                Appendix E                                      
                     CODATA Formatted Entry Template                            
                                                                                
                                                                                
ENTRY                                    Beginning-of-Entry                     
TITLE                                                                           
ALTERNATE_NAMES                                                                 
CONTAINS                                                                        
ORGANISM        #formal_name ... #common_name ...                               
DATE            #sequence_revision ... #text_change ...                         
ACCESSIONS                                                                      
REFERENCE       <Ref_num>                REFERENCE BLOCK (repeated)             
   #authors                                                                     
   #journal | #book | #submission | #citation                                   
   #title | #description                                                        
   #cross-references                                                            
   #contents                                                                    
   #note                                                                        
   #accession                            ACCESSION BLOCK (repeated)             
      ##status                                                                  
      ##molecule_type                                                           
      ##residues                                                                
      ##label                                                                   
      ##cross-references                                                        
      ##genetics                                                                
      ##note                                                                    
COMMENT                                  COMMENTS (repeated)                    
GENETICS                                 GENETICS BLOCK (repeated)              
   #gene                                                                        
   #map_position                                                                
   #genome                                                                      
   #genetic_code                                                                
   #start_codon                                                                 
   #introns                                                                     
   #note                                                                        
CLASSIFICATION  #superfamily                                                    
KEYWORDS                                                                        
FEATURE                                                                         
SUMMARY         #length ... #moleculer-weight ... #checksum                     
SEQUENCE                                                                        
///