PIR-International
Technical Development Bulletin
Number 7
19 September 1997

Contents
1. NBRF Flat File Format for PIR-International Release 54.00
2. Format Changes Anticipated for Release 55.00
3. Note on Flat File Format for NRL_3D Release 22.00
4. Available Instructional Material


1. NBRF Flat File Format in PIR-International Release 54.00

There are no major format changes in Release 54.00.  However, there have been a
number of changes in the cross-references to other databases.  These changes
are documented following the Flat File Format Specification.
Release 54.00 is scheduled on 30 September 1997.

NBRF Flat File Format Specification
========================================================================
    { ">P1;" | ">F1;" } entry_identifier [ "#type " entry_type ]
    protein_name " - " source_name [ genetic_code ] [ sequence_note ]
    sequence "*"
    [ "N;Alternate names: " protein_name [ "; " protein_name ... ] ]
    [ "N;Contains: " protein_name [ "; " protein_name ... ] ]
    [ "C;Species: " species_name ]
    [ "A;Taxonomy: " taxonomy_code ]
    [ "A;Variety: " variety_name [ "; " variety_name ] ]
    [ "A;Note: " text [ "; " text ... ] ]
    [ "C;Date: " [ date ] ["#sequence_revision " date] ["#text_change " date] ]
    "C;Accession: " accession_number [ "; " accession_number ... ]

|   "R;" author [ "; " author ... ]
|   { citation | source_statment }
| | [ "A;Authors:" author [ "; " author ... ] ]
|   [ { "A;Title: " title | "A;Description: " submission_description } ]
|   "A;Reference number: " reference_number [ "; " cit_cross_ref ... ]
|   [ "A;Contents: " text [ "; " text ... ] ]
|   [ "A;Note: " text [ "; " text ... ] ]

| | [ "A;Accession: " accession_number ]
| | [ "A;Status: " reference_status ]
| | [ "A;Molecule type: " molecule_type [ "; " molecule_type ... ] ]
| | [ "A;Residues: " residue_specification "<" label ">" ]
| | [ "A;Cross-references: " seq_cross_ref [ "; " seq_cross_ref ... ] ]
| | [ "A;Experimental source: " text ]
| | [ "A;Genetics: " genetics_record_pointer ]
| | [ "A;Note: " text [ "; " text ... ] ]

|   [ "C;Comment: " text [ "#link " link ] [ "<" label ">" ] ]

|   [ "C;Genetics:" [ " <" label ">" ] ]
|   [ "A;Gene: " gene_symbol [ "; " gene_symbol ... ] ]
|   [ "A;Cross-references: " gen_cross_ref [ "; " gen_cross_ref ... ] ]
|   [ "A;Map position: " [ "segment" ] map_text ]
|   [ "A;Genome: " genome_type ]
|   [ "A;Mobile element: " element_text ]
|   [ "A;Gene origin: " origin_text ]
|   [ "A;Genetic code: " genetic_code ]
|   [ "A;Start codon: " codon ]
|   [ "A;Introns: " [ intron_specification ]
        [ "#status " status [ ", " status ... ] ] ]
|   [ "A;Other products: " accession_number [ "; " accession_number ... ]
|   [ "A;Note: " text [ "; " text ... ] ]

|   [ "C;Complex: " text ]

|   [ "C;Function:" [ " <" label ">" ] ]
|   [ "A;Description: " text ]
|   [ "A;Pathway: " text [ "; " text ... ] ]
|   [ "A;Note: " text [ "; " text ... ] ]

    [ "C;Superfamily: " sf_name [ "; " sf_name ... ] ]
    [ "A;Group: " group_name [ "; " group_name ... ] ]
    [ "C;Keywords: " keyword [ "; " keyword ... ] ]

|   [ "F;" residue_specification "/" feature_name ":" [ description ]
        [ "#agent " agent ]
        [ "#bond-type " bond_type ]
        [ "#bond-class " bond_class ]
        [ "#ligand " ligand ]
        [ "#residues " res [ ", " res ... ] ]
        [ "#link " residue_text [ ", " residue_text ... ] ]
        [ "#status " feature_status [ ", " feature_status ... ] ]
        [ "#reference " feature_reference ]
        [ "#note " text ]
        [ "<" label ">" ] ]
========================================================================
     genetic_code  = "(SGC" genetic_code_number ")"
     sequence_note = { "(fragment)" | "(fragments)" | "(tentative sequence)" }
     species_name  = taxonomic_name [ ", " taxonomic_name ... ] 
                     [ "(" common_name [ ", " common_name ... ] ")" ]
     cit_cross_ref = cit_db_code ":" reference_code
     cit_db_code   = { "MUID" | "PDB" }
     reference_status = { "preliminary" | 
                          "nucleic acid sequence not shown" |
                          "translation not shown" |
                          "protein sequence not shown" |
                          "significant sequence differences" |
                          "translated from GB/EMBL/DDBJ" |
                          "not compared with conceptual translation" |
                          "unencoded polypeptide" |
                          "conceptual translation of pseudogene" }
     molecule_type = { "DNA" | "mRNA" | "genomic RNA" | "nucleic acid" |
                       "protein" }
     residue_specification = locant [ { "," | ";" } locant ... ]
     locant        = { seq_literal | position [ "-" position ] }
     seq_literal   = "'" sequence_string "'"
     seq_string    = seq_character [ seq_character ... ]
     seq_character = one of the IUPAC standard one letter amino acid codes
     position      = a number between "1" and the sequence length inclusive
     seq_cross_ref = seq_db_code ":" reference_code
     seq_db_code   = { "CAS" | "DDBJ" | "EMBL" | "GB" | "NID" |
                       "PDB" | "PID" | "TIGR" | "UWGP" }
     gene_symbol   = [ gene_db_code ":" ] gene_symbol_text
     gene_db_code  = { "FlyBase" | "GDB" | "LISTA" | "MGI" | "OMIM" }
     gen_cross_ref = gene_db_code ":" reference_code
     intron_specification = intron_locant [ "; " intron_locant ... ]
     intron_locant = position "/" { "1" | "2" | "3" }
     feature_status = { "experimental" | "predicted" | "atypical" | "absent" }
     date          = day "-" month "-" year
     day           = a two digit number between "01" and "31" 
     month         = { "Jan" | "Feb" | "Mar" | "Apr" | "May" | "Jun" |
                       "Jul" | "Aug" | "Sep" | "Oct" | "Nov" | "Dec" }
     year          = a four digit number
     label         = three or four upper case letters or numbers
========================================================================
Notes on the NBRF format
( 1) In the "NBRF" Format meta-description,
       double quotation marks enclose explicit character strings,
       unquoted alphabetic strings indicate meta-descriptors,
       braces "{" and "}" enclose alternative elements separated by
         verticals "|",
       brackets "[" and "]" enclose optional elements or records,
       three periods "..." indicate repetition of the preceding elements
         within the brackets,
       verticals in the left margin denote blocks of records repeatable
         in order,
       a blank line separates different repeatable blocks of records.

( 2) The authors in the "R;" and "A;Authors:" reference records appear in the
     following style. 
       "surname, initials; ..."
     Names are separated by semicolons.  No " and " appears before the last
     name in a list.  The name qualifiers, Jr., III, and so forth, are
     appended to the end of the surname with no intervening punctuation.
     The optional "A;Authors:" record immediately following the "citation" 
     is used when the complete list of author names cannot fit in a single
     record; it may be repeated as required.
     In NRL_3D single name authors, translated double letter initials, and
     named group authors may appear.

( 3) With this release, all but a very small number of free text notes
     pertaining to reference sequence appearance or processing status have
     been converted to appropriate status records using restricted vocabulary.
     The reference status texts "unencoded polypeptide" and "conceptual
     translation of pseudogene" appear only in the PIR4 database segment.

( 4) Because the NCBI backbone is no longer produced, the sequence cross-
     reference codes with "NCBIN" and "NCBIP" have been removed and appear
     only in sequence note records.
     The sequence cross-reference database codes with "NID" and "PID" refer
     to the corresponding entities in the GenBank/EMBL/DDBJ databases.  In
     particular, the appearance of the "PID" code indicates that the sequence
     of the corresponding residues record matches the indicated translation
     except for the possible presence or absence of initiators or other noted
     translation exceptions.
     The reference_code texts appearing in cit_cross_ref, seq_cross_ref and
     gen_cross_ref are defined by the databases referred to and are free text
     without spaces, commas, colons, or semicolons, and with any combination
     of upper- and lowercase letters numbers and hyphens.  These are checked
     against the corresponding databases.

( 5) The list of genetics database codes will be extended as circumstances
     permit.
     The gene_symbol texts are defined by authors or by the databases
     referred to.  They are free text without commas, colons, or semicolons,
     and with any combination of upper- and lowercase letters numbers,
     hyphens, slashes and other non-letters.  These are checked against the
     corresponding databases.

( 6) The NBRF format allows for subidentifiers in the feature "F;" record.
     Although only the "#status" and "#link" subidentifiers are used, other
     subidentifiers may appear in later releases.
     Locants separated by semicolons and literal sequences may appear in some
     feature location fields in future releases.
     The "#status" subidentifier appears in all feature records with the
     following feature names:
       Active site
       Binding site
       Cleavage site
       Cross-link
       Disulfide bonds
       Inhibitory site
       Modified site
       Product

2. Format Changes Anticipated for Release 55.00

The following format changes or developments are anticipated for Release
55.00 scheduled for 31 December 1997.
( 1) We anticipate that the format and use of the "R;" reference, citation
     author, and reference number records may be changed. A description of
     the use and format for these records will be documented in the next
     Technical Development Bulletin.


3. Note on Flat File Format for NRL_3D Release 22.00

Beginning with Release 21.00 of NRL_3D, a new "PDB title" record appeared.
This record except for capitalization uses the corresponding TITLE record
in the revised PDB format


4. Available Instructional Material

The following instructional material is available at
ftp://nbrf.georgetown.edu/pir/

filename               contents
ANNOUNCE11.TXT         Announcements of the PIR, 20 June 1997
ANNOUNCE12.TXT         Announcements of the PIR, 6 August 1997
PIRTECH07.TXT          Technical Bulletin, 19 September 1997
FEATURES.TXT           Guide for PIR Features Annotations

Instructions on obtaining this and additional material from the PIR
World Wide Web Site will be published in the next Announcements of the
Protein Information Resource.

Your comments and suggestions are welcomed.  You may direct them to individual
members of the PIR-International staff or to POSTMASTER@NBRF.Georgetown.EDU.
If there are comments you wish to share with others on the PIR-International
Technical Development Bulletin mailing list, please indicate that fact in your
message.  Otherwise, your comments will be circulated only to the
PIR-International staff.
------------------------------------------------------------------------
                                 Dr. John S. Garavelli
                                 Associate Director
                                 Protein Information Resource
                                 National Biomedical Research Foundation
                                 Washington, DC  20007
                                 POSTMASTER@NBRF.GEORGETOWN.EDU