PIR-International Technical Development Bulletin Number 7 19 September 1997 Contents 1. NBRF Flat File Format for PIR-International Release 54.00 2. Format Changes Anticipated for Release 55.00 3. Note on Flat File Format for NRL_3D Release 22.00 4. Available Instructional Material 1. NBRF Flat File Format in PIR-International Release 54.00 There are no major format changes in Release 54.00. However, there have been a number of changes in the cross-references to other databases. These changes are documented following the Flat File Format Specification. Release 54.00 is scheduled on 30 September 1997. NBRF Flat File Format Specification ======================================================================== { ">P1;" | ">F1;" } entry_identifier [ "#type " entry_type ] protein_name " - " source_name [ genetic_code ] [ sequence_note ] sequence "*" [ "N;Alternate names: " protein_name [ "; " protein_name ... ] ] [ "N;Contains: " protein_name [ "; " protein_name ... ] ] [ "C;Species: " species_name ] [ "A;Taxonomy: " taxonomy_code ] [ "A;Variety: " variety_name [ "; " variety_name ] ] [ "A;Note: " text [ "; " text ... ] ] [ "C;Date: " [ date ] ["#sequence_revision " date] ["#text_change " date] ] "C;Accession: " accession_number [ "; " accession_number ... ] | "R;" author [ "; " author ... ] | { citation | source_statment } | | [ "A;Authors:" author [ "; " author ... ] ] | [ { "A;Title: " title | "A;Description: " submission_description } ] | "A;Reference number: " reference_number [ "; " cit_cross_ref ... ] | [ "A;Contents: " text [ "; " text ... ] ] | [ "A;Note: " text [ "; " text ... ] ] | | [ "A;Accession: " accession_number ] | | [ "A;Status: " reference_status ] | | [ "A;Molecule type: " molecule_type [ "; " molecule_type ... ] ] | | [ "A;Residues: " residue_specification "<" label ">" ] | | [ "A;Cross-references: " seq_cross_ref [ "; " seq_cross_ref ... ] ] | | [ "A;Experimental source: " text ] | | [ "A;Genetics: " genetics_record_pointer ] | | [ "A;Note: " text [ "; " text ... ] ] | [ "C;Comment: " text [ "#link " link ] [ "<" label ">" ] ] | [ "C;Genetics:" [ " <" label ">" ] ] | [ "A;Gene: " gene_symbol [ "; " gene_symbol ... ] ] | [ "A;Cross-references: " gen_cross_ref [ "; " gen_cross_ref ... ] ] | [ "A;Map position: " [ "segment" ] map_text ] | [ "A;Genome: " genome_type ] | [ "A;Mobile element: " element_text ] | [ "A;Gene origin: " origin_text ] | [ "A;Genetic code: " genetic_code ] | [ "A;Start codon: " codon ] | [ "A;Introns: " [ intron_specification ] [ "#status " status [ ", " status ... ] ] ] | [ "A;Other products: " accession_number [ "; " accession_number ... ] | [ "A;Note: " text [ "; " text ... ] ] | [ "C;Complex: " text ] | [ "C;Function:" [ " <" label ">" ] ] | [ "A;Description: " text ] | [ "A;Pathway: " text [ "; " text ... ] ] | [ "A;Note: " text [ "; " text ... ] ] [ "C;Superfamily: " sf_name [ "; " sf_name ... ] ] [ "A;Group: " group_name [ "; " group_name ... ] ] [ "C;Keywords: " keyword [ "; " keyword ... ] ] | [ "F;" residue_specification "/" feature_name ":" [ description ] [ "#agent " agent ] [ "#bond-type " bond_type ] [ "#bond-class " bond_class ] [ "#ligand " ligand ] [ "#residues " res [ ", " res ... ] ] [ "#link " residue_text [ ", " residue_text ... ] ] [ "#status " feature_status [ ", " feature_status ... ] ] [ "#reference " feature_reference ] [ "#note " text ] [ "<" label ">" ] ] ======================================================================== genetic_code = "(SGC" genetic_code_number ")" sequence_note = { "(fragment)" | "(fragments)" | "(tentative sequence)" } species_name = taxonomic_name [ ", " taxonomic_name ... ] [ "(" common_name [ ", " common_name ... ] ")" ] cit_cross_ref = cit_db_code ":" reference_code cit_db_code = { "MUID" | "PDB" } reference_status = { "preliminary" | "nucleic acid sequence not shown" | "translation not shown" | "protein sequence not shown" | "significant sequence differences" | "translated from GB/EMBL/DDBJ" | "not compared with conceptual translation" | "unencoded polypeptide" | "conceptual translation of pseudogene" } molecule_type = { "DNA" | "mRNA" | "genomic RNA" | "nucleic acid" | "protein" } residue_specification = locant [ { "," | ";" } locant ... ] locant = { seq_literal | position [ "-" position ] } seq_literal = "'" sequence_string "'" seq_string = seq_character [ seq_character ... ] seq_character = one of the IUPAC standard one letter amino acid codes position = a number between "1" and the sequence length inclusive seq_cross_ref = seq_db_code ":" reference_code seq_db_code = { "CAS" | "DDBJ" | "EMBL" | "GB" | "NID" | "PDB" | "PID" | "TIGR" | "UWGP" } gene_symbol = [ gene_db_code ":" ] gene_symbol_text gene_db_code = { "FlyBase" | "GDB" | "LISTA" | "MGI" | "OMIM" } gen_cross_ref = gene_db_code ":" reference_code intron_specification = intron_locant [ "; " intron_locant ... ] intron_locant = position "/" { "1" | "2" | "3" } feature_status = { "experimental" | "predicted" | "atypical" | "absent" } date = day "-" month "-" year day = a two digit number between "01" and "31" month = { "Jan" | "Feb" | "Mar" | "Apr" | "May" | "Jun" | "Jul" | "Aug" | "Sep" | "Oct" | "Nov" | "Dec" } year = a four digit number label = three or four upper case letters or numbers ======================================================================== Notes on the NBRF format ( 1) In the "NBRF" Format meta-description, double quotation marks enclose explicit character strings, unquoted alphabetic strings indicate meta-descriptors, braces "{" and "}" enclose alternative elements separated by verticals "|", brackets "[" and "]" enclose optional elements or records, three periods "..." indicate repetition of the preceding elements within the brackets, verticals in the left margin denote blocks of records repeatable in order, a blank line separates different repeatable blocks of records. ( 2) The authors in the "R;" and "A;Authors:" reference records appear in the following style. "surname, initials; ..." Names are separated by semicolons. No " and " appears before the last name in a list. The name qualifiers, Jr., III, and so forth, are appended to the end of the surname with no intervening punctuation. The optional "A;Authors:" record immediately following the "citation" is used when the complete list of author names cannot fit in a single record; it may be repeated as required. In NRL_3D single name authors, translated double letter initials, and named group authors may appear. ( 3) With this release, all but a very small number of free text notes pertaining to reference sequence appearance or processing status have been converted to appropriate status records using restricted vocabulary. The reference status texts "unencoded polypeptide" and "conceptual translation of pseudogene" appear only in the PIR4 database segment. ( 4) Because the NCBI backbone is no longer produced, the sequence cross- reference codes with "NCBIN" and "NCBIP" have been removed and appear only in sequence note records. The sequence cross-reference database codes with "NID" and "PID" refer to the corresponding entities in the GenBank/EMBL/DDBJ databases. In particular, the appearance of the "PID" code indicates that the sequence of the corresponding residues record matches the indicated translation except for the possible presence or absence of initiators or other noted translation exceptions. The reference_code texts appearing in cit_cross_ref, seq_cross_ref and gen_cross_ref are defined by the databases referred to and are free text without spaces, commas, colons, or semicolons, and with any combination of upper- and lowercase letters numbers and hyphens. These are checked against the corresponding databases. ( 5) The list of genetics database codes will be extended as circumstances permit. The gene_symbol texts are defined by authors or by the databases referred to. They are free text without commas, colons, or semicolons, and with any combination of upper- and lowercase letters numbers, hyphens, slashes and other non-letters. These are checked against the corresponding databases. ( 6) The NBRF format allows for subidentifiers in the feature "F;" record. Although only the "#status" and "#link" subidentifiers are used, other subidentifiers may appear in later releases. Locants separated by semicolons and literal sequences may appear in some feature location fields in future releases. The "#status" subidentifier appears in all feature records with the following feature names: Active site Binding site Cleavage site Cross-link Disulfide bonds Inhibitory site Modified site Product 2. Format Changes Anticipated for Release 55.00 The following format changes or developments are anticipated for Release 55.00 scheduled for 31 December 1997. ( 1) We anticipate that the format and use of the "R;" reference, citation author, and reference number records may be changed. A description of the use and format for these records will be documented in the next Technical Development Bulletin. 3. Note on Flat File Format for NRL_3D Release 22.00 Beginning with Release 21.00 of NRL_3D, a new "PDB title" record appeared. This record except for capitalization uses the corresponding TITLE record in the revised PDB format 4. Available Instructional Material The following instructional material is available at ftp://nbrf.georgetown.edu/pir/ filename contents ANNOUNCE11.TXT Announcements of the PIR, 20 June 1997 ANNOUNCE12.TXT Announcements of the PIR, 6 August 1997 PIRTECH07.TXT Technical Bulletin, 19 September 1997 FEATURES.TXT Guide for PIR Features Annotations Instructions on obtaining this and additional material from the PIR World Wide Web Site will be published in the next Announcements of the Protein Information Resource. Your comments and suggestions are welcomed. You may direct them to individual members of the PIR-International staff or to POSTMASTER@NBRF.Georgetown.EDU. If there are comments you wish to share with others on the PIR-International Technical Development Bulletin mailing list, please indicate that fact in your message. Otherwise, your comments will be circulated only to the PIR-International staff. ------------------------------------------------------------------------ Dr. John S. Garavelli Associate Director Protein Information Resource National Biomedical Research Foundation Washington, DC 20007 POSTMASTER@NBRF.GEORGETOWN.EDU