P R O T E I N S E Q U E N C E D A T A B A S E PIR Document CXFSD-1293 CODATA Exchange Format Specification Version 3.0, December 31, 1993 Protein Information Resource (PIR)* National Biomedical Research Foundation 3900 Reservoir Road, N.W., Washington, DC 20007, USA E-mail: PIRMAIL@NBRF.Georgetown.Edu In collaboration with: International Protein Information | Martinsried Institute for Database in Japan (JIPID) | Protein Sequences (MIPS) Science University of Tokyo | Max Planck Institute for Biochemistry 2669 Yamazaki, Noda 278, Japan | D-8033 Martinsried bei Muenchen, FRG This database may be redistributed without prior consent, provided that this notice be given to each user and that the words "Derived from" shall precede this notice if the database has been altered by the redistributor. *PIR is a registered mark of NBRF PIR is partially supported by the National Library of Medicine Page 2 ________________________ 1.0 Recent Modifications In an effort to improve the coverage, accuracy, and completeness of the PIR-International Protein Sequence Database, a number of minor changes have been introduced into the CODATA formatted version of the Database beginning with Release 39.00. These changes were introduced to provide room for expansion, such as the inclusion of cross-references to other databases, and to make the overall presentation more uniform and more computer parsable. For the most part, the changes involve modifications to record identifiers and subidentifiers and the replacement of comma and back-slash list separators with semicolons. In addition, a number of records have been expanded and new subfields have been added. We realize that even minor changes present difficulties to database users and have carefully limited such changes accordingly. However, it is not possible to make improvements without making changes. Below is an itemized list of affected records and corresponding format revisions. Enhanced CODATA Record Descriptions ( 1) ENTRY the #type field values are: "complete", "fragment", or "fragments" ( 2) TITLE the #EC-number field has been eliminated; EC numbers are found within the title in parentheses ( 3) ALTERNATE_NAMES this identifier has changed from "ALTERNATE-NAME"; alternate names are separated by semicolons rather than backslashes ( 4) CONTAINS this identifier changed from "INCLUDES"; contains names will be separated by semicolons rather than backslashes; #EC-number field has been eliminated (see 2 above) ( 5) ORGANISM this identifer changed from "SOURCE"; in the future, unique identifiers will be assigned for every species appearing in an organism record; these identifiers will follow immediately after the record identifier; the scientific organism name field is now explicitly identified by "#formal_name"; the hyphen in #common_name identifier is replaced with an underscore and common_name is lowercase; two new fields have been added: #variety #note the HOST record has been eliminated, this information in now found in the #note field and will be moved to the Taxonomy file at a later date Page 3 ( 6) PLACEMENT this field has been eliminated; placement information is given in the PIR*.SNX index file; the PRINDX.LIS data file contains a listing of all primary placement numbers in Sections PIR1 and PIR2 ( 7) DATE the field identifiers "#sequence" and "#text" have been changed to "#sequence_revision" and "#text_change", respectively ( 8) ACCESSIONS this identifier has changed from "ACCESSION"; accession numbers are separated by semicolons rather than backslashes ( 9) REFERENCE parenthetical comments following the record identifier have been moved to the #contents field; #reference-number - this field has been eliminated; reference numbers now follow immediately after the REFERENCE record identifier #authors - author names are separated by semicolons (rather than commas); each name consists of a surname (one or more words) separated from initials by a comma, periods now punctuate the author's intitials and "Jr." and "Sr.", for example Maizel Jr., J.; Smith, E.L. #description - this identifier may follow the identifier #submission; the field contains a brief description of the submitted work #cross-references - this new field has been added for cross-references to medline with the tag "MUID:" #contents and #note - these fields have been added; initially the #contents field will contain the information previously found in the parenthetical comments appearing after the REFERENCE identifier #accession - this field allows only one accession number ##molecule_type - this field becomes a subfield of #accession; the hyphen is now an underscore ##residues - this field becomes a subfield of #accession; the `tag' that was delimited by < and > is moved to the ##label field ##cross-references - this field becomes a subfield of #accession; it contains cross-references to CAS, DDBJ, EMBL, GB, NCBIN, and NCBIP (primary sequence data from the `backbone') three new subfields have been added to #accession : ##status ##genetics - contains a pointer to a GENETICS block when there is more than one GENETICS block, it is the same as the label on the GENETICS block (see below) ##note (10) GENETICS multiple genetics records may occur when multiple strains or multiple genes are represented in the entry; the first field may contain an optional record label when more than one GENETICS record occurs #gene - the identifier has been changed from #Name; Page 4 cross-references to gene mapping databases now occur in this field, i.e., GDB:AGT1; AGT1 is the gene name as it appears in the GDB database #cross-references - this field is used to link the genetics information to other databases particularly GDB; information in this item refers to entry code or ID #map_position - the hyphen has been changed to an underscore; #Segment-number has been eliminated; this information is now considered to be part of the map position #genome - this new field decribes the genetic source #genetic_code - this identifier changed from #Special-code #start_codon - there is no change to this field #introns - intron specifications now have the form: 31/2; 45/3 ##status - this subfield is new and gives the status of intron assignment (it does not occur in Release 39.00) #note - this new field contains ancillary genetics information (11) COMPLEX this record describes the type of structural complex associated with forms of the molecule (12) FUNCTION this block contains information about the function of the protein #description - free text describing the function (required) #note - text denoting atypical activity (optional) (13) CLASSIFICATION this identifier has changed from "SUPERFAMILY"; and "#Name" is replaced by "#superfamily"; it contains a list (separated by semicolons instead of backslashes) of superfamily names; residue specifications, explicitly depicting domains, no longer occur in this field (14) KEYWORDS keywords are separated by semicolons rather than backslashes (15) FEATURE the first character of Feature descriptors is now lowercase #Thioester-bonds have been incorporated into #cross-links records; #Duplication features have been eliminated (the information has been moved to #domain or #region features) two new fields have been added to FEATURE #status #label - the label previously contained between `<' and `>' is now in this field Other new fields will be implemented in future releases. ________________ 2.0 Introduction For non-VAX/VMS systems, the PIR-International Protein Sequence Database is being distributed in a format consistent with the Sequence Data Exchange Format recommended by the CODATA Task Group on the Coordination of Protein Sequence Data Banks. The format is a context-independent, free format wherein data are not restricted to specific columns, fields, or Page 5 records but are identified by a defined set of field descriptors (identifiers and subidentifiers). For maximum portability, the data are represented in the International ASCII character set restricted to the printable ASCII characters (ASCII characters 32 through 126, decimal representation). The data are represented in upper- and lower-case letters; however, no significance should be attached to the case of a letter; upper- and lower-case letters are to be treated equivalently. Records are fixed at 80 characters per record padded on the right by space characters. The data are organized into entries that contain all the information associated with a particular sequence, including the title, the biological source, references, associated text, and the sequence itself. Some entries contain information from several closely related sequences but only one sequence is explicitly represented in each entry. The database is contained in a single file; the first several lines of the file contain the database header, which identifies the database. The entries follow sequentially; they are separated from any other text in the file by beginning and ending with a record containing three backslash characters, '\', in the first three columns. Types of data within an entry are distinguished by dividing them into specific data items, e.g., title, reference, feature table, etc. Space characters are used as general separators; a variable number of spaces are used to separate data, which allows the data to be represented in an easily readable tabular form. Although the format may appear to be in a fixed field representation, software should be designed to read a free field format; we cannot guarantee that the indentation and spacing between specific data will not be changed in future releases. Each data item is labeled with an Identifier. Identifiers are single words or several words connected by hyphens (they contain no internal space characters). A data item may extend over more than one line. The identifier starts at the first column of the first line of the corresponding data item; continuation lines are distinguished by containing at least three space characters at the beginning of the line. Data within data items are divided into subitems. Each subitem consists of a subidentifier, which identifies the subitems and separates them, followed by the associated data. Subidentifiers are of the same form as identifiers but are immediately preceded by a number sign, '#', which designates them as subidentifiers; the number sign appears in the database only to introduce subidentifiers. Most of the fields within each data item are optional; therefore, they are defined specifically by their identifier rather than by relative position (the relative position of subitems may change in future releases). It is often the case that many items of data are distinct but represent equivalent types of data. For example, many proteins are known by a variety of names. Although each name is distinct and should be treated as such, they are all alternate names. This may be handled in one of two Page 6 ways. Each alternate name may be presented as a separate data item introduced by the identifier ALTERNATE_NAMES or several alternate names may be introduced by a single identifier and separated by a semicolon. The semicolon is used as a general data item separator that indicates that what follows is a separate data item but of the same type. The semicolon is used in the database only to indicate the concatenation of data items. An entry contains the following data items; see Appendix E for a entry template: Beginning-of-entry (Entry) Title Alternate_names Contains Organism Date Accessions Reference Comment Genetics Classification Keywords Feature Summary Sequence End-of-entry All entries contain the data items Beginning-of-entry, Title, Organism, Date, Reference, Summary, Sequence, and End-of-entry; the other data items are optional. Each Entry begins with a Beginning-of-entry data item sequentially followed by a Title data item and ends with a Summary data item sequentially followed by a Sequence data item and an End-of-entry data item; these data items are present only once in each entry; all others may be repeated several times within the same entry. The remaining data items generally follow in the order indicated above; however, this order may not be strictly adhered to. Software should be designed to accommodate alterations in the order of data items. New data items and subitems may be introduced in the future. We recommend that software be designed to ignore or otherwise handle unknown data items and subitems. This will ensure that the software will continue to run when new data items and subitems are introduced and will give the programmer time to make smooth adjustments as new data are introduced. The Beginning-of-Entry Data Item The beginning-of-entry data item marks the beginning of each entry; it is denoted by the identifier 'ENTRY'. The entry identification code immediately follows the identifier; it is a four- to six-character word Page 7 that is used to identify the entry in the database; it contains only alphanumeric characters with no internal spaces. One subidentifier, '#type', which denotes the type of molecule, is defined for the beginning-of-entry data item. The '#type' subidentifier is followed by either 'complete', 'fragment' or 'fragments', which indicates whether the represented sequence is complete or fragmentary. The Title Data Item The Title data item contains the entry title; it is denoted by the identifier 'TITLE'. The entry title gives a brief description of the molecule and its biological source. The name of the molecule appears on the left and the name of the biological source appears on the right; they are separated by ' - '. The Enzyme Commission (EC) Numbers as defined by the Nomenclature Committee of the International Union of Biochemistry are denoted by an 'EC ' number in parentheses. It represents a hierarchical classification scheme and consists of four positive integers separated by periods. With the exception of the first, the integers may be replaced by a dash, to indicate enzymes that are not fully classified. The Alternate_Names Data Item The optional Alternate_names data item contains an alternate title for molecules for which a standard nomenclature has not been established and that are designated by more than one name. It is denoted by the identifier 'ALTERNATE_NAMES'. Multiple alternate names may be represented in separate alternate name data items or within the same data item separated by semicolon characters. The Contains Data Item The optional Contains data item also contains a title but is distinct from the alternate name and title data items in that the Contains data item refers to activities present within the molecule rather than those associated with the entire molecule. It is denoted by the identifier 'CONTAINS'. Multiple contains data items may be represented in separate data items or within the same data item separated by a single semicolon character. The Organism Data Item The Organism data item specifies the biological source of the sequence; it is denoted by the identifier 'ORGANISM'. The scientific name of the Page 8 organism immediately follows the identifier and is denoted by the descriptor '#formal_name'. The common name is denoted by the subidentifier '#common_name'. Synonymous scientific names and common names are separated by commas but follow contiguously in the appropriate subitem. In most cases both the scientific and common names of the organism are given; this is not always possible, however. The Date Data Item The Date data item contains the revision dates of the entry. It is denoted by the identifier 'DATE'. The date on which the entry was added to the database immediately follows the identifier. Two subidentifiers are defined: '#sequence_revision', denoting the date the sequence was last revised and '#text_change', denoting the date that the text of the entry was last revised. Not all entries contain all three dates. If the '#sequence_revision' or '#text_change' dates are absent, the entry has not been revised since its addition to the database; if the added date is absent, the entry was added prior to 1979. The dates are specified as a two-digit integer representing the day followed by a dash, the first three letters of the month, a second dash, and the four-digit integer representation of the year. The Accessions Data Item The Accessions data item is denoted by the identifier 'ACCESSIONS'. Accession numbers contain only alphanumeric characters. Multiple accession numbers may be represented in separate accession data items or within the same data item separated by a single semicolon character. These numbers are associated with the sequence as reported in a publication or as submitted to the databases; they constitute unique labels that provide an identity to the reported sequence. As these labels will remain permanently associated with this information, they may continue to serve as tags to locate the information in any future releases of the database. The Reference Data Item The Reference data item contains the literature citation; it is denoted by the identifier 'REFERENCE'. A unique six-character alphanumeric 'Reference number' may appear immediately after this identifier. The Reference data item begins the reference block. Contained in the reference block is the authors subitem; the journal, book, submission or citation subitem; an optional reference title subitem; an optional cross-references subitem; an optional contents subitem; an optional note subitem; and an optional Page 9 accession number subitem which begins the accession block. Contained in the accession block is an optional status subsubitem, the moleculetype subsubitem, the residues subsubitem, an optional cross-references subsubitem, an optional genetics subsubitem, and an optional note subitem. The accession block and associated subsubitems may be repeated with the reference block and the reference block may be repeated with the entry. The authors subitem is denoted by the subidentifier '#authors'. The author names follow the subidentifier and are separated by semicolons. Each author name consists of a surname and initials separated by a comma. The surname may be multiple words and may contain extensions such as "Jr.", "Sr.", "II", "III" or "IV". The initials are separated by periods. The journal subitem is denoted by the subidentifier '#journal'. The subidentifier is sequentially followed by the journal abbreviation as depicted in the PIR-International journals abbreviations file named JOURNALS.LIS, a four-digit integer representation of the year enclosed in parentheses, an optional volume number, an optional issue number enclosed in square brackets, a colon, and the page numbers. The volume number and issue number contain alphanumeric characters, and the page numbers are one or two alphanumeric characters separated by a dash. The book subitem is denoted by the subidentifier '#book', the submission subitem is denoted by the subidentifier '#submission' and the citation subitem is denoted by the subidentifier '#citation'. The book subitem contains citations to publications other than scientific journals, the submission subitem contains information of database and date of author submissions and the citation subitem contains other citations that cannot easily be represented in the format required for the journal subitem The reference title subitem is denoted by the subidentifier '#title' or '#description'. It contains the title of the manuscript or a brief description of submitted work for submissions. The contents subitem denoted by '#contents' may list the residue numbers reported in the paper, indicate what information is presented in the paper (complete sequence, X-ray crystallography, features, etc.), or give the biological source when the entry represents sequences from more than one organism. The note subitem denoted by '#note' describes any ancillary information associated with the citation. The accession number subitem is denoted by the subidentifier '#accession'. It contains a label that uniquely identifies the sequence that was reported in the reference. Single accession numbers are required as unique pointers to a sequence report. These labels have the same form as the entry identification code and are unique over all sections of the database except in very few exceptions where a portion of sequence overlap in two entries. The status subsubitem is denoted by '##status' and indicates a level of review of the sequence. A value for this field is "preliminary". Page 10 The molecule_type subsubitem is denoted by the subidentifier '##molecule_type'. It contains a symbol that identifies the type of molecule from which the sequence was experimentally determined. The valid symbols for this field are given in Appendix D. The residues subsubitem is denoted by the subidentifier '##residues'. It contains a residues specification and a special tag that provides a logical address to the sequence as reported in the publication. The tag subsubitem is denoted by the subidentifier '##label'. The residues specification is a special syntax that allows the sequence originally reported to be regenerated from that shown in the entry. The residues specification consists of a set of segment specifications separated by semicolons and commas. The segment specification may be either a literal sequence element (a sequence of amino acids represent in the one-letter amino acid abbreviations and enclosed in single quotes) or a range of sequence positions. The range is either a single integer or a pair of integers separated by a dash ('-') that denotes the inclusive extent of the specified segment with respect to the sequence shown in the entry (the integers correspond to a sequential and contiguous sequence numbering system in which the first residue is assigned the number 1). The sequence is reconstructed by extracting the indicated sequence elements and appending these together with the literally specified sequence elements. The semicolon is used to denote physical breaks in the sequence. Segments separated by commas should be 'spliced' together; those separated by semicolons represent boundaries between adjacent fragments. This syntax, although simple and concise, is powerful enough to express all known sequence transforms (i.e., insertions, deletions, substitutions, duplications, tranpositions, etc.) capable of converting one protein sequence into another without introducing nonstandard amino acids. Following the residues specification is an entry-identification-code extension denoted by the subidentifier '##label'. This short 'tag' when combined with the entry identification code provides a unique logical address or path to the indicated sequence, i.e., CCYC6->AIT. The cross-references subsubitem is denoted by '##cross-references'. It contains information about cross references to other databases such as Medline abstracts. The tag "MUID:" indicates a Medline cross reference. The genetics subsubitem is denoted by '##genetics'. This information is present when more than one genetics block is present and needs to be distinguished. The tag is unique and appears in the 'GENETICS' data item. The note subsubitem is denoted by '##note' and represents ancillary information about the sequence report. Page 11 The Comment Data Item The optional Comment data item contains free text; it is denoted by the identifier 'COMMENT'. The Genetics Data Item The optional Genetics data item contains genetic information associated with the entry; it is denoted by the identifier 'GENETIC'. This data item begins the genetics block; the following seven subidentifiers have been defined for this data item (all are optional but at least one is present): '#gene', denoting the gene name; '#map_position', denoting the genetic map position; '#genome', denoting the genome type; '#genetic_code', denoting the genetic code used for protein translation and back-translation if other than the universal genetic code; '#start_codon', denoting the start-of-translation codon if other than AUG; '#introns', denoting the intron positions; and '#note' denoting and comments particular to the specified gene. A complete listing of the special genetic codes is included in the file SGC.LIS. For human sequences the gene name and map position are given according to the recommendations of the Human Gene Mapping Library (HGML) at Yale University. The '#introns' subidentifier is followed by a set of intron specifications separated by semicolons; intron specifications consist of two integers separated by a slash. The first integer specifies the position of the intron in the protein sequence; the second specifies the position in the codon immediately preceding the intron and may have the value 1, 2, or 3. The Classification Data Item The optional Classification data item contains the name of the superfamily(s) to which the sequence has been assigned; it is denoted by the identifier 'CLASSIFICATION'. It has been recognized that many proteins are composed of functional domains of separate evolutionary origin. There are a number of these multidomain proteins in the database. If multiple superfamilies are assigned, each is specified as a "homology" except for the primary if there is one. Some domains may be specified in the FEATURE table described below. Each superfamily name is listed under the subidentifier '#superfamily' and is separated by a single semicolon. Page 12 The Keywords Data Item The optional Keywords data item contains a keyword or short key phrase; it is denoted by the identifier 'KEYWORDS'. Multiple keywords may be represented in separate keywords data items or within the same data item separated by a single semicolon. The Feature Data Item The optional Feature data item marks sites or regions of the sequence that are of biological interest; it is denoted by the identifier 'FEATURE'. The identifier is sequentially followed by a residue specification, a feature descriptor, and a feature title. The residue specification consists of a set of fragment specifications separated by commas. A fragment specification consists of an integer or two integers separated by a dash. The feature descriptor is a subidentifier that indicates the type of feature. The currently used feature descriptors are given in Appendix A. When the '#protein', '#peptide', or '#domain' descriptors are used, three- or four- character code extensions follow the subidentifier '#label'. These code extensions when added to the entry identification code generate a unique code with which to identify the corresponding subsequence in the database. The Summary Data Item The Summary data item contains the molecular weight, the sequence length, and a sequence checksum; it is denoted by the identifier 'SUMMARY'. The molecular weight, sequence length, and sequence checksum are denoted by the subidentifiers '#length', '#molecular-weight', and '#checksum', respectively; all are represented as integers. Note that the molecular weight is not included on fragmentary sequences, #type fragment(s). The checksum is computed using the method introduced by Devereux et al. (Devereux, J., Marquess, P., Martin, W., Smith, M., Winsborough, W., 1986, Sequence Analysis Software Package of the University of Wisconsin Genetics Computer Group). It is a simple method that yields checksums with values between 0 and 9999. Although the checksums are not unique, they have been found to be sensitive enough to changes in the sequence to indicate that the sequence has been altered or incorrectly transmitted. The sum is compiled by sequentially summing over the entire sequence the product of a modified position number times the ASCII decimal representation of the character corresponding to the amino acid at that position. The position numbers are modified by setting the value to 1 every time it exceeds 57 and thus restricting the values to 1 through 57. The remainder of this sum after division by 10,000 is then taken as the checksum. The following is a sample from a section of a FORTRAN 77 program that computes the checksum. Page 13 The function ICHAR returns the decimal equivalent of the sequence character stored in the array SEQ. Note that when residues are given as lower-case letters they are represented as the ASCII equivalent of the corresponding upper-case letter. C ---Compute checksum--- ICHECK=0 ICOUNT=0 DO 20 L=1,LENSEQ ICOUNT=ICOUNT+1 LCHAR=ICHAR(SEQ(L)) !Convert to decimal form IF (LCHAR.GE.97 !Optionally change case .and. LCHAR.LE.122) LCHAR=LCHAR-32 ICHECK = ICHECK + ICOUNT*LCHAR IF (ICOUNT.EQ.57) ICOUNT=0 20 CONTINUE LSUM = ICHECK/10000 !Integer division ICHECK = ICHECK - LSUM*10000 !Compute remainder The Sequence Data Item The Sequence data item contains the sequence in the one-letter amino acid abbreviations (Appendix B); it is denoted by the identifier 'SEQUENCE'. The sequences are numbered and displayed in an easily readable form; they also contain punctuation (see Appendix C). Inasmuch as the display of the sequence information may change in future releases, we strongly recommend that software developers do not rely on the explicit representation of the sequence data but write software that interprets only the one-letter amino acid abbreviations as sequence data (this may be accomplished by ignoring all nonletter characters following the 'SEQUENCE' identifier. The end of the sequence is defined by the end-of-entry marker (a line containing three slashes as the first three characters). The End-of-Entry Data Item The end-of-entry data item marks the end of the entry. It is denoted by the identifier consisting of three consecutive slashes. Page 14 Appendix A Feature Descriptors of The PIR International Protein Sequence Database Descriptors Denoting Single Residue Sites ----------------------------------------- #active_site #binding_site #cleavage_site #inhibitory_site #modified_site These sites may be represented using the full feature residue specification conventions; however, the specification should be interpreted to specify a collection of single sites. Descriptors Denoting Residues Connected by Covalent Bonds --------------------------------------------------------- #disulfide_bonds #cross_links The pairs of residues linked by hyphens should be intepreted as being connected by bonds. Single residues may be specified if the bond is between an amino acid in the sequence and another protein chain or molecule. The #cross-links descriptor specifically denotes bonds linking the sequence shown to an adjacent protein chain. Descriptors Denoting Sequence Regions ------------------------------------- #domain #peptide #protein #region Pairs of residues linked by hyphens denote a region extending from the first to the second position inclusive. The #domain descriptor indicates distinct functional regions that may be of separate evolutionary origin. The #peptide and #protein descriptors indicate regions corresponding to the mature protein or peptides derived from the sequence shown. The #region descriptor is used to indicate all other regions. Page 15 Appendix B One- and Three-letter Amino Acid Abbreviations A Ala Alanine C Cys Cysteine D Asp Aspartic acid E Glu Glutamic acid F Phe Phenylalanine G Gly Glycine H His Histidine I Ile Isoleucine K Lys Lysine L Leu Leucine M Met Methionine N Asn Asparagine P Pro Proline Q Gln Glutamine R Arg Arginine S Ser Serine T Thr Threonine V Val Valine W Trp Tryptophan Y Tyr Tyrosine B Asx Asp or Asn, not distinguished Z Glx Glu or Gln, not distinguished X X Undetermined or atypical amino acid These abbreviations conform to those suggested by the IUPAC-IUB Commission on Biochemical Nomenclature, J. Biol. Chem. 243, 3557-3559, 1968. Page 16 Appendix C Punctuation in Protein Sequences Two adjacent amino acids, with no punctuation between, indicates that they are connected, as determined experimentally. () Encloses a region, the composition but not the complete sequence of which has been determined experimentally, or encloses a single residue that has been tentatively identified. = Indicates )(, the juxtaposition of two regions of indeterminate sequence, while preserving proper spacing between amino acids. / Indicates that the adjacent amino acids are from different peptides, not necessarily connected. When the amino end of a protein has not been determined, / precedes the first residue. When the carboxyl end has not been determined, / follows the last residue. When )/, /(, or )/( are needed, only / is used. . Outside of parentheses, indicates the ends of sequenced fragments. The relative order of these fragments was not determined experimentally but is clear from homology or other indirect evidence. . Within parentheses, indicates that the amino acid to its left has been placed with at least 90% confidence by homology with known sequences. , Indicates that the amino acid to its left could not be positioned with confidence by homology. Page 17 Appendix D Molecule_type and Cross_references Symbols Molecule_type Symbols --------------------- The following constitute the set of valid symbols used in the molecular type subitem of the reference. They indicate the type of molecule whose experimental sequence determination was reported. DNA RNA genomic RNA mRNA nucleic acid protein Cross_references Symbols ------------------------ The following symbols are used in the '##cross_references subitem of the reference to specify the database. GB - GenBank(R) Genetic Sequence Data Bank EMBL - EMBL Data Library, Nucleotide Sequence Database DDBJ - DNA Data Bank of Japan PDB - Brookhaven National Laboratory's Protein Data Bank CAS - Chemical Abstracts Services registry numbers NCBIP - NCBI primary sequence database from protein (backbone) NCBIN - NCBI primary sequence database from nucleic acid Page 18 Appendix E CODATA Formatted Entry Template ENTRY Beginning-of-Entry TITLE ALTERNATE_NAMES CONTAINS ORGANISM #formal_name ... #common_name ... DATE #sequence_revision ... #text_change ... ACCESSIONS REFERENCE REFERENCE BLOCK (repeated) #authors #journal | #book | #submission | #citation #title | #description #cross-references #contents #note #accession ACCESSION BLOCK (repeated) ##status ##molecule_type ##residues ##label ##cross-references ##genetics ##note COMMENT COMMENTS (repeated) GENETICS GENETICS BLOCK (repeated) #gene #map_position #genome #genetic_code #start_codon #introns #note CLASSIFICATION #superfamily KEYWORDS FEATURE SUMMARY #length ... #moleculer-weight ... #checksum SEQUENCE ///