ATLAS - User's Guide Order Number: ATPD-1295 ATLAS The Atlas Multidatabase Information Retrieval System (ATLAS), developed by the National Biomedical Research Foundation (NBRF), is a retrieval program specifically designed to access macromolecular sequence databases. This version of the program has been configured to provide simultaneous retrieval from all (or a subset) of the databases on this CD-ROM. Full support is provided for VAX/VMS, OpenVMS AXP, OSF/1 AXP, ULTRIX (RISC), SunOS, IRIX, and PC/DOS systems. A preliminary version for the Macintosh is also included. The ATLAS CD-ROM contains the Atlas retrieval system, the FASTA database searching program, the PIR®- International Protein Sequence Database, the MIPS PATCHX Merged Protein Sequence Database, the NRL_3D Sequence-Structure Database, the ALN Protein Sequence Alignment Database, the RESID database of protein structure modifications, the JIPID E. coli Database and indexes to the GenBank Genetic Sequence Databank. Document Date: January 30, 1996 Document Version: 14.0 Operating Systems: VAX/VMS Version 5.5 Alpha OpenVMS & OSF/1 ULTRIX Version 4.3 SunOS Version 4.1.3 SGI IRIX Version V.4 DOS Version 3.0 or higher Macintosh Operating System Version 7.5 ii __________ Copyright ©1995 and 1996 National Biomedical Research Foundation We have made every effort to ensure the proper functioning of the ATLAS program. We cannot be responsible for the consequences to users of any errors resulting from its use. National Biomedical Research Foundation Georgetown University Medical Center 3900 Reservoir Road, N.W. Washington, D.C. 20007 USA Phone: (202) 687-2121 FAX: (202) 687-1662 E-mail addresses: PIRMAIL@NBRF.GEORGETOWN.EDU ________________ ® PIR is a registered trademark of NBRF. _______________________________________________________ Contents _________________________________________________ PREFACE v _______________________________________________________ Part I THE DATABASES _______________________________________________________ CHAPTER 1 DATABASE TERMINOLOGY 1-1 _________________________________________________ 1.1 ENTRY 1-1 _________________________________________________ 1.2 DATABASE 1-4 _________________________________________________ 1.3 IDENTIFYING AN ENTRY 1-5 _________________________________________________ 1.4 THE TERM INDEX 1-6 _________________________________________________ 1.5 ACTIVE DATABASES 1-8 _________________________________________________ 1.6 THE CURRENT LIST 1-9 iii Contents _______________________________________________________ CHAPTER 2 DATABASE DESCRIPTIONS 2-1 _________________________________________________ 2.1 THE PIR-INTERNATIONAL PROTEIN SEQUENCE DATABASE 2-1 _________________________________________________ 2.2 THE PATCHX MERGED SEQUENCE DATABASE 2-5 _________________________________________________ 2.3 THE NRL_3D SEQUENCE-STRUCTURE DATABASE 2-7 _________________________________________________ 2.4 THE PIR-ALN PROTEIN SEQUENCE ALIGNMENT DATABASE 2-8 _________________________________________________ 2.5 THE RESID DATABASE OF AMINO ACIDS RESIDUES 2-11 _________________________________________________ 2.6 THE ECOLI ESCHERICHIA COLI DNA DATABASE 2-15 _________________________________________________ 2.7 THE GENBANK NUCLEIC ACID SEQUENCE DATABANK 2-16 _______________________________________________________ Part II THE ATLAS PROGRAM iv Contents _______________________________________________________ CHAPTER 3 OVERVIEW OF THE ATLAS PROGRAM 3-1 _________________________________________________ 3.1 TEXT SEARCHING COMMANDS 3-2 _________________________________________________ 3.2 SEQUENCE SEARCHING COMMANDS 3-4 _________________________________________________ 3.3 DISPLAY COMMANDS 3-5 _________________________________________________ 3.4 FILE INTERFACE COMMANDS 3-5 _________________________________________________ 3.5 UTILITY COMMANDS 3-5 _________________________________________________ 3.6 COMMAND MODIFIERS 3-6 _________________________________________________ 3.7 SPECIAL CONTROL CHARACTERS 3-9 _________________________________________________ 3.8 THE PC MENU SYSTEM 3-9 _______________________________________________________ CHAPTER 4 THE ATLAS COMMANDS, DETAILED DESCRIPTIONS 4-1 ACCESSION 4-2 AUTHOR 4-6 BASES 4-10 COPY 4-14 CROSS 4-19 DEFINE 4-23 EXTRACT 4-28 FEATURE 4-33 FIND 4-39 v Contents GENE 4-45 GET 4-49 HELP 4-52 JOURNAL 4-54 KEYWORD 4-58 LIST 4-62 MATCH 4-64 MEMBERS 4-70 PRINT 4-73 QUIT 4-74 REFERENCE 4-75 REPORT 4-79 SCAN 4-84 SEARCH 4-87 SELECT 4-91 SET 4-97 SHOW 4-99 SPECIES 4-102 SUPERFAMILY 4-106 SFNUM 4-110 TAXONOMY 4-114 TYPE 4-120 _______________________________________________________ Part III THE FASTA DATABASE SEARCHING PROGRAM _______________________________________________________ CHAPTER 5 OVERVIEW OF THE FASTA DATABASE SEARCHING PROGRAM 5-1 _________________________________________________ 5.1 INTRODUCTION 5-1 _________________________________________________ 5.2 STEPS FASTA USES IN A COMPARISON 5-2 vi Contents _______________________________________________________ CHAPTER 6 RUNNING THE FASTA PROGRAM 6-1 _________________________________________________ 6.1 FASTA OPTIONS 6-1 _________________________________________________ 6.2 THE OUTPUT 6-6 _______________________________________________________ APPENDIX A COMMANDS AND COMMAND MODIFIERS OF ATLAS A-1 _______________________________________________________ APPENDIX B ONE- AND THREE-LETTER AMINO ACID ABBREVIATIONS B-1 _______________________________________________________ APPENDIX C PUNCTUATION IN PROTEIN SEQUENCES C-1 _______________________________________________________ APPENDIX D PIR-INTERNATIONAL PROTEIN SEQUENCE DATABASE ENTRY FORMAT D-1 _________________________________________________ D.1 HEADER D-2 _________________________________________________ D.2 TITLE D-2 _________________________________________________ D.3 SEQUENCE D-3 _________________________________________________ D.4 TEXT D-4 vii Contents D.4.1 Alternate names specification _ D-4 D.4.2 Contains line _________________ D-5 D.4.3 Species line __________________ D-5 D.4.4 Species Note record ___________ D-6 D.4.5 Date record ___________________ D-6 D.4.6 Entry-specific Accession record ________________________ D-6 D.4.7 Author/citation records _______ D-7 D.4.8 Authors record ________________ D-7 D.4.9 Reference Title record ________ D-8 D.4.10 Reference number record _______ D-8 D.4.11 Contents record _______________ D-8 D.4.12 Reference Note record _________ D-9 D.4.13 Reference-specific Accession records _______________________ D-9 D.4.14 Accession Status record _______ D-10 D.4.15 Molecule type record __________ D-10 D.4.16 Residues record _______________ D-10 D.4.17 Cross-references record _______ D-11 D.4.18 Accession Genetics record _____ D-11 D.4.19 Accession Note record _________ D-12 D.4.20 Comment records _______________ D-12 D.4.21 Genetics record _______________ D-12 D.4.22 Gene record ___________________ D-13 D.4.23 Map position record ___________ D-13 D.4.24 Genome record _________________ D-14 D.4.25 Genetic code record ___________ D-14 D.4.26 Start codon record ____________ D-14 D.4.27 Introns record ________________ D-15 D.4.28 Genetics Note record __________ D-15 D.4.29 Function record _______________ D-15 D.4.30 Function Description record ___ D-16 viii Contents D.4.31 Superfamily record ____________ D-16 D.4.32 Keywords record _______________ D-16 D.4.33 Feature record ________________ D-17 _______________________________________________________ INDEX _______________________________________________________ EXAMPLES 1-1 Typical PIR-International Protein Sequence Database Entry _________________________ 1-1 2-1 Sample ALN Entry ______________ 2-9 2-2 Sample RESID Entry ____________ 2-14 D-1 Format Specification of PIR-International Entry: ______ D-18 _______________________________________________________ FIGURES 4-1 Taxonomic Classification Scheme ________________________ 4-119 _______________________________________________________ TABLES 1-1 Databases and Database-Codes __ 1-4 1-2 Indexed Text Fields ___________ 1-7 2-1 Non-PIR Databases used to generate PATCHX _______________ 2-5 3-1 Text Searching Commands _______ 3-3 3-2 Sequence Searching Commands ___ 3-4 3-3 Display Commands ______________ 3-5 3-4 File Interface Commands _______ 3-5 3-5 Utility Commands ______________ 3-6 ix Contents 3-6 Command Modifiers _____________ 3-7 3-7 Main Menu Selections __________ 3-10 3-8 OPTION Submenu ________________ 3-11 4-1 ACCESSION Command Modifiers ___ 4-3 4-2 AUTHOR Command Modifiers ______ 4-7 4-3 BASES Command Modifiers _______ 4-11 4-4 COPY Command Modifiers ________ 4-15 4-5 CROSS Command Modifiers _______ 4-20 4-6 DEFINE Command Modifiers ______ 4-24 4-7 EXTRACT Command Modifiers _____ 4-30 4-8 FEATURE Command Modifiers _____ 4-34 4-9 FIND Command Modifiers ________ 4-40 4-10 GENE Command Modifiers ________ 4-46 4-11 GET Command Modifiers _________ 4-50 4-12 HELP Command Modifiers ________ 4-52 4-13 JOURNAL Command Modifiers _____ 4-55 4-14 KEYWORD Command Modifiers _____ 4-59 4-15 LIST Command Modifiers ________ 4-63 4-16 MATCH Command Modifiers _______ 4-67 4-17 MEMBERS Command Modifiers _____ 4-71 4-18 REFERENCE Command Modifiers ___ 4-76 4-19 Expression Operators __________ 4-80 4-20 Expression Fields _____________ 4-81 4-21 REPORT Command Modifiers ______ 4-83 4-22 SCAN Command Modifiers ________ 4-85 4-23 SEARCH Command Modifiers ______ 4-88 4-24 Expression Operators __________ 4-92 4-25 Expression Fields _____________ 4-93 4-26 SELECT Command Modifiers ______ 4-95 4-27 SET Command Modifiers _________ 4-97 4-28 SHOW Command Modifiers ________ 4-99 4-29 SPECIES Command Modifiers _____ 4-103 4-30 SUPERFAMILY Command Modifiers _ 4-107 4-31 SFNUM Command Modifiers _______ 4-112 x Contents 4-32 SELECT Command Modifiers ______ 4-115 4-33 Taxonomic Class Codes (Viruses) _____________________ 4-116 4-34 Taxonomic Class Codes (Prokaryotes) _________________ 4-117 4-35 Taxonomic Class Codes (Eukaryotes) __________________ 4-118 4-36 TYPE Command Modifiers ________ 4-123 B-1 One- and Three-letter Amino Acid Abbreviations ____________ B-1 xi _______________________________________________________ Preface The ATLAS multidatabase retrieval program is designed to retrieve and manipulate information contained in the PIR-International Protein Sequence Database as well as other databases that have been reformatted to either the NBRF, GenBank, or EMBL formats. The ATLAS program is an outgrowth of two programs that were developed by NBRF: the Protein Sequence Query (PSQ) program and the Nucleic Acid Query (NAQ) program. These programs have been distributed since about 1980 and are in use at a large number of sites around the world. Users of these programs will be familiar with many of the concepts in the ATLAS program. Some of the major new features of the ATLAS program are: o operates on amino acid sequence, nucleotide sequence, and structured text databases; o the identifiable fields in the annotation (e.g., title, author name, species name, keywords) have been indexed; therefore, searching for terms that occur in these fields is much faster than in the PSQ and NAQ programs; o the index of terms mentioned above is constructed from many databases and the search commands search all of these databases simultaneously, thereby eliminating the need to repeat the same query in each database of interest. The ATLAS program was developed by the NBRF with the cooperation of the following: William A. Gilbert of the University of New Hampshire in Durham, New Hampshire; Alex Reisner and Carolyn Bucholtz of Sydney University in Australia; and Jack London of the Thomas Jefferson University in Philadelphia, Pennsylvania. v Preface Program development was partially supported by NLM LM05206, by NSF BIR-9107540, and by Digital Equipment Corporation. The PIR-International Protein Sequence Database is produced cooperatively by the Protein Information Resource (PIR), the Martinsried Institute for Protein Sequences (MIPS), and the Japan International Protein Information Database (JIPID). This database may be referred to as the PIR database or the Protein Sequence Database in the manual. The FASTA program package was supplied by William R. Pearson, Department of Biochemistry, University of Virginia, Charlottesville, Virginia. __________________________________________________________________ Intended Audience This manual is intended for all users of the ATLAS program; however, it is not intended to be a tutorial. __________________________________________________________________ Examples The examples in this document were generated using the database configuration at PIR-NBRF as of the date of this document. The same examples run using other database configurations may not produce the same results. Most examples were run using the PIR1 dataset of the PIR-International Protein Sequence Database, Release 47.xx, as the active database. vi _______________________________________________________ Part I The Databases This part of the manual contains material to familiarize the user with the terminology that relates to the sequence databases and descriptions of each database included on the CD-ROM _______________________________________________________ 1 Database Terminology This chapter defines terminology that relates to the structure of the databases and the information contained therein. __________________________________________________________________ 1.1 Entry An entry is a collection of information pertaining to one particular amino acid or nucleotide sequence. This information is organized into three main sections: o Title - contains the sequence name and the species name o Text - contains the annotation that is associated with the sequence o Sequence - contains the amino acid or nucleotide sequence. In the alignment database this section is absent. In addition, each entry has assigned to it an entry- code. The entry-code is used to uniquely identify an entry within a database. Example 1-1 Typical PIR-International Protein Sequence Database Entry _______________________________________________________ _______________________________________________________ Example 1-1 Cont'd on next page 1-1 Database Terminology Example 1-1 (Cont.) Typical PIR-International Protein Sequence Database Entry _______________________________________________________ 1 PIR1:CCHU 2 cytochrome c - human 3 Species: Homo sapiens (man) Date: #sequence_revision 30-Sep-1991 #text_change 11-Dec-1993 Accession: A31764; A05676; A00001 Evans, M.J.; Scarpulla, R.C. Proc. Natl. Acad. Sci. U.S.A. 85, 9625-9629, 1988 Title: The human somatic cytochrome c gene: two classes of processed pseudogenes demarcate a period of rapid molecular evolution. Reference number: A31764; MUID:89071748 Accession: A31764 Molecule type: DNA Residues: 1-105 Cross-references: GB:M22877 Matsubara, H.; Smith, E.L. J. Biol. Chem. 238, 2732-2753, 1963 Title: Human heart cytochrome c. Chymotryptic peptides, tryptic peptides, and the complete amino acid sequence. Reference number: A05676 Accession: A05676 Molecule type: protein Residues: 2-28;29-46;47-100;101-105 Matsubara, H.; Smith, E.L. J. Biol. Chem. 237, 3575-3576, 1962 Title: The amino acid sequence of human heart cytochrome c. Reference number: A00001 Note: 66-Leu is found in 10% of the molecules in pooled protein. Genetics: Introns: 57/1 _______________________________________________________ Example 1-1 Cont'd on next page 1-2 Database Terminology Example 1-1 (Cont.) Typical PIR-International Protein Sequence Database Entry _______________________________________________________ Superfamily: cytochrome c Keywords: acetylation; electron transfer; heme; mitochondrion; oxidative phosphorylation; polymorphism; respiratory chain Residues Feature 2-105 Protein: cytochrome c #status experimental 2 Modified site: acetylated amino end (Gly) (in mature form) #status experimental 15,18 Binding site: heme (Cys) (covalent) #status experimental 19,81 Binding site: heme iron (His, Met) (axial ligands) #status predicted Composition 6 Ala A 2 Gln Q 6 Leu L 2 Ser S 2 Arg R 8 Glu E 18 Lys K 7 Thr T 5 Asn N 13 Gly G 4 Met M 1 Trp W 3 Asp D 3 His H 3 Phe F 5 Tyr Y 2 Cys C 8 Ile I 4 Pro P 3 Val V Mol. wt. unmod. chain = 11,749 Number of residues = 105 4 5 10 15 20 25 30 1 M G D V E K G K K I F I M K C S Q C H T V E K G G K H K T G 31 P N L H G L F G R K T G Q A P G Y S Y T A A N K N K G I I W 61 G E D T L M E Y L E N P K K Y I P G T K M I F V G I K K K E 91 E R A D L I A Y L K K A T N E ------ In this example: 1 PIR1:CCHU is the entry code. 2 is the title section. 3 is the text section containing the annotation. _______________________________________________________ Example 1-1 Cont'd on next page 1-3 Database Terminology Example 1-1 (Cont.) Typical PIR-International Protein Sequence Database Entry _______________________________________________________ 4 is the sequence section containing the amino acid sequence. _______________________________________________________ __________________________________________________________________ 1.2 Database A database is a collection of entries. Each database has assigned to it a database-code. Table 1-1 shows the databases and the database-codes that are currently on the ATLAS CD-ROM. Table_1-1__Databases_and_Database-Codes________________ Database- code[1]________Database________________________________ PIR1 Section 1. Annotated and Classified Entries PIR2 Section 2. Annotated Entries PIR3 Section 3. Unverified Entries PATCHX MIPSX Merged Sequence Database (minus PIR 1+2+3) NRL_3D NRL Sequence Structure Database ALN Database of Protein Family Alignments RESID Residues Database ECOLI Escherichia coli DNA Database GBBCT GenBank Bacterial (Locus and Title) GBEST GenBank EST (Locus and Title) GBINV GenBank Invertebrate (Locus and Title) _______________________________________________________ [1]Database-codes are not fixed and can be changed; therefore, the database-codes in use at your site may differ from those shown. 1-4 Database Terminology Table_1-1_(Cont.)__Databases_and_Database-Codes________ Database- code[1]________Database________________________________ GBMAM GenBank Other Mammalian (Locus and Title) GBPAT GenBank Patent (Locus and Title) GBPHG GenBank Phage (Locus and Title) GBPLN GenBank Plant (Locus and Title) GBPRI GenBank Primate (Locus and Title) GBRNA GenBank Structural RNA (Locus and Title) GBROD GenBank Rodent (Locus and Title) GBSYN GenBank Synthetic and Chimeric (Locus and Title) GBUNA GenBank Unannotated (Locus and Title) GBVRL GenBank Viral (Locus and Title) GBVRT GenBank Other Vertebrate (Locus and Title) GBNEW GenBank NEW Sequences _______________________________________________________ [1]Database-codes are not fixed and can be changed; therefore, the database-codes in use at your site may differ from those shown. _______________________________________________________ __________________________________________________________________ 1.3 Identifying an Entry An entry-identifier specifies the entry to be retrieved for processing using one of the many commands of ATLAS that process individual entries. This information is provided to the program by supplying the entry-code for the sequence entry and the database-code for the database containing the entry. The correct format for specifying an entry in a database is: database-code, colon, entry-code with no intervening blanks. 1-5 Database Terminology The program always maintains a set of default databases, called the active databases (see Section 1.5). If an entry-code is given without a database-code, the program looks for the entry in the active databases. If two, or more, entries in different databases have the specified entry-code, they will all be found. When the database-code is specified, the entry can be retrieved from either an active database or an inactive database (see Section 1.4). The following entry-identifier PIR1:CCHU identifies entry CCHU in database PIR1. The command line to display entry PIR1:CCHU is shown below. ATLAS> TYPE PIR1:CCHU __________________________________________________________________ 1.4 The Term Index In addition to accessing individual entries as described in the previous section, one must be able to search the entire database for entries. The ATLAS program facilitates this via an index containing all retrievable terms. Upon examination of the entry shown in Example 1-1, it will be noticed that the text section is formatted so certain categories of information, called fields, are easily identified. For example, the species name, the accession number, the superfamily name, and the keyword fields are labeled with the corresponding field name. Note that the fields may contain multiple values: each value is indexed separately. 1-6 Database Terminology Because of this special formatting it is possible to create indexes of the terms that occur in these and other fields. The indexes can then be searched to generate a list of entries that contain a particular term. For example, a user could: o Search the author index to generate a list of entries that reference author "Adelman, J P" o Search the superfamily index to generate a list of entries that reference superfamily "proenkephalin" o Search the keyword index to generate a list of entries that reference keyword "cell adhesion" Table 1-2 lists all the fields that are currently indexed for the PIR-International databases. Whenever possible, these same fields are indexed for the non- PIR-International databases. Table_1-2__Indexed_Text_Fields_________________________ Search Field_Name____________Field_Contents______Command______ ACCESSION accession number ACCESSION AUTHOR author name AUTHOR CROSS_REF cross-reference CROSS number FEATURE feature name FEATURE GENE_NAME gene name GENE JOURNAL journal citation JOURNAL KEYWORD keyword KEYWORD MEMBERS alignment member MEMBERS REFERENCE reference number REFERENCE SPECIES species name SPECIES SUPERFAMILY superfamily name SUPERFAMILY SUPFAM_NUM superfamily number SFNUM 1-7 Database Terminology Table_1-2_(Cont.)__Indexed_Text_Fields_________________ Search Field_Name____________Field_Contents______Command______ TITLE entry title FIND[1] _______________________________________________________ [1]FIND in the menu mode of the PC version is the command to search any of the indexed text fields. The fields are listed in a submenu under FIND. _______________________________________________________ The indexes for all the fields[1] and for all databases are combined to form the term index. For each field that occurs in the term index system, there is a corresponding command in ATLAS to search that index. These commands are collectively called the text searching commands. The text searching commands can only search the databases that are included in the term index. __________________________________________________________________ 1.5 Active Databases Any or all of the databases included in the term index can be processed by the text searching commands. Initially all databases are active. To selectively activate databases, use the database-list parameter of the BASES command to specify the list of databases to become active. ________________ [1] Not all the fields shown in Table 1-2 occur in all databases. If a databases does not have a particular field, then it does not occur in the term index. If the database is active when searching this index a message will appear warning the user that not all active databases are included in the index. 1-8 Database Terminology __________________________________________________________________ 1.6 The Current List The current list is a subset of the entries in the active databases that has been selected by the operation of a command or series of commands. The current list can be acted upon by many of the other commands and provides a facility to isolate and manipulate selected portions of the databases. Text and data searching commands accept four modifiers that facilitate logical manipulation of the current list. These are the /CURRENT, /SUBTRACT, /ADD, and /KEEP modifiers (see Section 3.6). The command LIST/RESTORE allows the restoration of the current list that existed before the last command operation. This allows for easy recovery when the user is not satisfied with the result of the command operation or when the current list is truncated as a result of a CTRL-C operation (see Section 3.7). The current list can also be saved to a file. When the entry codes for the current list are saved with the LIST/OUTPUT=file-spec, it can be generated again quickly with the GET command. 1-9 _______________________________________________________ 2 Database Descriptions This chapter provides a description of each of the databases contained on the CD-ROM. __________________________________________________________________ 2.1 The PIR-International Protein Sequence Database The Protein Sequence Database was initiated at the NBRF in the early 1960's by the late Margaret O. Dayhoff as a collection of sequences for the study of evolutionary relationships among proteins. The database is now an international collaboration of three data centers: the NBRF, the Martinsried Institute for Protein Sequences (MIPS), and the Japan International Protein Information Database (JIPID). The three centers cooperate to produce and distribute a single database of `wild-type' protein sequences. Currently the NBRF effort is supported as the Protein Information Resource (PIR) funded in part by the National Library of Medicine. The database contains information concerning all naturally occurring, wild-type proteins whose primary structure (the sequence) is known. A major goal of the database project is to provide comprehensive, nonredundant data uniquely organized by homology and taxonomy. In addition to sequence data, the database contains information (called annotation) concerning: (1) the name and classification of the protein and the organism in which it naturally occurs; (2) references to the primary literature, including information concerning the sequence determination; (3) the function and general characteristics of the protein, including gene expression, post-translational processing, and activation; and (4) sites and regions 2-1 Database Descriptions of biological interest within the sequence. The database is also unique in maintaining consistency of annotation with restricted vocabularies employed for features and keywords. Data are accumulated from the published literature, by submissions to PIR- International, and by translation of nucleic acid sequences submitted to GenBank, the European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database, and the DNA Data Base of Japan (DDBJ). These data include those deposited with the Genome Sequence Data Base of the National Center for Genome Resources. Entries in the database are cross-referenced to these source databases. In addition cross-references are included to the Genome Data Base (GDB), the yeast gene name LISTA database, and MEDLINE. Work is currently underway to cross-reference to the Drosophila genome database (FlyBase), the Brookhaven Protein Data Bank (PDB), and the Complex Carbohydrate Structure Database (CCSD) of the international CarbBank project. Conceptually the database consists of three primary components: the Literature, Source Sequence, and Canonical Sequence Components. The Literature component contains full citation information for all sources of information in the database and is linked to MEDLINE abstracts via MEDLINE MUIDs. Each citation is uniquely identified by PIR-International Reference Number. Each reported sequence is stored in its originally published form in the Source Sequence Component and is assigned a PIR-International Accession Number that uniquely identifies it. These unmerged source sequences are cross-referenced to the corresponding nucleic acid sequence when present in the GenBank/EMBL/DDBJ nucleic acids sequence databases. In the near future, cross-references will be added to link directly to CDS IDs, providing a stable link between the conceptual translation and the corresponding coding region specification. The Canonical Sequence Component (which encompasses the PIR1, PIR2, and PIR3 data sections) is constructed 2-2 Database Descriptions by assembling source sequences that represent the same molecule into merged entries, which display a single canonical sequence and instructions for regeneration of each original source sequence. Hence, all information concerning the various reports of the sequence is stored in a compacted, nonredundant form, while remaining directly accessible to users of the database. This architecture allows the competing goals of completeness and nonredundancy to be addressed seamlessly. Canonical sequences within the database are organized by placement numbers reflecting their similarity to other sequences in the database known to be homologous. Secondarily, the data are organized by species and by protein type (proteins having the same name). Originally, the separation of the database into data sections PIR1, PIR2, and PIR3 reflected the level of data processing and classification. Incompletely processed data are now designated by the status preliminary, listed within the reference portion of the entry, and may occur in any section of the database. In a sense no entry is ever completely processed: if new information becomes available, it is merged into the entry. Further, entries are continually monitored and revised as appropriate to reflect the most current biological understanding of the data. The status preliminary generally indicates that the paper reporting the sequence has not been fully analyzed. The partitioning of the entries into sections PIR1, PIR2, and PIR3 has been retained for ease of physical access to the data in the file distribution form (each file is limited to 64K entries) but this has no other significance. Entry codes are unique across all sections (PIR1-PIR4). For convenience, classified entries, found in PIR1, are ordered by placement number (and by species and protein type within placement classes); nonclassifed entries are ordered by species and taxonomy. These partitionings will be 2-3 Database Descriptions adjusted as appropriate in each release; therefore, section location is not a stable attribute of the entries. The PIR4 Section has been recently introduced. It is not a regular component of the database but has been created to make available sequences that are not naturally occurring and/or naturally expressed. This includes conceptual translations of pseudogenes and other nonexpressed potential genomic coding regions, engineered and chemically synthesized sequences, and sequences of natural polypeptides that are not ribosomally synthesized. Effort will not be made to collect these data; however, they are often accumulated during routine data processing, in which case they will be stored and made available in PIR4. Sequences of these types that occur within PDB entries will be accumulated comprehensively and cross-referenced to PDB. These and all subsequent changes in the database format are described in the PIR Technical Development Bulletin, sent to an E-mail distribution list. More information regarding the format and content of PIR entries can be found in Chapter 1, Database Terminology and in Appendix D, PIR-International Protein Sequence Database Entry Format. Correspondence regarding the PIR database should be directed to: PIR Technical Services Coordinator National Biomedical Research Foundation 3900 Reservoir Road, NW Washington, DC 20007 USA E-mail: pirmail@nbrf.georgetown.edu 2-4 Database Descriptions __________________________________________________________________ 2.2 The PATCHX Merged Sequence Database The PATCHX database is produced by MIPS and includes all protein sequences not identical with or contained in sequences from PIR1, PIR2 and PIR3. It was created to supplement the PIR database in providing a comprehensive dataset for searching and to help in identifying sequences that need to be added to the PIR database. PATCHX is generated by sequentially merging entries in a manner that eliminates an entry whenever there is already an entry containing exactly the same sequence. Table_2-1__Non-PIR_Databases_used_to_generate_PATCHX___ Database______Codes[1]____Description__________________ MIPSOwn Original MIPS preliminary entries (parts 1 and 2) PIRMOD Original MIPS/PIR preliminary entries (parts 1 and 2) MIPSTrn Original MIPS preliminary translations NRL_3D Orignal Brookhaven Data Bank sequences SwissProt Original SwissProt entries EMTrans Original EMBL automatic translations (parts 1, 2, and 3) _______________________________________________________ [1]The original entry codes are used whenever possible. For EMTrans and GBTrans, the first six characters of the entry code correspond to the primary accession number of the EMBL or GenBank entry. Code conflicts, of which a few hundred occur, require the assignment of special codes; these start with ``ZZ_''. For technical reasons, MIPS-assigned entry codes are still used for PSeqiP. 2-5 Database Descriptions Table 2-1 (Cont.) Non-PIR Databases used to generate ___________________PATCHX______________________________ Database______Codes[1]____Description__________________ GBTrans Original GenBank automatic (parts 1, 2, translations and 3) Kabat Original Kabat entries PSeqIP L... NEWAT M... PSD _______________________________________________________ [1]The original entry codes are used whenever possible. For EMTrans and GBTrans, the first six characters of the entry code correspond to the primary accession number of the EMBL or GenBank entry. Code conflicts, of which a few hundred occur, require the assignment of special codes; these start with ``ZZ_''. For technical reasons, MIPS-assigned entry codes are still used for PSeqiP. _______________________________________________________ PATCHX provides a good dataset to supplement the PIR database for sequence searching but, due to the manner in which it is constructed, the user should take note of the following: o All sequences that are IDENTICAL within or between databases are present ONCE. Duplicate sequences and sequences that were completely contained within others (subsequences) have been eliminated according to the priority (top to bottom) in the table above. Because entries with identical sequences have been eliminated, independent reports of the same correct sequences may be eliminated; therefore, this database may not yield a complete bibliography for a given sequence. o A black list of invalid sequences is kept. Black list sequences are removed from PATCHX to reduce noninformative redundancy. The EST sections of EMBL and GenBank are not considered for EMTrans 2-6 Database Descriptions and GBTrans. In addition, sequences with greater than 94% sequence identity to an entry in the PIR- International Protein Sequence Database (PIR1, PIR2, PIR3) (data supplied by K. Heumann, S. Liebl, W. Peng, and H.W. Mewes) have been removed from PATCHX. o The MIPSOwn, PIRMOD, and MIPSTrn databases contain preliminary data that should be used with extreme caution. Correspondence regarding PATCHX should be directed to: Dr. Friedhelm Pfeiffer MIPS at the Max Planck Inst. for Biochemistry 8033 Martinsried, Germany E-mail: pfeiffer@ehpmic.mips.biochem.mpg.de __________________________________________________________________ 2.3 The NRL_3D Sequence-Structure Database The NRL_3D database is produced by PIR from sequence and annotation information extracted from the Brookhaven Protein Databank (PDB) of crystallographic structures. This database makes the sequence information in PDB available for similarity searches and retrieval, and provides cross-reference information for use with the PIR. The titles and biological sources of the entries have been changed from PDB to conform to the nomenclature standards used in the PIR. The bibliographic references are included and some appear with the reference numbers and MEDLINE cross-references they have in corresponding PIR entries. Secondary structure, active site, binding site, and modified site annotations in PDB appear as in the corresponding PIR features. Information on experimental method, resolution and R-factor are included along with keywords. 2-7 Database Descriptions The ATLAS program is used for the retrieval and display of the entries in the NRL_3D database. Please see the AUTHOR, CROSS, FEATURE, FIND, JOURNAL, KEYWORD, LIST, MATCH, REFERENCE, SCAN, SPECIES, and TYPE commands in Chapter 4, The ATLAS Commands, Detailed Descriptions. Correspondence regarding NRL_3D should be directed to: Dr. John S. Garavelli PIR Database Coordinator National Biomedical Research Foundation 3900 Reservoir Road, NW Washington, DC 20007 USA E-mail: Garavelli@NBRF.Georgetown.EDU __________________________________________________________________ 2.4 The PIR-ALN Protein Sequence Alignment Database PIR-ALN is a database of protein alignments produced by L.-S. Yeh and G.Y. Srinivasarao of PIR. Alignments are of sequences in the same family (less than 55% different from each other), or of sequences representing various families within a superfamily, or of sequence segments corresponding to the same homology domain in different proteins. As of September 1995, PIR-ALN contained 257 homology domain alignments. These effectively define the homology domains so that they can be consistently represented in the entry annotations. The information contained in an ALN entry is divided into seven sections. The sections are listed below in the order in which they occur in the entry. 1 Header - Information that marks the first line of an entry 2 Title - The title of the alignment 3 Date - Creation and revision dates 2-8 Database Descriptions 4 Members - The entry codes of sequences used in the alignment 5 Members Titles - Titles of sequences used in the alignment 6 Alignment - The alignment of sequences 7 Matrix - The matrix of percent differences In the alignment, the completely conserved residues are marked by ` * ' and partially conserved residues are marked by ` . ' under the alignment. In the matrix of percent differences, the upper portion of the matrix gives the number of differences between the sequences; the lower portion represents the percent differences. The ATLAS program is used for the retrieval and display of the alignment entries in the ALN database. Please see the FIND, MEMBERS, and TYPE commands in Chapter 4, The ATLAS Commands, Detailed Descriptions. Example 2-1 Sample ALN Entry _______________________________________________________ >TX;FA0003 serum albumin - family 1.0 Date: 08-Feb-1995; Revision: 08-Feb-1995 Members: ABBOS; ABHUS; ABRTS; ABPGS; ABSHS; ABHOS ABBOS serum albumin precursor - bovine ABHUS serum albumin precursor - human ABRTS serum albumin precursor - rat ABPGS serum albumin precursor - pig (fragment) ABSHS serum albumin precursor - sheep ABHOS serum albumin precursor - horse Alignment: _______________________________________________________ Example 2-1 Cont'd on next page 2-9 Database Descriptions Example 2-1 (Cont.) Sample ALN Entry _______________________________________________________ ABBOS MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGEEQFKGLVLIAFSQYLQQCPF ABHUS MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPF ABRTS MKWVTFLLLLFISGSAFSRGVFRREAHKSEIAHRFKDLGEQHFKGLVLIAFSQYLQKCPY ABPGS --WVTFISLLFLFSSAYSRGVFRRDTYKSEIAHRFKDLGEQYFKGLVLIAFSQHLQQCPY ABSHS MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFNDLGEENFQGLVLIAFSQYLQQCPF ABHOS MKWVTFVSLLFLFSSAYSRGVLRRDTHKSEIAHRFNDLGEKHFKGLVLVAFSQYLQQCPF ****..**....**.****.**...***.****.****..*..***.**.*.**.**. ABBOS DEHVKLVNELTEFAKTCVADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEP ABHUS EDHVKLVNEVTEFAKTCVADESAENCDKSLHTLFGDKLCTVATLRETYGEMADCCAKQEP ABRTS EEHIKLVQEVTDFAKTCVADENAENCDKSIHTLFGDKLCAIPKLRDNYGELADCCAKQEP ABPGS EEHVKLVREVTEFAKTCVADESAENCDKSIHTLFGDKLCAIPSLREHYGDLADCCEKEEP ABSHS DEHVKLVKELTEFAKTCVADESHAGCDKSLHTLFGDELCKVATLRETYGDMADCCEKQEP ABHOS EDHVKLVNEVTEFAKKCAADESAENCDKSLHTLFGDKLCTVATLRATYGELADCCEKQEP ..*.***.*.*.***.*.***....*.**.******.**....**..**..****.*.** . . . ABBOS PDTEKQIKKQTALVELLKHKPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLV ABHUS SEKERQIKKQTALVELVKHKPKATKEQLKAVMDDFAAFVEKCCKADDKETCFAEEGKKLV ABRTS PDKEKQIKKQTALAELVKHKPKATEDQLKTVMGDFAQFVDKCCKAADKDNCFATEGPNLV ABPGS PEDEKQIKKQTALVELLKHKPHATEEQLRTVLGNFAAFVQKCCAAPDHEACFAVEGPKFV ABSHS PDTEKQIKKQTALVELLKHKPKATDEQLKTVMENFVAFVDKCCAADDKEGCFVLEGPKLV ABHOS PEDEKQIKKQSALAELVKHKPKATKEQLKTVLGNFSAFVAKCCGREDKEACFAEEGPKLV ...*.*****.**.**.****.**..**..*...*..**.***...*...**..**...* ABBOS VSTQTALA- ABHUS AASQAALGL ABRTS ARSKEALA- ABPGS IEIRGILA- ABSHS ASTQAALA- ABHOS ASSQLALA- ......*. Matrix of percent differences: _______________________________________________________ Example 2-1 Cont'd on next page 2-10 Database Descriptions Example 2-1 (Cont.) Sample ALN Entry _______________________________________________________ Number of differences 1 2 3 4 5 6 1 ABBOS . 146 183 122 47 157 2 ABHUS 24 . 164 147 150 143 3 ABRTS 30 27 . 167 185 168 4 ABPGS 20 24 28 . 131 142 5 ABSHS 8 25 30 22 . 146 6 ABHOS 26 23 28 23 24 . _____________Percent_difference________________________ Correspondence regarding ALN should be directed to: Dr. Lai-Su L. Yeh or Dr. Geetha Y. Srinivasarao Protein Information Resource National Biomedical Research Foundation 3900 Reservoir Road, NW Washington, DC 20007 USA E-mail: yeh@nbrf.georgetown.edu or geetha@nbrf.georgetown.edu __________________________________________________________________ 2.5 The RESID Database of Amino Acids Residues The RESID is a database of protein structure modifications produced by PIR. Due the large and steadily increasing number of protein structure modifications that require standardized annotation in the PIR-International Protein Sequence Database, the RESID database was introduced to assist users and annotators in interpreting features annotations for covalent binding sites, modified sites, and cross- links. The RESID database describes features annotated in the Protein Sequence Database and provides 2-11 Database Descriptions information on systematic chemical names, frequently observed alternate names, Chemical Abstracts Service registry numbers, atomic formulas and weights, and original amino acids that may have the modification. Entries in the RESID database contain this information in the following order. 1 Header - the code number of the entry in the RESID database, these consist of the letters `AA' followed by four digits 2 Title - the name of the amino acid residue 3 Alternate names - alternative names this residue may have in the chemical literature 4 Systematic name - an IUPAC systematic name 5 CAS Registry Number - the Chemical Abstracts Registry Number for the compounds corresponding to the free amino acids (CAS Registry Numbers are copyrighted by the American Chemical Society) 6 Formula - the atomic formula of the residue in the peptide chain 7 Formula weight - the chemical average isotope formula weight and the physical most common isotope formula weight 8 Correction formula - the difference between the residue atomic formula and the atomic formula for the encoded amino acid presented in a protein sequence 9 Correction weight - the differences between the residue chemical average isotope weight and the physical most common isotope weight and those for the encoded amino acid presented in a protein sequence; these correction weight can be used to calculate chemical and mass-spectrographic molecular weights for modified peptides 2-12 Database Descriptions 10 Date - Creation, structure revision and text revision dates for an entry 11 Reference - a reference block as in PIR Protein Sequence Database inluding author names, journal citation, article title, reference number, MEDLINE number and notes on the experimental methods used to detect and identify the residue 12 Comment - notes on the residue including sequence motifs 13 Generating enzyme - the enzyme activities required to produce a post-translationally modified residue 14 Sequence code - the single letter codes for the amino acids which may give rise to the same post- translationally modified residue 15 Conditions - conditions for the occurrence of the residue including whether it is amino-terminal, carboxyl-terminal, secondary or incidental to other modifications, or the number of peptide chains the residue cross-links 16 Abbreviation - the standard IUPAC three-letter code for the encoded amino acids 17 Keywords - the keywords in appearing in the PIR Protein Sequence Database associated with the residue 18 Feature - the feature for this residue as it appears in the PIR Protein Sequence Database The ATLAS program is used for the retrieval and display of the entries in the RESID database. Please see the AUTHOR, CROSS, FEATURE, FIND, JOURNAL, KEYWORD, LIST, REFERENCE, and TYPE commands in Chapter 4, The ATLAS Commands, Detailed Descriptions. 2-13 Database Descriptions Example 2-2 Sample RESID Entry _______________________________________________________ RESID:AA0077 N6-palmitoyl-L-lysine Alternate names: epsilon-palmitoyllysine; N(zeta)-palmitoyllysine; N6-(1-oxohexadecyl)-L-lysine Systematic name: (S)-2-amino-6-(hexadecanoylamino)hexanoic acid Cross-references: CAS:559012-43-0 Formula: C 22 H 42 N 2 O 2 Formula weight: #chem 366.59 #phys 366.3246 Correction formula: C 16 H 30 O 1 Correction weight: #chem 238.42 #phys 238.2297 Date: 31-Mar-1995 #structure_revision 31-Mar-1995 #text_change 31-Mar- 1995 Hackett, M.; Guo, L.; Shabanowitz, J.; Hunt, D.F.; Hewlett, E.L. Science 266, 433-435, 1994 Title: Internal lysine palmitoylation in adenylate cyclase toxin from Bordetella pertussis. Reference number: A55167 Note: mass spectrographic identification Stanley, P.; Packman, L.C.; Koronakis, V.; Hughes, C. Science 266, 1992-1996, 1994 Title: Fatty acylation of two internal lysine residues required for the toxic activity of Escherichia coli hemolysin. Reference number: A55387 Note: radioisotope labeling Generating enzyme: peptidyl-lysine N6-palmitoyltransferase (EC 2.3.1.-) Sequence code: K Conditions: combinable _______________________________________________________ Example 2-2 Cont'd on next page 2-14 Database Descriptions Example 2-2 (Cont.) Sample RESID Entry _______________________________________________________ Keywords: lipoprotein Residues Feature Binding site: palmitate (Lys) (covalent) _______________________________________________________ The RESID database is copyrighted by the National Biomedical Research Foundation and may not be redistributed without prior consent. Correspondence regarding RESID should be directed to: Dr. John S. Garavelli PIR Database Coordinator National Biomedical Research Foundation 3900 Reservoir Road, NW Washington, DC 20007 USA E-mail: Garavelli@NBRF.Georgetown.EDU __________________________________________________________________ 2.6 The ECOLI Escherichia coli DNA Database The ECOLI database was produced at JIPID by T. Kunisawa with the cooperation of L.-S. Yeh of PIR. The database consists of the available genomic nucleic acid sequence data of E. coli K12 compiled from the existing major data collections (GenBank, EMBL, and DDBJ) and from the literature. The sequence data are solely from strain K12, for which genetic map positions are known. The genome of E. coli K12 consists of about 4.7 million bp and the goal of this effort is to merge information obtained from the data collections and literature with data obtained from genome sequencing projects. The number of base pairs in the ECOLI database has doubled since it was first announced (Protein Seq. Data Anal. 3:157-162, 1990) and now corresponds to about 40% of the entire genome. 2-15 Database Descriptions Sequence redundancy in the database is eliminated by merging overlapping sequences. Each entry represents one sequence segment and all the entries in the database are ordered by genetic map position. Unlocalized genes will be included in the database once their chromosomal locations are known. Discrepancies among various reports of the same sequence are described in the ``Residues'' line following each Accession line. A tag, enclosed in angle brackets at the end of each Residues line, serves to identify the sequence reported in the publication. The gene name (and its alternate gene symbols), map position, and strand of the sequence segments, if known, are indicated within each entry. A plus or minus (+ or -) is used to specify on which of the two DNA strands the segment exists; the ``+'' strand is the strand transcribed clockwise in the usual genetic map. All known protein coding regions are annotated in the Feature table, as well as additional features such as promoter region, Shine-Dalgarno sequence, and inverted repeats. Correspondence regarding ECOLI should be directed to: Dr. T. Kunisawa JIPID, Research Institute for Biosciences Science University of Tokyo Noda 278, Japan E-mail: kunisawa@jpnsut31.bitnet __________________________________________________________________ 2.7 The GenBank Nucleic Acid Sequence Databank The GenBank database is a nucleic acid sequence database produced and distributed by the National Center for Biotechnology Information (NCBI). The complete database is not available on the ATLAS CD- ROM; however, retrieval of entry codes is available 2-16 Database Descriptions by GenBank accession number, author name, citation, species, sequence title and keyword indexing. Information available with each hit is limited to the LOCUS and DEFINITION fields of each entry. Complete GenBank entries can be retrieved from the PIR Network Request Server using the following electronic addresses: fileserv@nbrf.georgetown.edu Alternatively, if supplemental CD-ROM readers or several hundred megabytes of disk space are available then the ATLAS software system is capable of retrieving full GenBank entries in the native form as they appear in the multi-volume NCBI-GenBank flat file format CD-ROM package. Follow instructions in the Installation section for details on how to use the GenBank data as distributed on CD-ROM. Contact NCBI at the address below with regard to the NCBI-GenBank CD-ROM package. NCBI-GenBank National Center for Biotechnology Information National Library of Medicine, 38A, 8N805 8600 Rockville Pike Bethesda, MD 20894 USA E-mail: info@ncbi.nlm.nih.gov 2-17 _______________________________________________________ Part II The ATLAS program This part of the manual contains instructions on the use of the commands and command modifiers of the ATLAS program. _______________________________________________________ 3 Overview of the ATLAS Program The program responds to commands and modifiers typed at the ATLAS> prompt; PC users have the option to use a menu system. For convenience, all commands and modifiers may be abbreviated to their shortest unambiguous form. Commands may be typed in using upper- and/or lower-case letters; the program does not discriminate between them. Command operation does not begin until the key is pressed. The command names recognized by the ATLAS program can be modified by the user. A special DEFINE command is included to allow the definition of aliases for specific command and command modifier combinations. The descriptions of commands and command modifiers assume the use of the program in command mode (i.e., issuing commands at the ATLAS> prompt) and that the user has not modified the native command definitions. See Section 3.8 later in this chapter for an introduction to the menu system. Differences in commands between command mode and menu mode are indicated. For command mode, commands should be typed at the ATLAS> prompt in the format: ATLAS> COMMAND/MODIFIER parameter Character strings can be combined on the same command line to allow for searching logical combinations of strings. Strings are separated by space characters. The strings AND, or OR, and NOT are reserved as operators. When no operator is present, the AND operator is assumed. The OR operator instructs ATLAS to search for entries that contain either or both of the strings that immediately precede and follow it. 3-1 Overview of the ATLAS Program The NOT operator is a binary operator. It instructs ATLAS to find entries that contain the first string and do not contain the second. Lines containing more than one operator are evaluated according to the following order of precedence: the OR operator has the highest precedence, followed by the AND operator, and then the NOT operator; the implied AND operator (no operator) has the lowest precedence. The open and closed parenthesis characters specify an alternate evaluation order. Strings enclosed in parentheses are evaluated separately. The double quote characters are used to designate literal search strings. Literal search strings are searched for exactly as they appear. Quotes are used when strings containing the reserved characters, space, and open and closed parentheses or the reserved operators, AND, OR, and NOT, are required. The /SHOW modifier of the text searching commands will display how the search strings are parsed by the ATLAS program. The commands can be grouped into the following categories according to function: o Text Searching Commands o Sequence Searching Commands o Display Commands o File Interface Commands o Utility Commands __________________________________________________________________ 3.1 Text Searching Commands The commands in this group provide the primary retrieval capabilities of the ATLAS program by searching the term indexes. All text searching commands are predefined versions of the SEARCH command. When using command mode, the field name is usually used as the command, e.g. KEYWORD, 3-2 Overview of the ATLAS Program FEATURE, etc. The command KEYWORD is a symbol for the predefined full command, SEARCH/FIELD=KEYWORD. When using the menu, all the text searching commands are listed under the FIND command. Each term in the term index that matches the user- supplied search string is displayed on the screen. In addition, the database-code, entry-code, and title of each entry in which the term is found are displayed. Each word in the search string must be at least three characters in length. The /BRIEF modifier limits display to only the index terms found, omitting the codes and titles of the entries. The commands generate a new current list containing all the entries found during the search. Table_3-1__Text_Searching_Commands_____________________ Command__________Action________________________________ ACCESSION Search the accession index for an accession number AUTHOR Search the author index for an author name CROSS Search the cross-reference index for a cross-reference number FEATURE Search the feature index for a feature name FIND[1] Search the entry titles for a sequence name and/or an organism name GENE Search the gene name index for a gene name JOURNAL Search the journal index for a journal citation KEYWORD Search the keyword index for a keyword _______________________________________________________ [1]The FIND command in the menu system searches an index selected from those listed in the submenu. 3-3 Overview of the ATLAS Program Table_3-1_(Cont.)__Text_Searching_Commands_____________ Command__________Action________________________________ MEMBERS Search the members index of the alignment database (ALN) REFERENCE Search the reference number index for a reference number SPECIES Search the species index for a species SUPERFAMILY Search the superfamily index for a superfamily name SFNUM Search the superfamily index for a _________________superfamily_number____________________ __________________________________________________________________ 3.2 Sequence Searching Commands These commands allow limited sequence searching within the ATLAS program. They search the protein sequence data for the occurrence of short amino acid segments. The search strings are typed directly into the terminal in the one-letter amino acid code (see Appendix B). Table_3-2__Sequence_Searching_Commands_________________ Command__________Action________________________________ SCAN Rapid search for identically matching segments MATCH Search for protein segments allowing _________________mismatches____________________________ 3-4 Overview of the ATLAS Program __________________________________________________________________ 3.3 Display Commands These commands are designed to display database information. Table_3-3__Display_Commands____________________________ Command__________Action________________________________ LIST Display titles of all entries on the current list TYPE Display an entry EXTRACT Constructs and displays a modified _________________sequence______________________________ __________________________________________________________________ 3.4 File Interface Commands These commands provide the interface between the ATLAS program and external files. Table_3-4__File_Interface_Commands_____________________ Command__________Action________________________________ COPY Copy an entry into an external file GET Get current list from an external file PRINT[1] Display the contents of an external file _______________________________________________________ [1]The PRINT command is not available through the menu system. _______________________________________________________ __________________________________________________________________ 3.5 Utility Commands The utility commands set and display program defaults and perform other miscellaneous chores. 3-5 Overview of the ATLAS Program Table_3-5__Utility_Commands____________________________ Command__________Action________________________________ BASES Display or define the list of active databases DEFINE[1] Define abbreviations for commands or databases HELP Obtain help on commands and modifiers QUIT Terminate the program SET Set program operation parameters SHOW Show information about current program operation SYSTEM Transfer control to DOS (PC only) EXIT Return from DOS to ATLAS program (PC only) _______________________________________________________ [1]DEFINE as listed as an action under BASES in the menu system only allows the user to activate databases; command or database abbreviation definitions are not available through the menu. _______________________________________________________ __________________________________________________________________ 3.6 Command Modifiers The primary action of the commands can be modified by the addition of one or more command modifiers. Command modifiers immediately follow the command with no space between and must be preceded by a slash (/). Not all the command modifiers listed below work with every command. See the detailed descriptions of the commands and their command modifiers for more information. 3-6 Overview of the ATLAS Program Table_3-6__Command_Modifiers___________________________ ____________Modifiers_of_the_ATLAS_Commands____________ Alternate Display Formats_______________Modifier_Function________________ /BRIEF Display only index terms found /COUNTS Display number of entries containing search term. This number is displayed after all the entry titles have been displayed. Use /COUNTS in conjunction with /BRIEF to display only the index terms and corresponding count. /SHOW Display how search terms are parsed by ATLAS _______________________________________________________ Current List Processing____________Modifier_Function________________ /CURRENT Process entries on current list only. When /CURRENT is specified, the entry-identifier parameter on the command line should be omitted. If present, it is ignored. /ADD Entries found are added to current list /SUBTRACT Entries found are removed from current list /KEEP Command execution does not alter current list /ALL Process all entries in the active databases /RESTORE Restores previous current list (LIST command only) 3-7 Overview of the ATLAS Program Table_3-6_(Cont.)__Command_Modifiers___________________ ____________Modifiers_of_the_ATLAS_Commands____________ Information Category Selection_____________Modifier_Function________________ /TEXT Only text portion of entry is processed /SEQUENCE Only sequence portion of entry is processed _______________________________________________________ Redirecting Screen Output________________Modifier_Function________________ /OUTPUT=file-spec Directs normal output to a file (no screen display). Differs from /PRINTER=file-spec for some commands. /PRINTER Directs screen output to printer. Output is not immediately sent to the printer but accumulates until the QUIT command is issued to leave the program. /PRINTER=file-spec Directs screen output to disk file. The modifier value, file- spec, can be any valid file specification. The specified file is closed when the command terminates; it is not sent to the printer for printing, it is a permanent file that remains until explicitly deleted. 3-8 Overview of the ATLAS Program Table_3-6_(Cont.)__Command_Modifiers___________________ _______________________________________________________ Search_Manipulation___Modifier_Function________________ /ANCHOR Match terms beginning with character string /NOANCHOR Match character string anywhere in term /ENTRY Match each character string to ______________________separate_terms___________________ __________________________________________________________________ 3.7 Special Control Characters The ATLAS program responds to two control characters that interrupt program operation: CTRL-C The CTRL-C character terminates the command that is currently operating. It is generally used to recover from inadvertently issued commands. CTRL-Y The CTRL-Y character terminates the VAX/VMS version of the ATLAS program and returns the user to the system. Note: CTRL-Y is not the proper way to exit the program; the QUIT command should be used for normal program termination. __________________________________________________________________ 3.8 The PC Menu System Upon startup, the PC version of the ATLAS program is in a menu command system. This menu system can be toggled to command mode, in which the commands are the same as those for the VAX/VMS version described in Chapter 4. 3-9 Overview of the ATLAS Program This section describes how to get around in the menu and use some of the commands as seen through the menu system. The following main menu bar appears across the top of the screen: Bases Find List Get Type Copy Extract Scan Option Help Quit Initially "Bases" is highlighted. Move among commands with the left and right arrow keys. As each command is highlighted a message appears under the menu describing the command. The following table summarizes the descriptions of each highlighted command: Table_3-7__Main_Menu_Selections________________________ Bases: Display or define the list of active databases Find: Search the specified text field List: Display or manipulate the current list Get: Get the current list from an external file or from the user Type: Display an entry Copy: Copy an entry to an external file Extract: Construct and display a modified sequence Scan: Search protein sequences Option: Set or show option Help: Help Quit:____________Terminate_the_ATLAS_program___________ To select a command either type the first letter of the command or press the key or the down arrow key when the desired command is highlighted. The key deselects a command and returns the user to the main menu. When a command from the main menu is selected, one or two pop-up submenus appear on the screen. The one on the left shows the actions (commands) available and the one on the right shows the options (command modifiers) available. 3-10 Overview of the ATLAS Program To make a selection in the action submenu, type the appropriate letter or use the up and down arrow keys until the desired selection is highlighted and press to initiate the action. When is pressed and there are no options highlighted, the menu command action is initiated without modification. To make a selection from the option submenu, press the appropriate function key; pressing the key a second time deselects the option. More than one option may be selected if the menu box is divided into sections. One option may be selected from each section. If an attempt is made to select two items in the same section, the program simply toggles from one to the other. Most of the commands work the same through the menu as they do in command mode. The greatest differences occur in the FIND and OPTION selections. The various text searching commands, which all work in the same manner, are listed in a submenu under FIND; the user simply selects the index to be searched. OPTION has several commands that are unique to the PC version and particularly to the menu system. When OPTION has been selected, the following action submenu appears: Table_3-8__OPTION_Submenu______________________________ Ltr.___Action________________Description_______________ F show fields Display the indexed fields available for each database A ATLAS program header Display the citation header for the ATLAS program S gateway to DOS Temporarily leave ATLAS to use DOS 3-11 Overview of the ATLAS Program Table_3-8_(Cont.)__OPTION_Submenu______________________ Ltr.___Action________________Description_______________ C command mode Toggle from the menu system to command mode P toggle pause Switch to continuous scrolling M select colors Change colors for main _____________________________and_submenus______________ The S (system) action allows the user to temporarily leave ATLAS and go to DOS. To return from DOS to ATLAS, type EXIT at the DOS prompt. The C action allows the user to enter the command mode of ATLAS. Typing commands at the ATLAS> prompt is much more flexible than using the menu. To return from command mode to the menu type SET/MENU at the ATLAS> prompt. The P action allows the user to change how information is scrolled on the screen. When the pause is active (default), only one screen full of information is displayed at a time. The user is prompted at the bottom of the screen to decide whether or not to continue viewing with the prompt: More (Y/n): When toggled, the information scrolls continuously on the screen and stops/starts in response to the Scroll Lock key. The M action allows color to be changed in four regions of the screen: the main menu bar, the highlighting for the main menu bar, the submenus, and the highlighting for the submenus. When this action is selected, the lower part of the screen shows 12 numbers that determine the colors displayed. 3-12 Overview of the ATLAS Program The numbers are are divided into triplets; one triplet for each of the four regions. For example, in the triplet 7,1,1, the first represents the foreground color (0 to 7), the second represents the intensity (0 or 1), and the third the background color (0 to 7). The space bar steps through the colors at the cursor and the arrow keys move the cursor. The colors on the screen change as each selection is made and can be saved, until you quit ATLAS, by pressing S. 3-13 _______________________________________________________ 4 The ATLAS Commands, Detailed Descriptions This chapter provides a detailed description of each command, its parameters, and its modifiers. The command descriptions are arranged alphabetically. 4-1 ACCESSION _______________________________________________________ ACCESSION ACCESSION searches the accession index for all occurrences of a user-specified accession number. For each number found, the number, the entry identifier, and the title for each entry in which the number occurs are displayed. The list of these entries becomes the new current list. This command is an abbreviation for the full command SEARCH/FIELD=accession/ANCHOR. _______________________________________________________ FORMAT ACCESSION [accession-number] _______________________________________________________ PARAMETERS accession-number The accession-number parameter on the command line is the number (or part of the number) to be selected. Accession numbers in PIR consist of 1 letter followed by 5 digits. The character string you enter must contain at least 3 characters. If this parameter is omitted, you will be prompted for it with the prompt: _______________________________________________________ prompts Accession: If you respond to the prompt by pressing the key, all accession numbers will be found. 4-2 ACCESSION _______________________________________________________ DESCRIPTION The accession index is constructed from fields that contain accession numbers in the databases included, e.g., the Accession: lines in PIR databases. An accession number is assigned to each sequence entry when it is entered into one of the PIR datasets. For entries that are merged, the accessions are listed for all individual entries included. Often in the PIR2 and PIR3 datasets the accession number is also used as an entry-code. Normally the accession number search matches only accession numbers that begin with the character string you specified. Use the /NOANCHOR modifier (listed below) to alter this operation. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the ACCESSION command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-1__ACCESSION_Command_Modifiers_________________ Alternate_Display_Format___Modifier_Function___________ /BRIEF Display only index terms found /COUNTS Display number of entries containing search term /SHOW Display how search string is parsed 4-3 ACCESSION Table_4-1_(Cont.)__ACCESSION_Command_Modifiers_________ Database_List_Processing___Modifier____________________ /PIR Ignore all databases except PIR1, PIR2, and PIR3 _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /NOANCHOR Match search-string anywhere in term /ANCHOR (default) Match terms beginning with character string /ENTRY Match each character string ___________________________to_separate_terms___________ 4-4 ACCESSION _______________________________________________________ TheMcommand line: 1 ATLAS> ACCESSION A000 will produce the list of entries containing accession numbers that begin with A000. 2 ATLAS> ACC/SUB A0009 This command will remove from the current list (produced by the command in the first example) those entries that contain accession numbers beginning with A0009. Codes and titles of the removed entries are displayed. To see the entries that remain, use the LIST command. 4-5 AUTHOR _______________________________________________________ AUTHOR AUTHOR searches the author index for all occurrences of a user-specified name or partial name. For each name found, the author, the entry-identifier, and the title for each entry referencing the author are displayed. The list of these entries becomes the new current list. This command is an abbreviation for the full command: SEARCH/FIELD=author/ANCHOR. _______________________________________________________ FORMAT AUTHOR [author-name] _______________________________________________________ PARAMETERS author-name The author-name parameter on the command line is the name (or part of the name) of the author to be selected. The character string you enter must contain at least 3 characters. If this parameter is omitted on the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts Author: _______________________________________________________ DESCRIPTION Normally the author name search matches only author names that begin with the character string you specified. Use the /NOANCHOR modifier (listed below) to alter this operation. 4-6 AUTHOR Displaying the List of Authors If no Author-name is typed in at the prompt and you press , an alphabetical listing of all authors and the titles of their associated entries will be displayed. If the /BRIEF modifier (listed below) is used, only the author names will be displayed. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the AUTHOR command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-2__AUTHOR_Command_Modifiers____________________ Alternate_Display_Formats__Modifier_Function___________ /BRIEF Display only index terms found /COUNTS Display number of entries containing search term /SHOW Display how search string is parsed _______________________________________________________ Database_List_Processing___Modifier____________________ /PIR Ignore all databases except PIR1, PIR2, and PIR3 4-7 AUTHOR _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /ANCHOR (default) Match terms beginning with character string /NOANCHOR Match character string anywhere in term /ENTRY Match each character string ___________________________to_separate_terms___________ _______________________________________________________ TheMfollowing command line will produce the list of AUTHORS whose last names begin with Smith and the entry-identifiers and titles of entries containing those authors. 1 4-8 AUTHOR ATLAS> AUTHOR SMITH When specifying an initial for an author, it should be in the format: Lastname,I with no space between the comma and initial. Example: 2 ATLAS> BASES PIR1 ATLAS> AUTHOR ADELMAN,J Adelman,J PIR1:EQHUA proenkephalin precursor - human PIR1:IVHU14 interferon alpha-I-14 precursor - human PIR1:IVHUA9 interferon alpha-9 precursor - human PIR1:IVHUB1 interferon beta-1 precursor - human PIR1:ABHUS serum albumin precursor - human Adelman,J P PIR1:A34702 amphiregulin precursor - human PIR1:RHHUG gonadoliberin precursor - human PIR1:RHRTG gonadoliberin precursor - rat 2 authors found The above example found two authors; one with one initial and the other with two initials. To specify two initials, separate them with a single nonalphanumeric character (other than right or left parentheses). Alternatively, the search string may be enclosed in quotes. For example: 3 ATLAS> AUTHOR ADELMAN,J.P or 4 ATLAS> AUTHOR "ADELMAN,J P" 4-9 BASES _______________________________________________________ BASES The BASES command provides two functions. It can be used to either o display a table showing all the accessible databases, or o change the list of active databases. _______________________________________________________ FORMAT BASES [database-list] _______________________________________________________ PARAMETERS database-list The database-list parameter allows the user to specify the list of databases to become active. The database-list must be either: o a list of database names separated by plus (+) signs or spaces, o an abbreviation for a database-list the user created with the DEFINE command, or o the wildcard character (*) which specifies all databases. The order in which the databases are specified in the database-list is important because many commands that process entries from more than one database go through the active databases in the order specified. 4-10 BASES _______________________________________________________ DESCRIPTION Displaying the Table of Databases If no database-list parameter is specified on the command line, the BASES command displays a table showing all the databases included in the term index system. These are the only databases that are accessible through the text searching commands. The list of the database-list abbreviations defined by the user is shown below the table. Changing the Active Databases If a database-list is specified on the command line, the BASES command changes the list of active databases to the list specified. _______________________________________________________ MODIFIERS The following table lists the modifiers accepted by the BASES command. Table_4-3__BASES_Command_Modifiers_____________________ Database_List_Processing___Modifier_Function___________ /ADD (database_list) Add specified database(s) to current databases /SUB (database_list) Subtract specified database(s) from current databases _______________________________________________________ Alternate_Display_Formats__Modifier_Function___________ /BRIEF Display database list without indicating active databases 4-11 BASES Table_4-3_(Cont.)__BASES_Command_Modifiers_____________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to ___________________________disk_file___________________ _______________________________________________________ TheMfollowing examples illustrate the use of the BASES command. The database configuration and database-list abbreviations shown in the first example are assumed in all the subsequent examples. 1 ATLAS> BASES Database Entries Type Release -------- ------- ---- ------- Database Entries Type Rel. Description -------- ------- ---- ---- ----------- * PIR1 12404 PROT 41.00 Section 1. Classified and Annotated Entries * PIR2 35689 PROT 41.00 Section 2. Annotated Entries * PIR3 22755 PROT 41.00 Section 3. Unverified Entries NRL_3D 3911 PROT 15.00 NRL Protein Sequences in Brookhaven PDB PATCHX 31883 PROT 41.00 Protein Seq DB PATCHX (subseq of MIPSX) ECOLI 556 NUCL 2.20 Escherichia coli DNA Database GBBCT 15107 NUCL 82.00 GenBank Bacterial (Locus and Title) GBEST 33727 NUCL 82.00 GenBank EST (Locus and Title) GBINV 11234 NUCL 82.00 GenBank Invertebrate (Locus and Title) GBMAM 5628 NUCL 82.00 GenBank Other Mammalian (Locus and Title) GBPHG 968 NUCL 82.00 GenBank Phage (Locus and Title) GBPAT 5281 NUCL 82.00 GenBank Patent (Locus and Title) GBPLN 16154 NUCL 82.00 GenBank Plant (Locus and Title) GBPRI 31972 NUCL 82.00 GenBank Primate (Locus and Title) GBRNA 3602 NUCL 82.00 GenBank Struct RNA (Locus and Title) GBROD 20581 NUCL 82.00 GenBank Rodent (Locus and Title) GBSYN 1717 NUCL 82.00 GenBank Synthetic (Locus and Title) GBUNA 1490 NUCL 82.00 GenBank Unannotated (Locus and Title) GBVRL 15876 NUCL 82.00 GenBank Viral (Locus and Title) 4-12 BASES GBVRT 6558 NUCL 82.00 GenBank Other Vertebrate (Locus and Title) GBNEW 18570 NUCL 82.06 GenBank(R) NEW (Locus and Title) ALN 1133 TEXT 5.10 Database of family alignments * indicates an active database. Entries: 70848 active, 296796 total. PR*OTEIN = PIR1+PIR2+PIR3+PATCHX+NRL_3D The above example shows the format of the display produced by the BASES command with no parameter. 2 ATLAS> BASES PIR1+PIR2+PIR3 This command changes the list of active databases. After the command executes the new list will consist of three databases, PIR1, PIR2, and PIR3. An alternative way to activate the PIR datasets is shown below where any database code beginning with PIR will be activated: 3 ATLAS> BASES PIR* This command is equivalent to the preceding command. 4 ATLAS> BASES PR Because the abbreviation PR has been defined (see the last line of example 1), this command is equivalent to "ATLAS> BASES PIR1+PIR2+PIR3+PATCHX+NRL_3D." To define abbreviations for the database-list, use the DEFINE command. 5 ATLAS> BASES PR+ALN This command will produce an error message. PR is a valid abbreviation, as shown in example 1, but abbreviations cannot be combined with the plus sign. 4-13 COPY _______________________________________________________ COPY The COPY command copies entries into an output file. The output file is independent of the database files; the information in the file can be modified for any particular use. _______________________________________________________ FORMAT COPY [entry-identifier] _______________________________________________________ PARAMETERS entry-identifier If the entry-identifier parameter is omitted from the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts Code: If you press at the code prompt, the program processes the entries on the current list. If there are no entries on the current list, pressing the return key terminates the COPY command. On the first execution of the COPY command, if the file-specification is not specified on the command line the following prompt will appear: _______________________________________________________ prompts Output file: On subsequent executions of COPY, if the file- specification is not given on the command line, 4-14 COPY then the output is added to the last file created with the COPY command. If the file-specification is given on the command line (using the /OUTPUT=file-spec modifier), then a new file is created. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the COPY command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-4__COPY_Command_Modifiers______________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list _______________________________________________________ Information Category Selection__________________Modifier_Function___________ /TEXT Process only text portion of entry /SEQUENCE Process only sequence portion of entry 4-15 COPY _______________________________________________________ Redirecting_Output_________Modifier_Function___________ /OUTPUT=file-spec Specify output file name /OUTPUT=* Append output to previous file /NOFORMAT Copy sequence exactly as it is in database (sequence lines up to 500 characters ___________________________long)_______________________ _______________________________________________________ TheMfollowing example demonstrates how the sequence (with entry-identifier PIR1:CCHU) can be copied into an external file. 1 ATLAS> COPY CCHU Output file: TEST.SEQ The following example shows the contents of the file (TEST.SEQ) just created: 2 4-16 COPY >P1;CCHU cytochrome c - human M G D V E K G K K I F I M K C S Q C H T V E K G G K H K T G P N L H G L F G R K T G Q A P G Y S Y T A A N K N K G I I W G E D T L M E Y L E N P K K Y I P G T K M I F V G I K K K E E R A D L I A Y L K K A T N E * C;Species: Homo sapiens (man) C;Date: #sequence_revision 30-Sep-1991 #text_change 05-Aug-1994 C;Accession: A31764; A05676; A00001 R;Evans, M.J.; Scarpulla, R.C. Proc. Natl. Acad. Sci. U.S.A. 85, 9625-9629, 1988 A;Title: The human somatic cytochrome c gene: two classes of processed... A;Reference number: A31764; MUID:89071748 A;Accession: A31764 A;Molecule type: DNA A;Residues: 1-105 A;Cross-references: GB:M22877 R;Matsubara, H.; Smith, E.L. J. Biol. Chem. 238, 2732-2753, 1963 A;Title: Human heart cytochrome c. Chymotryptic peptides, tryptic pept... A;Reference number: A05676 A;Accession: A05676 A;Molecule type: protein A;Residues: 2-28;29-46;47-100;101-105 R;Matsubara, H.; Smith, E.L. J. Biol. Chem. 237, 3575-3576, 1962 A;Title: The amino acid sequence of human heart cytochrome c. A;Reference number: A00001 A;Contents: annotation A;Note: 66-Leu is found in 10% of the molecules in pooled protein C;Genetics: A;Introns: 57/1 C;Superfamily: cytochrome c; cytochrome c homology C;Keywords: acetylated amino end; electron transfer; heme; mitochondrion;... F;2-105/Protein: cytochrome c #status experimental F;2/Modified site: acetylated amino end (Gly) (in mature form) #status... F;15,18/Binding site: heme (Cys) (covalent) #status experimental F;19,81/Binding site: heme iron (His, Met) (axial ligands) #status pre... Note that the format of the entry looks much different 4-17 COPY from that shown in Example 1-1. Please see Appendix D for further information regarding the format for external user files. 4-18 CROSS _______________________________________________________ CROSS CROSS searches the cross-reference index for all occurrences of a user-specified cross-reference identifier. For each occurrence found, the cross- reference identifier, the entry identifier, and the title for each entry are displayed. The list of these entries becomes the new current list. Currently, the cross-reference index contains information from PIR databases only. This command is an abbreviation for the full command: SEARCH/FIELD=cross_ref. _______________________________________________________ FORMAT CROSS [cross-reference] _______________________________________________________ PARAMETERS cross-reference The cross-reference parameter on the command line is the identifier (or part of the identifier) to be selected. The character string you enter must contain at least 3 characters. If this parameter is omitted, you will be prompted for it with the prompt: _______________________________________________________ prompts Cross_ref: _______________________________________________________ DESCRIPTION The cross-reference index is constructed from the PIR Cross_reference: lines. A cross-reference identifier enables the user to find the same or similar entry in another database. Each database has its own method 4-19 CROSS of cross-referencing; PIR uses GenBank/EMBL/DDBJ accession numbers as the cross-reference identifiers for the nucleic acid sequence databases. Normally the cross-reference search matches the specified character string anywhere it appears in the term. Use the /ANCHOR modifier (listed below) to alter this operation. Displaying the List of Cross-Reference Identifiers If no cross-reference identifier is typed in at the prompt and you press , a listing of all cross- reference identifiers and the titles of the associated entries will be displayed. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the CROSS command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-5__CROSS_Command_Modifiers_____________________ Alternate_Display_Format___Modifier_Function___________ /BRIEF Display only index terms found /COUNTS Display number of entries containing search term 4-20 CROSS _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /ANCHOR Match terms beginning with character string /NOANCHOR (default) Match character string ___________________________anywhere_in_term____________ _______________________________________________________ TheMfollowing example demonstrates how to use the CROSS command to locate those entries in the PIR- International databases corresponding to accession number M32690 in GenBank entry GBVRL:BIM127. 1 4-21 CROSS ATLAS> B PIR* ATLAS> CROSS M32690 GB:M32690 PIR1:FOLJBT gag polyprotein - bovine immunodeficiency virus (isolate 127) PIR1:GNLJBT pol polyprotein - bovine immunodeficiency virus (isolate 127) PIR1:VCLJBT env polyprotein precursor - bovine immunodeficiency virus (isolate 127) PIR1:ASLJBT vif protein - bovine immunodeficiency virus (isolate 127) PIR1:TNLJBT trans-activating transcriptional regulatory protein - bovine immunodeficiency virus (isolate 127) PIR1:VKLJBT trans-regulatory splicing protein - bovine immunodeficiency virus (isolate 127) PIR1:ASLJBW orf-W protein - bovine immunodeficiency virus (isolate 127) PIR1:ASLJBY orf-Y protein - bovine immunodeficiency virus (isolate 127) 1 cross_ref found 4-22 DEFINE _______________________________________________________ DEFINE The DEFINE command can be used to create abbreviations for: o database lists o ATLAS commands o display layouts _______________________________________________________ FORMAT DEFINE [abbreviation] [text-string] _______________________________________________________ PARAMETERS abbreviation Specifies the abbreviation. text-string Specifies the command or database list that will be referenced via the abbreviation. _______________________________________________________ DESCRIPTION DEFINE allows the user to customize commands and database lists. It can also be used to specify a display layout that is used when an entry is displayed with the TYPE command. Users can have these definitions available each time they use the ATLAS program. Please see the installation document on the CD-ROM for your system. 4-23 DEFINE Defining a Display Layout A display layout is a specification of the text lines to be displayed when an entry is typed. A text line is specified by giving the string that occurs at the beginning of the line (upper and lower case letters are treated as equal). The string usually consists of a tag (the first two characters on the line) optionally followed by a descriptor and is used to determine the kind of information contained on the line. The tags themselves are not visible with the TYPE command, but are described fully in Appendix D. These tags can also be seen when an entry is copied into an external file with the COPY command. The SHOW/DISPLAY command lists all the symbols that have been defined and the display layouts they represent. _______________________________________________________ MODIFIERS The following table lists the modifiers accepted by the DEFINE command. Table_4-6__DEFINE_Command_Modifiers____________________ Modifier___________________Modifier_Function___________ /BASE (default) Define an abbreviation for a database list /COMMAND Define an abbreviation for an ATLAS command /DISPLAY Define a symbol for a ___________________________display_layout______________ _______________________________________________________ TheMfollowing examples illustrate the use of the DEFINE command. 1 4-24 DEFINE ATLAS> DEFINE/COMMAND EX*IT QUIT This example defines EXIT to be equivalent to the QUIT command. In fact, because of the placement of the asterisk, EX, EXI, and EXIT are all equivalent to the QUIT command. Another useful customization of commands is to create abbreviations for the full version of the command. For example, to define a command that will cause both the keyword and title indexes to be searched: 2 ATLAS> DEFINE/COMMAND KT SEARCH/FIELD=KEYWORD+TITLE To define a database-list abbreviation: 3 ATLAS> DEFINE PR PIR1+PIR2+PIR3 A database-list abbreviation can be used as the parameter with the BASES command. In the following example, the symbol definition is substituted for the symbol and the databases specified in the symbol- definition become the new active databases. Example: 4 ATLAS> BASES PR A database-list abbreviation can be used to specify an entry in the form symbol:entry-code. In this case, the databases specified in the symbol-definition are searched for the entry-code. Example: 5 ATLAS> TYPE PR:CCHU 4-25 DEFINE 6 ATLAS> DEFINE/DISPLAY F F; This command defines the symbol F to represent the display layout F;. Then to use the TYPE command with this display layout: 7 ATLAS> TYPE/TEXT=F code Only the lines of text that begin with F; (the feature table) will be displayed. Caution must be used in the specification to be sure all appropriate entries are found as illustrated by the following example. 8 ATLAS> DEFINE/DISPLAY KEY C;KEYWORDS: This command defines the symbol KEY to represent the display layout C;KEYWORDS:, however, as only the lines that match the specification exactly will be displayed, it will not match lines that begin with C;KEYWORD: (missing the s) or C;KEYWORDS (without the :). These lines will be displayed if KEY is defined as follows: 9 ATLAS> DEFINE/DISPLAY KEY C;KEYWORD In this next example, the display layout consists of a single string that contains a blank illustrating that the specification may contain blanks. A display layout can also specify more than one string, which must be separated by &. A text line will be displayed if it matches any of the strings given in the specification. 4-26 DEFINE 10 ATLAS> DEFINE/DISPLAY FN F;&N; Here the symbol FN specifies the feature lines plus any lines that begin with N;. Note that if a blank space is placed before the ampersand it is considered part of the first specification and if placed after the ampersand it is considered part of the second specification. In the following example, the symbol REF specifies three line types: 11 ATLAS> DEFINE/DISPLAY REF R;&A;REFERENCE NUMBER:&A;RESIDUES: The above discussion on display layouts applies to the PIR databases and may or may not apply to other databases. In particular, it does not work very well with the GenBank database. For example, 12 ATLAS> DEFINE/DISPLAY T Title does not display the TITLE lines because the word TITLE begins in column 3 not at the beginning of the line. The following example inserts the two blanks before the word TITLE (the ampersand does not count as a blank space). 13 ATLAS> DEFINE/DISPLAY T & TITLE This correctly displays the TITLE lines but there is another problem. In GenBank, the TITLE line may be continued onto subsequent lines. These lines begin with 12 blanks so they will not be displayed with the above definition. Attempting to include a specification for the continuation lines will cause all continuation lines, not just those for TITLE, to be displayed. 4-27 EXTRACT _______________________________________________________ EXTRACT The EXTRACT command constructs a sequence from the sequence shown in the entry and a set of instructions. The instructions can be entered in response to the prompt for the sequence specification or they can be read from the annotation for the entry. They allow one to insert, delete, and replace residues, and to transpose and/or duplicate sequence elements. The EXTRACT command applies to amino acid sequences only. It is ignored for nucleotide sequences. _______________________________________________________ FORMAT EXTRACT [entry-identifier] _______________________________________________________ PARAMETERS entry-identifier If the entry-identifier parameter is omitted from the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts Code: If you press at the code prompt, the program processes the entries on the current list. If there are no entries on the current list, then pressing the return key terminates the EXTRACT command. _______________________________________________________ prompts Sequence specification: 4-28 EXTRACT Enter the instructions for constructing the new sequence. _______________________________________________________ DESCRIPTION In response to the prompt for the specification you can type any of the following. Segment Specifications These are the instructions for constructing the new sequence from the sequence shown in the entry. The segment specifications are separated by commas or semi-colons, for example, 1-7,'SCCF',9-18;35,'T',36-* A comma indicates that the residues are adjacant on the same chain, whereas a semi-colon indicates the residues are on different chains. The valid forms for the segment specifications are: o N - a single number. Residue N from the original sequence is copied into the new sequence. o N1-N2 - a pair of numbers separated by a dash. Residues N1 to N2 from the original sequence are copied into the new sequence. An asterisk (*) can be used in place of the second number, N2, to represent the last residue in the original sequence. o 'xxx' - a string of amino acid symbols enclosed in quotes. The symbols within the quotes are copied into the new sequence. 4-29 EXTRACT A Code Extension The code extensions appear in the annotation for the entry at the end of feature table lines or "Residues:" lines of the references. They are enclosed in <..>. You may type the code extension with or without the enclosing <..>, for example, MAT or ? This displays the annotation lines of the entry which contain a code extension. Just type to terminate processing for this entry. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the EXTRACT command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-7__EXTRACT_Command_Modifiers___________________ Alternate_Display_Formats__Modifier_Function___________ /FULL_NUMBERING Display numbering for each line of sequence /ONE_LETTER (default) Display one-letter amino acid abbreviations /THREE_LETTER Display three-letter amino acid abbreviations 4-30 EXTRACT Table_4-7_(Cont.)__EXTRACT_Command_Modifiers___________ Current_List_Processing____Modifier_Function___________ /CURRENT[1] Process only entries on current list _______________________________________________________ Location of Sequence Specification______________Modifier_Function___________ /TABLE The Specification is read from the entry _______________________________________________________ Output_to_a_file___________Modifier_Function___________ /OUTPUT Output constructed sequence to a file /OUTPUT=file-spec Output to the indicated file /EXTRACT Output without verification _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ [1]When /CURRENT is specified, the entry-identifier parameter on the command line should be omitted. If present, it is ignored _______________________________________________________ _______________________________________________________ TheMKlebsiella pneumoniae entry with identification code ALKBG contains the following item in its feature table. 1 4-31 EXTRACT 31-655 Product: cyclomaltodextrin glucanotransferase #status experimental The following example demonstrates the use of the EXTRACT command to construct the product sequence given in the above item. 2 ATLAS> EXTRACT ALKBG PIR1:ALKBG 655 residues cyclomaltodextrin glucanotransferase (EC 2.4.1.19) precursor - Klebsiella pneumoniae Sequence Specification: MAT ALKBG->MAT Product: cyclomaltodextrin glucanotransferase #status experimental 5 10 15 20 25 30 1 A E P E E T Y L D F R K E T I Y F L F L D R F S D G D P S N 31 N A G F N S A T Y D P N N L K K Y T G G D L R G L I N K L P ... 571 Q Y P Q W S A S L E L P S D L N V E W K C V K R N E T N P T 601 A N V E W Q S G A N N Q F N S N D T Q T T N G S F Residues 31-655 of ALKBG The EXTRACT command was used to extract the product sequence from the sequence shown in the entry. 4-32 FEATURE _______________________________________________________ FEATURE FEATURE searches the feature table index for all occurrences of a user-specified feature name or partial name. For each name found, the name, the entry-identifier, and the title for each entry in which the feature name occurs are displayed. The list of these entries becomes the new current list. Currently, the feature table index includes information from PIR entries only. This command is an abbreviation for the full command SEARCH/FIELD=feature. _______________________________________________________ FORMAT FEATURE [feature-name] _______________________________________________________ PARAMETERS feature-name The feature-name parameter on the command line is the name (or part of the name) to be selected. The character string you enter must contain at least 3 characters. If this parameter is omitted, you will be prompted for it with the prompt: _______________________________________________________ prompts Feature: 4-33 FEATURE _______________________________________________________ DESCRIPTION The index is constructed from the feature table lines in PIR entries. Some examples of features are binding sites, active sites, modified sites, etc. Normally the search matches the specified string anywhere it appears in the feature. Use the /ANCHOR modifier (listed below) to alter this operation. Displaying the List of Features If no feature name is typed in at the prompt and you press , an alphabetical listing of all features and the titles of their associated entries will be displayed. If the /BRIEF modifier (listed below) is used only the feature names will be displayed. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the FEATURE command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-8__FEATURE_Command_Modifiers___________________ Alternate_Display_Formats__Modifier_Function___________ /BRIEF Display only index terms found /COUNTS Display number of entries containing search term /SHOW Display how search string is parsed 4-34 FEATURE Table_4-8_(Cont.)__FEATURE_Command_Modifiers___________ Database_List_Processing___Modifier____________________ /PIR Ignore all databases except PIR1, PIR2, and PIR3 _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /ANCHOR Match terms beginning with character string /NOANCHOR (default) Match character string anywhere in term /ENTRY Match each character string ___________________________to_separate_terms___________ 4-35 FEATURE _______________________________________________________ TheMfollowing example demonstrates the feature command: 1 ATLAS> FEATURE SELENOCYSTEINE Modified site: selenocysteine #status experimental PIR1:OPBOE glutathione peroxidase (EC 1.11.1.9) - bovine PIR1:OPRTE glutathione peroxidase (EC 1.11.1.9) I - rat PIR1:OMRTSP selenoprotein P precursor - rat Modified site: selenocysteine #status predicted PIR1:DEECFS formate dehydrogenase (EC 1.2.1.2) (benzylviologen-linked) selenocysteine-containing protein - Escherichia coli PIR1:OPHUE glutathione peroxidase (EC 1.11.1.9) - human PIR1:OPMSE glutathione peroxidase (EC 1.11.1.9) - mouse PIR1:HQDVLB hydrogenase (EC 1.18.99.1) (NiFeSe) large chain - Desulfovibrio baculatus PIR1:A47327 selenoprotein P precursor - human PIR1:OMRTSP selenoprotein P precursor - rat 2 features found The following example illustrates use of the RESID Database to help find appropriate features: 2 ATLAS> BASES PIR1 ATLAS> FEATURE PYROGLUTAMIC ACID Feature: pyroglutamic acid No features found The user cannot find a feature named ``pyroglutamic acid'' in the Protein Sequence Database. By using the RESID database, ``pyroglutamic acid'' is found to be an alternate name for ``2-pyrrolidone-5-carboxylic acid'' and that the feature appears as ``Modified site: pyrrolidone carboxylic acid'' in the Protein Sequence Database. 4-36 FEATURE ATLAS> BASES RESID ATLAS> TITLE PYROGLUTAMIC ACID pyroglutamic acid RESID:AA0031 2-pyrrolidone-5-carboxylic acid 1 title found ATLAS> TYPE/CURRENT RESID:AA0031 2-pyrrolidone-5-carboxylic acid Alternate names: pyroglutamic acid; 5-oxoproline Systematic name: (S)-5-oxo-2-pyrrolidinecarboxylic acid Cross-references: CAS:98-79-3 Formula: C 5 H 6 N 1 O 2 Formula weight: #chem 112.11 #phys 112.0399 Correction formula: C 0 H -2 O -1 Correction weight: #chem -18.02 #phys -18.0106 Date: 31-Mar-1995 #structure_revision 31-Mar-1995 #text_change 01-Sep- 1995 Podell, D.N.; Abraham, G.N. Biochem. Biophys. Res. Commun. 81, 176-185, 1978 Title: A technique for the removal of pyroglutamic acid from the amino terminus of proteins using calf liver pyroglutamate amino peptidase. Reference number: A44724 Comment: This modification can form non-enzymatically from amino- terminal glutamine. Generating enzyme: glutaminyl-peptide cyclotransferase (EC 2.3.2.5) Sequence code: Q; Z Conditions: amino-terminal Keywords: pyroglutamic acid Residues Feature Modified site: pyrrolidone carboxylic acid (Gln) Modified site: pyrrolidone carboxylic acid (Glx) 4-37 FEATURE ATLAS> BASES PIR1 ATLAS> FEATURE/BRIEF/COUNT PYRROLIDONE CARBOXYLIC ACID 1 Modified site: blocked amino end (Gln) (in mature form) (probably pyrrolidone carboxylic acid) #status experimental 20 Modified site: blocked amino end (Gln) (probably pyrrolidone carboxylic acid) #status experimental 1 Modified site: blocked amino end (Glx) (probably pyrrolidone carboxylic acid) #status experimental 97 Modified site: pyrrolidone carboxylic acid (Gln) #status experimental 2 Modified site: pyrrolidone carboxylic acid (Gln) #status predicted 47 Modified site: pyrrolidone carboxylic acid (Gln) (in mature form) #status experimental 16 Modified site: pyrrolidone carboxylic acid (Gln) (in mature form) #status predicted 2 Modified site: pyrrolidone carboxylic acid (Gln) (in mature form) (partial) #status experimental 1 Modified site: pyrrolidone carboxylic acid (Gln) (partial) #status experimental 9 features found 4-38 FIND _______________________________________________________ FIND FIND searches the title index for all occurrences of a user-specified title or partial title. For each found, the entry-identifier and the title for each entry in which the search-string occurs are displayed. The list of these entries becomes the new current list. This command is an abbreviation for the full command: SEARCH/FIELD=title. _______________________________________________________ FORMAT FIND [title] _______________________________________________________ PARAMETERS title The title parameter on the command line is the name (or part of the name) to be selected. The character string you enter must contain at least 3 characters. If this parameter is omitted, you will be prompted for it with the prompt: _______________________________________________________ prompts Title: _______________________________________________________ DESCRIPTION The FIND command is the primary method for locating an entry in the databases by searching through the title index. For PIR entries, the title index is constructed from the title line, the alternate names and the contains fields. 4-39 FIND Displaying the List of All Titles If no title is typed in at the prompt and you press , a listing of all the titles in all the active databases will be displayed. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the FIND command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-9__FIND_Command_Modifiers______________________ Alternate_Display_Formats__Modifier_Function___________ /BRIEF Display only index terms found /COUNTS Display number of entries containing search term /SHOW Display how search string is parsed _______________________________________________________ Database_List_Processing___Modifier____________________ /PIR Ignore all databases except PIR1, PIR2, and PIR3 4-40 FIND _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /ANCHOR Match terms beginning with character string /NOANCHOR (default) Match anywhere character string appears /ENTRY Match each character string to separate terms /MAIN Search TITLE field only NOT ALTERNATE NAMES or CONTAINS fields /TEXT Perform string search of ___________________________titles_(very_slow)__________ 4-41 FIND _______________________________________________________ WhenPusing the FIND command, type words (or parts of words) that characterize the entry. In general, it is best to type short words (each word must be at least 3 characters in length) that would occur regardless of alternate spellings of a name. The following examples demonstrate typical executions of the FIND command: 1 ATLAS> BASES PIR1 ATLAS> FIND HUMAN INSULIN GROWTH FACTOR insulin-like growth factor IA precursor - human PIR1:IGHU1 insulin-like growth factor IA precursor - human insulin-like growth factor IB precursor - human PIR1:IGHU1B insulin-like growth factor IB precursor - human insulin-like growth factor II precursor - human PIR1:IGHU2 insulin-like growth factor II precursor - human insulin-like growth factor-binding protein 1 precursor - human PIR1:IOHU1 insulin-like growth factor-binding protein 1 precursor - human insulin-like growth factor-binding protein 2 precursor - human PIR1:A41927 insulin-like growth factor-binding protein 2 precursor - human insulin-like growth factor-binding protein 3 precursor - human PIR1:IOHU3 insulin-like growth factor-binding protein 3 precursor - human 6 titles found In the above example, six entries were found in the PIR1 database. The following command sequence will generate a current list of methionine tRNA ligases and histidine tRNA ligases in PIR1 from species other than E. coli. 4-42 FIND 2 ATLAS> BASES PIR1 ATLAS> FIND METHIO TRNA LIG methionine--tRNA ligase (EC 6.1.1.10) - Escherichia coli PIR1:SYECMT methionine--tRNA ligase (EC 6.1.1.10) - Escherichia coli methionine--tRNA ligase (EC 6.1.1.10) - Thermus aquaticus PIR1:SYTWMT methionine--tRNA ligase (EC 6.1.1.10) - Thermus aquaticus methionine--tRNA ligase (EC 6.1.1.10), cytosolic - yeast (Saccharomyces) cerevisiae PIR1:SYBYMT methionine--tRNA ligase (EC 6.1.1.10), cytosolic - yeast (Saccharomyces cerevisiae) methionine--tRNA ligase (EC 6.1.1.10), mitochondrial - yeast (Saccharomyces) cerevisiae PIR1:SYBYMM methionine--tRNA ligase (EC 6.1.1.10), mitochondrial - yeast (Saccharomyces cerevisiae) 4 titles found 3 ATLAS> FIND/ADD HISTIDIN TRNA LIG histidine--tRNA ligase (EC 6.1.1.21) - Chinese hamster PIR1:SYHYHT histidine--tRNA ligase (EC 6.1.1.21) - Chinese hamster histidine--tRNA ligase (EC 6.1.1.21) - Escherichia coli PIR1:SYECH histidine--tRNA ligase (EC 6.1.1.21) - Escherichia coli histidine--tRNA ligase (EC 6.1.1.21) - human PIR1:SYHUHT histidine--tRNA ligase (EC 6.1.1.21) - human histidine--tRNA ligase (EC 6.1.1.21), cytosolic - yeast (Saccharomyces) cerevisiae PIR1:SYBYHC histidine--tRNA ligase (EC 6.1.1.21), cytosolic - yeast (Saccharomyces cerevisiae) histidine--tRNA ligase (EC 6.1.1.21), mitochondrial - yeast (Saccharomyces) cerevisiae PIR1:SYBYHM histidine--tRNA ligase (EC 6.1.1.21), mitochondrial - yeast (Saccharomyces cerevisiae) 5 titles found 4-43 FIND 4 ATLAS> FIND/SUB ESCHER COLI histidine--tRNA ligase (EC 6.1.1.21) - Escherichia coli PIR1:SYECH histidine--tRNA ligase (EC 6.1.1.21) - Escherichia coli methionine--tRNA ligase (EC 6.1.1.10) - Escherichia coli PIR1:SYECMT methionine--tRNA ligase (EC 6.1.1.10) - Escherichia coli 2 titles found 5 ATLAS> LIST 7 entries on the current list PIR1:SYBYMT methionine--tRNA ligase (EC 6.1.1.10), cytosolic - yeast (Saccharomyces cerevisiae) PIR1:SYBYMM methionine--tRNA ligase (EC 6.1.1.10), mitochondrial - yeast (Saccharomyces cerevisiae) PIR1:SYTWMT methionine--tRNA ligase (EC 6.1.1.10) - Thermus aquaticus PIR1:SYHUHT histidine--tRNA ligase (EC 6.1.1.21) - human PIR1:SYHYHT histidine--tRNA ligase (EC 6.1.1.21) - Chinese hamster PIR1:SYBYHM histidine--tRNA ligase (EC 6.1.1.21), mitochondrial - yeast (Saccharomyces cerevisiae) PIR1:SYBYHC histidine--tRNA ligase (EC 6.1.1.21), cytosolic - yeast (Saccharomyces cerevisiae) The FIND METHIO TRNA LIG command produced a current list of methionine tRNA ligases. The FIND/ADD HISTIDIN TRNA LIG command resulted in the addition of histidine tRNA ligases to the list, and the FIND/SUB ESCHER COLI command resulted in the removal of E. coli sequences from this list. The current list generated as a result of these operations is displayed using the LIST command described later in this chapter. 4-44 GENE _______________________________________________________ GENE GENE searches the gene name index for all occurrences of a user-specified gene name. For each name found, the name, the entry identifier, and the title for each entry in which the name occurs are displayed. The list of these entries becomes the new current list. This command is an abbreviation for the full command: SEARCH/FIELD=gene_name. _______________________________________________________ FORMAT GENE [gene-name] _______________________________________________________ PARAMETERS gene-name The gene-name parameter on the command line is the name (or part of the name) to be selected. The character string you enter must contain at least 3 characters. If this parameter is omitted, you will be prompted for it with the prompt: _______________________________________________________ prompts Gene: _______________________________________________________ DESCRIPTION The gene index is constructed from the Gene name: lines in PIR entries. 4-45 GENE Normally the gene name search matches the specified character string anywhere it appears within the gene name. Use the /ANCHOR modifier (listed below) to alter this operation. Displaying the List of Gene Names If no gene-name is typed in at the prompt and you press , an alphabetical listing of all gene names and the titles of their associated entries will be displayed. If the /BRIEF modifier (listed below) is used, only the gene names will be displayed. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the GENE command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-10__GENE_Command_Modifiers_____________________ Alternate_Display_Format___Modifier_Function___________ /BRIEF Display only index terms found /COUNTS Display number of entries containing search term /SHOW Display how search string is parsed 4-46 GENE _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /ANCHOR Match terms beginning with character string /NOANCHOR (default) Match character string anywhere in term /ENTRY Match each character string ___________________________to_separate_terms___________ _______________________________________________________ TheMcommand line: 1 ATLAS> GENE ERA will produce the following output: 4-47 GENE 2 At-ERabp PIR1:S31584 auxin-binding protein precursor - Arabidopsis thaliana cytokeratin VIb PIR1:KRBOVI keratin, 54K type I cytoskeletal - bovine era PIR1:RGECGT transforming protein homolog (ras) - Escherichia coli ERabp PIR1:S16262 auxin-binding protein precursor - maize merA PIR1:RDPSHA mercury(II) reductase (EC 1.16.1.1) - Pseudomonas aeruginosa transposon Tn501 PIR1:RDEBHA mercury(II) reductase (EC 1.16.1.1) - Shigella flexneri plasmid R100 serA PIR1:DEECPG phosphoglycerate dehydrogenase (EC 1.1.1.95) - Escherichia coli 6 gene_names found From the results of the search, it is obvious that the GENE command is unanchored as the default. An anchored search narrows the list as in the following example: 3 ATLAS> GENE/ANCHOR ERA era PIR1:RGECGT Transforming protein homolog (ras) - Escherichia coli ERabp PIR1:S16262 auxin-binding protein precursor - maize 2 gene_names found 4-48 GET _______________________________________________________ GET GET generates or modifies a current list from entry- codes either in an external file or specified by the user. _______________________________________________________ FORMAT GET [file-name] _______________________________________________________ PARAMETERS file-name The file-name parameter on the command line identifies the file that contains the list of entry codes. The default file type is .COD. If this parameter is omitted on the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts File: _______________________________________________________ DESCRIPTION Although not a text or data searching command, the GET command also accepts the current list modifiers. These modifiers allow the current list to be generated or modified by reading a list of entry-identifiers from an external file or specified by the user. The external file must contain one entry-code per line and the entry-code must begin in column 1 of each line. If column 1 is blank the entire line is ignored (this is useful for inserting comments into the file). 4-49 GET The /USER modifier allows the user to directly specify entry-codes and thereby edit the current list. If the entry-code is preceded by the database-code and if the database is not an active database, then the entry is ignored. If the entry-code is not preceded by the database- code, then the active databases are searched. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the GET command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. Table_4-11__GET_Command_Modifiers______________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /USER Entry-codes are specified ___________________________by_user_____________________ _______________________________________________________ TheMfollowing examples demonstrate how to bring an external file of entry-codes into the ATLAS program to become the new current list. 1 4-50 GET ATLAS> GET TEST.COD In this example, TEST.COD is the file-name for the list of entry-codes. The next example shows the contents of TEST.COD: 2 PIR1:CCHU PIR1:IGHU1 PIR1:IGHU1B PIR1:IGHU2 The list may be typed into a file using an editor or it can be one that is created using the LIST/OUTPUT=file-spec command described in detail later in this chapter. The next example shows an easy way to subtract an unwanted entry from the current list: 3 ATLAS> GET/USER/SUB Type the codes one per line with no spaces. Type a null line to exit. CCHU 1 entry found 4 ATLAS> LIST 3 entries on the current list PIR1:IGHU1 insulin-like growth factor IA precursor - human PIR1:IGHU1B insulin-like growth factor IB precursor - human PIR1:IGHU2 insulin-like growth factor II precursor - human The LIST command shows the current list without entry- code CCHU. 4-51 HELP _______________________________________________________ HELP HELP displays a list of the individual commands and command modifiers. _______________________________________________________ FORMAT HELP [command-name] _______________________________________________________ PARAMETERS command-name The command-name can be any of the commands of the ATLAS program. If no command-name is typed in at the prompt and you press , a listing of all topics on which HELP is available is displayed. _______________________________________________________ MODIFIERS Table_4-12__HELP_Command_Modifiers_____________________ Alternate_Display_Format___Modifier_Function___________ /BRIEF List ATLAS commands and their modifiers 4-52 HELP _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to ___________________________disk_file___________________ 4-53 JOURNAL _______________________________________________________ JOURNAL JOURNAL searches the index of literature citations for all occurrences of a user-specified citation or partial citation. For each citation found, the full citation, the entry identifier, and the title for each entry that contains the citation are displayed. The list of these entries becomes the new current list. This command is an abbreviation for the full command: SEARCH/FIELD=journal. _______________________________________________________ FORMAT JOURNAL [literature-citation] _______________________________________________________ PARAMETERS literature-citation The literature-citation parameter on the command line is the citation (or part of the citation) to be selected. The character string(s) you enter must contain at least 3 characters. If this parameter is omitted on the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts Journal: _______________________________________________________ DESCRIPTION Normally the literature citation search matches a citation whenever the character string specified occurs anywhere within a citation. Use the /ANCHOR modifier to cause the search to match citations only 4-54 JOURNAL when the specified character string occurs at the beginning of the citation. Displaying the List of Literature Citations If no literature-citation is typed in at the prompt and you press , an alphabetical listing of all citations and the titles of their associated entries will be displayed. If the /BRIEF modifier (listed below) is used only the citations will be displayed. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the JOURNAL command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-13__JOURNAL_Command_Modifiers__________________ Alternate_Display_Formats__Modifier_Function___________ /BRIEF Display only index terms found /COUNTS Display number of entries containing search term /SHOW Display how search string is parsed _______________________________________________________ Database_List_Processing___Modifier____________________ /PIR Ignore all databases except PIR1, PIR2, and PIR3 4-55 JOURNAL _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /ANCHOR Match terms beginning with character string /NOANCHOR (default) Match anywhere character string appears /ENTRY Match each character string ___________________________to_separate_terms___________ _______________________________________________________ TheMcommand line: 1 ATLAS> JOURNAL/BRIEF 1990 will produce the list of all literature citations in the active databases for the year 1990. The list of entries containing such citations becomes the current list. 4-56 JOURNAL Note: Citations in which 1990 occurs in a context other than as the year, such as a page number (or part of a page number), will also be found and will appear on the current list of citations. To find all entries in PIR and GenBank that cite a given article, it is usually sufficient to give three character-strings: one representing part of the journal name abbreviation, one representing the page number(s), and one representing the volume number or year. (These elements need not be in the order in which they appear in the citation.) For example, 2 ATLAS> JOURNAL Journal: acids 1105 1982 Nucleic Acids Res. 10, 1105-1112, 1982 PIR1:KIBPD4 deoxynucleotide monophosphate kinase (EC 2.7.1.-) - Phage T4 PIR1:ZABPT4 gene 57A protein - Phage T4 1 journal found Enclose low page numbers in quotes, including a space on either side as in the following example: 3 ATlAS> jou Acta 446 " 1-9 " Biochim. Biophys. Acta 446, 1-9, 1976 PIR1:H3NJ1C cytotoxin 1 - Cape cobra PIR1:H3NJ3C cytotoxin 3 - Cape cobra PIR1:H3NJ2C cytotoxin 2 - Cape cobra 1 journal found 4-57 KEYWORD _______________________________________________________ KEYWORD KEYWORD searches the keyword index for all occurrences of a user-specified keyword or partial keyword. For each keyword found, the keyword, the entry- identifier, and the title for each entry in which the keyword occurs are displayed. The list of these entries becomes the new current list. This command is an abbreviation for the full command: SEARCH/FIELD=keyword. _______________________________________________________ FORMAT KEYWORD [keyword] _______________________________________________________ PARAMETERS keyword The keyword parameter on the command line is the keyword (or part of the keyword) to be selected. If this parameter is omitted on the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts Keyword: _______________________________________________________ DESCRIPTION The keyword index is constructed from the Keywords: line of each entry. Normally the keyword search matches a keyword whenever the character string specified occurs anywhere within the keyword. Use the /ANCHOR modifier (listed below) to cause the search 4-58 KEYWORD to match keywords only when the specified character string occurs at the beginning of the keyword. Displaying the List of Keywords If no keyword is typed in at the prompt and you press , an alphabetical listing of all keywords and the titles of their associated entries will be displayed. If the /BRIEF modifier (listed below) is used only the keywords will be displayed. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the KEYWORD command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-14__KEYWORD_Command_Modifiers__________________ Alternate_Display_Formats__Modifier_Function___________ /BRIEF Display only index terms found /COUNTS Display number of entries containing the keyword /SHOW Display how search string is parsed _______________________________________________________ Database_List_Processing___Modifier____________________ /PIR Ignore all databases except PIR1, PIR2, and PIR3 4-59 KEYWORD _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /ANCHOR Match terms beginning with character string /NOANCHOR (default) Match character string anywhere in term /ENTRY Match each character string ___________________________to_separate_terms___________ _______________________________________________________ TheMfollowing examples demonstrate how to generate a list of terms to select from when a single keyword may not be specific enough. Then use the keyword command to retrieve a more specific list. 1 4-60 KEYWORD ATLAS> B PIR* ATLAS> KEYWORD/BRIEF ADHESION cell adhesion membrane adhesion pili adhesion 3 keywords found 2 ATLAS> KEYWORD CELL ADHESION cell adhesion PIR1:BNRT3 myelin-associated glycoprotein precursor, long form - rat PIR1:BNRT3S myelin-associated glycoprotein precursor, short form - rat PIR1:RWHU1B cell surface glycoprotein CD11b precursor - human PIR1:RWHU1C cell surface glycoprotein CD11c precursor - human PIR1:FNHU fibronectin precursor - human PIR1:IJFFTM cadherin-related tumor suppressor precursor - fruit fly (Drosophila melanogaster) PIR1:IJHUCN N-cadherin precursor, neuronal - human PIR1:IJBOCN N-cadherin precursor - bovine (fragment) PIR1:IJMSCN N-cadherin precursor, neuronal - mouse PIR1:IJCHCN N-cadherin precursor, neuronal - chicken PIR1:IJXLC2 N-cadherin 2 precursor - African clawed frog PIR1:IJXLC1 N-cadherin 1 precursor - African clawed frog PIR1:A47543 R-cadherin precursor - mouse PIR1:IJCHCR R-cadherin precursor - chicken PIR1:IJHUCE E-cadherin precursor - human . . . PIR2:S19872 vascular cell adhesion molecule 1 precursor - rat PIR2:JS0675 vascular cell adhesion molecule-1 precursor - rat PIR2:B37057 integrin beta-6 chain - guinea pig (fragment) 1 keyword found 4-61 LIST _______________________________________________________ LIST The LIST command can be used to: o Display the entry identifier and title of every entry on the current list o Display the entry identifier and title of every entry in the active databases (/ALL) o Save the entry-identifiers to a file (/OUTPUT=file- spec) o Save the current list to a file (/PRINTER=file- spec) o Restore the internally saved current list (/RESTORE) _______________________________________________________ FORMAT LIST _______________________________________________________ DESCRIPTION The LIST command displays the entry-identifier and title of each entry on the current list. Before each execution of a command that generates a new current list, the old current list is saved. If execution of a command does not produce the desired new current list, the old current list can be restored with the LIST/RESTORE command. The LIST/OUTPUT=file-spec and LIST/PRINTER=file-spec differ in that the /OUTPUT modifier will save a list of only the database-codes and entry-codes for entries on the current list whereas the /PRINTER=file-spec modifier saves the screen output including both the entry-identifiers AND titles. The current list saved 4-62 LIST with a LIST/OUTPUT=file-spec command can be generated again very quickly with the GET command. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the LIST command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-15__LIST_Command_Modifiers_____________________ Current_List_Processing____Modifier_Function___________ /CURRENT (default) Process only entries on current list /ALL List all entries in the active databases /RESTORE Restore previous current list /SET Active databases become current list _______________________________________________________ Redirecting_Output_________Modifier_Function___________ /OUTPUT=file-spec Direct current list entry- codes to a file /PRINTER Direct current list codes and titles to printer /PRINTER=file-spec Direct current list codes ___________________________and_titles_to_disk_file_____ 4-63 MATCH _______________________________________________________ MATCH The MATCH command searches a sequence for all segments that match a user-specified peptide. A similar command, SCAN, searches a precomputed index of amino acid locations and is suitable when there is a large number of sequences. MATCH goes through the sequence and compares the unknown peptide with the sequence residue by residue; therefore it is suitable when there is a small number of sequences. MATCH can be used to search an entire database but it is quite slow (e.g., it takes several minutes to search just PIR1.) _______________________________________________________ FORMAT MATCH [entry-identifier] _______________________________________________________ PARAMETERS entry-identifier If the entry-identifier parameter is omitted from the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts Code: If you press at the code prompt, the program will process the entries on the current list. This is equivalent to using the modifier /CURRENT. _______________________________________________________ prompts Peptide: 4-64 MATCH Enter a string of amino acid symbols (A,C,D,E,F,G,H,I,K,L,M,N,O,P,Q,R,S,T,V,W,Y or ambiguous amino acid symbols B = N or D, Z = E or Q, and X = any amino acid). The MATCH command ignores spaces and hyphens (-) in the search string entered by the user. The use of the special symbols + * ( ) [ ] is described below. The last search string entered can be recalled by typing a period followed by . Just type to exit the MATCH command. The maximum length of the unknown peptide is 30 residues. _______________________________________________________ prompts Number of mismatches allowed (e to exit): Enter a number. The last number entered can be recalled by typing a period followed by . Just typing is equivalent to entering 0. _______________________________________________________ DESCRIPTION For each segment found, the display shows: the number of the first residue in the matching segment, the ten residues preceding the matching segment, the matching segment, the ten residues following the matching segment. In the case of nonexact matches, the display also shows the number of mismatches. Use of + and * in the unknown peptide The symbol + preceding the peptide limits the search to the amino end of the sequence. No other symbols can appear before the +. The symbol * following the peptide limits the search to the carboxyl end of the sequence. No other symbols can appear after the *. 4-65 MATCH Use of parentheses in the unknown peptide One or more regions of the unknown peptide may be enclosed by parentheses. No mismatches are allowed for those regions within the parentheses. For example, if the peptide is KG(HPETL)EK(VQK), then mismatches can occur only in the KG and EK regions. The regions HPETL and VQK must match exactly. The MATCH command does not find insertion and/or deletion events. Use of brackets in the unknown peptide One or more regions of the unknown peptide may be enclosed by brackets. The residues within the brackets denote ambiguous residues at a position in the peptide. For example, if the peptide is KG[HP]EK[VQK], then there are 6 positions in the peptide. For a match, at position 3 either H or P can occur in the sequence and at position 6 either V, Q, or K can occur. Note that positions 4 and 6 will also match Z because Z = E or Q. The ambiguous symbol B = D or N is treated similarly. However, an X in the unknown peptide will match any residue in the sequence, but an X in the sequence will only match an X in the peptide. Both parentheses and brackets may be used in the unknown peptide, but a parenthesis cannot occur within a bracketed region. When ambiguous positions are specified with the [..] notation the display shows a dot at those positions. The /SHOW modifier can be used to show the positions in the peptide and the residues in the sequence that will produce a match. The two modifiers /PEPTIDE=peptide and /MISMATCHES=number inhibit prompting for the search parameters. With /PEPTIDE=peptide the peptide is entered on the command line and it is used in all subsequent searches until the command terminates. The same holds for /MISMATCHES=number. The number of mismatches is read from the command line and used on 4-66 MATCH subsequent searches until the command terminates. When you use either modifier on the command line there is no prompt for the corresponding search parameter. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the MATCH command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. Table_4-16__MATCH_Command_Modifiers____________________ Search_Parameters__________Modifier_Function___________ /PEPTIDE=peptide Specify the peptide /MISMATCHES=number Specify the maximum number of mismatches allowed /SHOW Shows the positions in the peptide and the matching residues _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process entries on the current list /DEFINE Affect the resultant current list /DEFINE/ADD Add entries found to the current list /DEFINE/SUBTRACT Remove entries found form the current list /FULL Report No matches found, normally these messages are suppressed 4-67 MATCH Table_4-16_(Cont.)__MATCH_Command_Modifiers____________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to ___________________________disk_file___________________ _______________________________________________________ ToAbeLable to specify multiple possibilities for residue positions is a powerful feature of the MATCH command. In the following example, chymase sequences are searched for possible carbohydrate binding sites (Asn) and a current list is created. 1 ATLAS>B PIR* ATLAS>FIND CHYMASE . . . 10 titles found ATLAS>MATCH/CURRENT/DEFINE Peptide:NX[ST] Number of mismatches allowed (e to exit):0 PIR1:KYHUCM 247 residues chymase (EC 3.4.21.39) precursor - human Matching----------- NX. residue ! 80 RSITVTLGAH NIT EEEDTWQKLE 103 VIKQFRHPKY NTS TLHHDIMLLK PIR1:A46504 246 residues chymase (EC 3.4.21.39) precursor - mouse 102 VEKQIIHKNY NVS FNLYDIMLLK 4-68 MATCH PIR2:A35842 249 residues chymase (EC 3.4.21.39) precursor - dog 121 IMLLKLKEKA NLT LAVGTLPLSP 155 RVAGWGKRQV NGS GSDTLQEVKL PIR2:A41076 247 residues chymase (EC 3.4.21.39) 5 precursor - mouse 80 RSITVLLGAH NKT SKEDTWQKLE PIR2:A46721 244 residues chymase (EC 3.4.21.39) precursor - mouse 44 YMAYLKFTTK NGS KERCGGFLIA 67 PQFVMTAAHC NGS EISVILGAHN PIR2:S23505 177 residues chymase (EC 3.4.21.39) 1 - mouse 33 VEKYILPPNY NVS SKFNDIVLLK 51 IVLLKLEKQA NLT SAVDVVPLPA PIR2:S23504 247 residues chymase (EC 3.4.21.39) 2 - mouse 80 RSITVLLGAH NKT SKEDTWQKLE PIR3:S33247 226 residues chymase (EC 3.4.21.39) - human 59 RSITVTLGAH NIT EEEDTWQKLE 82 VIKQFRHPKY NTS TLHHDIMLLK ATLAS>LIST 8 entries on the current list PIR1:KYHUCM chymase (EC 3.4.21.39) precursor - human PIR1:A46504 chymase (EC 3.4.21.39) precursor - mouse PIR2:A35842 chymase (EC 3.4.21.39) precursor - dog PIR2:A41076 chymase (EC 3.4.21.39) 5 precursor - mouse PIR2:A46721 chymase (EC 3.4.21.39) precursor - mouse PIR2:S23505 chymase (EC 3.4.21.39) 1 - mouse PIR2:S23504 chymase (EC 3.4.21.39) 2 - mouse PIR3:S33247 chymase (EC 3.4.21.39) - human 4-69 MEMBERS _______________________________________________________ MEMBERS MEMBERS searches the members index for all occurrences of a user-specified PIR entry-code. For each code found, the code, the alignment identifier, and the title for each alignment in which the code occurs are displayed. The list of these alignments becomes the new current list. This command is an abbreviation for the full command: SEARCH/FIELD=members. _______________________________________________________ FORMAT MEMBER [entry-code] _______________________________________________________ PARAMETERS entry-code The entry-code parameter on the command line is the code (or part of the code) to be selected. The character string you enter must contain at least 3 characters. If this parameter is omitted, you will be prompted for it with the prompt: _______________________________________________________ prompts Members: _______________________________________________________ DESCRIPTION The members index is constructed from the Members: line in the ALN alignment database and consists of the PIR entry-codes for the sequences listed in the alignment. 4-70 MEMBERS Normally the members search matches anywhere the character string specified occurs in the entry-code. Use the /ANCHOR modifier (listed below) to alter this operation. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the MEMBERS command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-17__MEMBERS_Command_Modifiers__________________ Alternate_Display_Format___Modifier_Function___________ /BRIEF Only the entry codes are displayed /COUNTS Display number of entries containing search term _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found added to current list /SUBTRACT Remove entries from current list /KEEP Do not alter current list with command execution 4-71 MEMBERS _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /ANCHOR Match character string at beginning of term /NOANCHOR (default) Match character string anywhere in term /ENTRY Match each character string ___________________________to_separate_terms___________ _______________________________________________________ TheMfollowing example demonstrates how to find out whether a PIR1 myosin entry with code MORTA1 is in an alignment in the ALN database: 1 ATLAS> MEM MORTA1 MORTA1 ALN:FA0247 Myosin L1 (A1) and L4 (A2) catalytic light chains 1 members found To display the alignment use the TYPE command: 2 ATLAS> TYPE FA0247 4-72 PRINT _______________________________________________________ PRINT PRINT displays the contents of an external file on the screen. _______________________________________________________ FORMAT PRINT [file-spec] _______________________________________________________ PARAMETERS file-spec The file-spec parameter on the command line specifies the file to be displayed. If this parameter is omitted on the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts File: The command aborts if no file-spec is given. 4-73 QUIT _______________________________________________________ QUIT QUIT terminates the execution of the program and returns control to the system. _______________________________________________________ FORMAT QUIT 4-74 REFERENCE _______________________________________________________ REFERENCE REFERENCE searches the reference number index for all occurrences of a user-specified number or partial number. For each number found, the number, the entry- identifier, and the title for each entry found are displayed. The list of these entries becomes the new current list. This command is an abbreviation for the full command: SEARCH/FIELD=reference/ANCHOR. _______________________________________________________ FORMAT REFERENCE [reference-number] _______________________________________________________ PARAMETERS reference-number The reference-number parameter on the command line is the reference number (or part of the number) to be selected. The character string you enter must contain at least 3 characters. If this parameter is omitted on the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts Reference: _______________________________________________________ DESCRIPTION Each journal article processed for the PIR databases is assigned a unique reference number. 4-75 REFERENCE Normally the reference number search matches only reference numbers that begin with the character string you specified. Use the /NOANCHOR modifier (listed below) to alter this operation. Displaying the List of Reference Numbers If no reference number is typed in at the prompt and you press , an alphabetical listing of all reference numbers and the titles of the entries will be displayed. If the /BRIEF modifier (listed below) is used only the reference numbers will be displayed. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the REFERENCE command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-18__REFERENCE_Command_Modifiers________________ Alternate_Display_Formats__Modifier_Function___________ /BRIEF Display only index terms found /COUNTS Display number of entries containing search term 4-76 REFERENCE _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /ANCHOR (default) Match terms beginning with character string /NOANCHOR Match character string anywhere in term /ENTRY Match each character string ___________________________to_separate_terms___________ _______________________________________________________ TheMcommand line: 1 ATLAS> REFERENCE A008 will produce the list of reference numbers that begin with A008 and the entry-codes and titles of the associated PIR entries. 4-77 REFERENCE Typing the full number will produce those PIR entries containing that particular reference. For example: 2 ATLAS> REFERENCE A00899 A00899 PIR1:IFBY beta-fructofuranosidase (EC 3.2.1.26) precursor - yeast (Saccharomyces cerevisiae) 1 reference found 4-78 REPORT _______________________________________________________ REPORT REPORT displays the values for user-specified arithmetic expressions or field names. One to five expressions or fields may be specified. If more than one is specified, they must be separated by commas. Statistics are calculated for the first expression only. The list of these entries becomes the new current list. _______________________________________________________ restrictions This command has been implemented for VAX/VMS and OpenVMS systems only! _______________________________________________________ FORMAT REPORT _______________________________________________________ PARAMETERS When the REPORT command has been given and the has been pressed, you will be prompted for the following: _______________________________________________________ prompts Expression: Specify the expressions or field names described below. 4-79 REPORT _______________________________________________________ prompts Sort in ascending or descending order (A/D/N for no): If the /CURRENT modifier is used, the display list may be sorted in the order of the specified expressions. Respond with if no sort is wanted, otherwise respond as directed with either A or D. _______________________________________________________ DESCRIPTION The REPORT command tabulates ancillary data such as sequence length, molecular weight, amino acid composition, etc. Valid expressions may be single field names or arithmetic expressions containing numerical field names and numerical constants as arithmetic elements. Arithmetic expressions are strings of arithmetic elements separated by operators or parentheses. Expressions are evaluated according to the following operator precedence. Table_4-19__Expression_Operators_______________________ operator____function______________precedence___________ * multiplication second / division second + addition third -___________subtraction___________third________________ Operators are evaluated in the order of their precedence. Operators with the same precedence are evaluated from left to right. Subexpressions separated by parentheses are evaluated first irrespective of operator precedence. Numerical constants may be integers or real numbers. 4-80 REPORT Table_4-20__Expression_Fields__________________________ Nonnumerical_fields________Description_________________ CODE Entry code GROUP Taxonomic group SUPERFAMILY Superfamily number FAMILY Family number SUBFAMILY Subfamily number ENTRY Entry number SUBENTRY Subentry number TYPE Sequence type (P=complete, F=fragment) TEXT_UPDATE Date of last text update (YYMMDD) SEQ_UPDATE Date of last sequence update (YYMMDD) ADDED_DATE Date added to database (YYMMDD) 4-81 REPORT _______________________________________________________ Numerical_fields___________Description_________________ MW Molecular weight LEN Sequence length %Ala % Alanine %Arg % Arginine %Asn % Asparagine %Asp % Aspartic acid %Asx % Asp or Asn %Cys % Cysteine %Glu % Glutamic acid %Gln % Glutamine %Glx % Glu or Gln %Gly % Glycine %His % Histidine %Ile % Isoleucine %Leu % Leucine %Lys % Lysine %Met % Methionine %Phe % Phenylalanine %Pro % Proline %Ser % Serine %Thr % Threonine %Trp % Tryptophan %Tyr % Tyrosine %Val % Valine %X_________________________%_Unknown___________________ _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the REPORT command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. 4-82 REPORT Table_4-21__REPORT_Command_Modifiers___________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to ___________________________disk_file___________________ _______________________________________________________ EXAMPLES 1 ATLAS> find bov thymosin alpha thymosin alpha-1 - bovine PIR1:TNBOA1 thymosin alpha-1 - bovine 1 title found ATLAS> REPORT Expression: (%ASP+%GLU)*LEN/100 9.00 TNBOA1 thymosin alpha-1 - bovine The above expression gives the number of acidic amino acids in a sequence. For bovine thymosin alpha-1, which has a length of 28, 10.71% Asp, and 21.43% Glu, the expression is equal to 9.0. 4-83 SCAN _______________________________________________________ SCAN SCAN is a fast sequence matching routine that locates exactly matching segments. The entries containing the test segment become the new current list. _______________________________________________________ FORMAT SCAN [peptide] _______________________________________________________ PARAMETERS peptide The peptide parameter on the command line is the peptide, in one-letter amino acid notation, to be located. The peptide must be at least three residues but not greater than 30 residues long. If this parameter is omitted on the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts Peptide: _______________________________________________________ DESCRIPTION The SCAN command searches a tripeptide index for nearly instantaneous matching of identical segments. Even though the string of amino acid residues must match identically, the SCAN command ignores spaces and hyphens (-) in the search string entered by the user. 4-84 SCAN _______________________________________________________ MODIFIERS In this summary the modifiers accepted by the SCAN command are grouped according to their function. Table_4-22__SCAN_Command_Modifiers_____________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Information Category Selection__________________Modifier_Function___________ /TITLE Display only titles of entries found /SEGMENT Display only segments of entries found _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to ___________________________disk_file___________________ _______________________________________________________ InAtheEfollowing example, the SCAN command was used to search the PIR1 dataset for the string CIDCGA, which forms part of the active site of certain ferredoxins. 1 4-85 SCAN ATLAS> SCAN CIDCGA 5 matches found Database: PIR1 FECLCE ferredoxin - Clostridium sp. 37 DAVRVIDADK CIDCGA CANTCPVDAI FECLCU ferredoxin - Clostridium acidiurici 37 DSRYVIDADT CIDCGA CAGVCPVDAP FECLCT ferredoxin - Clostridium thermosaccharolyticum 37 TGKYEVDADT CIDCGA CEAVCPTGAV FEME ferredoxin - Megasphaera elsdenii 36 GETKYVVTDS CIDCGA CEAVCPTGAI FETWT ferredoxin - Thermus aquaticus 39 GDQFYIHPEE CIDCGA CVPACPVNAI As indicated here, the peptide string can be entered on the command line; otherwise, a prompt for the string will be issued. The display of the results of the search show the number of matches found, the database searched, and the entry-code and title for each hit. The line underneath the title of each match shows the sequence position of the first matching residue, the ten residues preceding the matching segment, the matching residues, and the ten residues following them. The list of these five entries becomes the new current list, which can then be further manipulated. 4-86 SEARCH _______________________________________________________ SEARCH SEARCH/FIELD=field-name searches the specified indexed field(s) for all occurrences of a user-specified parameter. The fields are those available under the text searching commands. For each parameter found, the parameter, the entry-identifier, and the title for each entry are displayed. The list of these entries becomes the new current list. _______________________________________________________ FORMAT SEARCH/FIELD=field-name[+field-name] [parameter] _______________________________________________________ PARAMETERS parameter The parameter on the command line is the character string to be selected. Each word in the character string must contain at least 3 characters and must be appropriate for the field(s) to be searched. If this parameter is omitted, you will be prompted for it with a prompt for the field (i.e. Keyword:) or, if multiple fields are to be searched, the following prompt appears: _______________________________________________________ prompts Term: 4-87 SEARCH _______________________________________________________ DESCRIPTION The SEARCH/FIELD command is the most flexible means of searching the indexed fields. When the DEFINE command is used to create an abbreviation for SEARCH/FIELD=field-name(s), the user can readily customize the program to meet specialized needs. Note: The field-name may NOT be abbreviated. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the SEARCH command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-23__SEARCH_Command_Modifiers___________________ Alternate_Display_Format___Modifier_Function___________ /BRIEF Display only index terms found /COUNTS Display number of entries containing search term /SHOW Display how search string is parsed 4-88 SEARCH _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /NOANCHOR (default) Matches character string anywhere in term /ANCHOR Matches terms beginning with character string /ENTRY Match each character string to separate terms /MAIN (for /FIELD=title) Search titles only NOT alternate names or contains fields /TEXT (for /FIELD=title) Perform string search on titles (does not use indexes and is slow for ___________________________large_numbers_of_entries)___ 4-89 SEARCH _______________________________________________________ TheMfollowing example demonstrates the SEARCH command: 1 ATLAS> B PIR1 ATLAS> SEARCH/FIELD=KEYWORD/ENTRY KINASE RECEPT PIR1:KIHUCA protein kinase C (EC 2.7.1.-) alpha - human PIR1:KIRTC protein kinase C (EC 2.7.1.-) alpha - rat PIR1:KIMSCA protein kinase C (EC 2.7.1.-) alpha - mouse . . . 47 entries found In this example, the /ENTRY modifier was used to search the index for two separate terms: one containing KINASE and the other containing RECEPT. Without this modifier, the program assumes both character strings are part of the same search term. The entries found in example 1 had a keyword containing KINASE and a keyword containing RECEPT. The following example shows the results of the same search without the /ENTRY modifier: 2 ATLAS> SEARCH/FIELD=KEYWORD KINASE RECEPT Keyword: KINASE RECEPT No keywords found 4-90 SELECT _______________________________________________________ SELECT SELECT searches for entries on the basis of a user- specified arithmetic expressions or field names. One to five expressions or fields may be specified. If more than one is specified, they must be separated by commas. Statistics are calculated for the first expression only. The list of these entries becomes the new current list. _______________________________________________________ restrictions This command has been implemented for VAX/VMS and OpenVMS sytems only! _______________________________________________________ FORMAT SELECT _______________________________________________________ PARAMETERS When the SELECT command has been given and the has been pressed, you will be prompted for the following: _______________________________________________________ prompts Expression: Specify the expression to be searched for. Valid expressions may be arithmetic expressions containing numerical field names and numerical constants or single nonnumerical field names. Respond with a ? for more details. 4-91 SELECT _______________________________________________________ prompts Range: A range or a single number may be specified. Valid range specifications are of the form R1-R2. The terms R1 and R2 are the lower and upper bounds of the inclusive search range. These terms may be character strings if the expression contains only a single nonnumerical field specification. R1-* specifies a range greater than or equal R1. *-R2 specifies a range less than or equal to R2. If a single number is specified, values equal to this number are searched for. _______________________________________________________ DESCRIPTION Table_4-24__Expression_Operators_______________________ operator____function______________precedence___________ * multiplication second / division second + addition third -___________subtraction___________third________________ Operators are evaluated in the order of their precedence. Operators with the same precedence are evaluated from left to right. Subexpressions separated by parentheses are evaluated first irrespective of operator precedence. Numerical constants may be integers or real numbers. 4-92 SELECT Table_4-25__Expression_Fields__________________________ Nonnumerical_fields________Description_________________ CODE Entry code GROUP Taxonomic group SUPERFAMILY Superfamily number FAMILY Family number SUBFAMILY Subfamily number ENTRY Entry number SUBENTRY Subentry number TYPE Sequence type (P=complete, F=fragment) TEXT_UPDATE Date of last text update (YYMMDD) SEQ_UPDATE Date of last sequence update (YYMMDD) ADDED_DATE Date added to database (YYMMDD) 4-93 SELECT _______________________________________________________ Numerical_fields___________Description_________________ MW Molecular weight LEN Sequence length %Ala % Alanine %Arg % Arginine %Asn % Asparagine %Asp % Aspartic acid %Asx % Asp or Asn %Cys % Cysteine %Glu % Glutamic acid %Gln % Glutamine %Glx % Glu or Gln %Gly % Glycine %His % Histidine %Ile % Isoleucine %Leu % Leucine %Lys % Lysine %Met % Methionine %Phe % Phenylalanine %Pro % Proline %Ser % Serine %Thr % Threonine %Trp % Tryptophan %Tyr % Tyrosine %Val % Valine %X_________________________%_Unknown___________________ _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the SELECT command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. 4-94 SELECT Table_4-26__SELECT_Command_Modifiers___________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution /EQUALS=entry-identifier Searches for values equal to or within a range of the value of a user-specified sequence /BRIEF Suppresses warning messages that occur when an expression cannot be evaluated (e.g., when a division by zero is attempted) _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to ___________________________disk_file___________________ _______________________________________________________ EXAMPLES 1 ATLAS> SELECT Expression: MW Range: 95000-98000 The above example will produce a list of proteins with a molecular weight between 95 and 98 kD. 4-95 SELECT 2 ATLAS> SELECT/EQUALS=kdhop Expression:superfamily This example will produce the list of entries in the same superfamily as the sequence with code KDHOP 4-96 SET _______________________________________________________ SET SET is used to change program operation parameters. _______________________________________________________ SYNTAX SET/MODIFIER _______________________________________________________ DESCRIPTION The SET command changes the default parameters of the program. With the /CLOCK modifier, SET is used to set the command execution timer. The timer displays the CPU time and elapsed time for the execution of each command. The timer is initially off. The /MENU modifier, for the PC version, will change from command mode to the menu system. The /SWITCH modifier allows the user to change the modifier character ` / ' to another specified character. This is particularly useful for UNIX users when the slash character interferes with slashes used in pathnames. _______________________________________________________ MODIFIERS The following table summmarizes the modifiers accepted by the SET command. Table_4-27__SET_Command_Modifiers______________________ Modifier___________________Function____________________ /CLOCK Set internal timer /MENU Change from command mode to menu mode (PC version only) 4-97 SET Table_4-27_(Cont.)__SET_Command_Modifiers______________ Modifier___________________Function____________________ /NOWRAP Turns line wrap off. For subsequent text searching commands which use /PRINTER=filename to create an output file, the lines written are not broken up into multiple lines so that they fit into 80 columns. They are written on one line which may be up to 512 characters long. The action remains into effect until the SET/WRAP command is issued. /SWITCH=switch character Change modifier character from ` / ' /WRAP Turns line wrap on. For subsequent text searching commands which use /PRINTER=filename to create an output file, the lines writtenn are broken into multiple lines so that they fit into 80 columns. The action remains into effect until the SET/NOWRAP ___________________________command_is_issued.__________ 4-98 SHOW _______________________________________________________ SHOW The SHOW command displays information about the program operation. _______________________________________________________ FORMAT SHOW _______________________________________________________ DESCRIPTION Unmodified, or with the default /FIELDS modifier, the SHOW command displays the indexed fields currently available for the databases in the term index system. SHOW with the /COMMANDS modifier will also display the definitions for the commands to search these indexes. The /DISPLAYS modifier will list all the symbols that have been defined and the display layouts they represent. The /DATABASE and /INFORMATION modifiers will display the header information for the active databases and the Atlas program respectively. To display the date and time, use the /TIME modifier. The /TOTALS modifier will display the composition table generated by the previous execution of the TYPE command. See the discussion under the TYPE command. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the SHOW command are grouped according to their function. Table 4-28 SHOW Command Modifiers 4-99 SHOW Table_4-28_(Cont.)__SHOW_Command_Modifiers_____________ _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Miscellaneous______________Modifier_Function___________ /COMMANDS Display command definitions for each field /DATABASE Display headers for each active database /DISPLAYS List all the symbols defined for display layouts /FIELDS (default) Display indexed fields for each database /INFORMATION Display Atlas program header /TIME Display current date and time /TOTALS Display the composition table generated by the TYPE ___________________________command_____________________ _______________________________________________________ TheMfollowing example illustrates the show command: 1 4-100 SHOW ATLAS> SHOW/COMMANDS GENE SEARCH/FIELD=GENE_NAME CROSS SEARCH/FIELD=CROSS_REF MEMBERS SEARCH/FIELD=MEMBERS REFERENCE SEARCH/FIELD=REFERENCE/ANCHOR ACCESSION SEARCH/FIELD=ACCESSION/ANCHOR SPECIES SEARCH/FIELD=SPECIES SFNUM SEARCH/FIELD=SUPFAM_NUM SUPERFAMILY SEARCH/FIELD=SUPERFAMILY JOURNAL SEARCH/FIELD=JOURNAL FEATURE SEARCH/FIELD=FEATURE AUTHOR SEARCH/FIELD=AUTHOR/ANCHOR KEYWORD SEARCH/FIELD=KEYWORD FIND SEARCH/FIELD=TITLE KT SEARCH/FIELD=KEYWORD+TITLE The command definition for each of the text searching commands is listed; user-defined commands appear at the bottom of the list. In this case, KT was defined in example 2 of the DEFINE command. 4-101 SPECIES _______________________________________________________ SPECIES SPECIES searches the species index for all occurrences of a user-specified species name or partial name. For each name found, the name, the entry- identifier, and the title for each entry in which the species name occurs are displayed. The list of these entries becomes the new current list. This command is an abbreviation for the full command: SEARCH/FIELD=species. _______________________________________________________ FORMAT SPECIES [species-name] _______________________________________________________ PARAMETERS name The species name parameter on the command line is the name (or part of the name) to be selected. The character string you enter must contain at least 3 characters. If this parameter is omitted on the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts Species: _______________________________________________________ DESCRIPTION Normally the species name search matches a name whenever the character string you specified occurs anywhere within the name. Use the /ANCHOR modifier to cause the search to match species names only when the 4-102 SPECIES specified character string occurs at the beginning of the name. Displaying the List of Species If no species name is typed in at the prompt and you press , an alphabetical listing of all species names and the titles of their associated entries will be displayed. If the /BRIEF modifier (listed below) is used only the names will be displayed. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the SPECIES command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-29__SPECIES_Command_Modifiers__________________ Alternate_Display_Formats__Modifier_Function___________ /BRIEF Display only index terms found /COUNTS Display number of entries containing search term /SHOW Display how search string is parsed _______________________________________________________ Database_List_Processing___Modifier____________________ /PIR Ignore all databases except PIR1, PIR2, and PIR3 4-103 SPECIES _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /ANCHOR Match terms beginning with character string /NOANCHOR (default) Match character string anywhere in term /ENTRY Match each character string ___________________________to_separate_terms___________ _______________________________________________________ TheMfollowing example demonstrates the SPECIES command: 1 4-104 SPECIES ATLAS> BASES * ATLAS> SPECIES ALCALIG DENIT Warning: not all the requested indexes exist. Alcaligenes denitrificans PIR1:AZALCD azurin precursor - Alcaligenes denitrificans PIR2:PH1228 D-aminoacylase (EC 3.5.1.-) - Alcaligenes denitrificans (fragment) NRL_3D:1AZBA azurin (copper removed), chain A - Alcaligenes denitrificans NRL_3D:1AZBB azurin (copper removed), chain B - Alcaligenes denitrificans NRL_3D:1AZCA azurin (apo), chain A - Alcaligenes denitrificans NRL_3D:1AZCB azurin (apo), chain B - Alcaligenes denitrificans NRL_3D:1AIZA azurin (with cadmium), chain A - Alcaligenes denitrificans NRL_3D:1AIZB azurin (with cadmium), chain B - Alcaligenes denitrificans NRL_3D:2AZAA azurin (oxidized), chain A - Alcaligenes denitrificans NRL_3D:2AZAB azurin (oxidized), chain B - Alcaligenes denitrificans Alcaligenes denitrificans subsp. xylosoxydans PIR2:A61148 cyanidase - Alcaligenes denitrificans subsp. xylosoxydans (strain) DF3 (fragment) PIR3:S17501 glutaminase - Alcaligenes denitrificans subsp. xylosoxydans 2 species found In this example, all the databases were activated by the command BASES *. When the SPECIES command was issued, a message appeared warning that not all the active databases are included in this index. 4-105 SUPERFAMILY _______________________________________________________ SUPERFAMILY SUPERFAMILY searches the index of superfamily names for all occurrences of a user-specified name or partial name. For each name found, the superfamily, the database, the entry identifier, and the title for each entry belonging to the superfamily are displayed. The list of these entries becomes the new current list. This command is an abbreviation for the full command: SEARCH/FIELD=superfamily. _______________________________________________________ FORMAT SUPERFAMILY [superfamily-name] _______________________________________________________ PARAMETERS superfamily-name The superfamily-name parameter on the command line is the name (or part of the name) of the superfamily to be selected. Each word you enter must contain at least 3 characters. If this parameter is omitted on the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts Superfamily: Note: Classification of sequences into superfamilies is part of the annotation and verification process; therefore, not all sequences have been assigned to superfamilies. 4-106 SUPERFAMILY _______________________________________________________ DESCRIPTION Superfamily Definition A superfamily is a group of protein domains that share sequence similarity due to common ancestry. Although the protein domain often extends the entire length of protein, those proteins with more than one domain may belong to more than one superfamily. Displaying the List of Superfamilies If no superfamily-name is typed in at the prompt and you press , an alphabetical listing of all superfamilies and the titles of their associated entries will be displayed. If the /BRIEF modifier (listed below) is used only the superfamily names will be displayed. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the SUPERFAMILY command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-30__SUPERFAMILY_Command_Modifiers______________ Alternate_Display_Formats__Modifier_Function___________ /BRIEF Display only index terms found /COUNTS Display number of entries containing search term /SHOW Display how search string is parsed 4-107 SUPERFAMILY Table_4-30_(Cont.)__SUPERFAMILY_Command_Modifiers______ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /ANCHOR Match terms beginning with character string /NOANCHOR (default) Match character string anywhere in term /ENTRY Match each character string ___________________________to_separate_terms___________ _______________________________________________________ TheMfollowing command line demonstrates the SUPERFAMILY command: 1 4-108 SUPERFAMILY ATLAS> SUPERFAMILY ENKEPHALIN proenkephalin PIR1:EQHUA enkephalin precursor - human PIR1:EQBOA enkephalin precursor - bovine PIR1:EQRTA enkephalin A precursor - rat PIR1:EQXL enkephalin precursor - African clawed frog PIR1:DFHU beta-neoendorphin / dynorphin precursor - human PIR1:DFPG beta-neoendorphin / dynorphin precursor - pig PIR1:DFRTP beta-neoendorphin / dynorphin precursor - rat (fragment) PIR2:A18864 enkephalin-containing peptide E - guinea pig (fragment) PIR2:JL0067 Met-enkephalin-Arg-Phe - pig PIR2:JT0381 enkephalin precursor - pig (fragment) PIR2:A19448 enkephalin-containing protein - bovine PIR2:B35678 enkephalin precursor - mouse PIR2:A48541 proenkephalin - rat PIR2:A47589 enkephalin precursor - mouse (fragment) PIR2:A19145 dynorphin - pig PIR2:A41395 dynorphin precursor - rat PIR2:A60410 alpha-Neo-endorphin - guinea pig 1 superfamily found The user must be aware that the entries found in a particular superfamily may NOT be all the sequences of that type in the database; only some of the entries in PIR2 and few of the entries in PIR3 have been categorized by superfamily. 4-109 SFNUM _______________________________________________________ SFNUM SFNUM searches the index of superfamily numbers contained in the .CDX database file for all occurrences of a user-specified number or partial number. For each number found, the number, database, entry identifier, and title for each entry belonging to the superfamily are displayed. The list of these entries becomes the new current list. This command is an abbreviation for the full command: SEARCH/FIELD=supfam_num. _______________________________________________________ FORMAT SFNUM [superfamily-number] _______________________________________________________ PARAMETERS superfamily-number The superfamily-number parameter on the command line is the number which groups a set of entries into a superfamily. Each number entered must contain at least 3 characters; if the number contains less than three characters, precede the search string with "-" (dashes). If this parameter is omitted on the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts Supfam_num: Note: Classification of sequences into superfamilies is part of the annotation and verification process; therefore, not all sequences have been assigned to superfamilies. 4-110 SFNUM _______________________________________________________ DESCRIPTION Superfamily number Definition A superfamily number denotes a classification described by five numbers although the first is used most commonly. Each number denotes the level of sequence similarity. The first number, referred to as the superfamily number, groups related entries. The second number, referred to as the family number, groups entries of a superfamily into families whose sequences are less that 50% different. The third number, referred to as the subfamily number, groups entries of a family into subfamilies whose sequences are less than 20% different. The fourth number, referred to as the entry number, groups entries of a subfamily into entries whose sequences are less than 10% different. The fifth number allows for unique subentry classification. Displaying the List of Superfamily numbers If no superfamily-number is typed in at the prompt and you press , a numerical listing of all superfamily numbers and associated entries will be displayed. If the /BRIEF modifier (listed below) is used only the superfamily numbers will be displayed. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the SFNUM command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. 4-111 SFNUM Table_4-31__SFNUM_Command_Modifiers____________________ Alternate_Display_Formats__Modifier_Function___________ /BRIEF Display only index terms found /COUNTS Display number of entries containing search term /SHOW Display how search string is parsed _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file _______________________________________________________ Search_Manipulation________Modifier_Function___________ /ANCHOR Match terms beginning with character string /NOANCHOR (default) Match character string anywhere in term /ENTRY Match each character string ___________________________to_separate_terms___________ 4-112 SFNUM _______________________________________________________ TheMfollowing command line demonstrates the SFNUM command: 1 ATLAS> SFNUM -59.0 --59.0 1.0 0.0 0.0 0.0 PIR2:A54756 isocitrate dehydrogenase (NADP+) (EC 1.1.1.42), cytosolic - rat PIR2:S33859 isocitrate dehydrogenase (NADP+) (EC 1.1.1.42) - bovine PIR2:S51419 hypothetical protein L9470.12 - yeast (Saccharomyces cerevisiae) --59.0 1.0 1.0 1.0 1.0 PIR1:DCBYIS isocitrate dehydrogenase (NADP+) (EC 1.1.1.42) precursor, mitochondrial - yeast (Saccharomyces cerevisiae) --59.0 1.0 2.0 1.0 1.0 PIR2:A43294 isocitrate dehydrogenase (NADP+) (EC 1.1.1.42), mitochondrial - pig 3 supfam_nums found The user must be aware that the entries found in a particular superfamily number may NOT be all the sequences of that type in the database; only some of the entries in PIR2 and none of the entries in PIR3 have been categorized by superfamily number. 4-113 TAXONOMY _______________________________________________________ TAXONOMY The TAXONOMY command searches for entries on the basis of taxonomic classification. The list of entries found becomes the new current list. _______________________________________________________ restrictions This command has been implemented for VAX/VMS and OpenVMS sytems only! _______________________________________________________ FORMAT TAXONOMY taxonomic-class _______________________________________________________ PARAMETERS taxonomic-class The taxonomic-class parameter specifies the desired taxonomic classification. If you enter ?, the list of valid taxonomic classes and the classification scheme will appear. _______________________________________________________ DESCRIPTION The TAXONOMY command searches for entries on the basis of taxonomic classification. The list of entries found becomes the new current list. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the SELECT command are grouped according to their 4-114 TAXONOMY function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. Table_4-32__SELECT_Command_Modifiers___________________ Current_List_Processing____Modifier_Function___________ /CURRENT Process only entries on current list /ADD Add entries found to current list /SUBTRACT Remove entries found from current list /KEEP Do not alter current list with command execution /EQUALS=entry-identifier Selects sequences with the same taxonomic classification as the specified sequence /MAIN Searches only the main species of the entry; that is, the species of the sequence shown in the entry. Normally the TAXONOMY command searches the main species plus the species of other sequences mentioned in the text but not explicitly shown in the entry _______________________________________________________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to ___________________________disk_file___________________ 4-115 TAXONOMY _______________________________________________________ TAXONOMIC CLASSES Tables 4-33, 4-34, and 4-35 below list the valid taxonomic class codes. Table_4-33__Taxonomic_Class_Codes_(Viruses)____________ Code____Taxonomic_classification_______________________ VIRU Viruses BP . Bacterial viruses VP . Plant viruses VA______.__Animal_viruses______________________________ 4-116 TAXONOMY Table_4-34__Taxonomic_Class_Codes_(Prokaryotes)________ Code____Taxonomic_classification_______________________ PROK Prokaryotes MBA . Miscellaneous bacteria: Propionibacterium; Streptococcus; Thermus; theromophilic bacteria; Staphylococcus; Acinetobacter; Myxobacter CLO . Clostridia; Megasphaera; Peptococcus (nonphotosynthetic anaerobes) CHR . Chromatium; Chlorobium; other photosynthetic anaerobes EC . Escherichia coli; other Enterobacteriaceae; Vibrionaceae BC . Bacillus; Lactobacillus DES . Desulfovibrio; Desulfuromonas ARC . Thermoplasma; Halobacterium; Streptomyces; Micrococcus PS . Pseudomonas; Alcaligenes; Bordetella; Azotobacter; Mycobacterium RHS . Rhodospirillaceae (except Rhodospirillum tenue and Rhodopseudomonas gelatinosa); Paracoccus; Agrobacterium BG______.__Blue-green_algae____________________________ 4-117 TAXONOMY Table_4-35__Taxonomic_Class_Codes_(Eukaryotes)_________ Code____Taxonomic_classification_______________________ EUKA Eukaryotes PRO . Protists, including slime molds FUN . Fungi PLAN . Plants ALG . . Eukaryotic algae PLT . . Higher plants ANIM . Animals INVE . . Invertebrates ART . . . Arthropods INV . . . Other invertebrates VERT . . Vertebrates FISH . . . Fish OF . . . . Jawless fish; cartilaginous fish BF . . . . Bony fishes AM . . . Amphibians REPT . . . Reptiles SNK . . . . Squamata (snakes, etc.) REP . . . . Other reptiles BRD . . . Birds MAMM . . . Mammals MM . . . . Monotremes; marsupials ARTI . . . . Artiodactyls PG . . . . . Artiodactyls; pigs; hippopotamus BO . . . . . Artiodactyls; Bovidae (cattle, etc.) ARD . . . . . Other artiodactyls HO . . . . Perissodactyls WH . . . . Cetaceae RBT . . . . Lagomorphs RAT . . . . Rodents DOG . . . . Carnivores; pinnipeds PRIM . . . . Primates MKY . . . . . Monkeys HU . . . . . Human; great apes MAM . . . . Other mammals, including primitive primates VER18___.__.__.__Other_vertebrates_____________________ TAXONOMY Figure 4-1 Taxonomic Classification Scheme _______________________________________________________ Viruses Prokaryotes +----Eukaryotes---+ ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Fungi ! Animals +--+--+ +---+---+---+--+--+---+--+---+--+ Protists ! Plants ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! BP VP VA MBA CLO CHR EC BC DES ARC PS RHS BG PRO FUN ALG PLT ! ! +---------------------------------+------------------------------+ ! ! ! ! Invertebrates +--------+------Vertebrates--+-------+---------+ ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Amphibians ! Birds Mammals Other ! ! Fish ! Reptiles ! ! Vertebrates ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ART INV OF BF AM SNK REP BRD ! VER ! ! +--------+--------+----+----+----+----+-------+--------+ ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Artiodactyls ! ! ! ! ! Primates ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ____________MM___PG__BO_ARD___HO___WH__RBT__RAT__DOG___MKY HU MAM 4-119 TYPE _______________________________________________________ TYPE The TYPE command displays all, or selected categories, of the information contained in an entry. This is the basic command for displaying a sequence and/or the annotation associated with the sequence. _______________________________________________________ FORMAT TYPE [entry-identifier] _______________________________________________________ PARAMETERS entry-identifier If the entry-identifier parameter is omitted from the command line, you will be prompted for it with the prompt: _______________________________________________________ prompts Code: If you press at the code prompt, the program processes the entries on the current list. If there are no entries on the current list, pressing the return key terminates the TYPE command. _______________________________________________________ DESCRIPTION The display of an entry can be separated into four components: 1 header - entry code and entry title 2 text - annotation associated with the entry 3 composition - composition of the sequence 4-120 TYPE 4 sequence - sequence Displaying a Complete Entry Unmodified, the TYPE command displays all of the information in an entry in the order listed above. The /OLD modifier also displays all of the information, but in the order: header, composition, sequence, and text. Displaying Selected Components The header component of an entry is always displayed. Any combination of the three remaining components can be selected for display with the modifiers: /TEXT, /COMPOSITION, and /SEQUENCE. (See Appendix D for a description of the entry format). Displaying the Sequence with Three-letter Amino Acid Abbreviations An alternate display format is available for the sequence portion of a protein entry, which may be displayed with the one-letter (default) or three- letter amino acid abbreviations. Displaying Selected Text Lines Selected lines of text can be displayed, but first a display layout must be defined with the DEFINE command (see the description of the DEFINE command). The SHOW/DISPLAY command shows all the symbols that have been defined and the display layouts they represent. Computing Total Sequence Composition Each time the TYPE command is executed with the composition component indicated either implicitly, no information category modifier, or explicitly, using the /COMPOSITION modifier, the total composition of all the protein sequences displayed is computed and stored. The SHOW/TOTALS command shows this stored total composition table. The SHOW/TOTALS command can 4-121 TYPE be executed at any time after the total composition table is generated by the TYPE command but before the TYPE command is executed again because it will generate a new total composition table. Printing Entries in PostScript Format (VMS and OpendVMS only) Protein entries can be printed on a PostScript printer using the /POSTSCRIPT modifier (VMS and OpenVMS only). To use this modifier, define a symbol called POSTSCRIPT_QUEUE to be xxx. The files will be sent to queue xxx. The command creates a file in your home directory called PROTEIN.PS for the entry specified (or for each entry on the current list when /CURRENT is also used). The command then sends the file to the defined queue and then deletes it. If you are using /CURRENT and there are 500 entries on the current list, 500 files will be created and printed, each file will have the same name but a different version number. The entries are sent to the printer immediately; you do not need to terminate the program to have the entries printed. The entry must be a protein sequence entry and the entry must be in the NBRF format. The /POSTSCRIPT modifier will work with these other modifiers: /CURRENT, /SEQUENCE, /TEXT (but not /TEXT=display_ layout), /COMPOSITION, /OLD, and /THREE_LETTER. _______________________________________________________ MODIFIERS In the following table, modifiers accepted by the TYPE command are grouped according to their function. Multiple modifiers can be used together; a slash (/) precedes each modifier with no spaces before or after. A modifier designated as a default does not need to be typed in to be used. 4-122 TYPE Table_4-36__TYPE_Command_Modifiers_____________________ Alternate_Display_Formats__Modifier_Function___________ /EXCHANGE Display entry in CODATA format /FULL_NUMBERING Display numbering for each line of sequence /ONE_LETTER (default) Display one-letter amino acid abbreviations /THREE_LETTER Display three-letter amino acid abbreviations /OLD Display the entry in this order: header, composition, sequence, and text _______________________________________________________ Current_List_Processing____Modifier_Function___________ /CURRENT[1] Process only entries on current list _______________________________________________________ Information Category Selection__________________Modifier_Function___________ /COMPOSITION Display the composition of the sequence /COMPOSITION=QUIET Compute the composition but suppress its display /SEQUENCE Display sequence portion of entry /TEXT Display text portion of entry /TEXT=display-layout Display only selected text lines of entry _______________________________________________________ [1]When /CURRENT is specified, the entry-identifier parameter on the command line should be omitted. If present, it is ignored 4-123 TYPE Table_4-36_(Cont.)__TYPE_Command_Modifiers_____________ Redirecting_Screen_Output__Modifier_Function___________ /PRINTER Direct screen output to printer /PRINTER=file-spec Direct screen output to disk file /POSTSCRIPT Create-print-delete a PostScript output file for ___________________________a_protein_entry_____________ _______________________________________________________ TheMfollowing example demonstrates the use of the TYPE command: 1 ATLAS> TYPE/THREE_LETTER FEME PIR1:FEME ferredoxin 2[4Fe-4S] - Megasphaera elsdenii Species: Megasphaera elsdenii Date: #sequence_revision 13-Jul-1981 #text_change 03-Mar-1995 Accession: A00203 Azari, P.; Glantz, M.; Tsunoda, J.; Yasunobu, K.T. unpublished results, cited by Yasunobu, K.T., and Tanaka, M., Syst. Zool. 22, 570-589, 1973 Reference number: A00203 Accession: A00203 Molecule type: protein Residues: 1-54 Superfamily: ferredoxin 2[4Fe-4S]; ferredoxin 2[4Fe-4S] homology Keywords: 4Fe-4S; electron transfer; iron-sulfur protein 4-124 TYPE Residues Feature 1-54 Domain: ferredoxin 2[4Fe-4S] homology 8,11,14,46 Binding site: 4Fe-4S cluster (Cys) (covalent) #status predicted 18,36,39,42 Binding site: 4Fe-4S cluster (Cys) (covalent) #status predicted Composition 7 Ala A 0 Gln Q 0 Leu L 4 Ser S 0 Arg R 6 Glu E 2 Lys K 5 Thr T 0 Asn N 5 Gly G 1 Met M 0 Trp W 3 Asp D 1 His H 0 Phe F 1 Tyr Y 8 Cys C 4 Ile I 2 Pro P 5 Val V Mol. wt. unmod. chain = 5,430 Number of residues = 54 5 10 15 1 Met His Val Ile Ser Asp Glu Cys Val Lys Cys Gly Ala Cys Ala 16 Ser Thr Cys Pro Thr Gly Ala Ile Glu Glu Gly Glu Thr Lys Tyr 31 Val Val Thr Asp Ser Cys Ile Asp Cys Gly Ala Cys Glu Ala Val 46 Cys Pro Thr Gly Ala Ile Ser Ala Glu The TYPE command was used to display the Megasphaera elsdenii ferredoxin entry (identification code FEME) in the three-letter amino acid code. 2 ATLAS> TYPE/CURRENT/COMPOSITION=QUIET PIR1:LUHU annexin I - human PIR1:LUHU36 annexin II - human PIR1:LUHU3 annexin III - human 4-125 TYPE 3 ATLAS> SHOW/TOTALS Composition: 3 protein sequences, 1008 residues 78 Ala A 37 Gln Q 103 Leu L 69 Ser S 59 Arg R 70 Glu E 90 Lys K 60 Thr T 30 Asn N 61 Gly G 24 Met M 4 Trp W 86 Asp D 16 His H 29 Phe F 42 Tyr Y 11 Cys C 68 Ile I 21 Pro P 50 Val V The TYPE command was used to generate the stored total composition table for the three sequences on the current list. The /COMPOSITION=QUIET modifier was used to produce the minimum display for the TYPE command. After the table was generated by the TYPE command, the SHOW/TOTALS command was used to display it. 4-126 _______________________________________________________ Part III The FASTA Database Searching Program This part of the manual contains a description of the FASTA program. As a convenience to the end-user, version 1.6c2 of the FASTA program package was supplied to NBRF for inclusion on the ATLAS CD-ROM by: William R. Pearson Department of Biochemistry Box 440, Jordan Medical Education Building University of Virginia Charlottesville, VA 22908 COPYRIGHT NOTICE The FASTA package is copyrighted 1988, 1991, and 1992 by William R. Pearson and the University of Virginia. All rights reserved. The programs and documentation may not be sold or incorporated into a commercial product, in whole or in part, without written consent of William R. Pearson and the University of Virginia. For additional information, please contact William R. Wilkerson, Assistant Provost for Research, University of Virginia, P.O. Box 9025, Charlottesville, VA 22906- 9025. _______________________________________________________ 5 Overview of the FASTA Database Searching Program This summary of the FASTA program was derived from the following references which offer a complete description of the FASTA program: W.R. Pearson (1990) ``Rapid and Sensitive Sequence Comparison with FASTP and FASTA'', Methods in Enzymology 193:63-98 W.R. Pearson and D.J. Lipman (1988) ``Improved Tools for Biological Sequence Analysis'', PNAS 85:2444-2448 D.J. Lipman and W.R. Pearson (1985) ``Rapid and Sensitive Protein Similarity Searches'', Science 22:1435-1441 FASTA.DOC Release 1.6. The original document supplied with the FASTA program package and provided on this CD-ROM. __________________________________________________________________ 5.1 Introduction FASTA is a universal sequence comparison program that defaults to protein comparisons unless the query sequence is > 85% A+C+G+T, whereupon a DNA sequence is assumed. It compares the query sequence in the first file with all the sequences (there need only be one) in the second or library file, reporting the one best similarity score and alignment for each pairwise comparison. The score is not affected by poorly aligned portions of the sequence outside the best region. 5-1 Overview of the FASTA Database Searching Program Terms to know DIAGONAL - The term diagonal refers to the diagonal line that is seen on a dot matrix plot when a sequence is compared with itself and denotes an alignment between two sequences without gaps. ktup - The ktup parameter determines how many consecutive identities in a match. For proteins a ktup of 1 or 2 is used; ktup=1 is more sensitive but is about 3 times slower than ktup=2. DNA sequences can be searched with a ktup value of 1 to 6. PAM250 MATRIX - The PAM250 matrix was derived from the analysis of the amino acid replacements occurring among related proteins, and it specifies a range of positive scores for replacements that commonly occur among related proteins and negative scores for unlikely replacements. __________________________________________________________________ 5.2 Steps FASTA Uses in a Comparison FASTA uses four steps in calculating similarity between two sequences; the results are given with these three scores: o initn - The initial score, which may include several similar regions, is used to rank the sequences. o init1 - The initial score from the best initial region. o opt - The optimized score allowing gaps in a band 32 residues wide. 5-2 Overview of the FASTA Database Searching Program Step 1. Find the ten best local regions of identities In the first step, the 10 best diagonal regions are found using a simple formula based on the number of ktup matches and the distance between the matches without considering shorter runs of identities, conservative replacements, insertions, or deletions. These alignments are scored by incrementing for every matching pair and decrementing for every mismatching pair. The ktup parameter determines how many consecutive identities are required in a match. FASTA saves the ten best local regions, regardless of whether they are on the same or different diagonals. Step 2. Rescore the ten regions with a scoring matrix After the ten best local regions are found in the first step, they are rescored using a scoring matrix that allows runs of identities shorter than ktup residues and conservative replacements to contribute to the similarity score. The ends of the region are trimmed to include only those residues contributing to the highest score. Each region is a partial alignment without gaps. For each of the best diagonal regions rescanned with the scoring matrix, a subregion with the maximal score is identified. The highest-scoring subregion is reported as the init1 score. Step 3. Join initial regions to form an approximate alignment The third ``joining'' step increases the sensitivity of the search method because it allows for insertions and deletions as well as conservative replacements. The modification does, however, decrease selectivity. The degradation of selectivity is limited by including in the optimization step only those initial regions whose scores are above a threshold. 5-3 Overview of the FASTA Database Searching Program The program checks to see whether several initial regions can be joined together in a single alignment to increase the initial score. Given the locations of the initial regions, their respective scores, and a joining penalty (usually 20) that is analogous to a gap penalty, FASTA calculates an optimal alignment of initial regions as a combination of compatible regions with a maximal score. FASTA uses the resulting score, referred to as the initn score, to rank the library sequences. Step 4. Construct an optimal alignment of the best initial region In this final step of the comparison, an optimal alignment of the query sequence and the library sequence is constructed, considering only those residues that lie in a band 32 residues wide centered on the best initial region found in Step 2. The optimization employs the same scoring matrix used in determining the initial regions. FASTA reports this score as the optimized opt score. Because FASTA calculates an initial similarity score based on an optimization of initial regions during the library search, the initial score is close to the optimized score for many sequences. 5-4 _______________________________________________________ 6 Running the FASTA Program The program requires the name of the query sequence file, a library file, and the ktup parameter. The program can accept arguments on the command line or it will prompt for the file names and ktup value. __________________________________________________________________ 6.1 FASTA Options To start the program, the command line has the form: FASTA options where only the program name FASTA is required. It is possible to specify several options on the command line preceded by a dash; the following options are available: -a on output, show the complete sequence instead of just the overlap of the two aligned sequences. -b # number of sequence scores to be shown on output -c # threshold to be used for optimization in a band around the best initial region. Normally this OPTCUT value is calculated from the length of the sequence and the ktup value (for a 200 residue sequence, it is about 28). Set "-c 1" and "-o" to optimize every sequence in a database. This is the most sensitive option, but slows the program down about 5-fold. -d # number of alignments to be reported by default. (Used in conjunction with -Q). -f identical match score from scoring matrix in the scan for initial regions. (default for protein) (PAMFACT=1) 6-1 Running the FASTA Program -g # Threshold for joining init1 segments to build an initn score (GAPCUT). -k use constant score in scan for initial regions (like old fastp, fastn, default for DNA) (PAMFACT=0) -l file location of library menu file (FASTLIBS) -m # MARKX = # (0, 1, 2) highlight differences between two aligned sequences in one of three ways: -m 0 (default) -m 1 -m 2 MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT ::..:: ::: xx X ..KS..Y... MWKSCGYPYT MWKSCGYPYT : identities x conservative only differences . conservative replacements are shown replacements X nonconservative replacements -n Force the query sequence to be treated as a DNA sequence. This is particularly useful for query sequences that contain a large number of ambiguous residues, e.g. transcription factor binding sites. -o optimize all scores greater than OPTCUT. If '-c' is not specified, OPTCUT will be calculated from the length of the sequence and the ktup setting. -Q quiet - does not prompt for any input. Writes scores and alignments to the terminal or standard output file. -r file save a results summary line for every sequence in the sequence library. The summary line includes the sequence identifier, superfamily number (if available) position in the library, and the similarity scores calculated. This option can be used to evaluate the sensitivity and selectivity of different search strategies (see W. R. Pearson (1991) Genomics 11:635- 650.) 6-2 Running the FASTA Program -s file SMATRIX is read from file. Several SMATRIX files are provided with the standard distribution. For protein sequences: codaa.mat - based on minimum mutation matrix; idnaa.mat - identity matrix; idpaa.mat - identity matrix for mismatches, but identical matches weighted according to the PAM250 matrix; pam250.mat - the PAM250 matrix developed by Dayhoff et al (Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, 1978); pam120.mat - a PAM120 matrix. The SMATRIX also specifies the penalties for the first residue in a gap and additional residues in a gap; FASTA, the other alignment programs, and the SMATRIX files use -12 and -4. Currently, to change the -12, -4 gap penalties, the SMATRIX file must be edited. -v (LINEVAL) values used for line styles in plfasta -w # line length (width) = number (<200) -x specifies offsets for the beginning of the query and library sequence. For example, if you are comparing upstream regions for two genes, and the first sequence contains 500 nt of upstream sequence while the second contains 300 nt of upstream sequence, you might try: fasta -x "-500 -300" seq1.nt seq2.nt If the -x option is not used, FASTA assumes numbering starts with 1. This option will not work properly with the translated library sequence with tfasta. (You should double check to be certain the negative numbering works properly.) -1 sort output by init1 score. -3 (TFASTA only) translate only three forward frames For example: fasta -w 80 -a seq1.aa seq2.aa would compare the sequence in seq1.aa to that in seq2.aa and display the results with 80 residues 6-3 Running the FASTA Program on an output line, showing all of the residues in both sequences. Be sure to enter the options before entering the file names, or just enter the options on the command line, and the program will prompt for the file names. The file-name parameter is the name of the file containing the sequence to be used in the search. If the sequence of interest is in a database, use the COPY command of the ATLAS program to copy it into an external file prior to running the FASTA program. The library file parameter is the name of the database file to be searched or the appropriate letter from the displayed list of databases. Multiple databases may be searched by typing a percent sign followed by the letters of the desired databases. The ktup parameter determines how many consecutive identities in a match. For proteins a ktup of 1 or 2 is used; ktup=1 is more sensitive but is about 3 times slower than ktup=2. DNA sequences can be searched with a ktup value of 1 to 6. The title and the number of residues in the query sequence as well as the data library selected to be searched are displayed and the program begins the search procedure. When the search has been completed, a histogram of similarity scores will be displayed. Immediately after the histogram, the following statistics are displayed: o The number of residues and the number of sequences with which the test sequence was compared. o The mean and standard deviation (in parentheses) of the initn and init1 similarity scores obtained in the search. o The number of highest-scoring sequences saved as a result of the search. 6-4 Running the FASTA Program The program optionally writes output to a file created in your directory. You will be prompted by: Enter filename for results: Press the key for no output file. The output contains the histogram and statistics in addition to the information generated at the request of the user in response to the next few prompts. You are next prompted by: How many scores would you like to see? [20] Respond with the number of top-scoring sequences you would like to see. If you press the return key, the value of 20 is assumed. If the number entered is larger than the number of sequences saved, only the sequences saved will be shown. Note: If no sequences are saved, there will be no output. The program lists the titles of the top-scoring sequences, the initn, init1, and opt scores. These results are also copied to the output file. You are then prompted by: More scores? [0] Respond as above. If the return key is pressed, no more sequences are shown and you will be prompted by: Display alignments also? If you respond with yes (Y), the program continues and you are prompted by: Number of alignments [20]? 6-5 Running the FASTA Program Respond with the number of top-scoring alignments you wish to have displayed. The alignments are not displayed on the terminal, but are written to the output file, and then the program terminates. __________________________________________________________________ 6.2 The Output In the output, the region of alignment that was first selected and on which the inti1 similarity score is based is enclosed between the X characters. Identities are marked by double dots and conservative replacements are marked by single dots within the region corresponding to the optimized alignment. 6-6 _______________________________________________________ A Commands and Command Modifiers of ATLAS ACCESSION accession-num - search accession index for accession number /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /PIR - restrict search to PIR1, PIR2, and PIR3 databases /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR (default) - match terms beginning with character string /NOANCHOR - match character string anywhere in term /ENTRY - match each character string to separate terms AUTHOR author-name - search author name index /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /PIR - restrict search to PIR1, PIR2, and PIR3 databases /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR (default) - match terms beginning with character string /NOANCHOR - match character string anywhere in term /ENTRY - match each character string to separate terms BASES database-list - define list of active databases /BRIEF - display list without active database indicators /PRINTER(=file-spec) - direct screen output to printer (or disk file) A-1 Commands and Command Modifiers of ATLAS COPY code - copy entry to an external file /CURRENT - process current list entries /TEXT - copy only text of entry /SEQUENCE - copy sequence only /OUTPUT=file-spec - specify output file name CROSS cross_reference - search the cross_reference index /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR - match terms beginning with character string /NOANCHOR (default) - match character string anywhere in term /ENTRY - match each character string to separate terms DEFINE symbol-definition - define database-list or command abbreviations /BASES (default) - define abbreviations for database-list /COMMAND - define abbreviations for commands /DISPLAY - define symbol for display layout EXTRACT - construct and display a modified sequence /FULL_NUMBERING - display numbering for each line of sequence /ONE_LETTER (default) - display one-letter amino acid abbreviations /THREE_LETTER - display three-letter amino acid abbreviations /CURRENT - process only entries on current list /TABLE - the Specification is read from the entry /OUTPUT - output constructed sequence to a file /OUTPUT=file-spec - output to the indicated file /EXTRACT - output without verification /PRINTER(=file-spec) - direct screen output to printer (or disk file) A-2 Commands and Command Modifiers of ATLAS FEATURE feature-name - search the feature table index file /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /PIR - restrict search to PIR1, PIR2, and PIR3 databases /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR - match terms beginning with character string /NOANCHOR (default) - match character string anywhere in term /ENTRY - match each character string to separate terms FIND string-expression - search title index (also includes CONTAINS: field and ALTERNATE NAMES: field for PIR entries) /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /PIR - restrict search to PIR1, PIR2, and PIR3 databases /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR - match terms beginning with character string /NOANCHOR (default) - match character string anywhere in term /ENTRY - match each character string to separate terms /MAIN - search titles only NOT alternate names or contains fields /TEXT - perform string search on titles A-3 Commands and Command Modifiers of ATLAS GENE gene-name - search gene_name index /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR - match terms beginning with character string /NOANCHOR (default) - match character string anywhere in term /ENTRY - match each character string to separate terms GET file-name - get current list from an external file /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /USER - entry-codes are specified by user HELP - obtain help from program /BRIEF - list commands and modifiers /PRINTER(=file-spec) - direct screen output to printer (or disk file) JOURNAL - list literature and the number of citations /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /PIR - restrict search to PIR1, PIR2, and PIR3 databases /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR - match terms beginning with character string /NOANCHOR (default) - match character string anywhere in term /ENTRY - match each character string to separate terms A-4 Commands and Command Modifiers of ATLAS KEYWORD keyword - search the keyword index file /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /PIR - restrict search to PIR1, PIR2, and PIR3 databases /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR - match terms beginning with character string /NOANCHOR (default) - match character string anywhere in term /ENTRY - match each character string to separate terms LIST - list current list titles /ALL - list titles of all entries in active databases /CURRENT (default) - process current list entries /OUTPUT=file-spec - copy current list codes to an external file /RESTORE - restore previous current list /SET - active databases become current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) MATCH - search for protein segments allowing mismatches /CURRENT - restrict search to current list /DEFINE - affect the resultant current list /DEFINE/ADD - add entries found to current list /DEFINE/SUBTRACT - remove entries found from the current list /FULL - report "No matches found" message /PEPTIDE=peptide - specify the peptide /MISMATCHES=number - specify maximum number of mismatches allowed /SHOW - show positions in peptide and matching residues /PRINTER(=file-spec) - direct screen output to printer (or disk file) A-5 Commands and Command Modifiers of ATLAS MEMBERS entry-code - display list of alignments containing code /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR - match terms beginning with character string /NOANCHOR (default) - match character string anywhere in term /ENTRY - match each character string to separate terms PRINT file-spec - display contents of an external file QUIT - quit the ATLAS program REFERENCE reference - search reference index /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR (default) - match terms beginning with character string /NOANCHOR - match character string anywhere in term /ENTRY - match each character string to separate terms Report - tabulate ancillary data /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) A-6 Commands and Command Modifiers of ATLAS SCAN peptide - rapid sequence search for identically matching segments /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /TITLE - display only titles of entries found /SEGMENT - display only segments of entries found /PRINTER(=file-spec) - direct screen output to printer (or disk file) SEARCH/FIELD=field-name - search specified index(es) /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /PIR - restrict search to PIR1, PIR2, and PIR3 databases /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR - match terms beginning with character string /NOANCHOR (default) - match character string anywhere in term /ENTRY - match each character string to separate terms /MAIN (for /FIELD=title) - search titles only NOT alternate names or contains fields /TEXT (for /FIELD=title) - perform string search on titles SELECT - select entries on basis of ancillary information /BRIEF - suppresses warning messages /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /EQUALS=entry-identifier - searches for value equal to specified sequence /PRINTER(=file-spec) - direct screen output to printer (or disk file) A-7 Commands and Command Modifiers of ATLAS SET - set program operation parameters /CLOCK - set internal timer /MENU - return from command mode to menu mode (PC only) /SWITCH=switch character - change modifier character / to specified character /NOWRAP - turns line wrap off (affects /printer=file-spec output) /WRAP - turns line wrap on (affects /printer=file-spec output) SHOW - show information about current program operation /COMMANDS - show command definitions for each field /DATABASE - show header information for active databases /DISPLAYS - show symbols defined for display layouts /FIELDS (default) - show indexed fields for each database /INFORMATION - show ATLAS program header /PRINTER(=file-spec) - direct screen output to printer (or disk file) /TIME - show time /TOTALS - display composition table generated by TYPE command SPECIES species-name - search the species index file /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR - match terms beginning with character string /NOANCHOR (default) - match character string anywhere in term /ENTRY - match each character string to separate terms A-8 Commands and Command Modifiers of ATLAS SUPERFAMILY superfamily - search superfamily classifications /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR - match terms beginning with character string /NOANCHOR (default) - match character string anywhere in term /ENTRY - match each character string to separate terms SFNUM superfamily number - search superfamily numbers /BRIEF - display brief output /COUNTS - display number of entries containing search term /SHOW - display how search string is parsed /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /PRINTER(=file-spec) - direct screen output to printer (or disk file) /ANCHOR - match terms beginning with character string /NOANCHOR (default) - match character string anywhere in term /ENTRY - match each character string to separate terms TAXONOMY - select entries on basis of taxonomic classification /CURRENT - restrict search to current list /ADD - add entries found to current list /SUBTRACT - subtract entries found from current list /KEEP - do not modify current list /EQUALS=entry-identifier - searches for value equal to specified sequence /MAIN - searches only species of sequence shown in entry /PRINTER(=file-spec) - direct screen output to printer (or disk file) A-9 Commands and Command Modifiers of ATLAS TYPE code - display sequence entry /COMPOSITION - display composition only /COMPOSITION=quiet - compute composition but suppress its display /CURRENT - process current list entries /EXCHANGE - display entry in CODATA format /FULL_NUMBERING - display numbering for each line of sequence /OLD - display an entry in old order /ONE_LETTER (default) - display sequence in one-letter code /SEQUENCE - display sequence only /TEXT - display only text of entry /TEXT=display-layout - display only selected text lines of entry /THREE_LETTER - display sequence in three-letter code /PRINTER(=file-spec) - direct screen output to printer (or disk file) A-10 _______________________________________________________ B One- and Three-letter Amino Acid Abbreviations Table B-1 One- and Three-letter Amino Acid ___________Abbreviations_______________________________ A Ala Alanine C Cys Cysteine D Asp Aspartic acid E Glu Glutamic acid F Phe Phenylalanine G Gly Glycine H His Histidine I Ile Isoleucine K Lys Lysine L Leu Leucine M Met Methionine N Asn Asparagine P Pro Proline Q Gln Glutamine B-1 One- and Three-letter Amino Acid Abbreviations Table B-1 (Cont.) One- and Three-letter Amino Acid ___________________Abbreviations_______________________ R Arg Arginine S Ser Serine T Thr Threonine V Val Valine W Trp Tryptophan Y Tyr Tyrosine B Asx Asp or Asn, not distinguished Z Glx Glu or Gln, not distinguished X_____X_________Undetermined_or_atypical_amino_acid____ These abbreviations conform to those suggested by the IUPAC-IUB Commission on Biochemical Nomenclature, J. Biol. Chem. 243, 3557-3559, 1968. B-2 _______________________________________________________ C Punctuation in Protein Sequences No punctuation between two adjacent amino acids indicates that they are connected as determined experimentally () Encloses a region, the composition but not the complete sequence of which has been determined experimentally, or encloses a single residue that has been tentatively identified = Indicates )(, the juxtaposition of two regions of indeterminate sequence, while preserving proper spacing between amino acids / Indicates that the adjacent amino acids are from different peptides, not necessarily connected. When the amino end of a protein has not been determined, / precedes the first residue. When the carboxyl end has not been determined, / follows the last residue. When )/, /(, or )/( are needed, only / is used . Outside of parentheses, indicates the ends of sequenced fragments. The relative order of these fragments was not determined experimentally but is clear from homology or other indirect evidence . Within parentheses, indicates that the amino acid to its left has been placed with at least 90% confidence by homology with known sequences C-1 Punctuation in Protein Sequences , Indicates that the amino acid to its left could not be positioned with confidence by homology C-2 _______________________________________________________ D PIR-International Protein Sequence Database Entry Format Each entry consists of a variable number of consecutive records. The primary information contained in these lines is divided into four sections. The sections are listed below in the order in which they occur in the entry. 1 HEADER (exactly 1 record) Information that marks the line as the first line of an entry and that identifies the sequence contained in the entry. It contains the entry type specifier and the entry- code. 2 TITLE (exactly 1 record) The protein name and one or more species names. 3 SEQUENCE (variable number of records) The amino acid sequence ending with an asterisk. 4 TEXT (variable number of records) Reference citations and text information. To be accessed by the NBRF programs, external sequence files must conform to the same format as the database entries. Only the HEADER, TITLE, and SEQUENCE lines are necessary. Any number of entries may be contained in a single external file provided they all have different identification codes. D-1 PIR-International Protein Sequence Database Entry Format __________________________________________________________________ D.1 HEADER The first line of each entry is known as the HEADER line. The first character on this line is a right arrowhead (>) which indicates the beginning of an entry; it is followed by a two-character entry type, `P1' for a complete protein or `F1' for a fragment. The fourth character on the line is a semicolon; this character separates the protein type from the entry- code that immediately follows. An entry-code can be any combination of four to six alphanumeric characters containing no internal blanks. Header Line Format --------------------- Field Length Contents of field ----- ------ ----------------- `>' 1 marks the line as an entry header TYPE 2 type of sequence in the entry: `P1' protein, complete `F1' protein, fragment `;' 1 field separator CODE 4 to 6 unique retrieval key we have assigned to the entry The CODE contains four to six alphanumeric characters. Header line examples: -------------------- >P1;AZBR >P1;DENCEN >F1;XNECV >P1;LQBP37 __________________________________________________________________ D.2 TITLE The second line of the entry contains the entry title, which consists of a protein name and a species name separated by a dash flanked by two blank spaces (` - '). The species name, which lists the biological D-2 PIR-International Protein Sequence Database Entry Format source from which the protein was obtained, is always on the right of the dash. Title Line Format -------------------- Field Length Contents of field ----- ------ ----------------- SEQNAM variable name of the protein ` - ' 3 field separator ORGNAM variable name of biological source of the protein If the organism or organelle translates nucleic acid to protein by a special genetic code, this fact is noted in the ORGNAM field by the symbol `(SGCn)', where n denotes the special genetic code. Title line examples: -------------------- azurin precursor - Pseudomonas aeruginosa DNA ligase (ATP) (EC 6.5.1.1) - phage T7 cytochrome c - castor bean thyroliberin - pig hemoglobin alpha chain - eastern gray kangaroo epidermal growth factor precursor - human __________________________________________________________________ D.3 SEQUENCE The sequence section of an entry consists of a variable number of records. Each sequence record may be up to 500 characters long. The characters represent the amino acid sequence stored from the amino end to the carboxyl end. The symbols used for the amino acids are the one-letter abbreviations shown in Appendix B. The amino acid sequence is terminated by an asterisk, which is the last character on the last line of the sequence section. In addition, the sequence may contain punctuation symbols to indicate various degrees of reliability of the data (see Appendix C). One punctuation symbol may precede any amino acid symbol or the terminating asterisk. The sequence lines contain no blank characters (if space D-3 PIR-International Protein Sequence Database Entry Format characters are used to separate amino acid symbols, they are ignored by the NBRF programs when they occur within sequences). Sequence line examples: ---------------------- GDVE(G.K.G.I.F=T,M,C.S.Q,C.H.V,E.K.G.G.K.H) FTGPNLHGLFGRK.TGQAVGYSYTAANK.NK.GIIWGDDTLM EYLENPK.RYIPGTK.MVFTGLSK.YRE RTNLIAYLK.EK.TAA* __________________________________________________________________ D.4 TEXT The TEXT section contains annotation information such as species, reference citations, genetic information, superfamily classification and search keywords. The format of the TEXT section consists of a variable number of records. Each TEXT record may be up to 500 characters long. Each TEXT record, with the exception of the citation record, begins with a Tag that indicates the type of information contained on that record. Certain Tags mark the beginning of a block of data. The types of Tags or data items are listed below with examples. Required or optional notations with regard to tags refer to entry standards for PIR- International. A more compact format specification is given in Example D-1. ___________________________ D.4.1 Alternate names specification (optional) An alternate name specification consists of a single text record that is identified by having 'N;Alternate names:' Tag as the first 18 characters. The remainder of the line contains a list of other common names for the protein. Alternate names are separated by semicolons. D-4 PIR-International Protein Sequence Database Entry Format Alternate name line examples: ---------------------- N;Alternate names: soluble cytochrome f; cytochrome c553 N;Alternate names: cusacyanin; plantacyanin N;Alternate names: component II; nitrogenase reductase ___________________________ D.4.2 Contains line (optional) A contains line consists of a single text line that is identified by having 'N;Contains:' as the first 11 characters. This record specifies other activities or functions that are included in the sequence shown in the entry. Multiple "contains" titles are separated by semicolons. Contains line examples: ---------------------- N;Contains: cytochrome b2 core N;Contains: ribonuclease (EC 3.1.-.-) activity N;Contains: Arg-vasopressin; neurophysin 2 N;Contains: intestinal peptide PHM-27 ___________________________ D.4.3 Species line (required) The Species line is a single record identified by 'C;Species:' as the first 10 characters. This record describes the source or organism from which the sequence is derived. Each entry contains this record and it may mark the beginning of a Species data block. D-5 PIR-International Protein Sequence Database Entry Format ___________________________ D.4.4 Species Note record (optional; contained in Species block) The Species Note line is a single record identified by 'A;Note:' as the first 7 characters. This record describes special Species information and is contained in the Species block. All information in the 'C;Host:' records from previous versions of the database has been transferred to this record. Examples of maximal Species block: --------------------------------- C;Species: vaccinia virus A;Note: host Homo sapiens (man) C;Species: equine herpesvirus 1, equine abortion virus A;Note: host Equus caballus (domestic horse) ___________________________ D.4.5 Date record (required) The Date line is a single record identified by 'C;Date:' as the first 7 characters. This record indicates the date the entry was added to the data set, the date the SEQUENCE was modified and the date the TEXT was modified. Each of the dates (add, seq, text) is optional but at least one must appear. Examples: C;Date: 31-Jul-1979 sequence_revision 30-Sep-1992 text_change 14-Oct-1993 C;Date: 31-May-1979 sequence_revision 25-Feb-1985 text_change 14-Oct-1993 C;Date: text_change 30-Jun-1993 C;Date: sequence_revision 30-Sep-1991 text_change 02-Dec-1993 ___________________________ D.4.6 Entry-specific Accession record (required) The Entry-specific Accession line is a single record identified by 'C;Accession:' as the first 12 characters. This record indicates a list of Accession numbers associated with the entry. D-6 PIR-International Protein Sequence Database Entry Format Examples: C;Accession: A92196; A92218; A00169 C;Accession: A90383; A92053; B93774; A92231; A00170 C;Accession: JS0745; A00172; S07960 ___________________________ D.4.7 Author/citation records (required; start of Reference block) The Author line is a single record identified by 'R;' as the first 2 characters. This record indicates a list of authors, separated by semicolons, associated with the Reference block. Immediately following is the citation record that specifies the source of the Reference. This pair of records is required and begins the Reference block. The Reference block may be repeated in the TEXT section. Examples: R;Aigle, M.; Biteau, N.; Crouzet, M. submitted to the Protein Sequence Database, March 1992 R;Cossart, P.; Katinka, M.; Yaniv, M. Nucleic Acids Res. 9, 339-347, 1981 R;Skala, J.; Purnelle, B.; Goffeau, A. Yeast 8, 409-417, 1992 ___________________________ D.4.8 Authors record (optional; repeating; contained in Reference block) The Authors line is an optionally repeating record identified by 'A;Authors:' as the first 10 characters. This record is used as a supplement to the Author list in the 'R;' record. If the 'R;' record exceeds the maximum record length of 500 characters then additional authors are listed on the Authors line. D-7 PIR-International Protein Sequence Database Entry Format ___________________________ D.4.9 Reference Title record (optional; contained in Reference block) The Reference Title line is a single record identified by 'A;Title:' as the first 8 characters or 'A;Description:' as the first 14 characters. This record is the publication title (A;Title) or description of a sequence submission (A;Description:). Either Tag may be present but not both. Examples: A;Title: Amino acid sequence of ragweed allergen Ra3. A;Description: The amino acid sequence of a type I copper protein with an unusual serine- and hydroxyproline-rich C-terminal domain isolated from cucumber peelings. ___________________________ D.4.10 Reference number record (required; contained in Reference block) The Reference number line is a single record identified by 'A;Reference number:' as the first 19 characters. This record contains standardized information relating a reference with a six character alphanumeric string (ref_num). Optionally a Medline reference number may appear in the record identified by the 'MUID:' Tag and separated form the ref_num by a semicolon. Examples: A;Reference number: A94561 A;Reference number: A00100; MUID:82075747 ___________________________ D.4.11 Contents record (optional; contained in Reference block) The Contents line is a single record identified by 'A;Contents:' as the first 11 characters. This record specifies the source of the protein (species and/or strain), the portion of the sequence reported, the method of sequence determination, or the extent of experimental detail reported. The record may indicate that the D-8 PIR-International Protein Sequence Database Entry Format reference is included as a source of ancillary information such as X-ray crystallography or active site identification. Examples: A;Contents: ATCC 16455 A;Contents: annotation; methylation A;Contents: X-ray crystallography, 2.8 angstroms A;Contents: Strain BALB/c ___________________________ D.4.12 Reference Note record (optional; repeating; contained in Reference block) The Reference Note line is a repeating record identified by 'A;Note:' as the first 7 characters beneath the start of a Reference block. This record describes reference specific comments. Examples: A;Note: This is the final paper in a series. A;Note: The nucleotide sequence is not given in this paper. ___________________________ D.4.13 Reference-specific Accession records (optional; start of Accession block; contained in Reference block) The Reference-specific Accession line is a single record identified by 'A;Accession:' as the first 12 characters. This record indicates a single unique six character alphanumeric string (acc_ num) associated with the shown sequence according to the sequence specification in the Residues record described below. These unique numbers specifiy a unique sequence. The presence of this Accession record implies the start on an Accession block that may be repeated beneath the Reference block; the Accession block(s) is contained in the Reference block. D-9 PIR-International Protein Sequence Database Entry Format Examples: A;Accession: A00086 A;Accession: JT0008 A;Accession: S13939 ___________________________ D.4.14 Accession Status record (optional; contained in Accession block) The Accession-specific Status line is a single record identified by 'A;Status:' as the first 9 characters. This record indicates the review status of the sequence refered to by the acc_num. Currently "preliminary" is the only value for this information. Example: A;Status: preliminary ___________________________ D.4.15 Molecule type record (required if Accession present; contained in Accession block) The molecule type line is a single record identified by 'A;Molecule type:' as the first 16 characters. This record indicates the type of molecule from which the sequence was determined. Valid values for this data item are: "protein", "DNA", "mRNA", "nucleic acid" and "genomic RNA." Examples: A;Molecule type: protein A;Molecule type: DNA; mRNA ___________________________ D.4.16 Residues record (required if Accession present; contained in Accession block) The Residues line is a single record identified by 'A;Residues:' as the first 11 characters. This record specifies a reported sequence according to the amino acid sequence as depicted in PIRn.SEQ. D-10 PIR-International Protein Sequence Database Entry Format Examples: A;Residues: 1-85,'SK',88-92,'N',94-100,'K',102-103,'A' A;Residues: 2-57,'IV',60-61,'ZZ',64-66,'Z',68-69,'ZB',72-105 A;Residues: 1-107 ___________________________ D.4.17 Cross-references record (optional; contained in Accession block) The Cross- references line is a single record identified by 'A;Cross-references:' as the first 19 characters. This record specifies a list of database/identifier pairs that indicate related sequence information in another database. Current cross referenced databases are: "GB", "EMBL", "PDB", "DDBJ", "CAS", "NCBIP" and "NCBIN." Examples: A;Cross-references: GB:J04618; GB:J04619 A;Cross-references: CAS:124041-95-8 ___________________________ D.4.18 Accession Genetics record (optional; contained in Accession block) The Accession Genetics line is a single record identified by 'A;Genetics:' as the first 11 characters. This record contains a Tag that specifies which 'C;Genetics:' block (defined below) describes the sequence report depicted by the Accession block. This record is present only when more than one 'C;Genetics:' block is present in the entry. Examples: A;Genetics: ST1 A;Genetics: ST2 A;Genetics: HBA1 D-11 PIR-International Protein Sequence Database Entry Format ___________________________ D.4.19 Accession Note record (optional; repeating; conntained in Reference block) The Accession Note line is a repeating record identified by 'A;Note:' as the first 7 characters beneath the start of an Accession block. This record describes sequence specific comments. Examples: A;Note: The authors translated the codon CTG for residue 169 as Ile. A;Note: 175-Ala was also found. A;Note: The difference at the carboxyl end is due to a frameshift. ___________________________ D.4.20 Comment records (optional; repeating) The Comment line is a repeating record identified by 'C;Comment:' as the first 10 characters. This record contains general information in a free format, natural language form about the protein sequence entry. Some Comment records can be decomposed and the information move to more appropriate records; this is an ongoing standardization project. Examples: C;Comment: Met preceding 1-Gly is removed after translation. C;Comment: The sequence shown is iso-1-cytochrome c. C;Comment: Euglena is a genus of green algae. ___________________________ D.4.21 Genetics record (optional; start of Genetics block) The Genetics line is a single record identified by 'C;Genetics:' as the first 11 characters. This record contains no information except if more than one Genetics block exists in the entry. In the case of multiple gene information this record will contain a unique Tag that points to an Accession block; this indicates which sequence report is related to the genetic data. Presence of this record implies the start of D-12 PIR-International Protein Sequence Database Entry Format a Genetics block and is required if other genetic information exists such as that defined below. The Genetics block may be repeated within an entry. Examples: C;Genetics: C;Genetics: C;Genetics: C;Genetics: ___________________________ D.4.22 Gene record (optional; contained in Genetics block) The Gene line is a single record identified by 'A;Gene:' as the first 7 characters. This record specifies the gene symbol used to denote the gene. Some symbols may contain "GDB" as a Tag; this indicates a cross reference to the Genome Database. Examples: A;Gene: psbE A;Gene: GDB:CYP2D6 A;Gene: CYP2B1 ___________________________ D.4.23 Map position record (optional; contained in Genetics block) The Map position line is a single record identified by 'A;Map position:' as the first 15 characters. This record specifies a map position on which the gene is located. For viruses this may indicate a segment number. Examples: A;Map position: 71.6-76.2 A;Map position: 85 min A;Map position: 4q21-q23 D-13 PIR-International Protein Sequence Database Entry Format ___________________________ D.4.24 Genome record (optional; contained in the Genetics block) The Genome line is a single record identified by 'A;Genome:' as the first 9 characters. This record specifies which type of genome is described. Current values for this record are: "mitochondrion", "chloroplast", "cyanellle" and "plasmid." Examples: A;Genome: chloroplast A;Genome: cyanelle A;Genome: plasmid A;Genome: mitochondrion ___________________________ D.4.25 Genetic code record (optional; contained in Genetics block) The Genetic code line is a single record identified by 'A;Genetic code:' as the first 15 characters. This record indicates which Special Genetic Code table is used by the specific organism for nucleic acid translation. Current values for this record are "SGC1" - "SGC9." Examples: A;Genetic code: SGC1 ___________________________ D.4.26 Start codon record (optional; contained in Genetics block) The Start codon line is a single record identified by 'A;Start codon:' as the first 14 characters. This record indicates the codon in the nucleic acid sequence where translation is initiated. Examples: A;Start codon: ATT A;Start codon: ATC A;Start codon: GTG D-14 PIR-International Protein Sequence Database Entry Format ___________________________ D.4.27 Introns record (optional; contained in Genetics block) The Introns line is a single record identified by 'A;Introns:' as the first 10 characters. This record specifies the intron segments needed to code for the gene product. Segments are separated by semicolons. Examples: A;Introns: 253/3; 270/3 A;Introns: 139/1; 143/2; 169/2; 253/3; 270/3 ___________________________ D.4.28 Genetics Note record (optional; repeating; contained in Genetics block) The Genetics Note line is a repeating record identified by 'A;Note:' as the first 7 characters beneath the start of a Genetics block. This record describes genetics specific comments. Examples: A;Note: strain D273-10B/A21 A;Note: strain 777-3A ___________________________ D.4.29 Function record (optional; start of Function block) The Function line is a single record identified by 'C;Function:' as the first 11 characters. This record contains no information except if more than one Function block exists in the entry. In the case of multiple function information this record will contain a unique Tag to distinguish each block. Presence of this record implies the start of a Function block and is required if other function information exists such as that defined below. The Function block may be repeated within an entry. D-15 PIR-International Protein Sequence Database Entry Format ___________________________ D.4.30 Function Description record (optional; contained in Function block) The Function Description line is a single record identified by 'A;Description:' as the first 14 characters and immediately following the Function record. This record indicates the type of function the specified protein may have. Example: C;Function: A;Description: This protein functions as a molecular chaperone in the endosymbiont. ___________________________ D.4.31 Superfamily record (optional) The Superfamily line is a single record identified by 'C;Superfamily:' as the first 14 characters. This record indicates which Superfamily(s) has the protein as a member. Individual Superfamily names are separated by semicolons. Examples: C;Superfamily: phosphorylase C;Superfamily: sucrose synthase; sucrose/sucrose-phosphate synthase homology ___________________________ D.4.32 Keywords record (optional) The Keywords line is a single record identified by 'C;Keywords:' as the first 11 characters. This record indicates which keyword(s) are associated with the entry. Terms in the list are separated by semicolons and are used as a retreival key. Example: C;Keywords: homodimer; NAD; oxidoreductase C;Keywords: oxidoreductase; pentose phosphate pathway D-16 PIR-International Protein Sequence Database Entry Format ___________________________ D.4.33 Feature record (optional; repeating) The Feature line is a repeating record identified by 'F;' as the first 2 characters. This record may be one of many comprising a Feature table. Each line of the Feature table has the following format. Positions 3 to the occurence of a '/' character is a range or site specification. Multiple segments corresponding to a single feature are separated by commas. The feature title appear after the '/' and consists of a feature descriptor, a title, and an optional code extension enclosed by the symbols < and >. A feature descriptor is a word or short phrase followed by a colon that defines the type of feature (refer to Table IV for a complete listing and explanation of feature descriptors). The code extension consists of a short character string (alphanumerics only) that when associated with the entry identification code defines a logical address for the subsequence that is unique throughout the database. For example, the path DEECK->DKI uniquely defines the feature with code extension DKI in entry DEECK. Examples: F;316/Active site: Asp F;483/Binding site: carbohydrate F;19-69,28-52,44-65/Disulfide bonds: F;1-249/Domain: aspartokinase I F;1-24/Domain: signal sequence F;50-54/Peptide: Met-enkephalin 1 F;15-72/Protein: basic protease inhibitor Feature Table Descriptors The following descriptors denote single residue sites. These sites may be represented using the full feature residue specification conventions; however, the specification should be interpreted to specify a collection of single sites. D-17 PIR-International Protein Sequence Database Entry Format Active site: Binding site: Cleavage site: Inhibitory site: Modified site: The following descriptors denote residues connected by covalent bonds. The pairs of residues linked by hyphens should be intepreted as being connected by bonds. Single residues may be specified if the bond is between an amino acid in the sequence and another protein chain or molecule. The Cross-links: descriptor specifically denotes bonds linking the sequence shown to an adjacent protein chain. Disulfide bonds: Cross-links: The following descriptors denoting sequence regions. Pairs of residues linked by hyphens denote a region extending from the first to the second position inclusive. The Domain: descriptor indicates distinct functional regions that may be of separate evolutionary origin. The Duplication: descriptor indicates regions that have evolved by sequence duplication. The Peptide: and Protein: descriptors indicate regions corresponding to the mature protein or peptides derived from the sequence shown. The Region: descriptor is used to indicate all other regions. Domain: Peptide: Protein: Region: _______________________________________________________ Example D-1 Cont'd on next page D-18 PIR-International Protein Sequence Database Entry Format Example D-1 (Cont.) Format Specification of PIR- International Entry: _______________________________________________________ Example D-1 Format Specification of PIR-International Entry: _______________________________________________________ >..;code HEADER title - trivial name TITLE X X X X X X X X X X * SEQUENCE N;Alternate names: TEXT N;Contains: C;Species: Species block A;Variety: A;Note: C;Date: C;Accession: R; Reference block (may repeat) citation A;Authors: A;Title: A;Description: A;Reference number: A;Contents: A;Note: A;Accession: Accession block (may repeat) A;Status: A;Molecule type: A;Residues: A;Cross-references: A;Experimental source: A;Genetics: A;Note: C;Comment: Comments (may repeat) C;Genetics: Genetics block (may repeat) A;Gene: A;Map position: _______________________________________________________ Example D-1 Cont'd on next page D-19 PIR-International Protein Sequence Database Entry Format Example D-1 (Cont.) Format Specification of PIR- International Entry: _______________________________________________________ A;Genome: A;Gene origin: A;Genetic code: A;Start codon: A;Introns: A;Other products: A;Note: C;Complex: Complex C;Function: Function block A;Description: A;Pathway: A;Note: C;Superfamily: C;Keywords: F; Features block (may repeat) _______________________________________________________ D-20 _________________________________________________________________ Index _______________________________ A commands (cont'd) _______________________________ see Part II for detailed accessiono1-7, 4-2, A-1 descriptions exampleso4-5 categorieso3-2 modifierso4-3 displayo3-2, 3-5 alignmentso2-9, 2-11 file interfaceo3-2, 3-5 see also ALNo2-8 modifierso3-1 ALNo1-4, 2-8, 4-13, 4-70 see also modifiers amino acid abbreviationso2-11, operatorso3-1 4-65, B-1 punctuation ino3-2 amino end (+)o4-65 sequence searchingo3-2, 3-4 ATLAS programoi, v syntaxo3-1 use, see Chapter 3 text searchingo3-2, 3-3 command modeo3-1, 3-2 utilityo3-2, 3-5 commands composition, amino acido4-120 see also individual see show/totals commands see type see also Chapter 3 and copyo4-14, A-1 Chapter 4 exampleso4-16 menu mode (PC-DOS)o3-1, 3-9 modifierso4-15 main menuo3-10 crosso1-7, 2-8, 2-13, 4-19, option submenuo3-11 A-1 authoro1-7, 2-8, 2-13, 4-6, exampleso4-21 A-1 modifierso4-20 exampleso4-8 CTRL-Co1-9, 3-9 modifierso4-7 CTRL-Yo3-9 current listo1-9, 3-3 _______________________________ _______________________________ B D _______________________________ _______________________________ baseso4-10, A-1 database see also define activeo1-8, 4-11 database-listo4-10 contained on ATLAS CD-ROMo exampleso4-12 4-13 modifierso4-11 database-codeo1-4, 1-5, 3-3 Brookhaveno2-5, 2-7 defininiton ofo1-4 _______________________________ descriptions of eacho2-1 to C 2-17 _______________________________ DDBJo2-15 carboxyl end (*)o4-65 defineo3-1, 4-10, 4-23, 4-101, Chemical Abstracts Registry 4-121, A-1 Numbero2-12 see also BASES CODATA formato4-124 exampleso4-24 commands in menu modeo3-6 see also listing in modifierso4-24 Appendix A Index-1 Index _______________________________ E helpo4-52, A-1 _______________________________ modifierso4-52 ECOLIo1-4, 2-15, 4-13 _______________________________ EMBLo2-5, 2-15 J entry _______________________________ definition ofo1-1 JIPIDovi, 2-1, 2-15 formato4-18 addresso2-16 format (PIR)oD-1 journalo1-7, 2-8, 2-13, 4-54, texto1-1, D-4 A-1 titleo1-1, D-2 exampleso4-56 entry-codeo1-1, 1-5, 3-3, modifierso4-55 4-49, D-2 entry-identifiero1-5, 4-14, _______________________________ 4-49 K extracto4-28 _______________________________ modifierso4-30 Kabato2-5 /tableo4-31 keywordo1-7, 2-8, 2-13, 4-58, _______________________________ A-1 F exampleso4-60 _______________________________ modifierso4-59 FASTAovi ktupo6-1 see FASTAo5-2 use, see Chapter 6 referenceso5-1 _______________________________ featureo1-7, 2-8, 2-13, 4-33, L A-1 _______________________________ exampleso4-36 listo2-8, 2-13, 4-44, 4-62, modifierso4-34 A-1 fieldso1-6, 1-7, 3-3 modifierso4-63 findo1-7, 2-8, 2-9, 2-13, 3-3, /outputo1-9 4-39, A-1 /restoreo1-9 exampleso4-42 _______________________________ in menu modeo3-3 modifierso4-40 M _______________________________ _______________________________ matcho2-8, 4-64, A-1 G exampleso4-68 _______________________________ modifierso4-67 GenBanko1-4, 2-5, 2-15, 2-16, memberso1-7, 2-9, 4-70, A-1 4-13 exampleso4-72 geneo1-7, 4-45, A-1 modifierso4-71 exampleso4-47 MIPSovi, 2-1, 2-5 modifierso4-46 addresso2-7 geto4-49, A-1 modified amino acids exampleso4-50 see RESID modifierso4-50 modifierso1-9, 3-6 _______________________________ see also individual commands H /addo1-9 _______________________________ /briefo3-3 /currento1-9 Index-2 Index modifiers (cont'd) scano2-8, 4-84, A-1 /keepo1-9 exampleso4-85 /subtracto1-9 modifierso4-85 searcho4-87, A-1 _______________________________ exampleso4-90 N modifierso4-88 _______________________________ select NCBIo2-16 exampleso4-95 NRL_3Do1-4, 2-5, 2-7, 4-13 modifierso4-94, 4-115 _______________________________ seto4-97, A-1 P modifiers/nowrapo4-97 _______________________________ modifierso4-97 PATCHXo1-4, 2-5, 2-7, 4-13 /wrapo4-97 PDBo2-7 sfnumo1-7, 4-110, A-1 PIRovi, 1-4, 2-1, 2-5, 2-15, exampleso4-113 4-13 modifierso4-112 addresso2-4 showo4-99, 4-121, 4-122, A-1 post-translational modifica- exampleso4-100 tions modifierso4-99 see RESID /displayo4-24 printo4-73, A-1 /totalso4-100 protein sequence database /totalso4-126 ALNo2-8 specieso1-7, 2-8, 4-102, A-1 NRL_3Do2-7 exampleso4-104 PATCHXo2-5 modifierso4-103 PIRo2-1 superfamilyo1-7, 4-106, A-1 PSeqIPo2-5 exampleso4-108 punctuation in protein modifierso4-107 sequencesoC-1, D-3 SwissProto2-5 _______________________________ _______________________________ Q T _______________________________ _______________________________ quito4-74, A-1 taxonomyo4-114 term indexo1-6, 3-3 _______________________________ text searching commandso1-8 R titleo1-7 _______________________________ typeo2-8, 2-9, 2-13, 4-23, referenceo1-7, 2-8, 2-13, 4-72, 4-120, A-1 4-75, A-1 modifierso4-123 exampleso4-77 /exchangeo4-124 modifierso4-76 reporto4-79, 4-91 exampleso4-83 modifierso4-82 RESIDo2-11 _______________________________ S _______________________________ Index-3