PIR Non-redundant REFerence protein database(PIR-NREF) ****************************************************** Protein Information Resource (PIR) National Biomedical Research Foundation Georgetown University Medical Center 3900 Reservoir Road, N.W. Washington, DC 20057, USA Phone: (202) 687-2121 Fax : (202) 687-1662 E-mail: pirmail@nbrf.georgetown.edu 1. Introduction As a major resource of protein information, one of our primary aims is to provide a timely and comprehensive collection of all protein sequence data that keeps pace with the genome sequencing projects and contains source attribution and minimal redundancy. The PIR-NREF (Non-redundant REFerence) protein database includes all sequences from PIR-PSD, UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, RefSeq, GenPept, and PDB. The NREF entries, each representing an identical amino acid sequence from the same source organism redundantly presented in one or more underlying protein databases, can serve as the basic unit for protein annotation. The NCBI taxonomy is used as the ontology for matching source organism names at the species or strain (if known) levels. The NREF report provides source attribution (containing protein IDs, accession numbers, taxonomy ID, and protein names from underlying databases), in addition to taxonomy, amino acid sequence, and composite bibliography data. The composite protein names, including synonyms, alternate names, and even misspellings, can be used to assist ontology development on protein names and the identification of mis-annotated proteins. Related sequences, including identical sequences from different organisms, as well as identical subsequences and highly similar sequences (>=95% sequence identity) are also listed. 2. Major Features a)Comprehensiveness and Timeliness: Containing all sequences from PIR-PSD, UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, RefSeq, GenPept,and PDB, and updated bi-weekly. b)Non-Redundancy: Clustered by sequence identity and taxonomy at the species level. c)Source Attribution: Containing protein IDs and names from associated databases (with hypertext links), in addition to protein sequence, taxonomy, and bibliography. 3. New Release Features Two New Search Options: * Text: Retrieve a matching list of entries by searching the protein names or the species/organism name. * Species-Based Browsing and Searching Organisms: Browse/Search ~100 organisms including over 70 complete genomes. 4. Database Access and Usage FTP Downloading: PIR-NREF is availble for free downloading and redistribution from our FTP site in XML format (data file) and FASTA format (sequence file). Web Site Access: The Web site supports both text and sequence searches for report and list retrieval. Direct report retrieval is based on sequence unique identifiers, including IDs and accession numbers of the source databases. List retrieval is supported by both text and sequence searches. The text search matches protein and species names using combinations of text string (and substring). Sequence searches include full-scale and species-based BLAST searches and peptide/pattern match for functional identification of query proteins or peptides. The Peptide Match finds an exact match in the NREF database to a user-defined peptide sequence. The Pattern Match searches a user-defined pattern or ProSite pattern against all NREF sequences. 5. Publications The Protein Information Resource: an integrated public resource of functional annotation of proteins. Wu, C., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z., Ledley, R. S., Lewis, K. C., Mewes, H.-W., Orcutt, B.C., Suzek, B.E., Tsugita, A., Vinayaka, C.R., Yeh, L.-S., Zhang, J. and Barker, W.C. (2002). Nucleic Acids Research, 30, 35-37. Presentations/Posters PIR Non-Redundant Reference Protein Database (PIR-NREF) Suzek, B.E., Huang, H., Orcutt, B., Chen, Y., Hu, Z., Zhang, J. and Wu, C.H. Sixth Annual International Conference on Research in Computational Molecular Biology (April 2002) 5. Sponsorship PIR is supported by the NIH Grant# P41 LM05798.