UNIPROT DATABASE - UNIVERSAL INFORMATION RESOURCE OF PROTEIN SEQUENCES

Main Article Content

Authors

A. Kulyyassov

National Center for Biotechnology,13/5, Korgalzhyn road, Nur-Sultan,  010000, Kazakhstan

Abstract

Protein sequences are stored in public databases such as the UniProt Knowledgebase (UniProtKB), where curators add bioinformatics data, including prediction of structure and function of biomolecules and experimental results. Protein function prediction can be done using sequence similarity searches, but an alternative approach is to use protein signatures that classify proteins into families and domains. The main protein signature databases are accessible through the integrated InterPro database, which provides the UniProtKB sequence classification. In addition to characterizing proteins through protein families, many researchers are interested in analyzing the complete set of proteins from the genome (i.e., the proteome), and there are databases and resources providing unreduced sets of proteomes and analyzes of proteins from organisms with fully sequenced genomes. This article reviews the tools and resources available on the Internet for characterizing both individual proteins and analysis of the entire proteome.

Keywords

Association-Rule-Based Annotator (ARBA), European Bioinformatics Institute (EBI), The European Molecular Biology Laboratory (EMBL), The DNA Data Bank of Japan (DDBJ), Gene Ontology Annotation (GOA), Global Proteome Machine (GPM), Mass spectrometry (MS), proteomics, Liquid Chromatography tandem Mass Spectrometry (LC-MS/MS), Multiple reaction monitoring (MRM), National Institutes of Health (NIH), Protein Data Bank (PDB), PRoteomics IDEntifications (PRIDE), Protein Information Resource (PIR), Post-translational modification (PTM), Swiss Institute of Bioinformatics (SIB), the Universal Protein Resource (UniProt), the UniProt Archive (UniParc), the UniProt Knowledgebase (UniProt), the UniProt Reference (UniRef)

Article Details

References

Wheeler D. L., Barrett T., Benson D. A., Bryant S. H., Canese K., Chetvernin V., Church D. M., DiCuccio M., Edgar R., Federhen S., Geer L. Y., Kapustin Y., Khovayko O., Landsman D., Lipman D. J., Madden T. L., Maglott D. R., Ostell J., Miller V., Pruitt K. D., Schuler G. D., Sequeira E., Sherry S. T., Sirotkin K., Souvorov A., Starchenko G., Tatusov R. L., Tatusova T. A., Wagner L., Yaschenko E. Database resources of the National Center for Biotechnology Information // Nucleic Acids Res. ‒ 2007. ‒ T. 35, № Database issue. ‒ C. D5-12.

Okubo K., Sugawara H., Gojobori T., Tateno Y. DDBJ in preparation for overview of research activities behind data submissions // Nucleic Acids Res. ‒ 2006. ‒ T. 34, № Database issue. ‒ C. D6-9.

Kulikova T., Akhtar R., Aldebert P., Althorpe N., Andersson M., Baldwin A., Bates K., Bhattacharyya S., Bower L., Browne P., Castro M., Cochrane G., Duggan K., Eberhardt R., Faruque N., Hoad G., Kanz C., Lee C., Leinonen R., Lin Q., Lombard V., Lopez R., Lorenc D., McWilliam H., Mukherjee G., Nardone F., Pastor M. P., Plaister S., Sobhany S., Stoehr P., Vaughan R., Wu D., Zhu W., Apweiler R. EMBL Nucleotide Sequence Database in 2006 // Nucleic Acids Res. ‒ 2007. ‒ T. 35, № Database issue. ‒ C. D16-20.

Benson D. A., Karsch-Mizrachi I., Lipman D. J., Ostell J., Wheeler D. L. GenBank // Nucleic Acids Res. ‒ 2007. ‒ T. 35, № Database issue. ‒ C. D21-5.

Pruitt K. D., Tatusova T., Maglott D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins // Nucleic Acids Res. ‒ 2007. ‒ T. 35, № Database issue. ‒ C. D61-5.

Bhowmick P., Roome S., Borchers C. H., Goodlett D. R., Mohammed Y. An Update on MRMAssayDB: A Comprehensive Resource for Targeted Proteomics Assays in the Community // J Proteome Res. ‒ 2021. ‒ T. 20, № 4. ‒ C. 2105-2115.

Martens L. Public proteomics data: How the field has evolved from sceptical inquiry to the promise of in silico proteomics // EuPA Open Proteom. ‒ 2016. ‒ T. 11. ‒ C. 42-44.

Perez-Riverol Y., Alpi E., Wang R., Hermjakob H., Vizcaino J. A. Making proteomics data accessible and reusable: current state of proteomics databases and repositories // Proteomics. ‒ 2015. ‒ T. 15, № 5-6. ‒ C. 930-49.

Uszkoreit J., Winkelhardt D., Barkovits K., Wulf M., Roocke S., Marcus K., Eisenacher M. MaCPepDB: A Database to Quickly Access All Tryptic Peptides of the UniProtKB // J Proteome Res. ‒ 2021. ‒ T. 20, № 4. ‒ C. 2145-2150.

Deutsch E. W., Bandeira N., Sharma V., Perez-Riverol Y., Carver J. J., Kundu D. J., Garcia-Seisdedos D., Jarnuczak A. F., Hewapathirana S., Pullman B. S., Wertz J., Sun Z., Kawano S., Okuda S., Watanabe Y., Hermjakob H., MacLean B., MacCoss M. J., Zhu Y., Ishihama Y., Vizcaino J. A. The ProteomeXchange consortium in 2020: enabling 'big data' approaches in proteomics // Nucleic Acids Res. ‒ 2020. ‒ T. 48, № D1. ‒ C. D1145-D1152.

Perez-Riverol Y., Csordas A., Bai J., Bernal-Llinares M., Hewapathirana S., Kundu D. J., Inuganti A., Griss J., Mayer G., Eisenacher M., Perez E., Uszkoreit J., Pfeuffer J., Sachsenberg T., Yilmaz S., Tiwary S., Cox J., Audain E., Walzer M., Jarnuczak A. F., Ternent T., Brazma A., Vizcaino J. A. The PRIDE database and related tools and resources in 2019: improving support for quantification data // Nucleic Acids Res. ‒ 2019. ‒ T. 47, № D1. ‒ C. D442-D450.

Fenyo D., Beavis R. C. The GPMDB REST interface // Bioinformatics. ‒ 2015. ‒ T. 31, № 12. ‒ C. 2056-8.

Jones P., Cote R. G., Martens L., Quinn A. F., Taylor C. F., Derache W., Hermjakob H., Apweiler R. PRIDE: a public repository of protein and peptide identifications for the proteomics community // Nucleic Acids Res. ‒ 2006. ‒ T. 34, № Database issue. ‒ C. D659-63.

Burley S. K., Bhikadiya C., Bi C., Bittrich S., Chen L., Crichlow G. V., Duarte J. M., Dutta S., Fayazi M., Feng Z., Flatt J. W., Ganesan S. J., Goodsell D. S., Ghosh S., Kramer Green R., Guranovic V., Henry J., Hudson B. P., Lawson C. L., Liang Y., Lowe R., Peisach E., Persikova I., Piehl D. W., Rose Y., Sali A., Segura J., Sekharan M., Shao C., Vallat B., Voigt M., Westbrook J. D., Whetstone S., Young J. Y., Zardecki C. RCSB Protein Data Bank: Celebrating 50 years of the PDB with new tools for understanding and visualizing biological macromolecules in 3D // Protein Sci. ‒ 2022. ‒ T. 31, № 1. ‒ C. 187-208.

Gene Ontology C. The Gene Ontology (GO) project in 2006 // Nucleic Acids Res. ‒ 2006. ‒ T. 34, № Database issue. ‒ C. D322-6.

Camon E., Magrane M., Barrell D., Lee V., Dimmer E., Maslen J., Binns D., Harte N., Lopez R., Apweiler R. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology // Nucleic Acids Res. ‒ 2004. ‒ T. 32, № Database issue. ‒ C. D262-6.

Kerrien S., Alam-Faruque Y., Aranda B., Bancarz I., Bridge A., Derow C., Dimmer E., Feuermann M., Friedrichsen A., Huntley R., Kohler C., Khadake J., Leroy C., Liban A., Lieftink C., Montecchi-Palazzi L., Orchard S., Risse J., Robbe K., Roechert B., Thorneycroft D., Zhang Y., Apweiler R., Hermjakob H. IntAct--open source resource for molecular interaction data // Nucleic Acids Res. ‒ 2007. ‒ T. 35, № Database issue. ‒ C. D561-5.

Oughtred R., Rust J., Chang C., Breitkreutz B. J., Stark C., Willems A., Boucher L., Leung G., Kolas N., Zhang F., Dolma S., Coulombe-Huntington J., Chatr-Aryamontri A., Dolinski K., Tyers M. The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions // Protein Sci. ‒ 2021. ‒ T. 30, № 1. ‒ C. 187-200.

Szklarczyk D., Gable A. L., Nastou K. C., Lyon D., Kirsch R., Pyysalo S., Doncheva N. T., Legeay M., Fang T., Bork P., Jensen L. J., von Mering C. The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets // Nucleic Acids Res. ‒ 2021. ‒ T. 49, № D1. ‒ C. D605-D612.

Kulyyassov A., Fresnais M., Longuespee R. Targeted liquid chromatography-tandem mass spectrometry analysis of proteins: Basic principles, applications, and perspectives // Proteomics. ‒ 2021. ‒ T. 21, № 23-24. ‒ C. e2100153.

Deutsch E. W., Sun Z., Campbell D., Kusebauch U., Chu C. S., Mendoza L., Shteynberg D., Omenn G. S., Moritz R. L. State of the Human Proteome in 2014/2015 As Viewed through PeptideAtlas: Enhancing Accuracy and Coverage through the AtlasProphet // Journal of Proteome Research. ‒ 2015. ‒ T. 14, № 9. ‒ C. 3461-3473.

Kusebauch U., Campbell D. S., Deutsch E. W., Chu C. S., Spicer D. A., Brusniak M. Y., Slagel J., Sun Z., Stevens J., Grimes B., Shteynberg D., Hoopmann M. R., Blattmann P., Ratushny A. V., Rinner O., Picotti P., Carapito C., Huang C. Y., Kapousouz M., Lam H., Tran T., Demir E., Aitchison J. D., Sander C., Hood L., Aebersold R., Moritz R. L. Human SRMAtlas: A Resource of Targeted Assays to Quantify the Complete Human Proteome // Cell. ‒ 2016. ‒ T. 166, № 3. ‒ C. 766-778.

Mohammed Y., Bhowmick P., Smith D. S., Domanski D., Jackson A. M., Michaud S. A., Malchow S., Percy A. J., Chambers A. G., Palmer A., Zhang S., Sickmann A., Borchers C. H. PeptideTracker: A knowledge base for collecting and storing information on protein concentrations in biological tissues // Proteomics. ‒ 2017. ‒ T. 17, № 7.

Sharma V., Eckels J., Schilling B., Ludwig C., Jaffe J. D., MacCoss M. J., MacLean B. Panorama Public: A Public Repository for Quantitative Data Sets Processed in Skyline // Mol Cell Proteomics. ‒ 2018. ‒ T. 17, № 6. ‒ C. 1239-1244.

Mitchell A. L., Attwood T. K., Babbitt P. C., Blum M., Bork P., Bridge A., Brown S. D., Chang H. Y., El-Gebali S., Fraser M. I., Gough J., Haft D. R., Huang H., Letunic I., Lopez R., Luciani A., Madeira F., Marchler-Bauer A., Mi H., Natale D. A., Necci M., Nuka G., Orengo C., Pandurangan A. P., Paysan-Lafosse T., Pesseat S., Potter S. C., Qureshi M. A., Rawlings N. D., Redaschi N., Richardson L. J., Rivoire C., Salazar G. A., Sangrador-Vegas A., Sigrist C. J. A., Sillitoe I., Sutton G. G., Thanki N., Thomas P. D., Tosatto S. C. E., Yong S. Y., Finn R. D. InterPro in 2019: improving coverage, classification and access to protein sequence annotations // Nucleic Acids Res. ‒ 2019. ‒ T. 47, № D1. ‒ C. D351-D360.

Mistry J., Chuguransky S., Williams L., Qureshi M., Salazar G. A., Sonnhammer E. L. L., Tosatto S. C. E., Paladin L., Raj S., Richardson L. J., Finn R. D., Bateman A. Pfam: The protein families database in 2021 // Nucleic Acids Res. ‒ 2021. ‒ T. 49, № D1. ‒ C. D412-D419.

Sigrist C. J., de Castro E., Cerutti L., Cuche B. A., Hulo N., Bridge A., Bougueleret L., Xenarios I. New and continuing developments at PROSITE // Nucleic Acids Res. ‒ 2013. ‒ T. 41, № Database issue. ‒ C. D344-7.

Letunic I., Copley R. R., Pils B., Pinkert S., Schultz J., Bork P. SMART 5: domains in the context of genomes and networks // Nucleic Acids Res. ‒ 2006. ‒ T. 34, № Database issue. ‒ C. D257-60.

Lautenbacher L., Samaras P., Muller J., Grafberger A., Shraideh M., Rank J., Fuchs S. T., Schmidt T. K., The M., Dallago C., Wittges H., Rost B., Krcmar H., Kuster B., Wilhelm M. ProteomicsDB: toward a FAIR open-source resource for life-science research // Nucleic Acids Res. ‒ 2022. ‒ T. 50, № D1. ‒ C. D1541-D1552.

MacDougall A., Volynkin V., Saidi R., Poggioli D., Zellner H., Hatton-Ellis E., Joshi V., O'Donovan C., Orchard S., Auchincloss A. H., Baratin D., Bolleman J., Coudert E., de Castro E., Hulo C., Masson P., Pedruzzi I., Rivoire C., Arighi C., Wang Q., Chen C., Huang H., Garavelli J., Vinayaka C. R., Yeh L. S., Natale D. A., Laiho K., Martin M. J., Renaux A., Pichler K., UniProt C. UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase // Bioinformatics. ‒ 2020. ‒ T. 36, № 17. ‒ C. 4643-4648.

UniProt C. UniProt: a worldwide hub of protein knowledge // Nucleic Acids Res. ‒ 2019. ‒ T. 47, № D1. ‒ C. D506-D515.

Watkins X., Garcia L. J., Pundir S., Martin M. J., UniProt C. ProtVista: visualization of protein sequence annotations // Bioinformatics. ‒ 2017. ‒ T. 33, № 13. ‒ C. 2040-2041.

Nightingale A., Antunes R., Alpi E., Bursteinas B., Gonzales L., Liu W., Luo J., Qi G., Turner E., Martin M. The Proteins API: accessing key integrated protein and genome information // Nucleic Acids Res. ‒ 2017. ‒ T. 45, № W1. ‒ C. W539-W544.

McGarvey P. B., Nightingale A., Luo J., Huang H., Martin M. J., Wu C., UniProt C. UniProt genomic mapping for deciphering functional effects of missense variants // Hum Mutat. ‒ 2019. ‒ T. 40, № 6. ‒ C. 694-705.

Desiere F., Deutsch E. W., King N. L., Nesvizhskii A. I., Mallick P., Eng J., Chen S., Eddes J., Loevenich S. N., Aebersold R. The PeptideAtlas project // Nucleic Acids Res. ‒ 2006. ‒ T. 34, № Database issue. ‒ C. D655-8.

Wang M., Wang J., Carver J., Pullman B. S., Cha S. W., Bandeira N. Assembling the Community-Scale Discoverable Human Proteome // Cell Syst. ‒ 2018. ‒ T. 7, № 4. ‒ C. 412-421 e5.

Moriya Y., Kawano S., Okuda S., Watanabe Y., Matsumoto M., Takami T., Kobayashi D., Yamanouchi Y., Araki N., Yoshizawa A. C., Tabata T., Iwasaki M., Sugiyama N., Tanaka S., Goto S., Ishihama Y. The jPOST environment: an integrated proteomics data repository and database // Nucleic Acids Res. ‒ 2019. ‒ T. 47, № D1. ‒ C. D1218-D1224.

Edwards N. J., Oberti M., Thangudu R. R., Cai S., McGarvey P. B., Jacob S., Madhavan S., Ketchum K. A. The CPTAC Data Portal: A Resource for Cancer Proteomics Research // J Proteome Res. ‒ 2015. ‒ T. 14, № 6. ‒ C. 2707-13.

Samaras P., Schmidt T., Frejno M., Gessulat S., Reinecke M., Jarzab A., Zecha J., Mergner J., Giansanti P., Ehrlich H. C., Aiche S., Rank J., Kienegger H., Krcmar H., Kuster B., Wilhelm M. ProteomicsDB: a multi-omics and multi-organism resource for life science research // Nucleic Acids Res. ‒ 2020. ‒ T. 48, № D1. ‒ C. D1153-D1163.

Schaab C., Geiger T., Stoehr G., Cox J., Mann M. Analysis of high accuracy, quantitative proteomics data in the MaxQB database // Mol Cell Proteomics. ‒ 2012. ‒ T. 11, № 3. ‒ C. M111 014068.

Fornelli L., Toby T. K., Schachner L. F., Doubleday P. F., Srzentic K., DeHart C. J., Kelleher N. L. Top-down proteomics: Where we are, where we are going? // J Proteomics. ‒ 2018. ‒ T. 175. ‒ C. 3-4.

Zardecki C., Dutta S., Goodsell D. S., Lowe R., Voigt M., Burley S. K. PDB-101: Educational resources supporting molecular explorations through biology and medicine // Protein Sci. ‒ 2022. ‒ T. 31, № 1. ‒ C. 129-140.

Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Zidek A., Potapenko A., Bridgland A., Meyer C., Kohl S. A. A., Ballard A. J., Cowie A., Romera-Paredes B., Nikolov S., Jain R., Adler J., Back T., Petersen S., Reiman D., Clancy E., Zielinski M., Steinegger M., Pacholska M., Berghammer T., Bodenstein S., Silver D., Vinyals O., Senior A. W., Kavukcuoglu K., Kohli P., Hassabis D. Highly accurate protein structure prediction with AlphaFold // Nature. ‒ 2021. ‒ T. 596, № 7873. ‒ C. 583-589.

Boyer L. A., Lee T. I., Cole M. F., Johnstone S. E., Levine S. S., Zucker J. P., Guenther M. G., Kumar R. M., Murray H. L., Jenner R. G., Gifford D. K., Melton D. A., Jaenisch R., Young R. A. Core transcriptional regulatory circuitry in human embryonic stem cells // Cell. ‒ 2005. ‒ T. 122, № 6. ‒ C. 947-56.

Takahashi K., Tanabe K., Ohnuki M., Narita M., Ichisaka T., Tomoda K., Yamanaka S. Induction of pluripotent stem cells from adult human fibroblasts by defined factors // Cell. ‒ 2007. ‒ T. 131, № 5. ‒ C. 861-72.

Takahashi K., Yamanaka S. Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors // Cell. ‒ 2006. ‒ T. 126, № 4. ‒ C. 663-76.

Takahashi K., Yamanaka S. A decade of transcription factor-mediated reprogramming to pluripotency // Nat Rev Mol Cell Biol. ‒ 2016. ‒ T. 17, № 3. ‒ C. 183-93.

Yamanaka S. Induced pluripotent stem cells: past, present, and future // Cell Stem Cell. ‒ 2012. ‒ T. 10, № 6. ‒ C. 678-684.

Yamanaka S., Blau H. M. Nuclear reprogramming to a pluripotent state by three approaches // Nature. ‒ 2010. ‒ T. 465, № 7299. ‒ C. 704-12.

Chambers I., Tomlinson S. R. The transcriptional foundation of pluripotency // Development. ‒ 2009. ‒ T. 136, № 14. ‒ C. 2311-22.

Esch D., Vahokoski J., Groves M. R., Pogenberg V., Cojocaru V., Vom Bruch H., Han D., Drexler H. C., Arauzo-Bravo M. J., Ng C. K., Jauch R., Wilmanns M., Scholer H. R. A unique Oct4 interface is crucial for reprogramming to pluripotency // Nat Cell Biol. ‒ 2013. ‒ T. 15, № 3. ‒ C. 295-301.

Merino F., Ng C. K. L., Veerapandian V., Scholer H. R., Jauch R., Cojocaru V. Structural basis for the SOX-dependent genomic redistribution of OCT4 in stem cell differentiation // Structure. ‒ 2014. ‒ T. 22, № 9. ‒ C. 1274-1286.

Tapia N., MacCarthy C., Esch D., Gabriele Marthaler A., Tiemann U., Arauzo-Bravo M. J., Jauch R., Cojocaru V., Scholer H. R. Dissecting the role of distinct OCT4-SOX2 heterodimer configurations in pluripotency // Sci Rep. ‒ 2015. ‒ T. 5. ‒ C. 13533.

Kulyyassov A., Kalendar R. In Silico Estimation of the Abundance and Phylogenetic Significance of the Composite Oct4-Sox2 Binding Motifs within a Wide Range of Species // Data. ‒ 2020. ‒ T. 5, № 4.

Kulyyassov A., Ogryzko V. In Vivo Quantitative Estimation of DNA-Dependent Interaction of Sox2 and Oct4 Using BirA-Catalyzed Site-Specific Biotinylation // Biomolecules. ‒ 2020. ‒ T. 10, № 1.

Kulyyassov A., Shoaib M., Pichugin A., Kannouche P., Ramanculov E., Lipinski M., Ogryzko V. PUB-MS: A Mass Spectrometry-based Method to Monitor Protein-Protein Proximity in vivo // Journal of Proteome Research. ‒ 2011. ‒ T. 10, № 10. ‒ C. 4416-4427.

Kulyyassov A. Application of Skyline for Analysis of Protein-Protein Interactions In Vivo // Molecules. ‒ 2021. ‒ T. 26, № 23.