This is an hybrid event. Join in person at the LCSB, or tune in via Webex here.
De novo sequencing of MS2s and cryo-EM maps
Determining protein amino acid sequences directly from experimental data is a fundamental challenge in proteomics, particularly when genome-based reference databases are incomplete or missing. De novo sequencing with MS2 data allows protein identification without relying on prior sequence information. Unlike database searching, which limits the search space to known sequences, de novo approaches must consider the vast number of possible amino acid combinations, thereby dramatically increasing the problem’s complexity. As a result, accurately interpreting MS/MS spectra requires sophisticated models that can reliably infer peptide sequences from sparse and noisy signals. Here, I will provide a brief overview of the current state of de novo sequencing using mass spectrometry, emphasizing how deep learning has transformed this area. I will discuss various strategies for encoding mass spectra into neural network-compatible formats and highlight how specific architectural decisions affect model performance. Specifically, I will present our recent advancements using pairwise attention mechanisms—a variant of transformer-style attention—to better capture relationships between fragment ions. This method substantially improves sequence reconstruction accuracy compared to previous models. Furthermore, I will explore the emerging potential of cryo-electron microscopy (cryo-EM) data in de novo sequencing. Although traditionally employed for structural analysis, cryo-EM provides complementary constraints that, when integrated with MS/MS data and deep learning approaches, could enable comprehensive sequence determination for novel proteins. By combining these modalities, we advance closer to achieving reference-free proteomics.
About the speaker
Lukas Käll is a professor of computational proteomics at the Royal Institute of Technology (KTH) in Sweden and is affiliated with the Science for Life Laboratory. His research focuses on developing machine learning methods to improve the analysis and interpretation of large-scale, mass spectrometry-based proteomics data. Käll is the creator of Percolator, one of the most widely used software tools in proteomics for the post-processing of peptide identifications using semi-supervised machine learning. His research group’s work includes peptide retention time prediction, protein quantification with uncertainty estimation, and applying machine learning to predict various peptide properties. Käll has made significant contributions to establishing best practices for machine learning in proteomics research.
The Causal Analysis of Biomedical Data Lecture Series is supported by the Luxembourg National Research Fund (FNR) RESCOM Program.