Session WOB. There are 4 abstracts in this session.


Integrated Machine Learning of Complex Multi-Omic Data and Clinical Risk Factors to Build Interpretable Predictive Models for Type 1 Diabetes

Bobbie-Jo Webb-Robertson

o Type 1 diabetes (T1D) is a chronic autoimmune disease that results from autoimmune destruction of insulin-producing pancreatic beta-cells. T1D progresses through stages and clinical diabetes is generally preceded by the presentation of diabetes-related autoantibodies (IA), but no symptoms. As the cause of the disease remain elusive, multiple diabetes cohorts, such as the Diabetes Autoimmunity Study in the Young (DAISY;, have been established to collect information longitudinally to gain insights into the biological mechanisms driving changes in the progression of the disease from a pre-symptomatic IA to symptomatic T1D state. These prospective cohort studies have reported potential demographic, immune, genetic, metabolomic, and proteomic markers associated with IA or the progression from IA to T1D. Each of these studies offer insight into environmental or mechanistic drivers of the disease, but from a single source perspective. We present an approach to machine learning-based data integration and ensemble-based feature extraction to enable the development of precision diabetes models that can simultaneously utilize and evaluate the large collection of disparate data available. At the global level, due to ensemble-based learning, we can evaluate the features most important to the model associated with the diabetes endpoints of interest. These can be evaluated to identify clinically relevant biomarkers or to understand the biological pathways leading to IA and/or diabetes. At the individual subject level, we can acquire the probability of a specific diabetes endpoint of interest, as well as interrogate the specific features driving the prediction to acquire distinct profiles and better understand where the model works well and needs improvements.
Tips and Tricks:


FDR Control In Very Large Proteomic Data Sets and Proteome-wide Prediction of Peptide Tandem Mass Spectra by Deep Learning

Bernhard Kuster

Estimating false discovery rate (FDR) is a challenge when analyzing very large proteomic data sets because there are limitations inherent to the classic target-decoy strategy that lead to an over-representation of decoy identifications. We compared the classic to a novel target-decoy-based protein FDR estimation approach using ∼19,000 LC-MS/MS runs available in ProteomicsDB ( The "picked" protein FDR approach treats target and decoy sequences of the same protein as a pair rather than as individual entities and ‘picks’ either the target or the decoy sequence depending on which receives the highest score. This simple and unbiased strategy eliminates a conceptual issue in the "classic" protein FDR approach that causes over prediction of false-positive protein identification in large data sets. The approach scales from small to very large data sets without losing performance, consistently increases the number of true-positive protein identifications and is readily implemented in proteomics analysis software. The identification of peptides relies on sequence database searching or spectral library matching. The lack of accurate predictive models for fragment ion intensities render this process imperfect. We extended the ProteomeTools synthetic peptide library to 550,000 tryptic peptides and 21 million tandem mass spectra to train a deep neural network, termed Prosit, resulting in chromatographic retention time and fragment ion intensity predictions that exceed the quality of the experimental data. Integrating Prosit into database search pipelines led to more identifications at >10× lower FDR. We show the general applicability of Prosit by predicting spectra for other proteases, generating spectral libraries for DIA and improving the analysis of metaproteomes. Prosit is integrated into ProteomicsDB, allowing search result re-scoring and custom spectral library generation for any organism on the basis of peptide sequence alone.
Tips and Tricks:


Alternative Splice Isoforms in Cardiac Aging Defined by Multi-Omics

Erin Yu Han1, 2; Julianna Wright1, 2; Sara A. Wennersten1, 2; Rushita Bagchi1, 2; Edward Lau3, 4; Maggie PY Lam1, 2
1School of Medicine - University of Colorado AMC, Aurora, CO; 2Consortium for Fibrosis Research & Translation, Aurora, CO; 3Stanford Cardiovascular Institue, Palo Alto, CA; 4Stanford University, Palo Alto, CA

Introduction Aging-associated changes in alternative splicing (AS) have been implicated in the functional deterioration of multiple tissues including the aged heart. However, how AS changes lead to the expression of downstream alternative protein isoforms in cardiac aging remains incompletely examined to-date. In this study, we utilized an in-house developed multi-omics workflow to investigate the alterations in cardiac protein isoforms in a mouse model of cardiac aging.


Method We extracted RNAs and proteins from heart ventricles from young (12 weeks) and old (78 weeks) C57BL/6J mice. Transcript expression was quantified using short-read RNA-seq. Reads were mapped to mouse genome mm10 using STAR and splice junctions nominated with rMATS were translated into peptide sequences using an in-house algorithm to generate age-specific protein isoform databases. We then digested the cardiac proteins and performed tandem mass tag (TMT) 2D LC-MS/MS analysis on a Q-Exactive HF mass spectrometer coupled with the Nano-UHPLC. The MS spectra were searched against the UniProt mouse proteome and the custom age-specific cardiac isoform proteome to quantify changes between proteins expressed in the young and old hearts.


Result (1) A bioinformatics analysis using deep-coverage RNA-seq datasets identified new age-associated changes in the expression of mRNA splice junctions; (2) We present the first age-specific cardiac protein isoform databases, which include novel protein isoforms that are not currently documented in public proteome databases; (3) We discovered 65 protein isoforms differentially expressed (FDR < 0.1) in young vs. aged mouse ventricles. Among these age-associated isoforms, we found a number of promising cardiac proteins that have been known to play vital structural or functional roles in the heart and have been implicated in cardiovascular diseases.


Conclusion Our work uses a proteotranscriptomics approach to reveal widespread proteoform changes between young and aged hearts, and further identifies candidate aging-regulated targets with potential implications in cardiac dysfunction.

Tips and Tricks:


Integrative profiling of human plasma proteomes enables precision phenotyping

Cecilia E Thomas; Tea Dodig-Crnkovic; Matilda Dale; Annika Bendes; Claudia Fredolini; Mun-Gwan Hong; Jochen M Schwenk
Science for Life Laboratory, Solna, Sweden

Advances by different plasma proteomics approaches provide ever growing opportunities to study human biology. These efforts are being fuelled by growing depth and breadth of complementary multi-omics data as well as extensive clinical information. However, other types of bias have to be tasken Ingo account including technical, pre-analytical as well as meditation data.

We apply a variety of sensitive, multiplexed affinity proteomics systems to profile > 1000 proteins circulating in plasma: Olink’s proximity extension assays, Luminex-based assays, automated microfluidic SimplePlex ELISAs, or ultra-sensitive Quanterix Simoa assays. For exploratory screenings, we previously developed antibody bead arrays and built a comprehensive antibody validation pipeline including (i) paired antibodies, (ii) dual capture assays, (iii) sandwich immunoassays, (iv) immunocapture MS and (v) antibody-free targeted MS assays. We also use statistical and computational tools for data analysis such as to associate plasma protein levels with genetic variants (pQTLs). 

From large-scaled and systematic explorations of plasma proteomes, we found that sample, technical, personal and health related parameters affect the protein levels. We observed a substantial impact from pre-analytical variables (sampling, needle-to-freezer, storage time, collection date, study centre) and the need to control for batch-effects when profiling 100s of samples. Longitudinal collections are particularly attractive as resampling patients enables to determine personal baselines and from there monitor the individuals' progression. Traits like sex, BMI, age, medication, genetics and clinical disease definitions should also be assessed prior to integrating proteomics with other omics data types. Considering these aspects, and the growing number of proteomics assays as well as population biobanks, profiling plasma will contribute to our understanding of human biology, and if applied in a considerable manner also deliver sustainable features for patient stratification, health management and prediction of treatment outcomes.


Tips and Tricks: