Workshop on Machine Learning Strategies for Disease Prediction for PhD Students

16.-17. May 2018, Technical University of Denmark, Kgs. Lyngby, Denmark
This 2-day workshop focuses on the exploitation of Big Data in disease biology with the use of integrative methodologies and machine learning (ML) techniques. This involves development of ML models that integrate various biological data types (genomics, metabolomics, metagenomics along with patient characteristics, disease severity, treatment responses) to predict disease predisposition, progression and treatment outcomes. In addition to rich and diverse input data, the utilization of contextual information and prior knowledge is important to successful modelling. More interdisciplinary cooperation will facilitate translating artificial intelligence into clinic, which ultimately allow for personalized treatment strategies.
The workshop is composed by four sessions and these involve keynote talks and presentations by the PhD students’ own work through oral presentations and poster sessions.


Workshop Program


Challenges in Moving from Associations to Predictions
Genomics and other high throughput analyses have, to a large extent, focused on associations and statistical modelling between a phenotypic outcome and genetic markers [1,2]. These approaches have led to increased understanding of factors likely associated with phenotypic outcomes, however these associations are at the population level, and are often weak for predictive value for individual outcome. Translating the impact of discovered genetic variants to personal risk assessment is a recurrent challenge [3]. Alternative methods and predictive frameworks are emerging [0] that has its focus shifted from learning about a population towards prediction for the individual or subgroup level. This is not a straightforward change since the goals are fundamentally different and this also means an evolution in accepted methods, statistics and data representation. Furthermore, a strong association is not a guarantee for being a strong predictor and even very clear population level association does not necessarily lead to individual predictions or learnings that are implementable in the clinic. This session aims to highlight the importance of the shift, the challenges therein including differences in evaluation metrics and emerging solutions and successes.

Incorporating High-dimensional Data in Prediction Models
Genetics and other high-throughput omics data such as transcriptomics, metabolomics, metagenomics is increasingly high dimensional and far higher than its sample sizes. Few prediction models naively work well with these amounts of variables, and overfitting is a frequent challenge [1]. Various strategies for addressing this are emerging including intelligent feature selection [1,4], functional groupings of features and multi-omics scaffolding. However, challenges also arise when the selected variables are integrated with other types of variables in a prediction model. This session will focus on feature selection as well as ways to integrate data from of different types.

Use of Prior Knowledge in Associations/Predictions
Biology is affected from multiple interactions in the cellular space. Hence, single marker genomics is limited, since the only information used is genomic level linkage. Domain knowledge can be used for improving predictive power of a model. This may include knowledge from pathways, protein-protein interactions or genetic interaction networks. Studies applying prior knowledge in their models have improved predictions based on genetic markers and clinical biomarkers [2,5]. However, a limitation of including prior knowledge is that models may become biased towards known processes and therefore might miss out on novel interactions with weaker signals [2]. This session will focus on how to include known cellular pathways, networks and/or protein-protein interactions into prediction models.

Operational Data Challenges

Disease-associated data comes with a number of challenges to solve before it can be presented to machine learning models. By its nature, observational data contain missing data, which leads to a loss of samples, input features and power. A strategy for dealing with missing data is imputation. However, this may introduce bias and needs suitable assumptions. Another cleanliness aspect from patient health records is that samples and data are collected at different time points, which presents challenges for comparing time points. In order to combine heterogeneous data types, data must be scaled and encoded in a proper manner for the model to learn patterns. This session will focus on how to clean and prepare data for machine learning models by imputation strategies, handling longitudinal measurements and encoding of input features of different types of data.


Marylyn D Ritchie, PhD, University of Pennsylvania, USA
Dr. Ritchie’s research focuses on improving our understanding of the underlying genetic architecture of common diseases and pharmacogenomic traits among others. The approaches involve development and application of new statistical and computational methods which involve the integration of multiple types of ‘omics data.

Andrea Califano, PhD, Columbia University, USA
Dr. Califano’s interests reside in the assembly and interrogation of gene regulatory models for the elucidation of mechanisms presiding over cell physiology and their dysregulation in disease, with specific applications to cancer, stem cells, and neurodegenerative disease.

Jason Moore, PhD, University of Pennsylvania, USA
Dr. Moore’s research focuses on the development and application of artificial intelligence and machine learning methods for analysis of big biomedical data from research studies aimed at improving our understanding of human health. Recent work has focused on automated machine learning and accessible artificial intelligence.

Chloé-Agathe Azencott, the Centre for Computational Biology (CBIO) of Mines ParisTech, Institut Curie and INSERM, France
Dr. Azencott’s research focus on the development of methods for efficient multi-locus biomarker discovery. Essentially, the goal is to make sense of data with a small number of samples and a large number of variables and how we can find out which of them play a role in a particular biological process or pathology? Her work has numerous applications, in particular in precision medicine, where the goal is develop treatments that are adapted to the (genetic) specificities of patients, by contrast with a classical one-size-fits-all approach. Dr. Azencott is interested in the incorporation of additional (structured) information, for example as biological networks; in multi-task approaches, where one addresses multiple related problems simultaneously; and in the development of fast but accurate techniques to address these issues. In terms of machine learning, a lot of her work is linked to structured sparsity.



The workshop is held at:
Technical University of Denmark

Rooms for conference participants are to be booked individually.

DTU Bioinformatics – Department of Bio and Health Informatics at Technical University of Denmark
DTU Compute – Department of Applied Mathematics and Computer Science at Technical University of Denmark

Supported by: Poul V. Andersen’s Foundation

[0] Tien YinWong , Neil M. Bressler, Artificial Intelligence With Deep Learning Technology
Looks Into Diabetic Retinopathy Screening. JAMA. 2016;316(22):2366-2367.

[1] M. L. Bermingham, R. Pong-Wong, A. Spiliopoulou, C. Hayward, I. Rudan,H. Campbell, A. F. Wright, J. F. Wilson, F. Agakov, P. Navarro, C. S. Haley. Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific Reports 5, Article number: 10312 (2015).

[2] Sebastian Okser, Tapio Pahikkala, and Tero Aittokallio. Genetic variants and their interactions in disease risk prediction – machine learning and network perspectives. BioData Min. 2013; 6: 5.

[3] Nguyen, Tuan V., Eisman, John A. Genetics and the Individualized Prediction of Fracture. Current Osteoporosis Reports — 2012, Volume 10, Issue 3, pp. 236-244

[4] Saeys, Yvan, Iñaki Inza, and Pedro Larrañaga. "A review of feature selection techniques in bioinformatics." bioinformatics 23.19 (2007): 2507-2517.

[5] Pedersen, H. K., Gudmundsdottir, V., Pedersen, M. K., Brorsson, C. A., Brunak, S., & Gupta, R. (2016). Ranking factors involved in diabetes remission after bariatric surgery using machine-learning integrating clinical and genomic biomarkers. Genome Medicine, 1, [16035]. DOI: 10.1038/npjgenmed.2016.35


Wed 16 May 18 8:30 -
Thu 17 May 18 17:00


DTU Compute
DTU Bioinformatik


16.-17. May 2018, Kemitorvet 202, room 8003, Technical University of Denmark, Kgs. Lyngby, Denmark
22 MARCH 2019