Genomic epidemiology

The group uses sequencing together with informatics tools to spot occurrences of highly virulent or resistant pathogens and implement targeted interventions to prevent their spread.

Resistance to the current antimicrobials is evolving at an alarming rate. 700,000 persons per year dies from resistant infections and it is estimated that this number will rise to 10 million per year in 2050. Resistance may thus develop faster that new drugs can be developed. In a recent report from April 2014, WHO concludes “A post-antibiotic era – in which common infections and minor injuries can kill – far from being an apocalyptic fantasy, is instead a very real possibility for the 21st Century”. It is therefore imperative to prepare for a world where bacterial infections may be deadly and untreatable. Since pre-antibiotic times various approached to hinder spread of infectious diseases have been used, and these approaches should be revisited and updated in light of the technological development in the last 50 years.

We are using sequencing together with informatics tools to spot occurrences of highly virulent or resistant strains and implement targeted interventions to prevent their spread. This will involve bringing genetic epidemiology together with epidemiological modeling, and working closely with hospitals and health authorities involved in disease control.

The aim of the group is to provide the scientific foundation for future internet-based solutions where a central database will enable simplification of total genome sequence information and comparison to all other sequenced including spatial-temporal analysis. We will develop algorithms for rapid analyses of whole genome DNA-sequences, tools for analyses and extraction of information from the sequence data and internet/web-interfaces for using the tools in the global scientific and medical community. The activity is being expanded to also include other microorganisms, such as vira and parasites as well as metagenomic samples.

Center for Genomic Epidemiology

Over the last 6 years we have worked on developing a system for surveillance and diagnostics of infectious diseases at the Center for genomic Epidemiology (CGE). The basic aims are to find out what is in a sample (typing), how pathogenic it is, and what the antibiotic resistances profile is phenotyping. For epidemiological tracing it is furthermore necessary to know how it evolutionarily is related to isolates from other samples. 

In addition to the genomic epidemiology the group is engaged in other areas of research. These include Protein networks in cancers, development of a recombinant antibody-based treatment of snakebites and analysis of immune responses in HIV infected.

Research areas

The group is engaged in the following research areas.


The first tool we developed at CGE was a method for Multi Locus Sequence Typing (MLST) of bacteria using the raw reads (or assembled genomes) as input (Larsen et al., 2012.). This method is available as a webserver ( As for other MLST methods the user must select the species for the method to use the correct MLST scheme. We have therefore recently evaluated a number of methods based on 16S, k-mers, and ribosomal genes to deduct the species from the raw sequences (Larsen et al, submitted). We found that the k-mer based method was very fast and reliable for species identification. This method is available via We have later adopted the above-described methodology to another typing scheme: plasmid MLST (pMLST) (Manuscript in prepatation)

Phenotype predictions

Once a pathogen is diagnosed it is important to know what you can treat it with and what you cannot treat it with. We have therefore developed a method for identification of acquired antimicrobial resistance genes (Zankari et al., 2012b). A major effort was put in to compiling a human curated database based both on public databases and scientific papers. There have been raised concerns that an assigned genotype may not always correspond to a phenotype for example because mutations outside a gene may affect the expression of the gene product. We therefore conducted a study to compare geno- and phenotype. We found that genotyping using whole-genome sequencing is a realistic alternative to surveillance based on phenotypic antimicrobial susceptibility testing (Zankari et al., 2012a). We found a surprisingly high concordance (99.74%) between phenotypic and predicted antimicrobial susceptibility. This is promising, but is must also be said that the study was conducted in a population with relatively low levels of resistance and lower levels of concordance may be found in other populations.

Andreatta et al. took a radically different approach and sorted genomes of gamma-proteobacteria in pathogenic and non pathogenic, and looked for gene families that were statistically associated with being found in either pathogenic or non pathogenic bacteria (Andreatta et al., 2010). This is to the best of our knowledge the first example of using machine learning techniques to determine the phenotype from whole genome sequences. The method has later been extended to work for all species of bacteria and using and using raw sequencing data as input. (Cosentino et al., 2013. PMID: 24204795).

Similar methods can also be used on the single protein level. Jessen et al. developed a method for finding sites associated with biological activity, bases on sorting the sequences based on the measured activity associated with each sequence and then statistically investigating if certain amino acids at certain positions we associated with biological activity (Jessen et al., 2013. ).

Metagenomic samples

Much attention have recently been given to the possibility of diagnosing diseases based on metagenomic samples, since this is faster and simpler than having to isolate the bacteria. Hasman et al. were to the best of our knowledge the first to show that metagenomic samples (in this case urine) could be used to diagnose a pathogen without prior knowledge about which species it was. It was found that WGS improved the identification of the cultivated bacteria, and an almost complete agreement was observed between phenotypic and predicted antimicrobial susceptibilities. (Hasman et al., 2013). For this project a method ChainMapper was developed to map all reads against all fully sequenced bacteria and viruses, as well as resistance genes and genomes from the MetaHIT project. This method have since been updated and re-implemented and is available via a method called MGMapper at


Making phylogenetic trees based on SNPs is the emerging standard for detailed study of evolutionary relationships between isolates in an outbreak. We have developed the first web-based server for SNP tree analysis (Leekitcharoenphon et al., 2012.). In SNP tree analysis the details of the method such as how SNPs are called and filtered are very important for the reliability of the result. We have therefore initiated a number of studies to evaluate and refine these methods. One method that have shown promise is the NDtree method (Leekitcharoenphon et al. PMID: 24505344; Joensen et al. PMID: 24574290; Kaas et al. PMID: 25110940), which is centered around calling bases, and not distinguishing between SNPs and non SNPs since the concept of a reference sequence makes less sense for bacteria which varies a lot than it does for human genomics.

Low bandwidth solutions

Upload of NGS data to a server takes approximately an hour per gigabyte input data, depending on the connection between the client and the server.

The first approach is based on not sending all the data to the server. The raw data contains more copies of each part of the sequence. It is common to sequence to a depth of 30x meaning that each base in the target genome is represented on the average 30 times in the input data. Furthermore not the full genome is needed for example to determine the species. Often the species can be determined based on the 16S gene, which only covers approximately one per mille of the genome.

A second method for reducing the amount of data that needs to be transmitted over the internet is based on creating a database of highly discriminative 50mers from 16S genes (or other loci). By only selecting 50mers that are highly discriminative (found in one or a few species, but not in all others), it is possible to reduce the size of the database to around 5 megabytes. This database may easily be send from the server to the client. We have developed a method based on these principles called Reads2Type, which in a few minutes can predict which species that a set of reads is derived from.

A third approach is to move the databases and computing resources to the client.

Design of Diagnostics

Whole genome sequencing may also be used to select sequences that may be used for diagnostics. We have earlier developed a method to find short peptide sequences that were found in all strains that the diagnostic were supposed to cover (in this case Mycobacterium leprae) and not found in any other known bacteria, with special focus on closely related species such as Mycobacterium tuberculosis (Bobosha et al 2012. PMID: 23283462). We have recently (Martin Thomsen, unpublished project report) developed a similar approach to select promising PCR primers for diagnostic and are planning to evaluate the method experimentally in collaboration with SSI, Copenhagen, Denmark.


Ole Lund
DTU Bioinformatics
+45 45 25 24 25


Ole Lund
DTU Bioinformatics
+45 45 25 24 25