Research areas

The first tool we developed at CGE was a method for Multi Locus Sequence Typing (MLST) of bacteria using the raw reads (or assembled genomes) as input (Larsen et al., 2012.). This method is available as a webserver ( As for other MLST methods the user must select the species for the method to use the correct MLST scheme. We have therefore recently evaluated a number of methods based on 16S, k-mers, and  ribosomal genes to deduct the species from the raw sequences (Larsen et al, submitted).  We found that the k-mer based method was very fast and reliable for species identification. This method is available via We have later adopted the above-described methodology to another typing scheme: plasmid MLST (pMLST) (Manuscript in prepatation)

Phenotype predictions
Once a pathogen is diagnosed it is important to know what you can treat it with and what you cannot treat it with. We have therefore developed a method for identification of acquired antimicrobial resistance genes (Zankari et al.,  2012b). A major effort was put in to compiling a human curated database based both on public databases and scientific papers. There have been raised concerns that an assigned genotype may not always correspond to a phenotype for example because mutations outside a gene may affect the expression of the gene product. We therefore conducted a study to compare geno- and phenotype. We found that genotyping using whole-genome sequencing is a realistic alternative to surveillance based on phenotypic antimicrobial susceptibility testing (Zankari et al., 2012a). We found a surprisingly high concordance (99.74%) between phenotypic and predicted antimicrobial susceptibility. This is promising, but is must also be said that the study was conducted in a population with relatively low levels of resistance and lower levels of concordance may be found in other populations.

Andreatta et al. took a radically different approach and sorted genomes of gamma-proteobacteria in pathogenic and non pathogenic, and looked for gene families that were statistically associated with being found in either pathogenic or non pathogenic bacteria (Andreatta et al., 2010). This is to the best of our knowledge the first example of using machine learning techniques to determine the phenotype from whole genome sequences. The method has later been extended to work for all species of bacteria and using and using raw sequencing data as input. (Cosentino et al., 2013. PMID: 24204795).

Similar methods can also be used on the single protein level. Jessen et al. developed a method for finding sites associated with biological activity, bases on sorting the sequences based on the measured activity associated with each sequence and then statistically investigating if certain amino acids at certain positions we associated with biological activity (Jessen et al., 2013. ).

Metagenomic samples
Much attention have recently been given to the possibility of diagnosing diseases based on metagenomic samples, since this is faster and simpler than having to isolate the bacteria. Hasman et al. were to the best of our knowledge the first to show that metagenomic samples (in this case urine) could be used to diagnose a pathogen without prior knowledge about which species it was. It was found that WGS improved the identification of the cultivated bacteria, and an almost complete agreement was observed between phenotypic and predicted antimicrobial susceptibilities. (Hasman et al., 2013). For this project a method ChainMapper was developed to map all reads against all fully sequenced bacteria and viruses, as well as resistance genes and genomes from the MetaHIT project. This method have since been updated and re-implemented and is available via a method called MGMapper at 

Making phylogenetic trees based on SNPs is the emerging standard for detailed study of evolutionary relationships between isolates in an outbreak. We have developed the first web-based server for SNP tree analysis (Leekitcharoenphon et al., 2012.). In SNP tree analysis the details of the method such as how SNPs are called and filtered are very important for the reliability of the result. We have therefore initiated a number of studies to evaluate and refine these methods. One method that have shown promise is the NDtree method (Leekitcharoenphon et al. PMID: 24505344; Joensen et al. PMID: 24574290; Kaas et al. PMID: 25110940), which is centered around calling bases, and not distinguishing between SNPs and non SNPs since the concept of a reference sequence makes less sense for bacteria which varies a lot than it does for human genomics.

Low bandwidth solutions
Upload of NGS data to a server takes approximately an hour per gigabyte input data, depending on the connection between the client and the server. The challenge is illustrated in figure DataComparisonProblem1. We have worked with 3 different approaches to deal with the problem. 

Fig DataComparisonProblem1: Illustration of the problem of comparing a large data set at the client site with a very large dataset server side over a relatively slow internet connection.

The first approach is based on not sending all the data to the server. The raw data contains more copies of each part of the sequence. It is common to sequence to a depth of 30x meaning that each base in the target genome is represented on the average 30 times in the input data. Furthermore not the full genome is needed for example to determine the species. Often the species can be determined based on the 16S gene, which only covers approximately one per mille of the genome. 

A second method for reducing the amount of data that needs to be transmitted over the internet is based on creating a database of highly discriminative 50mers from 16S genes (or other loci). By only selecting 50mers that are highly discriminative (found in one or a few species, but not in all others), it is possible to reduce the size of the database to around 5 megabytes. This database may easily be send from the server to the client. We have developed a method based on these principles called Reads2Type, which in a few minutes can predict which species that a set of reads is derived from.

A third approach is to move the databases and computing resources to the client. 

Design of Diagnostics
Whole genome sequencing may also be used to select sequences that may be used for diagnostics. We have earlier developed a method to find short peptide sequences that were found in all strains that the diagnostic were supposed to cover (in this case Mycobacterium leprae) and not found in any other known bacteria, with special focus on closely related species such as Mycobacterium tuberculosis (Bobosha et al 2012. PMID: 23283462).  We have recently (Martin Thomsen, unpublished project report) developed a similar approach to select promising PCR primers for diagnostic and are planning to evaluate the method experimentally in collaboration with SSI, Copenhagen, Denmark.