Algorithmic Bioinformatics is a purely computational research group developing new algorithms to solve bioinformatics problems on sequence data. We address challenges such as read alignment, variant detection and genotyping, genome assembly, and whole-genome alignment among others. We implement our algorithms in new software tools and use them in the analysis of sequence data to gain new insights into human genetics and the immune system.

Examples of structural variants visualized in the Integrative Genomics Viewer (IGV).

01 | Complex structural variants

A complex structural variant (cxSV) is a rearrangement of multiple segments of DNA that we assume was originally caused by a single mutational event. We have a new method for genotyping cxSVs in the human genome; we use breakpoint-resolved descriptions of known cxSVs and alignments of short-read sequencing data from a person’s genome to build allele models and calculate expected read-pair probability distributions. Based on these distributions, we predict the person's genotypes of the cxSVs as well as certainty scores. Our method enables accurate genotyping of cxSVs from widely available short-read data sets.

02 | Linked-read and long-read data analysis

Our software tools, MoleMap, and MoleCall, can be chained into a workflow to allow structural variant detection and genotyping from linked-read sequencing data. Linked-read data provide long-range information through barcode labels on accurate short reads; in other words, all reads labeled with the same barcode originate from a small set of long DNA molecules. MoleMap and MoleCall can also be applied to regular long-read sequencing data.

Bcctools is a toolbox for pre-processing linked-read data. It can trim barcodes from the reads, infer a whitelist of barcodes, and implements an efficient index data structure for retrieving corrected barcode sequences in constant time. It is several times faster to pre-process linked-read data with bcctools than with LongRanger. Bcctools is a toolbox for pre-processing linked-read data. It can trim barcodes from the reads, infer a whitelist of barcodes, and implements an efficient index data structure for retrieving corrected barcode sequences in constant time. It is several times faster to pre-process linked-read data with bcctools than with LongRanger.

MoleMap efficiently determines the genomic intervals of long DNA molecules. We refer to this as ‘molecule mapping’. The output from MoleMap enables efficient retrieval of reads from genomic regions of interest, without the need to compute a full read alignment. This approach is significantly faster than read alignment. MoleMap uses an open-addressing k-mer index and minimizers to efficiently determine genomic intervals in sets of reads labeled with the same barcode.

Lastly, MoleCall allows for the detection and genotyping of structural variants. Given reads from a region of interest, it traverses a local assembly graph to detect structural variants and uses a statistical model to predict genotype likelihoods.

03 | Interactive exploratory workflows

In collaboration with the Weidlich lab at the Humboldt University in Berlin, we are developing interactive workflows to support the exploration of genomic data. The analysis of scientific data is usually a dynamic process in which various software alternatives and setups are tested over a period of time. We are developing the means to simplify and systematically document this process. By applying these methods, we transform existing program chains into exploratory workflows that allow and track user intervention during execution.

Workflows

04 | Structural variant detection in tens of thousands of genomes

Our software tool, PopDel, can simultaneously analyze tens of thousands of short-read sequenced genomes to reliably detect and accurately genotype structural variants—differences between genomes that affect at least 50 bp of DNA sequence. The current focus of the software is on deleted sequence, but we have started to extend PopDel to other types of structural variants including inversions, duplications, and translocations.

PopDel

We have already used PopDel to identify a rare deletion in the LDLR gene which causes extremely low levels of LDL cholesterol in the blood (Björnsson E. et al, Circulation: Genomic and Precision Medicine, 2021). The tool’s superior scalability, high accuracy, fast run time, and ease of use make PopDel an attractive alternative to previous approaches. At the core of PopDel is a space-efficient (binary) read-pair-profile format and a structural variant-detection algorithm that is based on a likelihood-ratio test.

More details in Niehus S. et al, Nature Communications, 2021

05 | Non-reference sequence variants

Our software tool, PopIns2, the PopIns successor, identifies a type of genomic structural variant that involves non-repetitive sequence not found in the reference genome. We call these variants ‘non-reference sequence variants,’ or short NRS variants. Previously we could show that the majority of human non-reference sequence is ancestral, rather than newly inserted, and described an association between an NRS variant in the SREBF1 gene and myocardial infarction (Kehr et al, Nature Genetics, 2017).

PopIns

The detection of NRS variants from short-read data is particularly challenging as it inevitably involves a de novo assembly of the non-reference sequence. We combine data from many individuals simultaneously to ensure reliable NRS assembly. PopIns2 realizes this by representing non-reference sequence data in colored de Bruijn graphs.

More details in Krannich T. et al, Bioinformatics, 2021

Visit the complete list of our Research Group’s publications:

https://kehrlab.github.io/publications.html

Here is a selection of the most important publications from the last few years:

Mirus T, Lohmayer R, Döhring C, Halldórsson BV, Kehr B. GGTyper: genotyping complex structural variants using short-read sequencing data. Bioinformatics 2024.

Lüpken R, Krannich T, Kehr B. Bcmap: fast alignment-free barcode mapping for linked-read sequencing data. Preprint on bioRxiv 2022. doi: 10.1101/2022.06.20.496811

Pinkert J, Boehm H, Trautwein M, Doecke W, Wessel F, Ge Y, Gutierrez EM, Carretero R, Freiberg C, Gritzan U, Luetke-Eversloh M, Golfier S, Von Ahsen O, Volpin V, Sorrentino A, Rathinasamy A, Xydia M, Lohmayer R, Sax J, Nur-Menevse A, Hussein A, Stamova S, Beckmann G, Glueck JM, Schoenfeld D, Weiske J, Zopf D, Offringa R, Kreft B, Beckhove P, Willuda J. T cell-mediated elimination of cancer cells by blocking CEACAM6-CEACAM1 interaction. Oncoimmunology 2022;11(1):2008110. doi: 10.1080/2162402X.2021.2008110. PMID: 35141051

Krannich T, White WTJ, Niehus S, Holley G, Halldórsson BV, Kehr B. Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinformatics 2022;38(3):604-611. doi: 10.1093/bioinformatics/btab749. PMID: 34726732

Niehus S, Jónsson H, Schönberger J, Björnsson E, Beyter D, Eggertsson HP, Sulem P, Stefánsson K, Halldórsson BV, Kehr B. PopDel identifies medium-size deletions simultaneously in tens of thousands of genomes. Nature communications 2021;12(1):730. doi: 10.1038/s41467-020-20850-5. PMID: 33526789

Schwarz JM, Lüpken R, Seelow D, Kehr B. Novel sequencing technologies and bioinformatic tools for deciphering the non-coding genome. Medizinische Genetik 2021;33(2):133-145. doi: 10.1515/medgen-2021-2072. PMID: 38836034

Markowski J, Kempfer R, Kukalev A, Irastorza-Azcarate I, Loof G, Kehr B, Pombo A, Rahmann S, Schwarz RF. GAMIBHEAR: whole-genome haplotype reconstruction from Genome Architecture Mapping data. Bioinformatics 2021;37(19):3128-3135. doi: 10.1093/bioinformatics/btab238. PMID: 33830196

Bjornsson E, Gunnarsdottir K, Halldorsson GH, Sigurdsson A, Arnadottir GA, Jonsson H, Olafsdottir EF, Niehus S, Kehr B, Sveinbjörnsson G, Gudmundsdottir S, Helgadottir A, Andersen K, Thorleifsson G, Eyjolfsson GI, Olafsson I, Sigurdardottir O, Saemundsdottir J, Jonsdottir I, Magnusson OT, Masson G, Stefansson H, Gudbjartsson DF, Thorgeirsson G, Holm H, Halldorsson BV, Melsted P, Norddahl GL, Sulem P, Thorsteinsdottir U, Stefansson K. Lifelong Reduction in LDL (Low-Density Lipoprotein) Cholesterol due to a Gain-of-Function Mutation in LDLR. Circulation. Genomic and precision medicine 2021;14(1):e003029. doi: 10.1161/CIRCGEN.120.003029. PMID: 33315477

Jónsson H, Sulem P, Kehr B, Kristmundsdottir S, Zink F, Hjartarson E, Hardarson MT, Hjorleifsson KE, Eggertsson HP, Gudjonsson SA, Ward LD, Arnadottir GA, Helgason EA, Helgason H, Gylfason A, Jonasdottir A, Jonasdottir A, Rafnar T, Frigge M, Stacey SN, Th Magnusson O, Thorsteinsdottir U, Masson G, Kong A, Halldorsson BV, Helgason A, Gudbjartsson DF, Stefansson K. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature 2017;549(7673):519-522. doi: 10.1038/nature24018. PMID: 28959963

Kehr B, Helgadottir A, Melsted P, Jonsson H, Helgason H, Jonasdottir A, Jonasdottir A, Sigurdsson A, Gylfason A, Halldorsson GH, Kristmundsdottir S, Thorgeirsson G, Olafsson I, Holm H, Thorsteinsdottir U, Sulem P, Helgason A, Gudbjartsson DF, Halldorsson BV, Stefansson K. Diversity in non-repetitive human sequences not found in the reference genome. Nature genetics 2017;49(4):588-593. doi: 10.1038/ng.3801. PMID: 28250455

Kehr B, Melsted P, Halldórsson BV. PopIns: population-scale detection of novel sequence insertions. Bioinformatics 2016;32(7):961-967. doi: 10.1093/bioinformatics/btv273. PMID: 25926346

We would like to thank the funding agencies who support our work:

FOR 2841 Beyond the Exome, Project P3

The Research Unit, FOR 2841, aims to identify, analyze, and predict the disease potential of non-coding DNA variants in patients with rare genetic diseases. The aim of Project P3 is to comprehensively identify genomic structural variants (SVs) in linked- and long-read sequencing data or rare disease patients. To this end, we developed a new genome-wide local assembly tool for SV detection during the first funding period. In the second funding period we are extending the tool to multi-sample variant calling.

https://www.beyond-the-exome.org/P03.html

CRC Transregio 221, GvH/GvL INF project

The Transregional Collaborative Research Center, CRC/TRR 221, is investigating innovative immune-modulation strategies to separate graft-versus-host disease from graft-versus-leukemia effects. This seeks to enhance the safety and efficacy of allogeneic hematopoietic stem cell transplantation (HSCT) in the future. Within this the INF project is dedicated to data infrastructure. It focuses mainly on data management, while it also supports the individual projects with adequate software and expert knowledge during the entire data analysis process.

https://www.gvhgvl.de/en/projects-publications/projects/project-section-b

Prof. Birte Kehr

Head of Research Group | Algorithmic Bioinformatics

Tel: +49 941 944–18161
Email: birte.kehr@ukr.de

Katrin Zehenter

Team Assistant

Tel: +49 941 944–38132
Email: katrin.zehenter@ukr.de

Research team

Prof. Birte Kehr

Head of Research Group | Algorithmic Bioinformatics

Dr. Robert Lohmayer

Postdoctoral Scientist

Kedi Cao

PhD Student

Tim Mirus

PhD Student

Richard Lüpken

PhD Student

Laura Grepmair

PhD Student

Lab Life

There is life outside the laboratory: The Leibniz Institute places great value on our scientists developing the team spirit both in and out of work. Here are the photos to prove it!