Christian Schaefer

(Dipl.-Bioinf. Univ.)
Rostlab | Computer Science | Technische Universität München

Thesis abstract

The infection with Helicobacter pylori is one of the most frequent bacterial infestations in the human population worldwide. It is furthermore known that pathogenic H. pylori strains play a major role in inducing several gastric diseases beginning with gastric inflammation up to the point of gastric and intestinal ulcer in the final stadium and even gastric cancer.

In this diploma thesis we want to address the question about a possible relationship between antibody responses against a variety of H. pylori antigens in human sera on the one hand and the status of infection as well as the kind of the involved strain (pathogenic or apathogenic) on the other hand.

In serology, a common way to deal with this kind of classification problem is the definition of cutoff values for each antigen derived from sera of uninfected donors to determine a fixed boundary between infected and non-infected cases. This approach, however, lacks in high accuracy and does not consider more complex relationships that might exist between antibody concentrations for a variety of different antigens and the status of the serum.

We here propose the use of four machine learning approaches to examine the discriminative power of Luminex-measured antibody concentrations: CART as a member of decision tree algorithms and, based on that, the ensemble method bagged CART, as well as Support Vector Machines and Logistic Regression Analysis.

In the first part of this work we access three classified datasets consisting of measured antibody concentrations. The classification into negative and positive cases as well as their further categorization into involved pathogenic or apathogenic strains, respectively, occurred by different biochemical and histological tests. By means of these classified cases and the use of the four machine learning algorithms, we induce classifiers for each of both problems and evaluate them as well as the cutoff approach to find the most accurate model for the two classification problems.

With these models at hand we then predict in the second part the state of unclassified sera, trying to find epidemiological and microbiological patterns that are different for both infected and non-infected populations.

Contact me if interested in the full thesis.