Bachelor and Master Thesis
If you are interested in one of the topics below email us at mail@bsc.fu-berlin.de. If you have a proposal for your own research which fits our scope just let us know.
Available Thesis (Master and Bachelor)
Depeding on the available time, most topics below can be scaled to either B.Sc. or M.Sc. level.
Estimation of Resolving Power (Resolution) from Mass Spectra
Modern mass spectrometers are capable of recording mass spectra in high resolution. However, lower resolution offers benefits in terms of ion transmission and acquisition speed and thus a tradeoff must usually be made.
For some algorithms, e.g. peptide search engines, such as those implemented in OpenMS (www.openms.de), it is important to know (or at least estimate) the resolution of the underlying data. Since this information is usually not available in non-vendor formats, such as mzML, the resolution needs to be estimated from data.
The goal of this thesis is to find, implement (C++) and benchmark different estimators for mass resolution, for multiple instrument classes (TOF, Orbitrap, FT-ICR, IonTrap ...) for profile data and possibly centroided data. The resolution should be estimated at multiple points in the spectrum and the estimated at a reference point (e.g. 400 m/z) based on the resolution behaviour in m/z.
Simulate IonMobility, FineIsotope and other modern features
OpenMS (www.openms.de) has a rather well-rounded simulator for proteomics mass spectrometry data. However, modern instruments are capable of producing more complex data. E.g. add another separation dimension such as ion mobility in addition to the well-known retention time. Also, modern instruments are capable of resolving fine isotopic structures, e.g. mass defects.
The aim of this thesis is to implement these modern features into the simulator using C++. Some models (such as fine-isotope structures) are available in general, but not coupled to the simulator. Others, such as ion-mobility require a simulation from the ground up.
Comparison and benchmarking of DeNovo Search Engines
LC-MS-based Proteomics typically relies on database search engines which require a FASTA proteome as input. In contrast deNovo methods attempt to determine the peptide sequence underlying an MS/MS spectrum by purely algorithmic methods.
Recently, there have been many exciting publications detailing new exciting de novo tools: Casanovo, Spectralis, π-HelixNovo, InstaNovo, GraphNovo.
The aim of this thesis is to compare the strength and weaknesses of these tools and benchmark them against established deNovo methods and traditional DB search using appropriate benchmark datasets and evaluations metrics (incl. runtime, RAM etc).
A Cache-optimized Approximate String Matching Approach for the Peptide-Protein Mapping Problem (C++)
A common step in a proteomics pipeline is finding all peptides (i.e. short amino acid sequences), identified from mass spectrometry data, in a large protein database. One promising index-free method is based on a modified Aho Corasick allowing for error-tolerant string search (i.e. allow for mismatches).
Modern CPUs employ multiple levels of CPU-caches (usually L1, L2, L3 on x86) to hide slow access to RAM. To minimize cache misses, cache-oblivious algorithms have been developed.
The goal of the thesis is to engineer and optimize (C++) the modified Aho-Corasick algorithm (trie-based) for maximum CPU cache friendlyness and benchmark the improved versions against the standard implementation.
Requirements: Expertise in C++ (C++11 and later) or a closely related language (Java) is required and interest in hardware/cache and benchmarking (perf tools, Intel VTune) is a big plus.
Basic knowledge of LC-MS is desirable, but can be acquired at a sufficient level during the first days.
Currently Running Topics
NA
Already Finished Topics (incomplete)
Unsupervised detection of peptide modifications (Kilian Malek)
In order to identify a peptide sequence from mass spectrometry data, the user must provide both an appropriate database (e.g. the human proteome) and a list of expected (post-translational) modifications.,usually oxidation of M and acetylation of the protein N-term. While usually sufficient, this may lead to subtoptimal results if other modifications (e.g. deamidation) were introduced during sample handling.
Instead of employing an open modification search for every dataset (very CPU-intens and not part of a standard analysis), it would be good to have a method which detects automatically if this is necessary by inspecting the search results for clues. Thus, we are basically looking for a quality metric which might be based on differences of identified vs. unidentified MS spectra and their underlying data (e.g. using mass decimals, etc).
The new metric should be able to detect missing modifications using a benchmark dataset (e.g. from https://www.mcponline.org/content/15/8/2791 - dataset PXD002389).
Implement small-databases in Proteomics suite (Philipp Wang)
Small protein databases, or the fact that a research question only focusses on a rather small subset of proteins when the sample actually contains many proteins, demand special methods when computing false-discovery rates and statistical confidence - see https://www.youtube.com/watch?v=jIFyqXaN7RI&t=10s. The aim of this thesis is to implement group-FDR and neighbour-subset search strategies into OpenMS (www.openms.org) and validate and compare their performance against traditional FDR on adequate samples.
Proteomics Database Suitability (Tom Waschischeck)
Using the correct database (usually in FASTA format) for the annotation of MS/MS spectra in a proteomics experiment is vital to the successful annotation of peptide-spectrum-matches (PSMs) and downstream false-discovery-rate filtering. For well studied organisms, this can be easily achieved by simply downloading the database from public sources, such as UniProt. For other cases, mostly non-model organisms and meta-proteomics, the picture is less clear, since it is hard to assess how good the proteome of a related species or a set of assumed organisms really fits the data under study. Therefore, a database suitability score has recently been developed, which aims to alleviate some of the above issues: it reports a score from 0-100%. The interpretation of the score from the original paper is not entirely intuitive, since it suffers from a non-linearity. Ideally 100% would indicat that the database is a perfect match, i.e. covers all peptides in the experiment, whereas 0% would indicate completely unrelated sequences. This theses aims to accomplish this goal by implementing a corrected suitability score, using some theory on hyperbolic functions and additional search runs for extrapolating some function parameters. The results, will be validated using multiple search engines (which extends the original Comet-only approach) and datasets. Also the ability of detection of unusual contaminants, which are usually not accounted for (e.g. Mycoplasma), will examined.
Tree-based Alignment of LC-MS data (Julia Thüringer)
Problem:
Liquid chromatography (LC) - Mass Spectrometry (MS) data is extremely complex and whole-cell lysates are never fully sequenced using todays state-of-the-art technology.
Comparison of LC-MS data can be significantly enhanced by a so-called 'map alignment' where unidentified features are assigned to a peptide sequence using information acquired in another LC-MS run. Since LC is potentially unstable, an retention time (RT) correction procedure is usually required before ID's can be successfully transferred based on an accurate mass and time approach. To reduce the number of user-defined parameters and achieve maximal robustness at the same time, the alignment should not require a single reference run.
Solution:
A guide-tree based multiple alignment should be implemented within the OpenMS C++ software framework (www.openms.de).
The algorithm must feature a robust metric to estimate an initial distance matrix (e.g. percentage of overlapping ID's, stddev of matching IDs, or a combination of them).
A comparison against a current implementation requiring a reference using benchmark metrics such as 1) stdev of aligned pairs, maybe even use Cross-validation (subset of IDs for alignment, test on out-set) 2) number of transferrable IDs 3) number/size distribution of consensus-features (larger but fewer clusters should be preferred), should be conducted to prove the quality of the implemented solution.
Implementations of high quality will be integrated into the official release of the OpenMS software.
Requirements:
Expertise in C++ or a closely related language (Java) and object-oriented programming is strongly advised.
Basic knowledge of LC-MS is desirable, but can be acquired at a sufficient level during the first days.
A Fast Aho-Corasick implementation using Double-Arrays (Patricia Scheil)
Aho-Corasick (AC) is a prominent method to search multiple patterns concurrently in a large text. The algorithm is usually implemented using a trie datastructure, which wastes a lot of memory (at the potential speed benefit of eliminating the failure function when using a complete trie by direct jumps). There are many ways to compat a trie, but one of the most promising methods is using a double array (DA). This thesis will implement (one of the flavours of) a DA, which should allow to keep the datastructure small enough for common problem sizes in proteomics (60k peptides) to still fit into L3 cache of most modern CPUs (8 MB), thus avoiding costly cache misses. The new DA-CDA will be benchmarked against existing approaches, such as AC-Trie and Suffix-Arrays.
Implementation and validation of mzQC metrics in PTXQC (Chen Xu)
Experimental data in proteomics and metabolomics is usually acquired using high-throughput instrumentation such as high-performance liquid chromatography - mass spectrometry (HPLC-MS).
Data is then analyzed by automated pipelines, e.g. OpenMS or MaxQuant, without user interaction. Since the amount of data is simply too large for manual validation,
automated systems, such as PTXQC have been developed to extract known quality metrics from all stages of the pipeline to generate a human readable quality control report.
For longitudinal studies, summary statistics and machine learning approaches, it is desirable to have the QC data available in a standardized machine-readable format.
For this purpose, mzQC was developed which stores QC metrics in a JSON format using a controlled vocabulary. This enables querying QC data across any number of mzQC files in an automated fashion.
The goal of this thesis is to implement mzQC export functionality within PTXQC, by assigning existing CV terms to PTXQC metrics and writing a JSON file.
Then standardized JSON data can be generated for a variety of LC-MS studies and compared, to gain summary statistics of certain platforms, setups and the biological systems under study.
A Webserver for Quality Control Reports (Kristin Koehler)
Quality Control (QC) is an essential step in high-throughput technologies, such as mass-spectrometry based proteomics. We have developed a framework in R called PTXQC (available on CRAN), which can create such reports. To ease the usage and avoid the overhead of maintaing a local R installation for the practicioners/users, this thesis aims at developing a Shiny application (web service based on R components) which where users can upload their data and receive a QC report in return. A successful implementation will be installed permanently on FU webservers to serve the research community.
Requirements: Expertise in the R programming languare is strongly advised. Interest in web technology and Docker containers is a must. Basic knowledge of LC-MS is desirable, but can be acquired at a sufficient level during the first days.