Exploratory visualizations and statistical analysis of large, heterogeneous epigenetic datasets

Halachev, Konstantin

Please use this identifier to cite or link to this item: doi:10.22028/D291-26578

Title:	Exploratory visualizations and statistical analysis of large, heterogeneous epigenetic datasets
Author(s):	Halachev, Konstantin
Language:	English
Year of Publication:	2014
SWD key words:	Bioinformatik Epigenetik Visualisierung Information Retrieval Datenintegration
Free key words:	Textsuche interaktive Visualisierung Web-Server Sondierungssuche bioinformatics epigenetics text search interactive visualization webserver exploratory search data integration information retrieval
DDC notations:	004 Computer science, internet
Publikation type:	Dissertation
Abstract:	Epigenetic marks, such as DNA methylation and histone modifications, are important regulatory mechanisms that allow a single genomic sequence to give rise to a complex multicellular organism. When studying mechanisms of epigenetic regulation, the analyses depend on the experimental technologies and the available data. Recent advancements in sequencing technologies allow for the efficient extraction of genome-wide maps of epigenetic marks. A number of large-scale mapping projects, such as ENCODE and IHEC, intensively produce data for different tissues and cell cultures. The increasing quantity of data highlights a major bottleneck in bioinformatic research, namely the lack of bioinformatic tools for analyzing these data. To date, there are bioinformatics tools for detailed (mostly visual) inspection of single genomic loci, allowing biologists to focus research on regions of interest. Also, efficient tools for manipulation and analysis of the data have been published, but often they require computer science abilities. Furthermore, the available tools provide solutions to only already well formulated biological questions. What is missing, in our opinion, are tools (or pipelines of tools) to explore the data interactively, in a process that would facilitate a trained biologist to recognize interesting aspects and pursue them further until concrete hypotheses are formulated. A possible solution stems from the best practices in the fields of information retrieval and exploratory search. In this thesis, I propose EpiExplorer, a paradigm for integration of state-of-the-art information retrieval methods and indexing structures, applied to offer instant interactive exploration of large epigenetic datasets. The algorithms we use are developed for semi-structured text data, but we apply them on bioinformatic data through clever textual mapping of biological properties. We demonstrate the power of EpiExplorer in a series of studies that address interesting biological problems. We also present in this manuscript EpiGRAPH, a bioinformatic software that we developed with colleagues. EpiGRAPH helps identify and model significant biological associations among epigenetic and genetic properties for sets of regions. Using EpiExplorer and EpiGRAPH, independently or in a pipeline, provides the bioinformatic community with access to large databases of annotations, allows for exploratory visualizations or statistical analysis and facilitates reproduction and sharing of results. Epigenetische Signaturen wie die Methylierung der DNS oder posttranslationale Modifikationen der Histonproteine stellen wichtige regulatorische Mechanismen dar. Diese ermöglichen es, dass ein komplexer, multizellulärer Organismus aus einer einzelnen genomische Sequenz hervorgeht. Adequate Analysemethoden hängen von den verwendeten experimentellen Technologien und den verfügbaren Daten ab. Jüngste Fortschritte in der DNS-Sequenzierungstechnologie ermöglichen die effiziente Erstellung genomweiter Karten epigenetischer Informationen. Diese Epigenomkarten werden von einigen Projekten und Initiativen wie ENCODE und IHEC im grossen Massstab für diverse Gewebe- und Zelltypen erstellt. Hierbei stellt der Mangel an effizienten bioinformatischen Softwarewerkzeugen einen wesentlichen Engpass in der Analyse dieser stetig wachsenden Datenflut dar. Experimentelle Biologen können heute einzelne genomische Loci mithilfe benutzerfreundlicher (meist visueller) bioinformatischer Software im Detail inspizieren. Des Weiteren existieren effiziente Werkzeuge für die Manipulation und Analyse dieser Datensätze, die jedoch ein gewisses Mass informatischer Expertise erfordern und sich zumeist auf die Lösung bereits wohldefinierter biologischer Fragestellungen fokussieren. Unserer Ansicht nach fehlen Werkzeuge und Softwarepipelines mithilfe derer ein Benutzer, der über ein fundiertes Wissen der biologischen Grundlagen, jedoch nicht unbedingt über informatische Kenntnisse verfügt, die verfügbaren Datensätze interaktiv durchstöbern und darauf aufbauend weiterführende Hypothesen entwickeln kann. Eine möglichen Ansatz hierfür bieten Methoden aus den Bereichen Information Retrieval und der explorativen Suche. Diese Arbeit beschreibt EpiExplorer, eine Software, die auf dem Paradigma der Integration von modernen Information Retrieval und Indexstrukturen basiert und darauf ausgelegt ist eine Vielzahl von (epi-)genomweiten Datensätzen in Echtzeit zu explorieren. Die verwendeten Algorithmen wurden ursprünglich für die Suche in semistrukturierten, textuellen Datensätzen entwickelt. EpiExplorer ermöglicht ihre Verwendung durch eine systematische Umwandlung biologischer Eigenschaften in Textdukumente. Ausserdem demonstriert diese Arbeit EpiExplorers Leistungsfähigkeit und Nützlichkeit durch relevante Anwendungsbeispiele biologisch interessanter Fragestellungen. Komplementär zu EpiExplorer wurde in Kollaboration mit Kollegen EpiGRAPH entwickelt, mithilfe dessen signifikante biologische Assoziationen zwischen genetischen und epigenetischen Eigenschaften regionsbasiert identifiziert und modelliert werden können. EpiExplorer und EpiGRAPH stellen - unabhängig voneinander oder im Verbund miteinander - nützliche Ressourcen dar. In einer bioinformatischen Softwarepipeline ermöglichen sie den Datenbank-basierten Zugriff auf eine Vielzahl (epi-)genomischer Datensätze, deren explorative Visualisierung oder statistische Analyse sowie die Reproduzierbarkeit und den Austausch von Analyseergebnissen.
Link to this record:	urn:nbn:de:bsz:291-scidok-59112 hdl:20.500.11880/26634 http://dx.doi.org/10.22028/D291-26578
Advisor:	Lengauer, Thomas
Date of oral examination:	10-Jun-2014
Date of registration:	27-Oct-2014
Faculty:	MI - Fakultät für Mathematik und Informatik
Department:	MI - Informatik
Collections:	SciDok - Der Wissenschaftsserver der Universität des Saarlandes

Files for this record:

File	Description	Size	Format
HalachevKonstantin_official_electronic_dissertation.pdf		4,02 MB	Adobe PDF	View/Open

Export: BibTex