Filter algorithms for approximate string matching

Burkhardt, Stefan

Please use this identifier to cite or link to this item: doi:10.22028/D291-25685

Title:	Filter algorithms for approximate string matching
Author(s):	Burkhardt, Stefan
Language:	English
Year of Publication:	2002
DDC notations:	004 Computer science, internet
Publikation type:	Dissertation
Abstract:	In this work we present new results and methods for approximate string matching with filter algorithms. We begin with the presentation of QUASAR, our efficient implementation of an improved version of the filter based on the q-gram lemma. The q-gram lemma provides a method based on matching substrings to quickly detect potential matches to a query in a subject or target. We improved and implemented an algorithm originally introduced in 1991. This resulted in a very efficient program for approximate string matching using an index. It was successfully applied to EST-clustering, a problem from computational biology. The second part of our work introduces a new class of filters based on gapped q-grams. We analyze the potential of this somewhat more complicated approach for use in filters for approximate string matching with an index. We consider two important distance measures in approximate string matching, the Hamming and the edit distance. For both problems we provide all the tools required to solve them using gapped q-grams. This includes threshold computation and the selection of good gapped q-grams using predictions of their speed and filtration effciency. Furthermore we consider the potential of gapped q-grams for use in lossy filters. We support our findings with extensive experiments. Our results prove that gapped q-grams are superior to existing filter approaches with respect to speed, filtration efficiency and their potential for use in lossy filters. In dieser Arbeit beschreiben wir neue Ergebnisse und Verfahren auf dem Gebiet der Filteralgorithmen für Aehnlichkeitssuche in Textdatenbanken. Im ersten Teil stellen wir QUASAR, die Implementierung eines verbesserten Filters basierend auf dem sogenannten q-gram Lemma, vor. Dieses Lemma basiert auf dem Vergleich von kurzen Teilwoerten und ermöglicht die effiziente Erkennung der Teile einer Textdatenbank, die einer bestimmten Anfrage ähneln. Der zweite Teil der Arbeit stellt eine neue Klasse von Filtern die q-grams mit Lücken, sogenannte "gapped q-grams", benutzen vor. Wir untersuchen das Potential dieser komplexeren q-grams für die Nutzung in Filteralgorithmen für Index-basierte Ähnlichkeitssuche in Textdatenbanken.-
Link to this record:	urn:nbn:de:bsz:291-scidok-1712 hdl:20.500.11880/25741 http://dx.doi.org/10.22028/D291-25685
Advisor:	Hans-Peter Lehnhof
Date of oral examination:	1-Jan-2002
Date of registration:	16-Feb-2004
Faculty:	MI - Fakultät für Mathematik und Informatik
Department:	MI - Informatik
Collections:	SciDok - Der Wissenschaftsserver der Universität des Saarlandes

Files for this record:

File	Description	Size	Format
StefanBurkhardt_ProfDrHansPeterLehnhof.pdf		915,14 kB	Adobe PDF	View/Open

Export: BibTex