Filter algorithms for approximate string matching

Burkhardt, Stefan

Bitte benutzen Sie diese Referenz, um auf diese Ressource zu verweisen: doi:10.22028/D291-25685

Titel:	Filter algorithms for approximate string matching
VerfasserIn:	Burkhardt, Stefan
Sprache:	Englisch
Erscheinungsjahr:	2002
DDC-Sachgruppe:	004 Informatik
Dokumenttyp:	Dissertation
Abstract:	In this work we present new results and methods for approximate string matching with filter algorithms. We begin with the presentation of QUASAR, our efficient implementation of an improved version of the filter based on the q-gram lemma. The q-gram lemma provides a method based on matching substrings to quickly detect potential matches to a query in a subject or target. We improved and implemented an algorithm originally introduced in 1991. This resulted in a very efficient program for approximate string matching using an index. It was successfully applied to EST-clustering, a problem from computational biology. The second part of our work introduces a new class of filters based on gapped q-grams. We analyze the potential of this somewhat more complicated approach for use in filters for approximate string matching with an index. We consider two important distance measures in approximate string matching, the Hamming and the edit distance. For both problems we provide all the tools required to solve them using gapped q-grams. This includes threshold computation and the selection of good gapped q-grams using predictions of their speed and filtration effciency. Furthermore we consider the potential of gapped q-grams for use in lossy filters. We support our findings with extensive experiments. Our results prove that gapped q-grams are superior to existing filter approaches with respect to speed, filtration efficiency and their potential for use in lossy filters. In dieser Arbeit beschreiben wir neue Ergebnisse und Verfahren auf dem Gebiet der Filteralgorithmen für Aehnlichkeitssuche in Textdatenbanken. Im ersten Teil stellen wir QUASAR, die Implementierung eines verbesserten Filters basierend auf dem sogenannten q-gram Lemma, vor. Dieses Lemma basiert auf dem Vergleich von kurzen Teilwoerten und ermöglicht die effiziente Erkennung der Teile einer Textdatenbank, die einer bestimmten Anfrage ähneln. Der zweite Teil der Arbeit stellt eine neue Klasse von Filtern die q-grams mit Lücken, sogenannte "gapped q-grams", benutzen vor. Wir untersuchen das Potential dieser komplexeren q-grams für die Nutzung in Filteralgorithmen für Index-basierte Ähnlichkeitssuche in Textdatenbanken.-
Link zu diesem Datensatz:	urn:nbn:de:bsz:291-scidok-1712 hdl:20.500.11880/25741 http://dx.doi.org/10.22028/D291-25685
Erstgutachter:	Hans-Peter Lehnhof
Tag der mündlichen Prüfung:	1-Jan-2002
Datum des Eintrags:	16-Feb-2004
Fakultät:	MI - Fakultät für Mathematik und Informatik
Fachrichtung:	MI - Informatik
Sammlung:	SciDok - Der Wissenschaftsserver der Universität des Saarlandes

Dateien zu diesem Datensatz:

Datei	Beschreibung	Größe	Format
StefanBurkhardt_ProfDrHansPeterLehnhof.pdf		915,14 kB	Adobe PDF	Öffnen/Anzeigen

Export: BibTex Statistik anzeigen

Alle Ressourcen in diesem Repository sind urheberrechtlich geschützt.