Bitte benutzen Sie diese Referenz, um auf diese Ressource zu verweisen:
doi:10.22028/D291-45991
Titel: | Swiftly identifying strongly unique k-mers |
VerfasserIn: | Zentgraf, Jens Rahmann, Sven |
Sprache: | Englisch |
Titel: | Algorithms for Molecular Biology |
Bandnummer: | 20 |
Heft: | 1 |
Verlag/Plattform: | BMC |
Erscheinungsjahr: | 2025 |
Freie Schlagwörter: | k-mer Hamming distance Strong uniqueness Parallelization Algorithm engineering |
DDC-Sachgruppe: | 004 Informatik |
Dokumenttyp: | Journalartikel / Zeitschriftenartikel |
Abstract: | Motivation Short DNA sequences of length k that appear in a single location (e.g., at a single genomic position, in a single species from a larger set of species, etc.) are called unique k-mers. They are useful for placing sequenced DNA fragments at the correct location without computing alignments and without ambiguity. However, they are not necessarily robust: A single basepair change may turn a unique k-mer into a diferent one that may in fact be present at one or more diferent locations, which may give confusing or contradictory information when attempting to place a read by its k-mer content. A more robust concept are strongly unique k-mers, i.e., unique k-mers for which no Hamming-distance-1 neighbor with conficting information exists in all of the considered sequences. Given a set of k-mers, it is therefore of interest to have an efcient method that can distinguish k-mers with a Hamming-dis tance-1 neighbor in the collection from those that do not. Results We present engineered algorithms to identify and mark within a set K of (canonical) k-mers all elements that have a Hamming-distance-1 neighbor in the same set. One algorithm is based on recursively running a 4-way comparison on sub-intervals of the sorted set. The other algorithm is based on bucketing and running a pairwise bit-parallel Hamming distance test on small buckets of the sorted set. Both methods consider canonical k-mers (i.e., taking reverse complements into account) and allow for efcient parallelization. The methods have been imple mented and applied in practice to sets consisting of several billions of k-mers. An optimized combined approach run ning with 16 threads on a 16-core workstation yields wall times below 20 seconds on the 2.5 billion distinct 31-mers of the human telomere-to-telomere reference genome. Availability An implementation can be found at https://gitlab.com/rahmannlab/strong-k-mers. |
DOI der Erstveröffentlichung: | 10.1186/s13015-025-00286-6 |
URL der Erstveröffentlichung: | https://doi.org/10.1186/s13015-025-00286-6 |
Link zu diesem Datensatz: | urn:nbn:de:bsz:291--ds-459915 hdl:20.500.11880/40361 http://dx.doi.org/10.22028/D291-45991 |
ISSN: | 1748-7188 |
Datum des Eintrags: | 8-Aug-2025 |
Fakultät: | MI - Fakultät für Mathematik und Informatik |
Fachrichtung: | MI - Informatik |
Professur: | MI - Prof. Dr. Sven Rahmann |
Sammlung: | SciDok - Der Wissenschaftsserver der Universität des Saarlandes |
Dateien zu diesem Datensatz:
Datei | Beschreibung | Größe | Format | |
---|---|---|---|---|
s13015-025-00286-6.pdf | 3,95 MB | Adobe PDF | Öffnen/Anzeigen |
Diese Ressource wurde unter folgender Copyright-Bestimmung veröffentlicht: Lizenz von Creative Commons