Bitte benutzen Sie diese Referenz, um auf diese Ressource zu verweisen:
doi:10.22028/D291-42300
Titel: | Learning entity and relation representation for low-resource medical language processing |
VerfasserIn: | Amin, Saadullah |
Sprache: | Englisch |
Erscheinungsjahr: | 2024 |
Freie Schlagwörter: | relation extraction named entity recognition pre-trained language models knowledge graph embeddings concept extraction |
DDC-Sachgruppe: | 004 Informatik 400 Sprache, Linguistik 600 Technik 610 Medizin, Gesundheit |
Dokumenttyp: | Dissertation |
Abstract: | Recent advancements in natural language processing have led to growing interest in critical domains such as legal, finance, and healthcare. In particular, medical language processing has emerged as a focus of its own. As the number of unstructured text and structured ontologies increases, applying information extraction techniques becomes an essential first step for downstream applications in healthcare. To partially meet these needs, this dissertation studies knowledge acquisition for medical language processing under real-world low-resource conditions. This includes limited-to-no labeled data, multilingualism, domain specificity, and missing knowledge. The focus is on the fundamental building blocks of information extraction, entities and relations, where the proposed methods derive representations from pre-trained language models for unstructured text and knowledge graph embedding models for structured data. First, we consider entity-centric learning in the clinical domain, starting with multilingual and unsupervised concept extraction from text for semantic indexing to named entity transfer for privacy-preserving cross-lingual de-identification. We demonstrate the effectiveness of transfer learning with multilingual and domain-specific language models in supervised, unsupervised, and few-shot settings. In particular, we follow a pre-train and then fine-tune paradigm to achieve better performance compared to state-of-the-art neural architectures for concepts extraction from multilingual clinical texts. Whereas for unsupervised extraction, we propose a hybrid framework, Dense Phrase Matching, which combines embedding-based matching with concepts string matching, showing strong improvements on lexically rich texts, with further application to multilingual clinical texts. We then propose a Transformer based transfer learning framework, T2NER, that offers to bridge the gap between growing research in deep transformer models, NER transfer, and domain adaptation. We use T2NER for the task of identifying protected health information by empirically investigating the few-shot cross-lingual transfer property of multilingual BERT, which primarily has been the focus of zero-shot transfer, and propose an adaptation strategy that significantly improves clinical de-identification for code-mixed texts with few samples. Second, we consider relation-centric learning in the biomedical domain, starting with distantly supervised relation extraction from text for knowledge base enrichment to multi-relational link prediction for discovering missing facts in the knowledge graph. We showcase the utility of scientific language models for relation extraction and efficient tensor factorization for knowledge graph completion. We first propose entity-enriched relation classification BERT for multi-instance learning, whereby knowledge sensitive data encoding scheme is introduced that significantly reduces noise in distant supervision. We then investigate existing broad-coverage biomedical relation extraction benchmarks to identify a notable shortcoming of overlapping training and test relationships which we address by introducing a more accurate benchmark MedDistant19. Lastly, we propose an efficient knowledge graph completion model, LowFER, that achieves on par or state-of-the-art on several datasets in general and biomedical domains. We show that LowFER's representation capacity is fully expressive to handle arbitrary relation types and its low-rank generalization of Tucker decomposition encapsulates existing models as special cases. |
Link zu diesem Datensatz: | urn:nbn:de:bsz:291--ds-423004 hdl:20.500.11880/38051 http://dx.doi.org/10.22028/D291-42300 |
Erstgutachter: | Neumann, Günter |
Tag der mündlichen Prüfung: | 5-Jun-2024 |
Datum des Eintrags: | 16-Jul-2024 |
Fördernummer: | EC/H2020/777107/EU//Precise4Q |
Bemerkung/Hinweis: | Partial fundings: BMBF/01IW17001//DEEPLEE; BMBF/01IW20010//CoRA4NLP |
Fakultät: | P - Philosophische Fakultät SE - Sonstige Einrichtungen |
Fachrichtung: | P - Sprachwissenschaft und Sprachtechnologie SE - DFKI Deutsches Forschungszentrum für Künstliche Intelligenz |
Professur: | P - Prof. Dr. Josef van Genabith |
Sammlung: | SciDok - Der Wissenschaftsserver der Universität des Saarlandes |
Dateien zu diesem Datensatz:
Datei | Beschreibung | Größe | Format | |
---|---|---|---|---|
PhD_final_Amin.pdf | 5,54 MB | Adobe PDF | Öffnen/Anzeigen |
Diese Ressource wurde unter folgender Copyright-Bestimmung veröffentlicht: Lizenz von Creative Commons