Please use this identifier to cite or link to this item:
doi:10.22028/D291-42300
Title: | Learning entity and relation representation for low-resource medical language processing |
Author(s): | Amin, Saadullah |
Language: | English |
Year of Publication: | 2024 |
Free key words: | relation extraction named entity recognition pre-trained language models knowledge graph embeddings concept extraction |
DDC notations: | 004 Computer science, internet 400 Language, linguistics 600 Technology 610 Medicine and health |
Publikation type: | Dissertation |
Abstract: | Recent advancements in natural language processing have led to growing interest in critical domains such as legal, finance, and healthcare. In particular, medical language processing has emerged as a focus of its own. As the number of unstructured text and structured ontologies increases, applying information extraction techniques becomes an essential first step for downstream applications in healthcare. To partially meet these needs, this dissertation studies knowledge acquisition for medical language processing under real-world low-resource conditions. This includes limited-to-no labeled data, multilingualism, domain specificity, and missing knowledge. The focus is on the fundamental building blocks of information extraction, entities and relations, where the proposed methods derive representations from pre-trained language models for unstructured text and knowledge graph embedding models for structured data. First, we consider entity-centric learning in the clinical domain, starting with multilingual and unsupervised concept extraction from text for semantic indexing to named entity transfer for privacy-preserving cross-lingual de-identification. We demonstrate the effectiveness of transfer learning with multilingual and domain-specific language models in supervised, unsupervised, and few-shot settings. In particular, we follow a pre-train and then fine-tune paradigm to achieve better performance compared to state-of-the-art neural architectures for concepts extraction from multilingual clinical texts. Whereas for unsupervised extraction, we propose a hybrid framework, Dense Phrase Matching, which combines embedding-based matching with concepts string matching, showing strong improvements on lexically rich texts, with further application to multilingual clinical texts. We then propose a Transformer based transfer learning framework, T2NER, that offers to bridge the gap between growing research in deep transformer models, NER transfer, and domain adaptation. We use T2NER for the task of identifying protected health information by empirically investigating the few-shot cross-lingual transfer property of multilingual BERT, which primarily has been the focus of zero-shot transfer, and propose an adaptation strategy that significantly improves clinical de-identification for code-mixed texts with few samples. Second, we consider relation-centric learning in the biomedical domain, starting with distantly supervised relation extraction from text for knowledge base enrichment to multi-relational link prediction for discovering missing facts in the knowledge graph. We showcase the utility of scientific language models for relation extraction and efficient tensor factorization for knowledge graph completion. We first propose entity-enriched relation classification BERT for multi-instance learning, whereby knowledge sensitive data encoding scheme is introduced that significantly reduces noise in distant supervision. We then investigate existing broad-coverage biomedical relation extraction benchmarks to identify a notable shortcoming of overlapping training and test relationships which we address by introducing a more accurate benchmark MedDistant19. Lastly, we propose an efficient knowledge graph completion model, LowFER, that achieves on par or state-of-the-art on several datasets in general and biomedical domains. We show that LowFER's representation capacity is fully expressive to handle arbitrary relation types and its low-rank generalization of Tucker decomposition encapsulates existing models as special cases. |
Link to this record: | urn:nbn:de:bsz:291--ds-423004 hdl:20.500.11880/38051 http://dx.doi.org/10.22028/D291-42300 |
Advisor: | Neumann, Günter |
Date of oral examination: | 5-Jun-2024 |
Date of registration: | 16-Jul-2024 |
Sponsorship ID: | EC/H2020/777107/EU//Precise4Q |
Notes: | Partial fundings: BMBF/01IW17001//DEEPLEE; BMBF/01IW20010//CoRA4NLP |
Faculty: | P - Philosophische Fakultät SE - Sonstige Einrichtungen |
Department: | P - Sprachwissenschaft und Sprachtechnologie SE - DFKI Deutsches Forschungszentrum für Künstliche Intelligenz |
Professorship: | P - Prof. Dr. Josef van Genabith |
Collections: | SciDok - Der Wissenschaftsserver der Universität des Saarlandes |
Files for this record:
File | Description | Size | Format | |
---|---|---|---|---|
PhD_final_Amin.pdf | 5,54 MB | Adobe PDF | View/Open |
This item is licensed under a Creative Commons License