Semantic Annotation and Ambiguity Resolution for a Multilingual Medical Information System

Dominic Widdows, Stanley Peters, Beate Dorow, Chiu-Ki Chan

CSLI, Stanford University

Language resources for biomedical informatics, such as the Unified Medical Language System (UMLS), can be key to the success of adapting information systems to work with more than one language, which is particularly important for making recent, relevant research quickly available to a busy physician whose first language is not English.

A pilot version of such a system has been built by the MuchMore project, an international collaboration including Stanford's Center for the Study of Language and Information. The information system uses UMLS to match English language documents to German queries, by automatically annotating the medical concepts in documents and queries with the corresponding Concept Unique Identifiers from UMLS (Volk et al, 2002). This enables us to express the medical concepts used in particular documents in a semantic `metalanguage' which is independent of the language in which the document was originally written (much as Latin terms are tradionally used for medical concepts, independently of any one vernacular tongue).

This enables queries written in one language (in this case, German) to be matched to relevant documents in another language (in this case, English). This is particularly useful for a German-speaking physician who has a reasonable grasp of English and can easily understand the importance of an English language article if it is known to be relevant, but would have difficulty formulating an English query in reasonable time.

The automatic annotation process suffers because of conceptual ambiguity. For example, the English word "drugs" is used for two distinct concepts, roughly corresponding to the German terms "Drogen" (narcotics) and "Medikamenten" (medical drugs). Both of these terms could realistically be used in a German physician's query: before matching this query with an English document containing the word "drugs", it is important to know which of these senses is being invoked in this document, to avoid wasting the physician's time with irrelevant information.

The ambiguity-resolution problem was addressed very successfully using the part of the UMLS knowledge which records which pairs of concepts have been used together by medical experts to index the same document in the MedLine collection. These pairs can then be thought of as "conceptual partners", and enabled us to choose those senses which had the most "conceptual partners" in the same document (Lesk 1986). This method resolved over half of the ambiguities in the document collection, with nearly 80% accuracy (compared with the judgement of medical experts).

In retrieval tests, the annotation with UMLS concepts enabled a search engine to find considerably more relevant documents than a baseline search engine which only used the "surface-forms" of the words (Littman et al, 1998). The ambiguity resolution component slightly improved these results still further. (These results are shown in the "precision against recall curves" below.) This demonstrates that a rich knowledge-source such as UMLS can be used to greatly enhance multilingual access to relevant, unambiguous information.

This work is described much more fully by Widdows et al (2003).

References

M. L. Littman, S. T. Dumais, T. K. Landauer. 1998. Automatic cross-linguage information retrieval using latent semantic indexing, Ch. 4 in Cross-language information retrieval, ed. G. Grefenstette, Kluwer.

M. E. Lesk. 1986. Automated sense disambiguation using machine-readable dictionaries: How to tell a pine cone from an ice cream cone, Proceedings of the SIGDOC conference, ACM.

M. Volk, B. Ripplinger, S. Vintar, P. Buitelaar, D. Raileanu, B. Sacaleanu. 2002. Semantic Annotation for Concept-Based Cross-Language Medical Information Retrieval. International Journal of Medical Informatics, Volume 67:1-3.

D. Widdows, S. Peters, P. Buitelaar, D. Steffen, S. Cederberg, B. Dorow. 2003. A Multilingual Medical Information System using unsupervised Word Sense Disambiguation. Under review for Journal of Computers, Speech and Language.


Last modified: Tue Feb 10 12:44:06 PST 2004