Cross Language Information Retrieval
The goal of Cross Language Information Retrieval or CLIR is to find the
information a user needs even if it's written in a different language.
This is achieved by designing a system where a query in one
language can be compared with documents in another.
Information Mapping is particularly effective for CLIR, because our
system is fundamentally based upon determining the similarity
between words. Whereas most Information Retrieval systems treat the
words "dinner" and "meal" as unrelated strings of letters, Infomap
recognizes that they are closely associated with one another.
It is logically only one step further to build a model which
associates words from different languages. So if you're looking for
French bread, your query for "bread" will naturally recognize
documents about "pain". But that's not all: the word "bread" will also
be cross-liguistically similar to words like "baguette" and
"croissant". Even though these are not literal translations of the
English word "bread" (being instead particular kinds of bread), the
information mapping process is correct to associate them with
"bread". This process is ideal for CLIR, because if you want to buy
bread in Francs these are exactly the sort of words you want to know
about. Infomation Mapping thus allows a query to retrieve not only
literal but idiomatic translations.
This can be achieved by using a bilingual training corpus to
build the model. We have recently built a model using a corpus of 10,000
medical abstracts with translations from German to English. You can use
this to model the meanings of medical terms in both languages. In
this way you can translate query statements and retrieve documents.
Translating a query word by word using a dictionary is another method, but
it has serious drawbacks. The dictionary may not contain some of the words
needed, and it may not know all the possible meanings of the words it
refers to. Most importantly, it is impossible to resolve translational
abiguities just using a dictionary: if two possible tranlations are given
for your query term, which one should the system choose?
The Infomap translation system answers these problems. Words are
translated based upon how they are used in practice, and the words the
system leaves out are ones which aren't represented in the documents
you want to search!
This search engine has also been used to extend an English / German
bilingual dictionary. This experiment is described in the following paper:
You can also visualize the meanings of words, and how they relate to
other words in both languages , by building a
bilingual word spectrum.
We have carried out successful experiments
using an English-Japanese bilingual patent abstract corpus, where each
Japanese patent abstract has an English tanslation. The resulting
CLIR tests showed a 97.4% success rate in associating the right
Japanese original with a query from the translated English version.
The experiment is described in the following paper:
Other References
-
Raymond Flournoy, Hiroshi Masuichi, and Stanley Peters.
Cross-Language Information Retrieval: Some Methods and Tools. In
D. Hiemstra, F. de Jong and K. Netter, eds, TWLT 14 Language
Technology in Multimedia Information Retrieval, pp. 79-83,
Universiteit Twente: Enschede, The Netherlands, 1998.
-
Genichiro Kikui. Term-list Translation Using Monolingual Word
Co-occurrence Vectors. COLING-ACL '98, Montreal, August 1998.
Back to Infomap Home
|