CSLI
Home
Contact Us
Projects
People
Links

Cross Language Information Retrieval

The goal of Cross Language Information Retrieval or CLIR is to find the information a user needs even if it's written in a different language.
This is achieved by designing a system where a query in one language can be compared with documents in another.

Information Mapping is particularly effective for CLIR, because our system is fundamentally based upon determining the similarity between words. Whereas most Information Retrieval systems treat the words "dinner" and "meal" as unrelated strings of letters, Infomap recognizes that they are closely associated with one another.

It is logically only one step further to build a model which associates words from different languages. So if you're looking for French bread, your query for "bread" will naturally recognize documents about "pain". But that's not all: the word "bread" will also be cross-liguistically similar to words like "baguette" and "croissant". Even though these are not literal translations of the English word "bread" (being instead particular kinds of bread), the information mapping process is correct to associate them with "bread". This process is ideal for CLIR, because if you want to buy bread in Francs these are exactly the sort of words you want to know about. Infomation Mapping thus allows a query to retrieve not only literal but idiomatic translations.

This can be achieved by using a bilingual training corpus to build the model. We have recently built a model using a corpus of 10,000 medical abstracts with translations from German to English. You can use this to model the meanings of medical terms in both languages. In this way you can translate query statements and retrieve documents.

Translating a query word by word using a dictionary is another method, but it has serious drawbacks. The dictionary may not contain some of the words needed, and it may not know all the possible meanings of the words it refers to. Most importantly, it is impossible to resolve translational abiguities just using a dictionary: if two possible tranlations are given for your query term, which one should the system choose?

The Infomap translation system answers these problems. Words are translated based upon how they are used in practice, and the words the system leaves out are ones which aren't represented in the documents you want to search!

Try it out now.

This search engine has also been used to extend an English / German bilingual dictionary. This experiment is described in the following paper:

You can also visualize the meanings of words, and how they relate to other words in both languages , by building a bilingual word spectrum.

We have carried out successful experiments using an English-Japanese bilingual patent abstract corpus, where each Japanese patent abstract has an English tanslation. The resulting CLIR tests showed a 97.4% success rate in associating the right Japanese original with a query from the translated English version. The experiment is described in the following paper:

Other References

Raymond Flournoy, Hiroshi Masuichi, and Stanley Peters. Cross-Language Information Retrieval: Some Methods and Tools. In D. Hiemstra, F. de Jong and K. Netter, eds, TWLT 14 Language Technology in Multimedia Information Retrieval, pp. 79-83, Universiteit Twente: Enschede, The Netherlands, 1998.

Genichiro Kikui. Term-list Translation Using Monolingual Word Co-occurrence Vectors. COLING-ACL '98, Montreal, August 1998.

Back to Infomap Home