CSLI
Home
Contact Us
Projects
People
Links

Software

|

Bilingual Infomap

|

Demo

|

Semantic Classes

|

Semantic Graphs

|

Papers

|

Book

|

Partners


Information Mapping Project

The goal of the Information Mapping project was to understand the meanings of words based on the way they are used in text. This is important for language understanding and translation, and also for intelligent, concept-based information representation and retrieval. Currently, document retrieval from large document collection--such as the web, library card catalogs or newspaper archives--is based on keyword search. A query is posed as a list of words, and any entries in the database which contain any or all of those specific words are returned. However, if we treat those query words not as literal strings of letters, but as representing concepts, then we can retrieve relevant documents even if they do not contain the specific words used in the query.

The Infomap project focussed on the development of the following important mathematical models.

Vector Space Models

The Infomap vector space model, or WORDSPACE, was pioneered by Hinrich Schütze. The model works by mapping words to points in a high-dimensional space, by recording the frequency of co-occurrence between words in the text; for example, the number of times two words appear in the same document. The distribution of co-occurrences between a word and some set of content-bearing terms then serves as a profile of the word's usage, and can accurately associate words with similar meanings. A word can thus be described by a spectrum of related words. The user can choose to accept or reject these relations, thus building an increasingly accurate profile of the meaning they desire. Generalizing this, by comparing the query words' profiles to profiles generated for each document, we can return articles which are conceptually related to the query words, even if the words themselves do not appear in the text. Vector space models for text are described here.

Later on in the project, work on the logical properties of WORDSPACE by Dominic Widdows and Stanley Peters demonstrated that WORDSPACE can naturally be navigated using the same logic as quantum mechanics, with powerful and exciting consequences for modelling word-meanings. This aspect of WORDSPACE is described here.

As part of our goals to distribute methods and tools to the wider scientific community, the Infomap project also released open-source software that researchers and practitioners can use to build and navigate WORDSPACE models. The software is available from here.
Try the WORDSPACE Demo

Link Analysis Models

Link analysis has proved to be a popular tool in recent years for many applications, including ranking Webpages and studying social and professional networks. The technique also proved to be a powerful method for building clusters of similar words, and recognising ambiguous words spanning several clusters. The graph model for word meanings, pioneered by Dominic Widdows and Beate Dorow, is described here.
Try the Link Analysis Demo

Dictionaries, Knowledge Bases, and the Medical Domain

As well as building models directly from free text, we performed groundbreaking research on integrating these automatically derived models with information from traditional dictionaries and knowledge bases. Some of this research is described here.

For example, we were able to use the Unified Medical Language System (UMLS) to build automatic word sense disambiguation tools for use with medical search engines, as described here.

Other Infomap Research included:

  • The adaptation of our work to the Medical Domain was carrried out as part of the MuchMore project.
  • Understanding and developing the linguistic, logical and algebraic theory which we use to encode and model the meaning of words.
  • Investigating how different training corpora affect the resulting associations between meanings. You can see some of this for yourself by comparing search results using the different models available on our demo. As you can see, words like "heart" have much more specific associations in the Medical domain than in general newswire.
  • Detecting and resolving ambiguity. Speakers naturally disambiguate words for one another, using simple phrases like "Do you mean fire as in burning or fire as guns?" We have developed algorithms which model this process.
  • Improving options for the user and methods of interaction. Our users can add and remove aspects of a word's meaning from a query, create clusters of word senses, and visualize the meaning of different words and queries. You can learn more about these options and try them out on the demo.
  • Applying these techniques to build concept-based, cross-language information systems for medical documents. You can try out a part of this project on the bilingual Infomap demo.


Infomap techniques proved to be naturally adaptible and were succesfully applied to German and Japanese with little adaptation. We have used the concept space created by a bilingual corpus to perform cross-language information retrieval. Ultimately, if we can request information in one language and retrieve articles written in another, information retrieval will be freed from the constraints of a particular language and users will be able to draw from a rich pool of untranslated materials previously unavailable to them.

References

`The Search for Mr. Goodfile Generates New Online Tools.' Research News, Science 276:5318, 6 June 1997.

Raymond Flournoy, Ryan Ginstrom, Kenichi Imai, Stefan Kaufmann, Genichiro Kikui, Stanley Peters, Hinrich Schütze, and Yasuhiro Takayama. Personalization and Users' Semantic Expectations. ACM SIGIR'98 Workshop on Query Input and User Expectations, Melbourne, Australia, August 1998.

Raymond Flournoy, Hiroshi Masuichi, and Stanley Peters. Cross-Language Information Retrieval: Some Methods and Tools. In D. Hiemstra, F. de Jong and K. Netter, eds, TWLT 14 Language Technology in Multimedia Information Retrieval, pp. 79-83, Universiteit Twente: Enschede, The Netherlands, 1998.

Genichiro Kikui. Term-list Translation Using Monolingual Word Co-occurrence Vectors. COLING-ACL '98, Montreal, August 1998.

Hiroshi Masuichi, Raymond Flournoy, Stefan Kaufmann, and Stanley Peters. Query Translation Method for Cross Language Information Retrieval. In Proceedings of the Workshop on Machine Translation for Cross Language Information Retrieval, MT Summit VII, pp. 30-34, Singapore, September 1999.

Hinrich Schütze. Ambiguity Resolution in Language Learning. CSLI Publications, 1997. CSLI Lecture Notes number 71.

Yasuhiro Takayama, Raymond Flournoy, Stefan Kaufmann, and Stanley Peters. Information Retrieval Based on Domain-Specific Word Associations. In Proceedings of PACLING '99, Waterloo, Ontario, Canada, June 1999.

Dominic Widdows, Beate Dorow, and Chiu-Ki Chan. Using Parallel Corpora to enrich Multilingual Lexical Resources.. Third International Conference on Language Resources and Evaluation, Las Palmas, May 2002, pages 240-245. (.ps)

Dominic Widdows and Beate Dorow. A Graph Model for Unsupervised Lexical Acquisition. 19th International Conference on Computational Linguistics, Taipei, August 2002, pages 1093-1099. (.ps)

Dominic Widdows, Scott Cederberg and Beate Dorow. Visualisation Techniques for Analysing Meaning. Fifth International Conference on Text, Speech and Dialogue, Brno, Czech Republic, September 2002, pages 107-115. (.ps)

Beate Dorow and Dominic Widdows, Discovering Corpus-Specific Word Senses. EACL 2003, Budapest, Hungary Conference Companion (research notes and demos) pages 79-82 (.ps)

Dominic Widdows. Unsupervised methods for developing taxonomies by combining syntactic and statistical information. In Proceedings of HLT/NAACL 2003, Edmonton, Canada, June 2003, pages 276-283. (.ps)

Scott Cederberg and Dominic Widdows. Using LSA and Noun Coordination Information to Improve the Precision and Recall of Automatic Hyponymy Extraction. In Seventh Conference on Computational Natural Language Learning (CoNLL-2003), Edmonton, Canada, June 2003, pages 111-118. (.ps)

Dominic Widdows. A Mathematical Model for Context and Word-Meaning. Fourth International and Interdisciplinary Conference on Modeling and Using Context, Stanford, California, June 23-25, 2003, pages 369-382 (.ps)

Dominic Widdows and Stanley Peters. Word Vectors and Quantum Logic: Experiments with negation and disjunction. Eighth Mathematics of Language Conference, Bloomington, Indiana, June 20-22, 2003, pages 141-154 (.ps)

Dominic Widdows. Orthogonal Negation in Vector Spaces for Modelling Word-Meanings and Document Retrieval. 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, July 7-12, 2003 pages 136-143. (.ps)

Dominic Widdows, Stanley Peters, Scott Cederberg, Chiu-Ki Chan, Diana Steffen and Paul Buitelaar. Unsupervised Monolingual and Bilingual Word-Sense Disambiguation of Medical Documents using UMLS. Natural Language Processing in Biomedicine ACL 2003 Workshop, ACL workshop, Sapporo, Japan, July 11, 2003, pages 9-16 (.ps)

Dominic Widdows, Stanley Peters, Paul Buitelaar, Diana Steffen, Scott Cederberg and Beate Dorow. A Multilingual Medical Information System using unsupervised Word Sense Disambiguation. Under review for Journal of Computers, Speech and Language (.ps)

Dominic Widdows and Stanley Peters. Quantum Logic of Word Meanings: Concept Lattices in Vector Space Models. Under review for Journal of Logic, Language and Information (.ps)

Partners

Organizations actively collaborating with the Infomap project include