Information Mapping Project
The goal of the Information Mapping project was to understand the
meanings of words based on the way they are used in text. This is
important for language understanding and translation, and also for
intelligent, concept-based information representation and retrieval.
Currently, document retrieval from large document collection--such as
the web, library card catalogs or newspaper archives--is based on
keyword search. A query is posed as a list of words, and any entries
in the database which contain any or all of those specific words are
returned. However, if we treat those query words not as literal
strings of letters, but as representing concepts, then we can
retrieve relevant documents even if they do not contain the specific
words used in the query.
The Infomap project focussed on the development of the following important
mathematical models.
Vector Space Models
The Infomap vector space model, or WORDSPACE, was pioneered by Hinrich
Schütze. The model works by mapping words to points in a
high-dimensional space, by recording the frequency of co-occurrence
between words in the text; for example, the number of times two words
appear in the same document. The distribution of co-occurrences
between a word and some set of content-bearing terms then serves as a
profile of the word's usage, and can accurately associate words with
similar meanings. A word can thus be described by a spectrum of
related words. The user can choose to accept or reject these
relations, thus building an increasingly accurate profile of the
meaning they desire. Generalizing this, by comparing the query words'
profiles to profiles generated for each document, we can return
articles which are conceptually related to the query words, even if
the words themselves do not appear in the text. Vector space models
for text are described here.
Later on in the project, work on the logical properties of WORDSPACE
by Dominic
Widdows and Stanley
Peters demonstrated that WORDSPACE can naturally be navigated
using the same logic as quantum mechanics, with powerful and exciting
consequences for modelling word-meanings. This aspect of WORDSPACE is
described here.
As part of our goals to distribute methods and tools to the wider
scientific community, the Infomap project also released open-source
software that researchers and practitioners can use to build and
navigate WORDSPACE models. The software is available from here.
Try the WORDSPACE Demo
Link Analysis Models
Link analysis has proved to be a popular tool in recent years for
many applications, including ranking Webpages and studying social and
professional networks. The technique also proved to be a powerful
method for building clusters of similar words, and recognising
ambiguous words spanning several clusters. The graph model for word
meanings, pioneered by Dominic
Widdows and Beate
Dorow, is described here.
Try the Link Analysis Demo
Dictionaries, Knowledge Bases, and the Medical Domain
As well as building models directly from free text, we performed
groundbreaking research on integrating these automatically derived
models with information from traditional dictionaries and knowledge
bases. Some of this research is described here.
For example, we were able to use the Unified Medical Language System
(UMLS) to build automatic word sense disambiguation tools for use with
medical search engines, as described here.
Other Infomap Research included:
-
The adaptation of our work to the Medical Domain was carrried
out as part of the MuchMore project.
- Understanding and developing the
linguistic, logical and algebraic theory which we use to
encode and model the meaning of words.
-
Investigating how different training corpora affect
the resulting associations between meanings. You can see some of
this for
yourself by comparing search results using the different models
available on our demo. As you can see, words like "heart" have much
more specific associations in the Medical domain than in general
newswire.
-
Detecting and resolving ambiguity. Speakers naturally
disambiguate words for one another, using simple phrases like
"Do you mean fire as in burning or fire as guns?"
We have developed algorithms which model this process.
-
Improving options for the user and methods of interaction. Our
users can add and remove aspects of a word's meaning from a
query, create clusters of word senses, and
visualize the meaning
of different words and queries. You can
learn more about these options
and try them
out on the demo.
-
Applying these techniques to build concept-based, cross-language
information systems for medical
documents. You can try out a part of this project on the bilingual
Infomap demo.
Infomap techniques proved to be naturally adaptible and were
succesfully applied to German and Japanese with little adaptation. We
have used the concept space created by a bilingual corpus to perform cross-language information retrieval.
Ultimately, if we can request information in one language and retrieve
articles written in another, information retrieval will be freed from
the constraints of a particular language and users will be able to draw
from a rich pool of untranslated materials previously unavailable to
them.
References
- `The Search for Mr. Goodfile Generates New
Online Tools.' Research News, Science
276:5318, 6 June 1997.
-
Raymond Flournoy, Ryan Ginstrom, Kenichi Imai, Stefan Kaufmann,
Genichiro Kikui, Stanley Peters, Hinrich Schütze, and Yasuhiro Takayama.
Personalization and Users' Semantic Expectations.
ACM SIGIR'98 Workshop on Query Input and User Expectations,
Melbourne, Australia, August 1998.
-
Raymond Flournoy, Hiroshi Masuichi, and Stanley Peters.
Cross-Language Information Retrieval: Some Methods and Tools. In
D. Hiemstra, F. de Jong and K. Netter, eds, TWLT 14 Language
Technology in Multimedia Information Retrieval, pp. 79-83,
Universiteit Twente: Enschede, The Netherlands, 1998.
-
Genichiro Kikui. Term-list Translation Using Monolingual Word
Co-occurrence Vectors. COLING-ACL '98, Montreal, August 1998.
-
Hiroshi Masuichi, Raymond Flournoy, Stefan Kaufmann, and Stanley
Peters.
Query
Translation Method for Cross Language Information Retrieval.
In Proceedings of the Workshop on Machine Translation for
Cross Language Information Retrieval, MT Summit VII, pp. 30-34,
Singapore, September 1999.
-
Hinrich Schütze.
Ambiguity Resolution in Language Learning.
CSLI Publications, 1997.
CSLI Lecture Notes number 71.
-
Yasuhiro Takayama, Raymond Flournoy, Stefan Kaufmann,
and Stanley Peters.
Information Retrieval Based on Domain-Specific Word Associations.
In Proceedings of PACLING '99, Waterloo, Ontario, Canada, June 1999.
-
Dominic Widdows, Beate Dorow, and Chiu-Ki Chan.
Using Parallel Corpora to enrich Multilingual Lexical Resources..
Third International Conference on Language Resources and Evaluation,
Las Palmas, May 2002, pages 240-245.
(.ps)
-
Dominic Widdows and Beate Dorow.
A Graph Model for Unsupervised Lexical Acquisition.
19th International Conference on Computational Linguistics,
Taipei, August 2002, pages 1093-1099.
(.ps)
-
Dominic Widdows, Scott Cederberg and Beate Dorow.
Visualisation Techniques for Analysing Meaning.
Fifth International Conference on Text, Speech and Dialogue,
Brno, Czech Republic, September 2002, pages 107-115.
(.ps)
-
Beate Dorow and Dominic Widdows,
Discovering Corpus-Specific Word Senses.
EACL 2003, Budapest, Hungary
Conference Companion (research notes and demos) pages 79-82
(.ps)
-
-
Dominic Widdows.
Unsupervised methods for developing taxonomies by combining
syntactic and statistical information. In Proceedings of
HLT/NAACL 2003, Edmonton, Canada, June 2003, pages 276-283.
(.ps)
- Scott Cederberg and Dominic Widdows.
Using
LSA and Noun Coordination Information to Improve the Precision and
Recall of Automatic Hyponymy Extraction. In Seventh Conference
on Computational Natural Language Learning (CoNLL-2003),
Edmonton, Canada, June 2003, pages 111-118.
(.ps)
- Dominic Widdows.
A
Mathematical Model for Context and Word-Meaning.
Fourth International and Interdisciplinary Conference on Modeling and
Using Context, Stanford, California, June 23-25, 2003, pages 369-382
(.ps)
- Dominic Widdows and Stanley Peters.
Word
Vectors and Quantum Logic: Experiments with negation and
disjunction. Eighth Mathematics of Language
Conference, Bloomington,
Indiana, June 20-22, 2003, pages 141-154
(.ps)
- Dominic Widdows.
Orthogonal Negation in Vector Spaces for Modelling Word-Meanings and
Document Retrieval.
41st Annual Meeting of the Association for Computational
Linguistics, Sapporo, Japan, July 7-12, 2003 pages 136-143.
(.ps)
- Dominic Widdows, Stanley Peters, Scott Cederberg, Chiu-Ki Chan,
Diana Steffen and Paul Buitelaar.
Unsupervised Monolingual and Bilingual Word-Sense Disambiguation
of Medical Documents using UMLS. Natural
Language Processing in Biomedicine ACL 2003 Workshop, ACL
workshop, Sapporo, Japan, July 11, 2003, pages 9-16
(.ps)
-
Dominic Widdows, Stanley Peters, Paul Buitelaar,
Diana Steffen, Scott Cederberg and Beate Dorow.
A Multilingual Medical Information
System using unsupervised Word Sense Disambiguation.
Under review for Journal of Computers, Speech and Language
(.ps)
- Dominic Widdows and Stanley Peters.
Quantum Logic of Word Meanings: Concept Lattices in Vector Space
Models.
Under review for Journal of Logic, Language and Information
(.ps)
Partners
Organizations actively collaborating with the Infomap project include
|