Relevant Papers on CLIR and Other Topics
- Technical issues of cross-language information retrieval: a review. Kazuaki Kishida. Information Processing and Management (2005).
This comprehensive article reviews the state of the art in CLIR. Queries and documents are either matched using cognates (no translation) or using query and document translations as well as interlingual techniques like latent semantic indexing (LSI). Translation can be done with Machine Translation (MT), dictionaries or parallel corpora. Parallel corpora help us find out translation probabilities of terms and phrases. We can efficiently extract parallel corpora from the web also.
Disambiguation among multiple translation alternatives is an important challenge. There are several techniques for dealing with this problem, viz. use of POS tags, use of parallel corpus, use of co-occurrence statistics in target corpus and use of query expansion techniques. POS tags allow us to match terms based on their parts of speech. Parallel corpus helps align similar terms, possibly under a probabilistic setting. Co-occurrence statistics play a similar role. Finally, query expansion techniques like pre and post-translation feedback help improve precision and recall. An important generalization is the structured query model, where multiple translations are considered as synonyms and then ORed together to form a Boolean query. Some other techniques are bi-directional translation and phrasal translation.
The author also discusses document scoring, tokenization issues, stop word lists, stemming, user interface specialty (for interactive CLIR), evaluation techniques and multilingual information retrieval strategies.
Finally, an important technique discussed in this paper is pivot language approach, where queries of one language are translated into another using a middle language. Transitive translation and lexical triangulation are key concepts in this case.
- Cross-Language Information Retrieval. Daqing He and Jianqiang Wang. Book chapter, Information Retrieval: Searching in the 21st Century (2009). Wiley.
This paper starts with the question of why CLIR is necessary and then describes the major approaches and challenges in CLIR. CLIR users need more support to match the effectiveness of monolingual IR. Identifying translation units involves tokenization, stemming, phrase identification and stop word removal. Translations should be applied to phrases first; if that fails, words should be translated. Translation knowledge can be obtained from online and printed dictionaries and parallel corpora. Out-of-vocabulary (OOV) terms may be transliterated, cognate-matched or expanded to include within-vocabulary terms. Document Expansion is an important idea.
Authors discuss translation disambiguation, weighting translation alternatives (weighted Boolean model) and using translation probabilities in term weighting. Interactive CLIR is essential for system improvement. Different stages in interactive CLIR have been explained along with user-assisted query translation and document selection. Finally, Cranfield-based CLIR evaluation framework, evaluation of interactive CLIR and current CLIR evaluation frameworks have been discussed.
- Indian Language Information Retrieval. Prasenjit Majumder and Mandar Mitra. Book chapter, Guide to OCR for Indic Scripts. Springer.
This paper gives a short outline of text filtering, question answering, event detection and event tracking. In Indian language Information Retrieval (ILIR) setting, there are two primary sources of information - online and offline. Online documents are blogs, newspapers and magazines. Offline documents are printed material and CD-ROMs. Research efforts in ILIR include surprise language exercise (SLE). Authors discuss the MIRACLE system developed under SLE and note that indexing based on overlapping character n-grams may be more useful when stemming tools are not available for a language or when the text is erroneous, as in OCR-ed text.
Authors discuss information extraction, named entity identification, question answering and topic detection and tracking (TDT) issues in ILIR. Finally, they discuss about the Indian Language Subtrack at CLEF 2007, the CLIA project and the Forum for Information Retrieval Evaluation (FIRE).
Back to CyDAR Page