What is information retrieval with human language technology?

Information retrieval tools such as search engines use a range of different methods to index and retrieve documents in a document collection.

To improve these methods one can use human language technology techniques as for example stemming (or query expansion) to find all possible inflection of a search term. This has given significant better relevance of the found documents specifically for languages with lot of inflection. In our experiments with information retrieval and stemming for Swedish, we had a document collection of 50 000 Swedish news texts. Stemming gave us an improvement of the precision with 15 percent and an improvement of the relative recall with 18 percent.

One more technique to improve information retrieval is to use spelling support to give the user feed back if s/he misspells a search term. If no document is retrieved the spelling support suggests an other search term which is closely related with the original search term and which is present in the document collection. Statistics show that one tenth (10 percent) of all search queries are misspelled. SiteSeeker spell checker corrects 90 percent of these errors and improves also precision and recall with 4 respectively 11.5 percent.

Try our search engine SiteSeeker with Swedish stemming and built in spelling support applied on the web site of Vårdguiden and on the web site of Lund university

In our current research project Infomat we are treating query expansion and clustering.

We have also developed SiteSeeker Voice where a Speech interface is applied to the SiteSeeker search engine.

To make it possible to search across several languages, for example in a cooperate intranet that contain documents in several languages one need dictionaries so one can translate the search query to the other languages. It is therefore needed to use some word alignment tool aligning parallel texts and hence construct the dictionary, please read about Uplug and about parallel corpora below, see also TvärSök (in Swedish)


Read more

Dalianis, H., H. Xing and X. Zhang 2010. Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction, in the Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, May 19-21, 2010, pp 1700-1705, pdf.

Draghoender, A. and M. Kanhov. 2010. Creating a reusable English - Afrikaans parallel corpora for bilingual dictionary construction. In Proceedings of Third Swedish Language Technology Conference (SLTC-2010), Linköping University, October 28-29, pp. 33-34, pdf.

Draghoender, A. and M. Kanhov. 2010. Creating a reusable English - Afrikaans parallel corpora for bilingual dictionary construction, B.Sc thesis. Department of Computer and Systems Sciences, (DSV), Stockholm University, pdf.

Jongejan, B. and H. Dalianis. 2009. Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In the Proceeding of the ACL-2009, Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Singapore, August 2-7, 2009, pdf.

Dalianis, H, M. Rimka and V. Kann, 2009. Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian languages. In the proceedings of the 17th Nordic Conference on Computational Linguistics, Nodalida 2009, Odense, May 15-16, 2009. pdf.

H. Xing and X. Zhang. 2008. Master thesis: Using parallel corpora and Uplug to create a Chinese-English dictionary. Master Thesis, Department of Computer and Systems Sciences, KTH/Stockholm University pdf

Velupillai, S. and H. Dalianis 2008. Automatic Construction of Domain-specific Dictionaries on Sparse Parallel Corpora in the Nordic languages, in the Proceedings of Workshop MMIES-2: Multi-source, Multilingual Information Extraction and Summarization, Held in conjunction with COLING-2008, Manchester, 23 August, 2008 pdf.

Karlgren, J., H. Dalianis and B. Jongejan 2008. Experiments to investigate the connection between case distribution and topical relevance of search terms in an information retrieval setting. In the Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008, Marrakech, Morocco, May 28-30, 2008, pdf.

Dalianis, H., M. Rimka and V. Kann 2007. Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian. Workshop: The Automatic Treatment of Multilinguality in Retrieval, Search and Lexicography, Copenhagen, April 2007, pdf.

Charitakis, K. 2007, Using parallel corpora to create a Greek-English dictionary with Uplug, in the Proceedings of Nodalida 2007, The 16th Nordic Conference of Computational Linguistics, 25-26 May 2007 in Tartu, Estonia, pdf

Charitakis, K, 2006, Using parallel corpora to create Greek-English dictionary for web site searching, Master Thesis, Department of Computer and Systems Sciences, KTH/Stockholm University, November 2006, pdf

Dalianis, H. and B. Jongejan. 2006. Hand-crafted versus Machine-learned Inflectional Rules: The Euroling-SiteSeeker Stemmer and CST's Lemmatiser, In the proceeding of the International Conference on Language Resources and Evaluation, LREC 2006, May 24-26, Genoa, Italy, pdf.

Ntais, G. 2006. Development of a Stemmer for the Greek Language, Master Thesis, Department of Computer and Systems Sciences, KTH-Stockholm university, February 2006, pdf, demo of the Greek stemmer.

Dalianis, H. 2005. Improving search engine retrieval using a compound splitter for Swedish, Presented at Nodalida 2005 - 15th Nordic Conference on Computational Linguistics, May 21-22, Joensuu, Finland, html

Dalianis, H. 2005. To Search and Summarize on Internet with Human Language Technology, in Y. Kiyoiki, B.Wangler, H. Jaakola and H. Kangassalo (eds). Information Modelling and Knowledge Bases XVI, Frontiers in Artificial Intelligence and Applications, Volume 121, pp 344-350, IOS Press 2005, pdf.

Dalianis, H. 2004. To search and summarize in Scandinavia. In the proceedings of The First Baltic Conference, Human Language Technologies - the Baltic Perspective, Riga, Latvia, April 21-22, 2004, pp. 93-97. pdf.

Informationssökning på Internet av Våge, Dalianis, Iselid, Studentlitteratur 2003. (in Swedish)

Dalianis, H., A. Blomberg, R. Lindgren, J. Carlberger and M. Hassel. 2003. SiteSeeker Voice -A speech controlled search engine. Demonstration hold on NODALIDA 2003, the 14th Nordic Conference of Computational Linguistics, Reykjavik, May 30-31, 2003.

Mansour Sarr: Improving precision and recall using a spell checker in a search engine. In the proceeding of NODALIDA 2003, the 14th Nordic Conference of Computational Linguistics, Reykjavik, May 30-31, 2003.
pdf (Dr. Hercules Dalianis was supervisor)

Dalianis, H. 2002. Evaluating a spelling support in a search engine presented at NLDB 2002, NLDB 2002, The Seventh International Workshop on the Applications of Natural Language to Information Systems, June 27-28, 2002, Stockholm pdf

Carlberger, J. H. Dalianis, M. Hassel and O. Knutsson. 2001. Improving Precision in Information Retrieval for Swedish using Stemming. In the Proceedings of NoDaLiDa-01 - 13th Nordic Conference on Computational Linguistics, May 21-22, 2001, Uppsala, Sweden. pdf

SeaSum-Search and Summarize project (Completed project) Sökmotor



Responsible for this page: Hercules Dalianis <hercules@kth.se>
Latest change February, 2013