Projects on Inductive Logic Programming funded by TFR

Principal investigator: Henrik Boström
Dept. of Computer and Systems Sciences (DSV)
Stockholm University (SU) and Royal Institute of Technology (KTH)
Electrum 230, 164 40 Kista
Sweden
Email: henke@dsv.su.se
Phone: +46 8 16 16 16
Fax: +46 8 703 90 25

Projects Year No. Total Funding

Inductive Logic Programming for Natural Language Processing 1997-1998 221-96-749 652.508 SEK

Inductive Logic Programming for Knowledge Discovery in Databases 1998-1999 221-97-345 632.509 SEK

Projects	Year	No.	Total Funding
Inductive Logic Programming for Natural Language Processing	1997-1998	221-96-749	652.508 SEK
Inductive Logic Programming for Knowledge Discovery in Databases	1998-1999	221-97-345	632.509 SEK

1. Research activities during 1997-1999

The focus of the machine learning research at DSV has during the last three years been on Inductive Logic Programming (ILP) and Knowledge Discovery in Databases (KDD) and has concerned theory, techniques and applications, where the latter have considered natural language processing (NLP) and medicine. The research funded by TFR has focused on ILP and has primarily been guided by problems that arise when trying to apply existing ILP techniques to problems within NLP and KDD. During this period effort has also been invested in building up a research group as well as initiating cooperation with researchers within application areas. Both these enterprises have been quite successful: the group has increased from two members to six (three Ph.D. students and one research assistant have been recruited that are funded by the faculty), and the group has developed well-established cooperation with researchers within medicine and NLP.

A brief introduction to ILP is given below, followed by a presentation of the research group and a description of projects that the group has taken part in. Finally, the group's national and international cooperation are described.

1.1 Inductive Logic Programming

Inductive Logic Programming (ILP) is a research area in the intersection of machine learning and computational logic whose main goal is the development of theories of and practical algorithms for inductive learning in first-order logic representation formalisms. From inductive machine learning, ILP inherits its goal: to develop tools and techniques for inducing hypotheses from observations (examples) or to synthesise new knowledge from experience. By using computational logic as the representation formalism for hypotheses and observations, ILP can overcome the two main limitations of classical machine learning techniques (such as the top-down induction of decision tree learners):

the use of a limited knowledge representation formalism (essentially propositional logic), and
the difficulties to use substantial background knowledge in the learning process.

The first limitation is important because many domains of expertise can only be expressed in first-order logic, or a variant of first-order logic, and not in propositional logic. The use of domain knowledge is also crucial because one of the well-established findings of artificial intelligence (and machine learning) is that the use of domain knowledge is essential for achieving intelligent behaviour. From computational logic, ILP inherits not only its representational formalism but also its theoretical orientation and various well-established techniques. Indeed, in contrast to many other approaches to inductive learning, ILP is also interested in properties of inference rules, in convergence (e.g. soundness and completeness) of algorithms and the computational complexity of procedures. Because of its background, it is no surprise that ILP has a strong application potential in inductive learning, logic programming and automatic programming. Strong applications exist in drug-design, protein-engineering, medicine, mechanical engineering, etc. The importance of these applications is clear when considering that, for example, in the case of drug-design and protein-engineering the results were published in the biological and chemical literature, the results were obtained using a general purpose ILP algorithm and they were transparent to the experts in the domain.

1.2 Research Group

Henrik Boström, Ph.D., Docent, lecturer at the Dept. of Computer and Systems Sciences (DSV), partly funded by the TFR projects during 1997-1999
Lars Asker, Ph.D., research fellow at DSV, full funding from DSV during 1997-2001
Martin Eineborg, Ph.D. student, full funding by a graduate position at DSV from 1998 until 2002 (on leave from Telia Research AB)
Anette Hulth, Ph.D. student, full funding by a graduate position at DSV from 1999 until 2004
Rickard Cöster, Ph.D. student, full funding by a graduate position at DSV from 1999 until 2004
Tony Lindgren, research assistant, full funding from DSV during 1999-2000

1.3 Projects

Besides the two TFR projects "Inductive Logic Programming for Natural Language Processing" and "Inductive Logic Programming for Knowledge Discovery in Databases" the research group has been a partner during 1996-1999 in the EU project Inductive Logic Programming II (ESPRIT LTR No. 20237) acting as scientific coordinator for the research area complex hypotheses. The funding obtained from EU was 881 KSEK in total and financed a research assistant on 50%. In contrast to the funding from TFR, the EU funding could only be used to pay personell that do not have a permanent position, and hence the TFR funding has been crucial for funding research performed by permanently employed personell (Henrik Boström).

Since October 1999 the group is participating in a NUTEK funded national project on data analysis with learning systems (AIS-8), which involves five academical partners and seven industrial partners (the group receives 440 KSEK in total for two years).

1.4 National Cooperation

The group has cooperated with Ivan Bretan at Telia Research AB on the problem of inducing transfer rules, and with Nikolaj Lindberg at Centre for Speech Technology (CTT) at KTH and Martin Eineborg at Telia Research AB on induction of constraint-grammar rules. Since September 1998, Martin Eineborg is on leave from Telia Research AB holding a graduate position at DSV. Since October 1999, Nikolaj Lindberg is a guest researcher at DSV.

The group has also cooperated with Rutger Bentz at Huddinge Hospital on learning rules for diagnosing blood tests, and with the Biocomplexity group headed by Joakim Cöster at Centre for Microbiology and Tumor biology (MTC) at Karolinska Institutet.

1.5 International Cooperation

Within the project Inductive Logic Programming II (ESPRIT LTR No. 20237), the group has cooperated with the following European universities and institutes: K.U. Leuven (Belgium), Universities of Oxford, York and Bristol (Great Britain), GMD (Germany), Jozef Stefan Institute (Slovenia), Universite Paris Sud (France), Universita di Torino (Italy) and Joszef Attila University (Hungary).

The group has also cooperated with researchers at SRI-Cambridge (Great Britain) on the problem of inducing transfer rules.

Members of the group also have close contacts with and have been, or are cooperating with researchers from several leading research institutions in the United States. They include NASA Ames Research Center, the Jet Propulsion Laboratory/California Institute of Technology, Information Science Institute/University of Southern California, Institute for the Study of Learning and Expertise, Center for the Study of Language and Information, University of California at Irvine, Stanford University and the University of Minnesota.

During 1997-1999, members of the group have served on the programme committees of the seventh, eighth and ninth international conference on inductive logic programming (ILP-97, ILP-98, ILP-99), and the fourteenth and fifteenth national conference on artificial intelligence (AAAI-97, AAAI-98), the international conference on vision and pattern recognition (CVPR-99) and the first international workshop on Learning Language in Logic (LLL-99).

Members of the group have also served as reviewers for a number of additional conferences and international journals, such as the Journal of Machine Learning, the Journal of Artificial Intelligence Research, the Journal for New Generation Computing and the Journal of Logic Programming.

2. Inductive Logic Programming for Natural Language Processing

2.1 Induction of Constraint Grammar Rules

The task of a part of speech (POS) tagger is to assign to each word in a text the contextually correct morphological analysis. POS tagging of unedited text is an interesting task because of several reasons: It is a well-known problem to which several different techniques have been applied, it is a hard task and it has real world applications (in text-to-speech conversion, information extraction/retrieval, corpus linguistics, and so on).

Constraint Grammar (CG) is a successful approach to POS tagging (and shallow syntactic analysis), developed at Helsinki University. After a lexicon look-up, rendering the text morphologically ambiguous, rules discarding contextually impossible readings are applied until the text is unambiguous or no more rules can be applied. The rules are hand-coded by experts, and the CG developers report high figures of accuracy for unrestricted English text.

A nice feature of CG is that the rules can be thought of as independent of each other; new rules can be added, or old ones changed, without worrying about unexpected behaviour because of complex dependencies between existing rules or because of rule ordering.

The Stockholm-Umeå Corpus (SUC) is a POS tagged, balanced corpus of 1 000 000 words of Swedish text. The SUC tag set has 146 unique tags, and the tags consist of a POS tag followed by a (possibly empty) set of morphological features. There are 24 different POS categories. The first edition is available on CD-ROM and a second, corrected edition is on its way.

This task aims at producing a state of the art part of speech tagger for unedited Swedish text. Rules eliminating faulty tags have been induced using the ILP system Progol on the Stockholm-Umeå corpus. In previously reported experiments, we used almost no linguistically motivated background knowledge. Still, the result was rather promising (recall 97.7%, with a pending average ambiguity of 1.13 tags/word). Compared to the previous study, a much richer, more linguistically motivated, background knowledge has been supplied in this project, consisting of examples of noun phrases, verb chains, auxiliary verbs, and sets of part of speech categories. The aim has been to create the background knowledge rapidly, without laborious hand-coding of linguistic knowledge. In addition to the new background knowledge, new, more expressive rule types have been induced for two part of speech categories and compared to the corresponding rules of the previous bottom-line experiment. The new rules perform considerably better, with a recall of 99.4% for the new rules, compared to 97.6% for the old rules. Precision was slightly better for the new rules.

2.2 Induction of Transfer Rules

The Core Language Engine (CLE), developed at SRI Cambridge, is one of the most comprehensive NLP systems designed to date. When used as the basis in systems for bilingual communication, two versions of the system are employed: one for the source language and one for the target language. Input sentences are analyzed by the source language version as far as the level of quasi logical form (QLF), and then undergo transfer, by using so-called transfer rules, into another QLF having constants and predicates corresponding to word senses in the target language. The target language CLE then generates an output sentence from the transferred QLF, using the same linguistic data as is used for analysis of that language.

A transfer rule specifies a pair of QLF patterns. The first argument corresponds to a QLF in one language and the second argument to a QLF in the other. The QLF patterns in these rules can be QLF (sub)expressions, or such expressions with ``transfer variables´´ showing the correspondence between the two arguments. Many transfer rules are only responsible for transferring word senses, e.g. trans(flight_AirplaneTrip,flygning_Flygresa), while others, the so-called structural transfer rules, are more complex, e.g.

trans([and,X1,form(_,verb(no,A,yes,M,D),V,Y1,_)], [and,X2,[island,form(_,verb(pres,A,no,M,D),V,Y2,_)]]):- trans(X1,X2), trans(Y1,Y2).

Transfer rules are applied recursively, and this process follows the recursive structure of the source QLF. When developing automatic translators between two languages based on the CLE, a substantial amount of work go into the formulation of transfer rules, which is a tedious and time-consuming task. We have explored ways of automating this task by using ILP to generate the transfer rules from pairs of source and target QLFs. The main features of this problem are:

more than one rule may need to be produced from a single example
only positive examples are provided
the produced hypothesis should be recursive

In this project we have studied the case when the non-recursive transfer rules are given as input to the learning system. A preliminary experiment with English-French QLFs obtained from SRI Cambridge has been conducted using a prototype system demonstrating that this information is sufficient for the generation of generally applicable rules that can be used for transfer between previously unseen source and target QLFs. However, the experiment also shows that the system suffers from producing overly specific rules, even when the problem of disallowing the derivation of other target QLFs than the correct one is not considered. Potential approaches to this problem have been outlined, but are yet to be explored.

2.3 Learning from Positive Examples Only

Previous bias shift approaches to predicate invention are not applicable to learning from positive examples only as they need negative examples to determine whether new predicates should be invented or not. One approach to this problem has been developed, MERLIN 2.0, which is a successor of MERLIN 1.0 in which predicate invention is guided by sequences of input clauses in SLD-refutations of positive and negative examples w.r.t. an overly general theory. In contrast to its predecessor which searches for the minimal finite-state automaton that can generate all positive and no negative sequences, MERLIN 2.0 uses a technique for inducing Hidden Markov Models from positive sequences only. This enables the system to invent new predicates without being triggered by negative examples. Another advantage of using this induction technique is that it allows for incremental learning. MERLIN 2.0 is particularly suited for learning recursive hypotheses. Apart from being able to learn from positive examples only and invent new predicates, one of the main features of the system is that it may infer both base clauses and recursive clauses from a single example. This contrasts to traditional covering techniques, which produce at most one clause from each example and furthermore need particular examples from which base clauses but no recursive clauses are to be induced. Experiments have been performed comparing MERLIN 2.0 with the positive only learning framework of Progol 4.2 and comparing the original induction technique with a new version that produces deterministic Hidden Markov Models. The results show that predicate invention may indeed be both necessary and possible when learning from positive examples only as well as it can be beneficial to keep the induced model deterministic.

2.4 Selected Publications

Boström H., ``Predicate Invention and Learning from Positive Examples Only´´, Proc. of the Tenth European Conference on Machine Learning, Springer Verlag (1998) 226-237 PostScript

Boström H., ``Induction of Recursive Transfer Rules´´, Proc. of Learning Language in Logic (LLL) Workshop, Bled, Slovenia (1999) 52-62 (to appear in LNAI series, Springer) PostScript

Lindberg N. and Eineborg M., ``Improving Part of Speech Disambiguation Rules by Adding Linguistic Knowledge´´, Proc. of the Ninth International Workshop on Inductive Logic Programming, LNAI Series 1634, Springer (1999) 186-197 PostScript

2.5 Prototype System

MERLIN 2.0 is publicly available and may be used free of charge for academic purposes.

3. Inductive Logic Programming for Knowledge Discovery in Databases

The emerging field of data mining and knowledge discovery in databases has recently attracted significant attention. It is an area that deals with automatic or semi-automatic ways to discover significant, valid, novel, potentially useful, and ultimately understandable patterns in massive data sets (or in short, data mining attempts to extract knowledge from data). Data mining builds on techniques from several disciplines including machine learning, statistics, pattern recognition, databases, knowledge representation, visualisation and graphics, optimisation, computational mathematics, and the theory of algorithms.

The emergence of the field is triggered by the fact that while more and more data becomes available in corporate and private databases and on the internet our resources to interpret and extract information from available data remains relatively constant. Few people or businesses today (at least in the industrialised part of the world) suffer from lack of access to data, but rather from an overload of it. It has therefore become essential to find efficient ways to automatically analyse and extract useful information from these large data sets.

The ease of understanding as well as the expressive power of the pattern language is a strong motivation for the use of ILP for knowledge discovery in databases.

The main goal of the project has been to develop and extend existing ILP algorithms into efficient tools for knowledge discovery in databases. This development was primarily guided by the problems that occur when applying existing algorithms to real world (non-artificial) databases. One such database that has been used is a database containing several years of patient records that has been stored at Huddinge Hospital near Stockholm, Sweden. This dataset contains potentially useful information regarding the interactions between different medicines, how multiple diseases affect the diagnosis of patients, etc.

The main problems that need to be addressed when applying ILP techniques to extract useful knowledge from such databases are the following:

Efficiency - When a large number of examples and background predicates are available, most current ILP systems have problems with efficiency. Methods are needed for restricting and efficiently exploring the search space.
Noise - Most current ILP methods have only limited abilities to handle noise, and this is a significant problem for knowledge discovery in real world databases, as it is not to be expected that useful hypotheses are consistent with all training data.
Continuos-valued attributes - Continuous-valued attributes are used frequently in many real world databases, including the one at Huddinge Hospital, which requires that the applied ILP methods should be able to efficiently and effectively handle such attributes.

3.1 Analysis of Search Strategies

Resolution has been used as a specialisation operator in several approaches to top-down induction of logic programs. This operator allows the overly general hypothesis to be used as a declarative bias that restricts not only what predicate symbols can be used in produced hypotheses, but also how the predicates can be invoked. The two main strategies for top-down induction of logic programs, Separate-and-Conquer (also known as Covering) and Divide-and-Conquer, have been formalised using resolution as a specialisation operator, resulting in two strategies for performing example-guided unfolding. These strategies have been compared both theoretically and experimentally. It was shown that the computational cost grows quadratically in the size of the example set for Separate-and-Conquer, while it grows linearly for Divide-and-Conquer. This was also demonstrated by experiments, in which the amount of work performed by Separate-and-Conquer was up to 30 times the amount of work performed by Divide-and-Conquer. The theoretical analysis shows that the hypothesis space is larger for Separate-and-Conquer, and thus more compact hypotheses may be found by this technique than by Divide-and-Conquer. However, it also shows that for each non-recursive hypothesis that can be produced by Separate-and-Conquer, there is an equivalent hypothesis (w.r.t. the background predicates) that can be produced by Divide-and-Conquer. A major draw-back of Divide-and-Conquer, in contrast to Separate-and-Conquer, is that it is not applicable to learning recursive definitions.

3.2 A Hybrid Search Strategy

The main advantage of Divide-and-Conquer compared to Separate-and-Conquer is that it runs in linear time with the number of examples, while the latter runs in quadratic time. However, Divide-and-Conquer suffers from difficulties in effectively inducing disjunctive concepts due to the replication problem. Separate-and-Conquer on the other hand searches a larger hypothesis space, which in many cases avoids the replication problem but at the cost of lower efficiency. During the project a hybrid strategy was developed and incorporated in the SPECTRE 3.0 system called Reconsider-and-Conquer, which handles the replication problem more effectively than Divide-and-Conquer and allows for more efficient induction than Separate-and-Conquer. Experimental results from propositional, numerical and relational domains have demonstrated that Reconsider-and-Conquer significantly reduces the replication problem from which Divide-and-Conquer suffers and is several times (up to an order of magnitude) faster than Separate-and-Conquer.

Noise handling was included in SPECTRE 3.0 by upgrading pre- and post-pruning techniques developed within propositional learning for an ILP framework. In addition to this, incremental reduced error pruning has been included in the system, allowing a further significant speed-up compared to using post-pruning only. Number handling has been incorporated in SPECTRE 3.0 by adapting a discretisation algorithm that previously has been used for decision-tree induction. Experiments on the database with laboratory tests from Huddinge Hospital are currently undertaken.

3.3 Selected Publications

Boström H. and Asker L., ``Combining Divide-and-Conquer and Separate-and-Conquer for Efficient and Effective Rule Induction´´, Proc. of the Ninth International Workshop on Inductive Logic Programming, LNAI Series 1634, Springer (1999) 33-43 PostScript

Boström H. and Idestam-Almquist P., ``Induction of Logic Programs by Example-Guided Unfolding´´, Journal of Logic Programming, Vol. 40 (2-3) (1999) 159-183 PostScript

3.4 Prototype System

SPECTRE 3.0 is publicly available and may be used free of charge for academic purposes.

4. Impact

The work on producing a state-of-the-art Swedish part of speech tagger by using ILP to learn constraint grammar rules from annotated corpora has already resulted in a prototype tagger that is currently being used by several systems at the Centre for Speech Technology (CTT) at KTH. Researchers from Telia Research AB has also shown great interest in using the system. The tagger will become publicly available in a near future. There are also concrete plans to extend the cooperation with researchers at CTT and apply ILP to learn constraint grammar rules, not only for part of speech tagging, but also for semantic interpretation as well as pronounciation. The applicability of this technique for protein secondary structure prediction is also currently being investigated in cooperation with researchers from the Centre for Microbiology and Tumor biology (MTC) at Karolinska Institutet.

The work on efficient search strategies allow the ILP algorithms to be used in applications that are at the cutting edge of todays research in both machine learning and data mining. The developments of these fields in turn is of strategic importance for the Swedish industry and of critical importance for tomorrow's information access technologies.

5. Publications 1997-1999

Alexin Z., Gyimothy T. and Boström H., ``IMPUT: An Interactive Learning Tool based on Program Specialization´´, Intelligent Data Analysis Vol. 4, Elsevier Holland (1997)

Asker L., Danielson M. and Ekenberg L., ``Intelligent Alarm Handling´´, Proc. of the 12th International FLAIRS Conference, Special Track on Uncertain Reasoning, AAAI/The MIT Press (1999)
Asker L. and Maclin R., ``Feature Engineering and Classifier Selection: A Case Study in Venusian Volcano Detection´´, Proc. of the Fourteenth International Conference on Machine Learning, Morgan Kaufmann (1997) PostScript

Asker L. and Maclin R., ``Ensembles as a Sequence of Classifiers´´, Proc. of the Fifteenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann (1997) PostScript

Boström H., ``Predicate Invention and Learning from Positive Examples Only´´, Proc. of the Tenth European Conference on Machine Learning, Springer Verlag (1998) 226-237 PostScript

Boström H., ``Induction of Recursive Transfer Rules´´, Proc. of the Learning Language in Logic (LLL) Workshop, Bled, Slovenia (1999) 52-62 (to appear in LNAI series, Springer) PostScript

Boström H. and Asker L., ``Combining Divide-and-Conquer and Separate-and-Conquer for Efficient and Effective Rule Induction´´, Proc. of the Ninth International Workshop on Inductive Logic Programming, LNAI Series 1634, Springer (1999) 33-43 PostScript

Boström H. and Idestam-Almquist P., ``Induction of Logic Programs by Example-Guided Unfolding´´, Journal of Logic Programming, Vol. 40 (2-3) (1999) 159-183 PostScript

Burl M., Asker L., Smyth P., Fayyad U. M., Perona P., Crumpler L. and Aubele J, ``Learning to Recognize Volcanoes on Venus´´, Machine Learning Journal, Vol. 30, special issue on Applications of Machine Learning and the Knowledge Discovery Process (1998) 134-165 PostScript

Eineborg M. and Lindberg N., ``Induction of Constraint Grammar-rules using Progol´´, Proc. of the Eighth International Conference on Inductive Logic Programming, LNAI series 1446 (1998) 116-124

Lindberg N. and Eineborg M., ``Learning Constraint Grammar-style disambiguation rules using Inductive Logic Programming´´, Proc. of COLING/ACL'98, volume II, (1998) 775-779 PostScript

Lindberg N. and Eineborg M., ``Improving Part of Speech Disambiguation Rules by Adding Linguistic Knowledge´´, Proc. of the Ninth International Workshop on Inductive Logic Programming, LNAI Series 1634, Springer (1999) 186-197 PostScript