The focus of the machine learning research at DSV has during the last
three years been on Inductive Logic Programming (ILP) and Knowledge
Discovery in Databases (KDD) and has concerned theory, techniques and
applications, where the latter have considered natural
language processing (NLP) and medicine. The research funded by TFR has
focused on ILP and has primarily been guided by problems that arise
when trying to apply existing ILP techniques to problems within NLP
and KDD. During this period effort has also been invested in building
up a research group as well as initiating cooperation with researchers
within application areas. Both these enterprises have been quite
successful: the group has increased from two members to six (three
Ph.D. students and one research assistant have been recruited that are
funded by the faculty), and the group has developed well-established
cooperation with researchers within medicine and NLP.
A brief introduction to ILP is given below, followed by a presentation
of the research group and a description of projects that the group has
taken part in. Finally, the group's national and international
cooperation are described.
1.1 Inductive Logic Programming
Inductive Logic Programming (ILP) is a research area in the
intersection of machine learning and computational logic whose main
goal is the development of theories of and practical algorithms for
inductive learning in first-order logic representation formalisms.
From inductive machine learning, ILP inherits its goal:
to develop tools and techniques for inducing hypotheses from
observations (examples) or to synthesise new knowledge from experience.
By using computational logic as the representation formalism for
hypotheses and observations, ILP
can overcome the two main limitations of classical machine learning
techniques (such as the top-down induction of decision tree learners):
the use of a limited knowledge representation formalism
(essentially propositional logic), and
the difficulties to use substantial background knowledge in
the learning process.
The first limitation is important because many domains
of expertise can only be expressed in first-order logic,
or a variant of first-order logic, and not in propositional logic.
The use of domain knowledge is also crucial
because one of the well-established findings of artificial
intelligence (and machine learning) is that the use of domain knowledge is
essential for achieving intelligent behaviour.
From computational logic, ILP inherits not only its representational
formalism but also its theoretical orientation and various
well-established techniques. Indeed, in contrast to many other
approaches to inductive learning, ILP is also interested in properties
of inference rules, in convergence (e.g. soundness and completeness)
of algorithms and the computational complexity of procedures.
Because of its background, it is no surprise that ILP has a strong
application potential in inductive learning, logic programming and
automatic programming. Strong applications exist in drug-design,
protein-engineering, medicine, mechanical engineering, etc. The
importance of these applications is clear when considering that, for
example, in the case of drug-design and protein-engineering the
results were published in the biological and chemical literature, the
results were obtained using a general purpose ILP algorithm and they
were transparent to the experts in the domain.
1.2 Research Group
Henrik Boström, Ph.D., Docent, lecturer at the Dept.
of Computer and Systems Sciences (DSV), partly funded by the TFR projects
during 1997-1999
Lars
Asker, Ph.D., research fellow at DSV, full funding from DSV during
1997-2001
Martin Eineborg, Ph.D. student, full funding by a graduate position at
DSV from 1998 until 2002 (on leave from Telia Research AB)
Anette Hulth, Ph.D. student, full funding by a graduate position at
DSV from 1999 until 2004
Rickard Cöster, Ph.D. student, full funding by a graduate position at
DSV from 1999 until 2004
Tony Lindgren,
research assistant, full funding from DSV during 1999-2000
1.3 Projects
Besides the two TFR projects "Inductive Logic Programming for Natural
Language Processing" and "Inductive Logic Programming for Knowledge
Discovery in Databases" the research group has been a partner during
1996-1999 in the EU project Inductive
Logic Programming II (ESPRIT LTR No. 20237) acting as scientific
coordinator for the research area complex hypotheses. The
funding obtained from EU was 881 KSEK in total and financed a research
assistant on 50%. In contrast to the funding from TFR, the EU funding
could only be used to pay personell that do not have a permanent position,
and hence the TFR funding has been crucial for funding research
performed by permanently employed personell (Henrik Boström).
Since October 1999 the group is participating in a NUTEK funded
national project on data analysis with learning systems (AIS-8), which
involves five academical partners and seven industrial partners (the
group receives 440 KSEK in total for two years).
1.4 National Cooperation
The group has cooperated with Ivan Bretan at Telia Research AB on the
problem of inducing transfer rules, and with Nikolaj Lindberg at
Centre
for Speech Technology (CTT) at KTH and Martin Eineborg at Telia
Research AB on induction of constraint-grammar rules. Since September
1998, Martin Eineborg is on leave from Telia Research AB holding a
graduate position at DSV. Since October 1999, Nikolaj Lindberg is a
guest researcher at DSV.
The group has also cooperated with
Rutger Bentz at Huddinge Hospital on learning rules for diagnosing
blood tests, and with the Biocomplexity group headed by Joakim
Cöster at Centre for Microbiology and Tumor biology (MTC)
at Karolinska Institutet.
1.5 International Cooperation
Within the project Inductive
Logic Programming II (ESPRIT LTR No. 20237), the group has
cooperated with the following European universities and institutes:
K.U. Leuven (Belgium), Universities of Oxford, York and Bristol (Great Britain), GMD
(Germany), Jozef Stefan Institute (Slovenia), Universite Paris Sud
(France), Universita di Torino (Italy) and Joszef Attila University (Hungary).
The group has also cooperated with researchers at SRI-Cambridge (Great Britain)
on the problem of inducing transfer rules.
Members of the group also have close contacts with and have been, or
are cooperating with researchers from several leading research
institutions in the United States. They include NASA Ames Research
Center, the Jet Propulsion Laboratory/California Institute of
Technology, Information Science Institute/University of Southern
California, Institute for the Study of Learning and Expertise, Center
for the Study of Language and Information, University of California at
Irvine, Stanford University and the University of Minnesota.
During 1997-1999, members of the group have served on the programme
committees of the seventh, eighth and ninth international conference
on inductive logic programming (ILP-97, ILP-98, ILP-99), and the
fourteenth and fifteenth national conference on artificial
intelligence (AAAI-97, AAAI-98), the international conference on vision
and pattern recognition (CVPR-99) and the first international
workshop on Learning Language in Logic (LLL-99).
Members of the group have also served as reviewers for a number of
additional conferences and international journals, such as
the Journal of Machine Learning, the Journal of Artificial
Intelligence Research, the Journal for New Generation Computing
and the Journal of Logic Programming.
2. Inductive Logic Programming for Natural Language Processing
2.1 Induction of Constraint Grammar Rules
The task of a part of speech (POS) tagger is to assign to each word in
a text the contextually correct morphological analysis. POS tagging of
unedited text is an interesting task because of several reasons: It is
a well-known problem to which several different techniques have been
applied, it is a hard task and it has real world applications (in
text-to-speech conversion, information extraction/retrieval, corpus
linguistics, and so on).
Constraint Grammar (CG) is a successful approach to POS tagging (and
shallow syntactic analysis), developed at Helsinki University.
After a lexicon look-up, rendering the text morphologically
ambiguous, rules discarding contextually impossible readings are
applied until the text is unambiguous or no more rules can be applied. The
rules are hand-coded by experts, and the CG developers report high
figures of accuracy for unrestricted English text.
A nice feature of CG is that the rules can be thought of as
independent of each other; new rules can be added, or old ones
changed, without worrying about unexpected behaviour because of
complex dependencies between existing rules or because of rule
ordering.
The Stockholm-Umeå Corpus (SUC) is a POS tagged, balanced corpus
of 1 000 000 words of Swedish text. The SUC tag set has 146 unique
tags, and the tags consist of a POS tag followed by a (possibly empty)
set of morphological features. There are 24 different POS
categories. The first edition is available on CD-ROM and a second,
corrected edition is on its way.
This task aims at producing a state of the art part of speech
tagger for unedited Swedish text. Rules eliminating faulty tags have
been induced using the ILP system Progol on the Stockholm-Umeå corpus. In
previously reported experiments, we used almost no linguistically motivated
background knowledge. Still, the result was rather promising
(recall 97.7%, with a pending average ambiguity of 1.13
tags/word). Compared to the previous study, a much richer, more
linguistically motivated, background knowledge has been supplied in
this project, consisting of examples of noun phrases, verb chains,
auxiliary verbs, and sets of part of speech categories. The aim has
been to create the background knowledge rapidly, without laborious
hand-coding of linguistic knowledge. In addition to the new background
knowledge, new, more expressive rule types have been induced for two
part of speech categories and compared to the corresponding rules of
the previous bottom-line experiment. The new rules perform
considerably better, with a recall of 99.4% for the new rules,
compared to 97.6% for the old rules. Precision was slightly better for
the new rules.
2.2 Induction of Transfer Rules
The Core Language Engine (CLE), developed at SRI Cambridge,
is one of the most comprehensive NLP systems designed to date. When
used as the basis in systems for bilingual communication, two versions
of the system are employed: one for the source language and one for
the target language. Input sentences are analyzed by the source
language version as far as the level of quasi logical form (QLF), and
then undergo transfer, by using so-called transfer rules, into another
QLF having constants and predicates corresponding to word senses in
the target language. The target language CLE then generates an output
sentence from the transferred QLF, using the same linguistic data as
is used for analysis of that language.
A transfer rule specifies a pair of QLF patterns.
The first argument corresponds to a QLF in one language
and the second argument to a QLF in the other. The QLF patterns in these
rules can be QLF (sub)expressions, or such expressions with ``transfer
variables´´ showing the correspondence between the two arguments.
Many transfer rules are only responsible for transferring word senses,
e.g. trans(flight_AirplaneTrip,flygning_Flygresa), while others, the so-called structural transfer rules, are more complex, e.g.
Transfer rules are applied recursively, and this process follows the
recursive structure of the source QLF.
When developing automatic translators between two languages based on
the CLE, a substantial amount of work go into the formulation of
transfer rules, which is a tedious and time-consuming task. We have
explored ways of automating this task by using ILP to generate the
transfer rules from pairs of source and target QLFs.
The main features of this problem are:
more than one rule may need
to be produced from a single example
only positive examples are provided
the produced hypothesis should be recursive
In this project we have studied the case when the non-recursive
transfer rules are given as input to the learning system. A
preliminary experiment with English-French QLFs obtained from
SRI Cambridge has been conducted using a prototype system
demonstrating that this information is sufficient for the generation
of generally applicable rules that can be used for transfer between
previously unseen source and target QLFs. However, the experiment also
shows that the system suffers from producing overly specific rules,
even when the problem of disallowing the derivation of other target
QLFs than the correct one is not considered. Potential approaches to
this problem have been outlined, but are yet to be explored.
2.3 Learning from Positive Examples Only
Previous bias shift approaches to predicate invention are not
applicable to learning from positive examples only as they need
negative examples to determine whether new predicates should be
invented or not. One approach to this problem has been developed,
MERLIN 2.0, which is a successor of MERLIN 1.0 in which predicate
invention is guided by sequences of input clauses in SLD-refutations
of positive and negative examples w.r.t. an overly general theory. In
contrast to its predecessor which searches for the minimal
finite-state automaton that can generate all positive and no negative
sequences, MERLIN 2.0 uses a technique for inducing Hidden Markov
Models from positive sequences only. This enables the system to invent
new predicates without being triggered by negative examples. Another
advantage of using this induction technique is that it allows for
incremental learning. MERLIN 2.0 is particularly suited for learning
recursive hypotheses. Apart from being able to learn from positive
examples only and invent new predicates, one of the main features of
the system is that it may infer both base clauses and recursive
clauses from a single example. This contrasts to traditional covering
techniques, which produce at most one clause from each example and
furthermore need particular examples from which base clauses but no
recursive clauses are to be induced. Experiments have been performed
comparing MERLIN 2.0 with the positive only learning framework of
Progol 4.2 and comparing the original induction technique with a new
version that produces deterministic Hidden Markov Models. The results
show that predicate invention may indeed be both necessary and
possible when learning from positive examples only as well as it can
be beneficial to keep the induced model deterministic.
2.4 Selected Publications
Boström H., ``Predicate Invention and Learning from Positive
Examples Only´´, Proc. of the Tenth European Conference on Machine
Learning, Springer Verlag (1998) 226-237PostScript
Boström H., ``Induction of Recursive Transfer Rules´´, Proc. of Learning
Language in Logic (LLL) Workshop, Bled, Slovenia (1999) 52-62 (to appear in LNAI series, Springer)PostScript
Lindberg N. and Eineborg M., ``Improving Part of Speech Disambiguation
Rules by Adding Linguistic Knowledge´´, Proc. of the Ninth International
Workshop on Inductive Logic Programming, LNAI Series 1634, Springer (1999) 186-197
PostScript
2.5 Prototype System
MERLIN 2.0 is publicly available and may be used free of charge for academic purposes.
3. Inductive Logic Programming for Knowledge Discovery in Databases
The emerging field of data mining and knowledge discovery in databases has recently
attracted significant attention. It is an area that deals with
automatic or semi-automatic ways to discover significant, valid,
novel, potentially useful, and ultimately understandable patterns in
massive data sets (or in short, data mining attempts to extract
knowledge from data). Data mining builds on techniques from several
disciplines including machine learning, statistics, pattern
recognition, databases, knowledge representation, visualisation and
graphics, optimisation, computational mathematics, and the theory of
algorithms.
The emergence of the field is triggered by the fact that while more
and more data becomes available in corporate and private databases and
on the internet our resources to interpret and extract information
from available data remains relatively constant. Few people or
businesses today (at least in the industrialised part of the world)
suffer from lack of access to data, but rather from an overload of
it. It has therefore become essential to find efficient ways to
automatically analyse and extract useful information from these large
data sets.
The ease of understanding as well as the expressive power of the
pattern language is a strong motivation for the use of ILP for
knowledge discovery in databases.
The main goal of the project has been to develop and extend existing ILP algorithms
into efficient tools for knowledge discovery in databases. This
development was primarily guided by the problems that occur when
applying existing algorithms to real world (non-artificial) databases.
One such database that has been used is a database containing several
years of patient records that has been stored at Huddinge Hospital
near Stockholm, Sweden. This dataset contains potentially useful
information regarding the interactions between different medicines,
how multiple diseases affect the diagnosis of patients, etc.
The main problems that need to be addressed when applying ILP
techniques to extract useful knowledge from such databases are the following:
Efficiency - When a large number of examples and background
predicates are available, most current ILP systems have problems with
efficiency. Methods are needed for restricting and efficiently
exploring the search space.
Noise - Most current ILP methods have only limited abilities to
handle noise, and this is a significant problem for knowledge
discovery in real world databases, as it is not to be expected that
useful hypotheses are consistent with all training data.
Continuos-valued attributes - Continuous-valued attributes are
used frequently in many real world databases, including the one at
Huddinge Hospital, which requires that the applied ILP methods
should be able to efficiently and effectively handle such attributes.
3.1 Analysis of Search Strategies
Resolution has been used as a specialisation operator in several
approaches to top-down induction of logic programs. This operator
allows the overly general hypothesis to be used as a declarative bias
that restricts not only what predicate symbols can be used in produced
hypotheses, but also how the predicates can be invoked. The two main
strategies for top-down induction of logic programs,
Separate-and-Conquer (also known as Covering) and
Divide-and-Conquer, have been formalised using resolution as a
specialisation operator, resulting in two strategies for performing
example-guided unfolding. These strategies have been compared both
theoretically and experimentally. It was shown that the computational
cost grows quadratically in the size of the example set for
Separate-and-Conquer, while it grows linearly for
Divide-and-Conquer. This was also demonstrated by experiments, in
which the amount of work performed by Separate-and-Conquer was up to
30 times the amount of work performed by Divide-and-Conquer. The
theoretical analysis shows that the hypothesis space is larger for
Separate-and-Conquer, and thus more compact hypotheses may be found by
this technique than by Divide-and-Conquer. However, it also shows that
for each non-recursive hypothesis that can be produced by
Separate-and-Conquer, there is an equivalent hypothesis (w.r.t. the
background predicates) that can be produced by Divide-and-Conquer. A
major draw-back of Divide-and-Conquer, in contrast to
Separate-and-Conquer, is that it is not applicable to learning
recursive definitions.
3.2 A Hybrid Search Strategy
The main advantage of Divide-and-Conquer compared to Separate-and-Conquer
is that it runs in linear time with the number of examples,
while the latter runs in quadratic time. However, Divide-and-Conquer
suffers from difficulties in effectively inducing disjunctive concepts
due to the replication problem. Separate-and-Conquer on the
other hand searches a larger hypothesis space, which in many cases
avoids the replication problem but at the cost of lower
efficiency. During the project a hybrid strategy was developed and
incorporated in the SPECTRE 3.0 system called Reconsider-and-Conquer,
which handles the replication problem more effectively than
Divide-and-Conquer and allows for more efficient induction than
Separate-and-Conquer. Experimental results from propositional,
numerical and relational domains have demonstrated that
Reconsider-and-Conquer significantly reduces the replication problem
from which Divide-and-Conquer suffers and is several times (up to an
order of magnitude) faster than Separate-and-Conquer.
Noise handling was included in SPECTRE 3.0 by upgrading pre- and
post-pruning techniques developed within propositional learning for an
ILP framework. In addition to this, incremental reduced error pruning
has been included in the system, allowing a further significant
speed-up compared to using post-pruning only. Number handling has been
incorporated in SPECTRE 3.0 by adapting a discretisation algorithm
that previously has been used for decision-tree induction. Experiments
on the database with laboratory tests from Huddinge Hospital are
currently undertaken.
3.3 Selected Publications
Boström H. and Asker L., ``Combining Divide-and-Conquer and
Separate-and-Conquer for Efficient and Effective Rule Induction´´,
Proc. of the Ninth International Workshop on Inductive
Logic Programming, LNAI Series 1634, Springer (1999) 33-43
PostScript
Boström H. and Idestam-Almquist P., ``Induction of Logic Programs by
Example-Guided Unfolding´´, Journal of Logic Programming, Vol. 40 (2-3) (1999) 159-183 PostScript
3.4 Prototype System
SPECTRE 3.0 is publicly available
and may be used free of charge for academic purposes.
4. Impact
The work on producing a state-of-the-art Swedish part of speech tagger
by using ILP to learn constraint grammar rules from annotated corpora
has already resulted in a prototype tagger that is currently being used by
several systems at the Centre for Speech Technology (CTT) at
KTH. Researchers from Telia Research AB has also shown great interest
in using the system. The tagger will become publicly available in a near future.
There are also concrete plans to extend the cooperation with researchers at CTT and
apply ILP to learn constraint grammar rules, not only for part of
speech tagging, but also for semantic interpretation as well as
pronounciation. The applicability of this technique for protein secondary
structure prediction is also currently being investigated in cooperation with
researchers from the Centre for Microbiology and Tumor biology (MTC)
at Karolinska Institutet.
The work on efficient search strategies allow the ILP algorithms to be used
in applications that are at the cutting edge of todays research in both machine
learning and data mining. The developments of these fields in turn is of strategic
importance for the Swedish industry and of critical importance for tomorrow's
information access technologies.
5. Publications 1997-1999
Alexin Z., Gyimothy T. and Boström H., ``IMPUT: An
Interactive Learning Tool based on Program Specialization´´,
Intelligent Data Analysis Vol. 4, Elsevier Holland
(1997)
Asker L., Danielson M. and Ekenberg L., ``Intelligent Alarm
Handling´´, Proc. of the 12th International FLAIRS Conference,
Special Track on Uncertain Reasoning, AAAI/The MIT Press (1999)
Asker L. and Maclin R., ``Feature Engineering and
Classifier Selection: A Case Study in Venusian Volcano Detection´´,
Proc. of the Fourteenth International Conference on Machine Learning,
Morgan Kaufmann (1997) PostScript
Asker L. and Maclin R., ``Ensembles as a Sequence of Classifiers´´,
Proc. of the Fifteenth International Joint Conference
on Artificial Intelligence, Morgan Kaufmann (1997) PostScript
Boström H., ``Predicate Invention and Learning from Positive
Examples Only´´, Proc. of the Tenth European Conference on Machine
Learning, Springer Verlag (1998) 226-237PostScript
Boström H., ``Induction of Recursive Transfer Rules´´,
Proc. of the Learning
Language in Logic (LLL) Workshop, Bled, Slovenia (1999) 52-62 (to appear in LNAI series, Springer)PostScript
Boström H. and Asker L., ``Combining Divide-and-Conquer and
Separate-and-Conquer for Efficient and Effective Rule Induction´´,
Proc. of the Ninth International Workshop on Inductive
Logic Programming, LNAI Series 1634, Springer (1999) 33-43
PostScript
Boström H. and Idestam-Almquist P., ``Induction of Logic Programs by
Example-Guided Unfolding´´, Journal of Logic Programming, Vol. 40 (2-3) (1999) 159-183 PostScript
Burl M., Asker L., Smyth P., Fayyad U. M., Perona P., Crumpler L.
and Aubele J, ``Learning to Recognize Volcanoes on Venus´´,
Machine Learning Journal, Vol. 30, special issue on
Applications of Machine Learning and the Knowledge Discovery Process
(1998) 134-165 PostScript
Eineborg M. and Lindberg N., ``Induction of Constraint Grammar-rules using Progol´´,
Proc. of the Eighth International Conference on Inductive Logic Programming,
LNAI series 1446 (1998) 116-124
Lindberg N. and Eineborg M., ``Learning Constraint Grammar-style disambiguation rules
using Inductive Logic Programming´´, Proc. of COLING/ACL'98, volume II,
(1998) 775-779 PostScript
Lindberg N. and Eineborg M., ``Improving Part of Speech Disambiguation
Rules by Adding Linguistic Knowledge´´, Proc. of the Ninth International
Workshop on Inductive Logic Programming, LNAI Series 1634, Springer (1999) 186-197
PostScript