Automatic writing: Text Generation and Summarization

This research area treats the creation of a text either from a knowledge representation or from an other text.

Automatic Text Generation and Summarization

Automatic text generation and automatic text summarization are two research areas which have a lot in common. Both areas have the goal of creating a readable and coherent text for a specific user. Many of the research problems are valid in both areas. For example user modeling and discourse structure.

In automatic text generation a computer automatically creates natural language, e.g. English, Chinese, or Greek, from a computational representation.
In automatic text summarization a computer creates automatically an abstract or summary from an original man-made source text.

The process of explaining in natural language is complex and fascinating. The Speaker needs knowledge of the domain to be explained, knowledge about the Hearer's knowledge of the domain; and knowledge of what the Hearer wants or needs to know, so called user modeling.

When a Speaker wants to generate natural language, after considering the three previous points, she has to perform several processes: she has to make a content determination from her abundant knowledge base, plan and organize the information to construct a coherent organization, decide on sentence structure and scope, sentence planning includes also aggregation and finally generate the surface form which involves the realization of the syntactic structures and lexical choice.

When computers produce natural language (NL), the computer needs to perform the same tasks as a when a human creates a text. The knowledge base in a computer is presentation-independent, parallel, and non-ambiguous. NL on the other hand is presentation-dependent, linear, and may be somewhat ambiguous. This is therefore not a trivial mapping. For the reader familiar with Natural Language Parsing (NLP), are problems in Natural Language Generation (NLG) considered as hard as problems in NLP.

Automatic text generation can be used for:

Read more about Natural Language Generation (NLG) in Så genererar datorn text, (in Swedish).

In automatic text summarization there are two distinct techniques either text extraction or text abstraction. Text extraction means to extract pieces of an original text on a statistical basis or with heuristic methods and put together it to a new shorter text with the same information content.
There are three steps to perform text extraction. First to understand the topic of a text, so called topic indentification, secondly the interpretation of the text and finally the generation of the text.
In text extraction the method is basically to give scores to each sentence depending on the importance of each sentence and when creating the summary the most significant sentences are kept. The scores can be based on high-frequent open word class words, bold or numerical text, proper nouns, citations, position in text etc.
Text abstration is to parse the original text in a linguistic way, interpret the text and find new concepts to describe the text and then generate a new shorter text with the same information content. The latter is very similar to text generation.

Automatic text summarization can be used for:




Responsible for this page: Hercules Dalianis <hercules@kth.se>
Latest change February 14, 2005.