Peter Turney's articles on arXiv

Evolution of Spots and Stripes in Cellular Automata

2025-01-08T12:46:38-05:00

Cellular automata are computers, similar to Turing machines. The main difference is that Turing machines use a one-dimensional tape, whereas cellular automata use a two-dimensional grid. The best-known cellular automaton is the Game of Life, which is a universal computer. It belongs to a family of cellular automata with 262,144 members. Playing the Game of Life generally involves engineering; that is, assembling a device composed of various parts that are combined to achieve a specific intended result. Instead of engineering cellular automata, we propose evolving cellular automata. Evolution applies mutation and selection to a population of organisms. If a mutation increases the fitness of an organism, it may have many descendants, displacing the less fit organisms. Unlike engineering, evolution does not work towards an imagined goal. Evolution works towards increasing fitness, with no expectations about the specific form of the final result. Mutation, selection, and fitness yield structures that appear to be more organic and life-like than engineered structures. In our experiments, the patterns resulting from evolving cellular automata look much like the spots on leopards and the stripes on tigers.

Evolution of Symbiosis in the Game of Life: Three Characteristics of Successful Symbiotes

2022-09-26T20:20:57-04:00

In past work, we developed a computational model of the evolution of symbiotic entities (Model-S), based on Conway's Game of Life. In this article, we examine three trends that biologists have observed in the evolution of symbiotes. (1) Management: If one partner is able to control the symbiotic relation, this control can reduce conflict; thus, evolutionary selection favours symbiotes that have a manager. (2) Mutualism: Although partners in a symbiote often have conflicting needs, evolutionary selection favours symbiotes in which partners are better off together inside the symbiote than they would be as individuals outside of the symbiote. (3) Interaction: Repeated interaction among partners in symbiosis tends to promote increasing fitness due to evolutionary selection. We have added new components to Model-S that allow us to observe these three trends in runs of Model-S. The new components are analogous to the practice of staining cells in biology research, to reveal patterns that are not usually visible. When we measure the fitness of a symbiote by making it compete with other symbiotes, we find that fitter symbiotes have significantly more management, mutualism, and interaction than less fit symbiotes. These results confirm the trends observed in nature by biologists. Model-S allows biologists to study these evolutionary trends and other characteristics of symbiosis in ways that are not tractable with living organisms.

Evolution of Autopoiesis and Multicellularity in the Game of Life

2021-01-11T14:23:58-05:00

Recently we introduced a model of symbiosis, Model-S, based on the evolution of seed patterns in Conway's Game of Life. In the model, the fitness of a seed pattern is measured by one-on-one competitions in the Immigration Game, a two-player variation of the Game of Life. Our previous article showed that Model-S can serve as a highly abstract, simplified model of biological life: (1) The initial seed pattern is analogous to a genome. (2) The changes as the game runs are analogous to the development of the phenome. (3) Tournament selection in Model-S is analogous to natural selection in biology. (4) The Immigration Game in Model-S is analogous to competition in biology. (5) The first three layers in Model-S are analogous to biological reproduction. (6) The fusion of seed patterns in Model-S is analogous to symbiosis. The current article takes this analogy two steps further: (7) Autopoietic structures in the Game of Life (still lifes, oscillators, and spaceships -- collectively known as ashes) are analogous to cells in biology. (8) The seed patterns in the Game of Life give rise to multiple, diverse, cooperating autopoietic structures, analogous to multicellular biological life. We use the apgsearch software (Ash Pattern Generator Search), developed by Adam Goucher for the study of ashes, to analyze autopoiesis and multicellularity in Model-S. We find that the fitness of evolved seed patterns in Model-S is highly correlated with the diversity and quantity of multicellular autopoietic structures.

Measuring Behavioural Similarity of Cellular Automata

2020-12-17T17:19:26-05:00

Conway's Game of Life is the best-known cellular automaton. It is a classic model of emergence and self-organization, it is Turing-complete, and it can simulate a universal constructor. The Game of Life belongs to the set of semi-totalistic cellular automata, a family with 262,144 members. Many of these automata may deserve as much attention as the Game of Life, if not more. The challenge we address here is to provide a structure for organizing this large family, to make it easier to find interesting automata, and to understand the relations between automata. Packard and Wolfram (1985) divided the family into four classes, based on the observed behaviours of the rules. Eppstein (2010) proposed an alternative four-class system, based on the forms of the rules. Instead of a class-based organization, we propose a continuous high-dimensional vector space, where each automaton is represented by a point in the space. The distance between two automata in this space corresponds to the differences in their behavioural characteristics. Nearest neighbours in the space have similar behaviours. This space should make it easier for researchers to see the structure of the family of semi-totalistic rules and to find the hidden gems in the family.

Symbiosis Promotes Fitness Improvements in the Game of Life

2020-06-16T15:06:54-04:00

We present a computational simulation of evolving entities that includes symbiosis with shifting levels of selection. Evolution by natural selection shifts from the level of the original entities to the level of the new symbiotic entity. In the simulation, the fitness of an entity is measured by a series of one-on-one competitions in the Immigration Game, a two-player variation of Conway's Game of Life. Mutation, reproduction, and symbiosis are implemented as operations that are external to the Immigration Game. Because these operations are external to the game, we are able to freely manipulate the operations and observe the effects of the manipulations. The simulation is composed of four layers, each layer building on the previous layer. The first layer implements a simple form of asexual reproduction, the second layer introduces a more sophisticated form of asexual reproduction, the third layer adds sexual reproduction, and the fourth layer adds symbiosis. The experiments show that a small amount of symbiosis, added to the other layers, significantly increases the fitness of the population. We suggest that the model may provide new insights into symbiosis in biological and cultural evolution.

Conditions for Open-Ended Evolution in Immigration Games

2020-04-06T11:00:13-04:00

The Immigration Game (invented by Don Woods in 1971) extends the solitaire Game of Life (invented by John Conway in 1970) to enable two-player competition. The Immigration Game can be used in a model of evolution by natural selection, where fitness is measured with competitions. The rules for the Game of Life belong to the family of semitotalistic rules, a family with 262,144 members. Woods' method for converting the Game of Life into a two-player game generalizes to 8,192 members of the family of semitotalistic rules. In this paper, we call the original Immigration Game the Life Immigration Game and we call the 8,192 generalizations Immigration Games (including the Life Immigration Game). The question we examine here is, what are the conditions for one of the 8,192 Immigration Games to be suitable for modeling open-ended evolution? Our focus here is specifically on conditions for the rules, as opposed to conditions for other aspects of the model of evolution. In previous work, it was conjectured that Turing-completeness of the rules for the Game of Life may have been necessary for the success of evolution using the Life Immigration Game. Here we present evidence that Turing-completeness is a sufficient condition on the rules of Immigration Games, but not a necessary condition. The evidence suggests that a necessary and sufficient condition on the rules of Immigration Games, for open-ended evolution, is that the rules should allow growth.

The Natural Selection of Words: Finding the Features of Fitness

2019-08-19T14:28:59-04:00

We introduce a dataset for studying the evolution of words, constructed from WordNet and the Google Books Ngram Corpus. The dataset tracks the evolution of 4,000 synonym sets (synsets), containing 9,000 English words, from 1800 AD to 2000 AD. We present a supervised learning algorithm that is able to predict the future leader of a synset: the word in the synset that will have the highest frequency. The algorithm uses features based on a word's length, the characters in the word, and the historical frequencies of the word. It can predict change of leadership (including the identity of the new leader) fifty years in the future, with an F-score considerably above random guessing. Analysis of the learned models provides insight into the causes of change in the leader of a synset. The algorithm confirms observations linguists have made, such as the trend to replace the -ise suffix with -ize, the rivalry between the -ity and -ness suffixes, and the struggle between economy (shorter words are easier to remember and to write) and clarity (longer words are more distinctive and less likely to be confused with one another). The results indicate that integration of the Google Books Ngram Corpus with WordNet has significant potential for improving our understanding of how language evolves.

Conditions for Major Transitions in Biological and Cultural Evolution

2018-06-20T15:41:55-04:00

Evolution by natural selection can be seen an algorithm for generating creative solutions to difficult problems. More precisely, evolution by natural selection is a class of algorithms that share a set of properties. The question we address here is, what are the conditions that define this class of algorithms? There is a standard answer to this question: Briefly, the conditions are variation, heredity, and selection. We agree that these three conditions are sufficient for a limited type of evolution, but they are not sufficient for open-ended evolution. By open-ended evolution, we mean evolution that generates a continuous stream of creative solutions, without stagnating. We propose a set of conditions for open-ended evolution. The new conditions build on the standard conditions by adding fission, fusion, and cooperation. We test the proposed conditions by applying them to major transitions in the evolution of life and culture. We find that the proposed conditions are able to account for the major transitions.

Leveraging Term Banks for Answering Complex Questions: A Case for Sparse Vectors

2017-04-11T17:21:39-04:00

While open-domain question answering (QA) systems have proven effective for answering simple questions, they struggle with more complex questions. Our goal is to answer more complex questions reliably, without incurring a significant cost in knowledge resource construction to support the QA. One readily available knowledge resource is a term bank, enumerating the key concepts in a domain. We have developed an unsupervised learning approach that leverages a term bank to guide a QA system, by representing the terminological knowledge with thousands of specialized vector spaces. In experiments with complex science questions, we show that this approach significantly outperforms several state-of-the-art QA systems, demonstrating that significant leverage can be gained from continuous vector representations of domain terminology.

Semantic Composition and Decomposition: From Recognition to Generation

2014-05-30T12:36:07-04:00

Semantic composition is the task of understanding the meaning of text by composing the meanings of the individual words in the text. Semantic decomposition is the task of understanding the meaning of an individual word by decomposing it into various aspects (factors, constituents, components) that are latent in the meaning of the word. We take a distributional approach to semantics, in which a word is represented by a context vector. Much recent work has considered the problem of recognizing compositions and decompositions, but we tackle the more difficult generation problem. For simplicity, we focus on noun-modifier bigrams and noun unigrams. A test for semantic composition is, given context vectors for the noun and modifier in a noun-modifier bigram ("red salmon"), generate a noun unigram that is synonymous with the given bigram ("sockeye"). A test for semantic decomposition is, given a context vector for a noun unigram ("snifter"), generate a noun-modifier bigram that is synonymous with the given unigram ("brandy glass"). With a vocabulary of about 73,000 unigrams from WordNet, there are 73,000 candidate unigram compositions for a bigram and 5,300,000,000 (73,000 squared) candidate bigram decompositions for a unigram. We generate ranked lists of potential solutions in two passes. A fast unsupervised learning algorithm generates an initial list of candidates and then a slower supervised learning algorithm refines the list. We evaluate the candidate solutions by comparing them to WordNet synonym sets. For decomposition (unigram to bigram), the top 100 most highly ranked bigrams include a WordNet synonym of the given unigram 50.7% of the time. For composition (bigram to unigram), the top 100 most highly ranked unigrams include a WordNet synonym of the given bigram 77.8% of the time.

Experiments with Three Approaches to Recognizing Lexical Entailment

2014-01-31T14:42:19-05:00

Inference in natural language often involves recognizing lexical entailment (RLE); that is, identifying whether one word entails another. For example, "buy" entails "own". Two general strategies for RLE have been proposed: One strategy is to manually construct an asymmetric similarity measure for context vectors (directional similarity) and another is to treat RLE as a problem of learning to recognize semantic relations using supervised machine learning techniques (relation classification). In this paper, we experiment with two recent state-of-the-art representatives of the two general strategies. The first approach is an asymmetric similarity measure (an instance of the directional similarity strategy), designed to capture the degree to which the contexts of a word, a, form a subset of the contexts of another word, b. The second approach (an instance of the relation classification strategy) represents a word pair, a:b, with a feature vector that is the concatenation of the context vectors of a and b, and then applies supervised learning to a training set of labeled feature vectors. Additionally, we introduce a third approach that is a new instance of the relation classification strategy. The third approach represents a word pair, a:b, with a feature vector in which the features are the differences in the similarities of a and b to a set of reference words. All three approaches use vector space models (VSMs) of semantics, based on word-context matrices. We perform an extensive evaluation of the three approaches using three different datasets. The proposed new approach (similarity differences) performs significantly better than the other two approaches on some datasets and there is no dataset for which it is significantly worse. Our results suggest it is beneficial to make connections between the research in lexical entailment and the research in semantic relation classification.

Distributional semantics beyond words: Supervised learning of analogy and paraphrase

2013-10-18T10:50:39-04:00

There have been several efforts to extend distributional semantics beyond individual words, to measure the similarity of word pairs, phrases, and sentences (briefly, tuples; ordered sets of words, contiguous or noncontiguous). One way to extend beyond words is to compare two tuples using a function that combines pairwise similarities between the component words in the tuples. A strength of this approach is that it works with both relational similarity (analogy) and compositional similarity (paraphrase). However, past work required hand-coding the combination function for different tasks. The main contribution of this paper is that combination functions are generated by supervised learning. We achieve state-of-the-art results in measuring relational similarity between word pairs (SAT analogies and SemEval~2012 Task 2) and measuring compositional similarity between noun-modifier phrases and unigrams (multiple-choice paraphrase questions).

Domain and Function: A Dual-Space Model of Semantic Relations and Compositions

2013-09-16T12:51:02-04:00

Given appropriate representations of the semantic relations between carpenter and wood and between mason and stone (for example, vectors in a vector space model), a suitable algorithm should be able to recognize that these relations are highly similar (carpenter is to wood as mason is to stone; the relations are analogous). Likewise, with representations of dog, house, and kennel, an algorithm should be able to recognize that the semantic composition of dog and house, dog house, is highly similar to kennel (dog house and kennel are synonymous). It seems that these two tasks, recognizing relations and compositions, are closely connected. However, up to now, the best models for relations are significantly different from the best models for compositions. In this paper, we introduce a dual-space model that unifies these two tasks. This model matches the performance of the best previous models for relations and compositions. The dual-space model consists of a space for measuring domain similarity and a space for measuring function similarity. Carpenter and wood share the same domain, the domain of carpentry. Mason and stone share the same domain, the domain of masonry. Carpenter and mason share the same function, the function of artisans. Wood and stone share the same function, the function of materials. In the composition dog house, kennel has some domain overlap with both dog and house (the domains of pets and buildings). The function of kennel is similar to the function of house (the function of shelters). By combining domain and function similarities in various ways, we can model relations, compositions, and other aspects of semantics.

Analogy perception applied to seven tests of word comprehension

2011-07-22T12:54:11-04:00

It has been argued that analogy is the core of cognition. In AI research, algorithms for analogy are often limited by the need for hand-coded high-level representations as input. An alternative approach is to use high-level perception, in which high-level representations are automatically generated from raw data. Analogy perception is the process of recognizing analogies using high-level perception. We present PairClass, an algorithm for analogy perception that recognizes lexical proportional analogies using representations that are automatically generated from a large corpus of raw textual data. A proportional analogy is an analogy of the form A:B::C:D, meaning "A is to B as C is to D". A lexical proportional analogy is a proportional analogy with words, such as carpenter:wood::mason:stone. PairClass represents the semantic relations between two words using a high-dimensional feature vector, in which the elements are based on frequencies of patterns in the corpus. PairClass recognizes analogies by applying standard supervised machine learning techniques to the feature vectors. We show how seven different tests of word comprehension can be framed as problems of analogy perception and we then apply PairClass to the seven resulting sets of analogy perception problems. We achieve competitive results on all seven tests. This is the first time a uniform approach has handled such a range of tests of word comprehension.

From Frequency to Meaning: Vector Space Models of Semantics

2010-03-04T16:07:18-05:00

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.

The Latent Relation Mapping Engine: Algorithm and Experiments

2008-12-23T15:08:53-05:00

Many AI researchers and cognitive scientists have argued that analogy is the core of cognition. The most influential work on computational modeling of analogy-making is Structure Mapping Theory (SMT) and its implementation in the Structure Mapping Engine (SME). A limitation of SME is the requirement for complex hand-coded representations. We introduce the Latent Relation Mapping Engine (LRME), which combines ideas from SME and Latent Relational Analysis (LRA) in order to remove the requirement for hand-coded representations. LRME builds analogical mappings between lists of words, using a large corpus of raw text to automatically discover the semantic relations among the words. We evaluate LRME on a set of twenty analogical mapping problems, ten based on scientific analogies and ten based on common metaphors. LRME achieves human-level performance on the twenty problems. We compare LRME with a variety of alternative approaches and find that they are not able to reach the same level of performance.

A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations

2008-08-31T10:00:26-04:00

Recognizing analogies, synonyms, antonyms, and associations appear to be four distinct tasks, requiring distinct NLP algorithms. In the past, the four tasks have been treated independently, using a wide variety of algorithms. These four semantic classes, however, are a tiny sample of the full range of semantic phenomena, and we cannot afford to create ad hoc algorithms for each semantic phenomenon; we need to seek a unified approach. We propose to subsume a broad range of phenomena under analogies. To limit the scope of this paper, we restrict our attention to the subsumption of synonyms, antonyms, and associations. We introduce a supervised corpus-based machine learning algorithm for classifying analogous word pairs, and we show that it can solve multiple-choice SAT analogy questions, TOEFL synonym questions, ESL synonym-antonym questions, and similar-associated-both questions from cognitive psychology.

Empirical Evaluation of Four Tensor Decomposition Algorithms

2007-11-13T11:28:47-05:00

Higher-order tensor decompositions are analogous to the familiar Singular Value Decomposition (SVD), but they transcend the limitations of matrices (second-order tensors). SVD is a powerful tool that has achieved impressive results in information retrieval, collaborative filtering, computational linguistics, computational vision, and other fields. However, SVD is limited to two-dimensional arrays of data (two modes), and many potential applications have three or more modes, which require higher-order tensor decompositions. This paper evaluates four algorithms for higher-order tensor decomposition: Higher-Order Singular Value Decomposition (HO-SVD), Higher-Order Orthogonal Iteration (HOOI), Slice Projection (SP), and Multislice Projection (MP). We measure the time (elapsed run time), space (RAM and disk space requirements), and fit (tensor reconstruction accuracy) of the four algorithms, under a variety of conditions. We find that standard implementations of HO-SVD and HOOI do not scale up to larger tensors, due to increasing RAM requirements. We recommend HOOI for tensors that are small enough for the available RAM and MP for larger tensors.

Similarity of Semantic Relations

2006-08-25T10:35:11-04:00

There are at least two kinds of similarity. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words have a high degree of relational similarity, we say that their relations are analogous. For example, the word pair mason:stone is analogous to the pair carpenter:wood. This paper introduces Latent Relational Analysis (LRA), a method for measuring relational similarity. LRA has potential applications in many areas, including information extraction, word sense disambiguation, and information retrieval. Recently the Vector Space Model (VSM) of information retrieval has been adapted to measuring relational similarity, achieving a score of 47% on a collection of 374 college-level multiple-choice word analogy questions. In the VSM approach, the relation between a pair of words is characterized by a vector of frequencies of predefined patterns in a large corpus. LRA extends the VSM approach in three ways: (1) the patterns are derived automatically from the corpus, (2) the Singular Value Decomposition (SVD) is used to smooth the frequency data, and (3) automatically generated synonyms are used to explore variations of the word pairs. LRA achieves 56% on the 374 analogy questions, statistically equivalent to the average human score of 57%. On the related problem of classifying semantic relations, LRA achieves similar gains over the VSM.

Expressing Implicit Semantic Relations without Supervision

2006-07-27T14:23:45-04:00

We present an unsupervised learning algorithm that mines large text corpora for patterns that express implicit semantic relations. For a given input word pair X:Y with some unspecified semantic relations, the corresponding output list of patterns is ranked according to how well each pattern Pi expresses the relations between X and Y. For example, given X=ostrich and Y=bird, the two highest ranking output patterns are "X is the largest Y" and "Y such as the X". The output patterns are intended to be useful for finding further pairs with the same relations, to support the construction of lexicons, ontologies, and semantic networks. The patterns are sorted by pertinence, where the pertinence of a pattern Pi for a word pair X:Y is the expected relational similarity between the given pair and typical pairs for Pi. The algorithm is empirically evaluated on two tasks, solving multiple-choice SAT word analogy questions and classifying semantic relations in noun-modifier pairs. On both tasks, the algorithm achieves state-of-the-art results, performing significantly better than several alternative pattern ranking algorithms, based on tf-idf.

Self-Replication and Self-Assembly for Manufacturing

2006-07-27T13:55:16-04:00

It has been argued that a central objective of nanotechnology is to make products inexpensively, and that self-replication is an effective approach to very low-cost manufacturing. The research presented here is intended to be a step towards this vision. We describe a computational simulation of nanoscale machines floating in a virtual liquid. The machines can bond together to form strands (chains) that self-replicate and self-assemble into user-specified meshes. There are four types of machines and the sequence of machine types in a strand determines the shape of the mesh they will build. A strand may be in an unfolded state, in which the bonds are straight, or in a folded state, in which the bond angles depend on the types of machines. By choosing the sequence of machine types in a strand, the user can specify a variety of polygonal shapes. A simulation typically begins with an initial unfolded seed strand in a soup of unbonded machines. The seed strand replicates by bonding with free machines in the soup. The child strands fold into the encoded polygonal shape, and then the polygons drift together and bond to form a mesh. We demonstrate that a variety of polygonal meshes can be manufactured in the simulation, by simply changing the sequence of machine types in the seed.

Corpus-based Learning of Analogies and Semantic Relations

2005-08-23T16:21:56-04:00

We present an algorithm for learning from unlabeled text, based on the Vector Space Model (VSM) of information retrieval, that can solve verbal analogy questions of the kind found in the SAT college entrance exam. A verbal analogy has the form A:B::C:D, meaning "A is to B as C is to D"; for example, mason:stone::carpenter:wood. SAT analogy questions provide a word pair, A:B, and the problem is to select the most analogous word pair, C:D, from a set of five choices. The VSM algorithm correctly answers 47% of a collection of 374 college-level analogy questions (random guessing would yield 20% correct; the average college-bound senior high school student answers about 57% correctly). We motivate this research by applying it to a difficult problem in natural language processing, determining semantic relations in noun-modifier pairs. The problem is to classify a noun-modifier pair, such as "laser printer", according to the semantic relation between the noun (printer) and the modifier (laser). We use a supervised nearest-neighbour algorithm that assigns a class to a given noun-modifier pair by finding the most analogous noun-modifier pair in the training data. With 30 classes of semantic relations, on a collection of 600 labeled noun-modifier pairs, the learning algorithm attains an F value of 26.5% (random guessing: 3.3%). With 5 classes of semantic relations, the F value is 43.2% (random: 20%). The performance is state-of-the-art for both verbal analogies and noun-modifier relations.

Measuring Semantic Similarity by Latent Relational Analysis

2005-08-10T15:35:57-04:00

This paper introduces Latent Relational Analysis (LRA), a method for measuring semantic similarity. LRA measures similarity in the semantic relations between two pairs of words. When two pairs have a high degree of relational similarity, they are analogous. For example, the pair cat:meow is analogous to the pair dog:bark. There is evidence from cognitive science that relational similarity is fundamental to many cognitive and linguistic tasks (e.g., analogical reasoning). In the Vector Space Model (VSM) approach to measuring relational similarity, the similarity between two pairs is calculated by the cosine of the angle between the vectors that represent the two pairs. The elements in the vectors are based on the frequencies of manually constructed patterns in a large corpus. LRA extends the VSM approach in three ways: (1) patterns are derived automatically from the corpus, (2) Singular Value Decomposition is used to smooth the frequency data, and (3) synonyms are used to reformulate word pairs. This paper describes the LRA algorithm and experimentally compares LRA to VSM on two tasks, answering college-level multiple-choice word analogy questions and classifying semantic relations in noun-modifier expressions. LRA achieves state-of-the-art results, reaching human-level performance on the analogy questions and significantly exceeding VSM performance on both tasks.

Self-Replicating Strands that Self-Assemble into User-Specified Meshes

2005-02-22T11:53:15-05:00

It has been argued that a central objective of nanotechnology is to make products inexpensively, and that self-replication is an effective approach to very low-cost manufacturing. The research presented here is intended to be a step towards this vision. In previous work (JohnnyVon 1.0), we simulated machines that bonded together to form self-replicating strands. There were two types of machines (called types 0 and 1), which enabled strands to encode arbitrary bit strings. However, the information encoded in the strands had no functional role in the simulation. The information was replicated without being interpreted, which was a significant limitation for potential manufacturing applications. In the current work (JohnnyVon 2.0), the information in a strand is interpreted as instructions for assembling a polygonal mesh. There are now four types of machines and the information encoded in a strand determines how it folds. A strand may be in an unfolded state, in which the bonds are straight (although they flex slightly due to virtual forces acting on the machines), or in a folded state, in which the bond angles depend on the types of machines. By choosing the sequence of machine types in a strand, the user can specify a variety of polygonal shapes. A simulation typically begins with an initial unfolded seed strand in a soup of unbonded machines. The seed strand replicates by bonding with free machines in the soup. The child strands fold into the encoded polygonal shape, and then the polygons drift together and bond to form a mesh. We demonstrate that a variety of polygonal meshes can be manufactured in the simulation, by simply changing the sequence of machine types in the seed.

Combining Independent Modules in Lexical Multiple-Choice Problems

2005-01-10T16:03:14-05:00

Existing statistical approaches to natural language problems are very coarse approximations to the true complexity of language processing. As such, no single technique will be best for all problem instances. Many researchers are examining ensemble methods that combine the output of multiple modules to create more accurate solutions. This paper examines three merging rules for combining probability distributions: the familiar mixture rule, the logarithmic rule, and a novel product rule. These rules were applied with state-of-the-art results to two problems used to assess human mastery of lexical semantics -- synonym questions and analogy questions. All three merging rules result in ensembles that are more accurate than any of their component modules. The differences among the three rules are not statistically significant, but it is suggestive that the popular mixture rule is not the best rule for either of the two problems.

Human-Level Performance on Word Analogy Questions by Latent Relational Analysis

2004-12-06T16:50:18-05:00

This paper introduces Latent Relational Analysis (LRA), a method for measuring relational similarity. LRA has potential applications in many areas, including information extraction, word sense disambiguation, machine translation, and information retrieval. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words have a high degree of relational similarity, we say that their relations are analogous. For example, the word pair mason/stone is analogous to the pair carpenter/wood. Past work on semantic similarity measures has mainly been concerned with attributional similarity. Recently the Vector Space Model (VSM) of information retrieval has been adapted to the task of measuring relational similarity, achieving a score of 47% on a collection of 374 college-level multiple-choice word analogy questions. In the VSM approach, the relation between a pair of words is characterized by a vector of frequencies of predefined patterns in a large corpus. LRA extends the VSM approach in three ways: (1) the patterns are derived automatically from the corpus (they are not predefined), (2) the Singular Value Decomposition (SVD) is used to smooth the frequency data (it is also used this way in Latent Semantic Analysis), and (3) automatically generated synonyms are used to explore reformulations of the word pairs. LRA achieves 56% on the 374 analogy questions, statistically equivalent to the average human score of 57%. On the related problem of classifying noun-modifier relations, LRA achieves similar gains over the VSM, while using a smaller corpus.

Word Sense Disambiguation by Web Mining for Word Co-occurrence Probabilities

2004-07-29T15:46:01-04:00

This paper describes the National Research Council (NRC) Word Sense Disambiguation (WSD) system, as applied to the English Lexical Sample (ELS) task in Senseval-3. The NRC system approaches WSD as a classical supervised machine learning problem, using familiar tools such as the Weka machine learning software and Brill's rule-based part-of-speech tagger. Head words are represented as feature vectors with several hundred features. Approximately half of the features are syntactic and the other half are semantic. The main novelty in the system is the method for generating the semantic features, based on word \hbox{co-occurrence} probabilities. The probabilities are estimated using the Waterloo MultiText System with a corpus of about one terabyte of unlabeled text, collected by a web crawler.

Combining Independent Modules to Solve Multiple-choice Synonym and Analogy Problems

2003-09-19T16:13:07-04:00

Existing statistical approaches to natural language problems are very coarse approximations to the true complexity of language processing. As such, no single technique will be best for all problem instances. Many researchers are examining ensemble methods that combine the output of successful, separately developed modules to create more accurate solutions. This paper examines three merging rules for combining probability distributions: the well known mixture rule, the logarithmic rule, and a novel product rule. These rules were applied with state-of-the-art results to two problems commonly used to assess human mastery of lexical semantics -- synonym questions and analogy questions. All three merging rules result in ensembles that are more accurate than any of their component modules. The differences among the three rules are not statistically significant, but it is suggestive that the popular mixture rule is not the best rule for either of the two problems.

Measuring Praise and Criticism: Inference of Semantic Orientation from Association

2003-09-19T12:30:55-04:00

The evaluative character of a word is called its semantic orientation. Positive semantic orientation indicates praise (e.g., "honest", "intrepid") and negative semantic orientation indicates criticism (e.g., "disturbing", "superfluous"). Semantic orientation varies in both direction (positive or negative) and degree (mild to strong). An automated system for measuring semantic orientation would have application in text classification, text filtering, tracking opinions in online discussions, analysis of survey responses, and automated chat systems (chatbots). This paper introduces a method for inferring the semantic orientation of a word from its statistical association with a set of positive and negative paradigm words. Two instances of this approach are evaluated, based on two different statistical measures of word association: pointwise mutual information (PMI) and latent semantic analysis (LSA). The method is experimentally tested with 3,596 words (including adjectives, adverbs, nouns, and verbs) that have been manually labeled positive (1,614 words) and negative (1,982 words). The method attains an accuracy of 82.8% on the full test set, but the accuracy rises above 95% when the algorithm is allowed to abstain from classifying mild words.

Coherent Keyphrase Extraction via Web Mining

2003-08-20T16:42:19-04:00

Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. A limitation of previous keyphrase extraction algorithms is that the selected keyphrases are occasionally incoherent. That is, the majority of the output keyphrases may fit together well, but there may be a minority that appear to be outliers, with no clear semantic relation to the majority or to each other. This paper presents enhancements to the Kea keyphrase extraction algorithm that are designed to increase the coherence of the extracted keyphrases. The approach is to use the degree of statistical association among candidate keyphrases as evidence that they may be semantically related. The statistical association is measured using web mining. Experiments demonstrate that the enhancements improve the quality of the extracted keyphrases. Furthermore, the enhancements are not domain-specific: the algorithm generalizes well when it is trained on one domain (computer science documents) and tested on another (physics documents).

Learning Analogies and Semantic Relations

2003-07-24T17:09:43-04:00

We present an algorithm for learning from unlabeled text, based on the Vector Space Model (VSM) of information retrieval, that can solve verbal analogy questions of the kind found in the Scholastic Aptitude Test (SAT). A verbal analogy has the form A:B::C:D, meaning "A is to B as C is to D"; for example, mason:stone::carpenter:wood. SAT analogy questions provide a word pair, A:B, and the problem is to select the most analogous word pair, C:D, from a set of five choices. The VSM algorithm correctly answers 47% of a collection of 374 college-level analogy questions (random guessing would yield 20% correct). We motivate this research by relating it to work in cognitive science and linguistics, and by applying it to a difficult problem in natural language processing, determining semantic relations in noun-modifier pairs. The problem is to classify a noun-modifier pair, such as "laser printer", according to the semantic relation between the noun (printer) and the modifier (laser). We use a supervised nearest-neighbour algorithm that assigns a class to a given noun-modifier pair by finding the most analogous noun-modifier pair in the training data. With 30 classes of semantic relations, on a collection of 600 labeled noun-modifier pairs, the learning algorithm attains an F value of 26.5% (random guessing: 3.3%). With 5 classes of semantic relations, the F value is 43.2% (random: 20%). The performance is state-of-the-art for these challenging problems.

Self-Replicating Machines in Continuous Space with Virtual Physics

2003-04-15T15:33:45-04:00

JohnnyVon is an implementation of self-replicating machines in continuous two-dimensional space. Two types of particles drift about in a virtual liquid. The particles are automata with discrete internal states but continuous external relationships. Their internal states are governed by finite state machines but their external relationships are governed by a simulated physics that includes Brownian motion, viscosity, and spring-like attractive and repulsive forces. The particles can be assembled into patterns that can encode arbitrary strings of bits. We demonstrate that, if an arbitrary "seed" pattern is put in a "soup" of separate individual particles, the pattern will replicate by assembling the individual particles into copies of itself. We also show that, given sufficient time, a soup of separate individual particles will eventually spontaneously form self-replicating patterns. We discuss the implications of JohnnyVon for research in nanotechnology, theoretical biology, and artificial life.

Increasing Evolvability Considered as a Large-Scale Trend in Evolution

2002-12-12T17:39:39-05:00

Evolvability is the capacity to evolve. This paper introduces a simple computational model of evolvability and demonstrates that, under certain conditions, evolvability can increase indefinitely, even when there is no direct selection for evolvability. The model shows that increasing evolvability implies an accelerating evolutionary pace. It is suggested that the conditions for indefinitely increasing evolvability are satisfied in biological and cultural evolution. We claim that increasing evolvability is a large-scale trend in evolution. This hypothesis leads to testable predictions about biological and cultural evolution.

Exploiting Context When Learning to Classify

2002-12-12T14:40:50-05:00

This paper addresses the problem of classifying observations when features are context-sensitive, specifically when the testing set involves a context that is different from the training set. The paper begins with a precise definition of the problem, then general strategies are presented for enhancing the performance of classification algorithms on this type of problem. These strategies are tested on two domains. The first domain is the diagnosis of gas turbine engines. The problem is to diagnose a faulty engine in one context, such as warm weather, when the fault has previously been seen only in another context, such as cold weather. The second domain is speech recognition. The problem is to recognize words spoken by a new speaker, not represented in the training set. For both domains, exploiting context results in substantially more accurate classification.

Robust Classification with Context-Sensitive Features

2002-12-12T14:26:52-05:00

This paper addresses the problem of classifying observations when features are context-sensitive, especially when the testing set involves a context that is different from the training set. The paper begins with a precise definition of the problem, then general strategies are presented for enhancing the performance of classification algorithms on this type of problem. These strategies are tested on three domains. The first domain is the diagnosis of gas turbine engines. The problem is to diagnose a faulty engine in one context, such as warm weather, when the fault has previously been seen only in another context, such as cold weather. The second domain is speech recognition. The context is given by the identity of the speaker. The problem is to recognize words spoken by a new speaker, not represented in the training set. The third domain is medical prognosis. The problem is to predict whether a patient with hepatitis will live or die. The context is the age of the patient. For all three domains, exploiting context results in substantially more accurate classification.

Data Engineering for the Analysis of Semiconductor Manufacturing Data

2002-12-12T14:11:11-05:00

We have analyzed manufacturing data from several different semiconductor manufacturing plants, using decision tree induction software called Q-YIELD. The software generates rules for predicting when a given product should be rejected. The rules are intended to help the process engineers improve the yield of the product, by helping them to discover the causes of rejection. Experience with Q-YIELD has taught us the importance of data engineering -- preprocessing the data to enable or facilitate decision tree induction. This paper discusses some of the data engineering problems we have encountered with semiconductor manufacturing data. The paper deals with two broad classes of problems: engineering the features in a feature vector representation and engineering the definition of the target concept (the classes). Manufacturing process data present special problems for feature engineering, since the data have multiple levels of granularity (detail, resolution). Engineering the target concept is important, due to our focus on understanding the past, as opposed to the more common focus in machine learning on predicting the future.

Low Size-Complexity Inductive Logic Programming: The East-West Challenge Considered as a Problem in Cost-Sensitive Classification

2002-12-12T13:51:06-05:00

The Inductive Logic Programming community has considered proof-complexity and model-complexity, but, until recently, size-complexity has received little attention. Recently a challenge was issued "to the international computing community" to discover low size-complexity Prolog programs for classifying trains. The challenge was based on a problem first proposed by Ryszard Michalski, 20 years ago. We interpreted the challenge as a problem in cost-sensitive classification and we applied a recently developed cost-sensitive classifier to the competition. Our algorithm was relatively successful (we won a prize). This paper presents our algorithm and analyzes the results of the competition.

The Identification of Context-Sensitive Features: A Formal Definition of Context for Concept Learning

2002-12-12T13:29:02-05:00

A large body of research in machine learning is concerned with supervised learning from examples. The examples are typically represented as vectors in a multi-dimensional feature space (also known as attribute-value descriptions). A teacher partitions a set of training examples into a finite number of classes. The task of the learning algorithm is to induce a concept from the training examples. In this paper, we formally distinguish three types of features: primary, contextual, and irrelevant features. We also formally define what it means for one feature to be context-sensitive to another feature. Context-sensitive features complicate the task of the learner and potentially impair the learner's performance. Our formal definitions make it possible for a learner to automatically identify context-sensitive features. After context-sensitive features have been identified, there are several strategies that the learner can employ for managing the features; however, a discussion of these strategies is outside of the scope of this paper. The formal definitions presented here correct a flaw in previously proposed definitions. We discuss the relationship between our work and a formal definition of relevance.

The Management of Context-Sensitive Features: A Review of Strategies

2002-12-12T13:14:38-05:00

In this paper, we review five heuristic strategies for handling context-sensitive features in supervised machine learning from examples. We discuss two methods for recovering lost (implicit) contextual information. We mention some evidence that hybrid strategies can have a synergetic effect. We then show how the work of several machine learning researchers fits into this framework. While we do not claim that these strategies exhaust the possibilities, it appears that the framework includes all of the techniques that can be found in the published literature on contextsensitive learning.

Myths and Legends of the Baldwin Effect

2002-12-11T16:34:18-05:00

This position paper argues that the Baldwin effect is widely misunderstood by the evolutionary computation community. The misunderstandings appear to fall into two general categories. Firstly, it is commonly believed that the Baldwin effect is concerned with the synergy that results when there is an evolving population of learning individuals. This is only half of the story. The full story is more complicated and more interesting. The Baldwin effect is concerned with the costs and benefits of lifetime learning by individuals in an evolving population. Several researchers have focussed exclusively on the benefits, but there is much to be gained from attention to the costs. This paper explains the two sides of the story and enumerates ten of the costs and benefits of lifetime learning by individuals in an evolving population. Secondly, there is a cluster of misunderstandings about the relationship between the Baldwin effect and Lamarckian inheritance of acquired characteristics. The Baldwin effect is not Lamarckian. A Lamarckian algorithm is not better for most evolutionary computing problems than a Baldwinian algorithm. Finally, Lamarckian inheritance is not a better model of memetic (cultural) evolution than the Baldwin effect.

Types of Cost in Inductive Concept Learning

2002-12-11T14:42:14-05:00

Inductive concept learning is the task of learning to assign cases to a discrete set of classes. In real-world applications of concept learning, there are many different types of cost involved. The majority of the machine learning literature ignores all types of cost (unless accuracy is interpreted as a type of cost measure). A few papers have investigated the cost of misclassification errors. Very few papers have examined the many other types of cost. In this paper, we attempt to create a taxonomy of the different types of cost that are involved in inductive concept learning. This taxonomy may help to organize the literature on cost-sensitive learning. We hope that it will inspire researchers to investigate all types of cost in inductive concept learning in more depth.

Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

2002-12-11T14:17:06-05:00

This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing).

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews

2002-12-11T13:57:42-05:00

This paper presents a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not recommended (thumbs down). The classification of a review is predicted by the average semantic orientation of the phrases in the review that contain adjectives or adverbs. A phrase has a positive semantic orientation when it has good associations (e.g., "subtle nuances") and a negative semantic orientation when it has bad associations (e.g., "very cavalier"). In this paper, the semantic orientation of a phrase is calculated as the mutual information between the given phrase and the word "excellent" minus the mutual information between the given phrase and the word "poor". A review is classified as recommended if the average semantic orientation of its phrases is positive. The algorithm achieves an average accuracy of 74% when evaluated on 410 reviews from Epinions, sampled from four different domains (reviews of automobiles, banks, movies, and travel destinations). The accuracy ranges from 84% for automobile reviews to 66% for movie reviews.

Contextual Normalization Applied to Aircraft Gas Turbine Engine Diagnosis

2002-12-11T13:30:59-05:00

Diagnosing faults in aircraft gas turbine engines is a complex problem. It involves several tasks, including rapid and accurate interpretation of patterns in engine sensor data. We have investigated contextual normalization for the development of a software tool to help engine repair technicians with interpretation of sensor data. Contextual normalization is a new strategy for employing machine learning. It handles variation in data that is due to contextual factors, rather than the health of the engine. It does this by normalizing the data in a context-sensitive manner. This learning strategy was developed and tested using 242 observations of an aircraft gas turbine engine in a test cell, where each observation consists of roughly 12,000 numbers, gathered over a 12 second interval. There were eight classes of observations: seven deliberately implanted classes of faults and a healthy class. We compared two approaches to implementing our learning strategy: linear regression and instance-based learning. We have three main results. (1) For the given problem, instance-based learning works better than linear regression. (2) For this problem, contextual normalization works better than other common forms of normalization. (3) The algorithms described here can be the basis for a useful software tool for assisting technicians with the interpretation of sensor data.

Theoretical Analyses of Cross-Validation Error and Voting in Instance-Based Learning

2002-12-11T12:36:00-05:00

This paper begins with a general theory of error in cross-validation testing of algorithms for supervised learning from examples. It is assumed that the examples are described by attribute-value pairs, where the values are symbolic. Cross-validation requires a set of training examples and a set of testing examples. The value of the attribute that is to be predicted is known to the learner in the training set, but unknown in the testing set. The theory demonstrates that cross-validation error has two components: error on the training set (inaccuracy) and sensitivity to noise (instability). This general theory is then applied to voting in instance-based learning. Given an example in the testing set, a typical instance-based learning algorithm predicts the designated attribute by voting among the k nearest neighbors (the k most similar examples) to the testing example in the training set. Voting is intended to increase the stability (resistance to noise) of instance-based learning, but a theoretical analysis shows that there are circumstances in which voting can be destabilizing. The theory suggests ways to minimize cross-validation error, by insuring that voting is stable and does not adversely affect accuracy.

A Theory of Cross-Validation Error

2002-12-11T11:08:36-05:00

This paper presents a theory of error in cross-validation testing of algorithms for predicting real-valued attributes. The theory justifies the claim that predicting real-valued attributes requires balancing the conflicting demands of simplicity and accuracy. Furthermore, the theory indicates precisely how these conflicting demands must be balanced, in order to minimize cross-validation error. A general theory is presented, then it is developed in detail for linear regression and instance-based learning.

Technical Note: Bias and the Quantification of Stability

2002-12-11T10:50:41-05:00

Research on bias in machine learning algorithms has generally been concerned with the impact of bias on predictive accuracy. We believe that there are other factors that should also play a role in the evaluation of bias. One such factor is the stability of the algorithm; in other words, the repeatability of the results. If we obtain two sets of data from the same phenomenon, with the same underlying probability distribution, then we would like our learning algorithm to induce approximately the same concepts from both sets of data. This paper introduces a method for quantifying stability, based on a measure of the agreement between concepts. We also discuss the relationships among stability, predictive accuracy, and bias.

How to Shift Bias: Lessons from the Baldwin Effect

2002-12-10T13:19:54-05:00

An inductive learning algorithm takes a set of data as input and generates a hypothesis as output. A set of data is typically consistent with an infinite number of hypotheses; therefore, there must be factors other than the data that determine the output of the learning algorithm. In machine learning, these other factors are called the bias of the learner. Classical learning algorithms have a fixed bias, implicit in their design. Recently developed learning algorithms dynamically adjust their bias as they search for a hypothesis. Algorithms that shift bias in this manner are not as well understood as classical algorithms. In this paper, we show that the Baldwin effect has implications for the design and analysis of bias shifting algorithms. The Baldwin effect was proposed in 1896, to explain how phenomena that might appear to require Lamarckian evolution (inheritance of acquired characteristics) can arise from purely Darwinian evolution. Hinton and Nowlan presented a computational model of the Baldwin effect in 1987. We explore a variation on their model, which we constructed explicitly to illustrate the lessons that the Baldwin effect has for research in bias shifting algorithms. The main lesson is that it appears that a good strategy for shift of bias in a learning algorithm is to begin with a weak bias and gradually shift to a strong bias.

A Simple Model of Unbounded Evolutionary Versatility as a Largest-Scale Trend in Organismal Evolution

2002-12-10T10:52:43-05:00

The idea that there are any large-scale trends in the evolution of biological organisms is highly controversial. It is commonly believed, for example, that there is a large-scale trend in evolution towards increasing complexity, but empirical and theoretical arguments undermine this belief. Natural selection results in organisms that are well adapted to their local environments, but it is not clear how local adaptation can produce a global trend. In this paper, I present a simple computational model, in which local adaptation to a randomly changing environment results in a global trend towards increasing evolutionary versatility. In this model, for evolutionary versatility to increase without bound, the environment must be highly dynamic. The model also shows that unbounded evolutionary versatility implies an accelerating evolutionary pace. I believe that unbounded increase in evolutionary versatility is a large-scale trend in evolution. I discuss some of the testable predictions about organismal evolution that are suggested by the model.

Learning Algorithms for Keyphrase Extraction

2002-12-10T10:30:56-05:00

Many academic journals ask their authors to provide a list of about five to fifteen keywords, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful, as we discuss in this paper. We approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. We evaluate the performance of nine different configurations of C4.5. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for automatically extracting keyphrases from text. The experimental results support the claim that a custom-designed algorithm (GenEx), incorporating specialized procedural domain knowledge, can generate better keyphrases than a generalpurpose algorithm (C4.5). Subjective human evaluation of the keyphrases generated by Extractor suggests that about 80% of the keyphrases are acceptable to human readers. This level of performance should be satisfactory for a wide variety of applications.

Answering Subcognitive Turing Test Questions: A Reply to French

2002-12-09T08:09:10-05:00

Robert French has argued that a disembodied computer is incapable of passing a Turing Test that includes subcognitive questions. Subcognitive questions are designed to probe the network of cultural and perceptual associations that humans naturally develop as we live, embodied and embedded in the world. In this paper, I show how it is possible for a disembodied computer to answer subcognitive questions appropriately, contrary to French's claim. My approach to answering subcognitive questions is to use statistical information extracted from a very large collection of text. In particular, I show how it is possible to answer a sample of subcognitive questions taken from French, by issuing queries to a search engine that indexes about 350 million Web pages. This simple algorithm may shed light on the nature of human (sub-) cognition, but the scope of this paper is limited to demonstrating that French is mistaken: a disembodied computer can answer subcognitive questions.

Extraction of Keyphrases from Text: Evaluation of Four Algorithms

2002-12-08T14:40:42-05:00

This report presents an empirical evaluation of four algorithms for automatically extracting keywords and keyphrases from documents. The four algorithms are compared using five different collections of documents. For each document, we have a target set of keyphrases, which were generated by hand. The target keyphrases were generated for human readers; they were not tailored for any of the four keyphrase extraction algorithms. Each of the algorithms was evaluated by the degree to which the algorithm's keyphrases matched the manually generated keyphrases. The four algorithms were (1) the AutoSummarize feature in Microsoft's Word 97, (2) an algorithm based on Eric Brill's part-of-speech tagger, (3) the Summarize feature in Verity's Search 97, and (4) NRC's Extractor algorithm. For all five document collections, NRC's Extractor yields the best match with the manually generated keyphrases.

Learning to Extract Keyphrases from Text

2002-12-08T14:27:56-05:00

Many academic journals ask their authors to provide a list of about five to fifteen key words, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a surprisingly wide variety of tasks for which keyphrases are useful, as we discuss in this paper. Recent commercial software, such as Microsoft's Word 97 and Verity's Search 97, includes algorithms that automatically extract keyphrases from documents. In this paper, we approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for this task. The third set of experiments examines the performance of GenEx on the task of metadata generation, relative to the performance of Microsoft's Word 97. The fourth and final set of experiments investigates the performance of GenEx on the task of highlighting, relative to Verity's Search 97. The experimental results support the claim that a specialized learning algorithm (GenEx) can generate better keyphrases than a general-purpose learning algorithm (C4.5) and the non-learning algorithms that are used in commercial software (Word 97 and Search 97).

Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus

2002-12-08T14:06:08-05:00

The evaluative character of a word is called its semantic orientation. A positive semantic orientation implies desirability (e.g., "honest", "intrepid") and a negative semantic orientation implies undesirability (e.g., "disturbing", "superfluous"). This paper introduces a simple algorithm for unsupervised learning of semantic orientation from extremely large corpora. The method involves issuing queries to a Web search engine and using pointwise mutual information to analyse the results. The algorithm is empirically evaluated using a training corpus of approximately one hundred billion words -- the subset of the Web that is indexed by the chosen search engine. Tested with 3,596 words (1,614 positive and 1,982 negative), the algorithm attains an accuracy of 80%. The 3,596 test words include adjectives, adverbs, nouns, and verbs. The accuracy is comparable with the results achieved by Hatzivassiloglou and McKeown (1997), using a complex four-stage supervised learning algorithm that is restricted to determining the semantic orientation of adjectives.

Mining the Web for Lexical Knowledge to Improve Keyphrase Extraction: Learning from Labeled and Unlabeled Data

2002-12-08T13:52:33-05:00

Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. Good performance on this task has been obtained by approaching it as a supervised learning problem. An input document is treated as a set of candidate phrases that must be classified as either keyphrases or non-keyphrases. To classify a candidate phrase as a keyphrase, the most important features (attributes) appear to be the frequency and location of the candidate phrase in the document. Recent work has demonstrated that it is also useful to know the frequency of the candidate phrase as a manually assigned keyphrase for other documents in the same domain as the given document (e.g., the domain of computer science). Unfortunately, this keyphrase-frequency feature is domain-specific (the learning process must be repeated for each new domain) and training-intensive (good performance requires a relatively large number of training documents in the given domain, with manually assigned keyphrases). The aim of the work described here is to remove these limitations. In this paper, I introduce new features that are derived by mining lexical knowledge from a very large collection of unlabeled data, consisting of approximately 350 million Web pages without manually assigned keyphrases. I present experiments that show that the new features result in improved keyphrase extraction, although they are neither domain-specific nor training-intensive.

JohnnyVon: Self-Replicating Automata in Continuous Two-Dimensional Space

2002-12-07T19:26:49-05:00

JohnnyVon is an implementation of self-replicating automata in continuous two-dimensional space. Two types of particles drift about in a virtual liquid. The particles are automata with discrete internal states but continuous external relationships. Their internal states are governed by finite state machines but their external relationships are governed by a simulated physics that includes brownian motion, viscosity, and spring-like attractive and repulsive forces. The particles can be assembled into patterns that can encode arbitrary strings of bits. We demonstrate that, if an arbitrary "seed" pattern is put in a "soup" of separate individual particles, the pattern will replicate by assembling the individual particles into copies of itself. We also show that, given sufficient time, a soup of separate individual particles will eventually spontaneously form self-replicating patterns. We discuss the implications of JohnnyVon for research in nanotechnology, theoretical biology, and artificial life.