We defined the categories on which the network was trained in terms of the properties of the categories' extensions (volume, shape, overlap) and in terms of the presence of form-to-form associations between a linguistic context specifying the question asked of the network and the linguistic outputs that were possible answers to those questions. The network of course does not have direct access to any of these global properties of the learning task. It simply receives one category example at a time and for each modifies its weights in such a way that it has stored a composite record of the instances of each category. The network in no sense stores category boundaries or anything like the representations of category extensions we have used throughout this paper to visualize the differences between nouns and adjectives.
Why then do factors such as shape and volume and overlap matter as they do? Two factors are fundamental to the network's performance: (1) the distance between members of the same category relative to the distance between members of different categories and (2) the degree of redundancy in the input.
Each input the network receives represents a point in its multi-dimensional input space. Via the weights connecting the input layers and the hidden layer, the network maps this point in input space onto a point in multi-dimensional hidden-layer space. Inputs which are similar---close to each other in input space---will tend to map onto points which are close to each other in hidden-layer space. Points in hidden-layer space in turn are mapped onto points in category space via the weights connecting the hidden layer and the output layer. Before training, these mappings will be random, depending on the randomly generated initial weights. As training progresses, however, the weights in the network take on values which permit regions in input space to be associated roughly with the appropriate regions in category space. This involves some readjustment of the regions in hidden-layer space associated with inputs. In particular, inputs belonging to the same category will tend to map onto relatively compact regions in hidden-layer space [HHL91]. Each time the network is trained on an instance of a category, the weights in the network are adjusted in such a way that that point in input space tends to get assigned to the region in output space associated with the category. When a test item is presented to the network, where it maps to in category space depends entirely on where it is in input space, in particular, how far it is from previously trained inputs. The input is implicitly compared to all of these inputs. Thus the network is an instance of an exemplar-based model of categorization (e.g., Nosofsky, 1986).
In these models, it is the relative distance between an input and previously learned exemplars of the different categories which determines the behavior of the system.
If a given input is likely to be as close to a previous member of another category as it is to previously trained members of its own category, error will tend to be high, and learning will take longer, requiring more examples of each category. More examples result in a greater density of within-category examples which can compensate for the nearness to a test input of distracting examples of other categories.
Category volume and compactness both relate to this relative distance measure. As category volume increases and number of examples remains constant, density within categories decreases: the average distance between members of each category increases. At the same time, the boundaries of different categories approach each other, so that for a given example of one category, the nearest distractor becomes nearer. Thus increasing volume leads to greater potential confusion between categories.
As category compactness decreases, we also see an increase in the average distance between members of a category. Consider two extreme cases, a set of parallel ``hyperslabs'' which extend across the full range of values on all dimensions but one and a set of evenly-spaced hyperspheres of the same volume as the hyperslabs. The average distance between members of the same category is greater for the hyperslabs because they may be arbitrarily far apart on all but one dimension. At the same time, the average distance between a member of one category and the nearest distractor in another category is smaller for the parallel hyperslabs, since the boundary of the nearest other category is found just across the narrow hyperslab-shaped gap separating the categories. Thus decreasing compactness, like increasing volume, means greater difficulty because of the potential confusion from examples of competing categories.
A further factor in category difficulty, though not as important in our results, is the degree of redundancy in the input. If more than one input unit conveys information about the category for an input pattern, then more network resources (weights) will be dedicated to representing the input-to-category mapping than would be the case if only one unit were relevant. In our experiments there is redundancy in all input patterns because of the use of thermometer encoding. On a given sensory dimension, all units to the ``left'' of a unit which is activated are redundant. However, in Experiment 4, some categories, namely, those with lexical dimension input, had the benefits of more redundancy than other categories. Recall that in this experiment, lexical dimensions were not required to categorize inputs, which on the basis of sensory input alone were unambiguous. Thus the redundant linguistic input gave the advantage to those categories for which it was available. Note, however, that while real adjective categories tend to be distinguished in part by lexical dimensions, they also tend to overlap with one another. When there is overlap, the lexical dimension is no longer redundant; rather, it, in combination with the sensory input, is necessary for determining the category of the input.
In sum, these two factors, (1) relative within- and between-category exemplar distances and (2) input redundancy, account for the results of our experiments. Interestingly, a third potential factor, the extent to which a particular input sensory dimension is relevant for a category, did not play a significant role. In Experiment 3, ``adjective'' categories were defined in such a way that a single dimension mattered much more than the other three. For ``nouns'', on the other hand, each sensory dimension was equally relevant. A learner with a propensity to selectively attend to particular sensory dimensions might find the adjectives easier. Relevance of a single dimension for a category conveys a disadvantage rather than an advantage for the network, and this result agrees with what we find for children.