next up previous
Next: Empirical Arguments for Phonetic Up: The Discreteness of Previous: What is `incomplete neutralization'?

Linguistic Phonetics: The Standard Theory

AMR takes the traditional linguistic approach to speech perception very seriously. This theory of linguistic phonetics is essentially the one presented in Chomsky and Halle's 1968 Sound Pattern of English (SPE, especially pp. 293-301) - and is derived from earlier feature theories [Jakobson et al., 1952, Hockett, 1955]. Although many specifics of phonology have changed over the years, it is difficult to see much change in the treatment of phonetics from within generative phonology since 1968. Only recently in `gestural phonology' [Browman and Goldstein, 1986, Browman and Goldstein, 1995] and in the movement toward `laboratory phonology' has fundamental change occurred. I will try to summarize this standard theory of phonetics as viewed by many linguists.

Competence vs. Performance.

The whole issue derives from the Chomskyan distinction between Competence and Performance. When it comes down to mechanisms, this is normally interpreted to differentiate between Competence as the domain of discrete variables (that is, symbols and symbol structures) as they are reconfigured and processed in discrete time. Processing time involves discrete jumps between system states when a rule is executed. (The structure of events in real time associated with the pronunciation of words is also discrete but is encoded as the ordering of static objects like segments, words and other syntactic units.) In opposition to this is Performance, the domain of continuous variables evolving in real (continuous) time. The continuous variables include processes related to motor control, audition and speech perception. So Performance is essentially the physical, while Competence is cognitive or mental and is assumed to work on principles similar to those of logic, mathematical proof and digital computation (see Haugeland 1985 and van Gelder and Port, 1995). Within linguistics, it is an article of faith that language (and probably everything else that is mental) will be best understood in terms of discrete time and discrete symbols.

But the competence-performance distinction (closely related to the Mind vs. Body distinction) creates problems at both output and input. The first is how can discrete, static symbols independent of real time (in the mind) control realtime continuous performance (in the body)? Presumably the mind controls the body that produces the speech gestures. But, as pointed out by Fowler et al, 1981 and Turvey, 1990, any model of this process must be quite implausible. The problem is that temporal specifications must now be set for every one of those timeless symbols at output time and then performed somehow. It is not so difficult to postulate rules to specify the durations (so called `temporal implementation rules') but it is very difficult to imagine how these specifications could actually be performed that is not full of arbitrariness. In trying to do it, one is forced to continue discovering new static states (since there is almost no end of subtle contextual effects that can be found), thus causing the problem to blow up exponentially (see, eg, Port, Cummins McAuley, 1995). Then some executive system from within performance must assume responsibility to assure that each minisegment type actually lasts the specified amount of time. The second problem, the input problem, is how can the auditory system become a phoneticizer and translate continuous-time auditory events into discrete static symbols? Mechanisms capable of this can be easily constructed, but how could they be designed so as to exploit all the many kinds of subtle temporal information that human speakers and hearers employ? Here too, new intermediate and context-sensitive states tend to proliferate as soon as one looks closely at any data (see, eg, Port and Rotunno, 1979).

Phonetic Theory as Alphabet.

Of course, Chomsky and Halle did not need to address these performance problems in order to do phonology. They only needed to have some description of speech articulation and speech perception that was sufficient to characterize the sound systems of human languages. They assumed, naturally, that this would take the form of a list of symbols, an inventory. So they asserted the existence of an interface alphabet, the list of `the phonetic capabilities of man' as they called it in their ringing but now quaintly old-fashioned turn of phrase. These minimal phonetic objects are atomic symbols as far as Competence is concerned, even though within Performance it is assumed they have both articulatory and auditory aspects involving continuous variables and continuous time - very difficult problems that were left to the phoneticians to deal with.

Thus, on the motor side, these discrete phonetic objects can be thought of as providing a universal alphabet of control configurations - all that is necessary for speech production specification and all that could in principle be controlled by the grammar of a language. Nothing beyond these units (that is, no further articulatory or acoustic detail) is supposed to be controllable - at least, not by the grammar of a language (though apparently people can mimic each other in nonlinguistic ways). On the perception side, it is assumed that audition comes with a `phoneticizer' that exhibits `categorical perception' and translates continuous acoustic events into discrete symbolic descriptions. It is these two devices, the input device and the output device, that are responsible for phonetic discreteness. And it is their performance that is thrown into question by incomplete neutralization (as well as by many other data, of course).

The traditional linguistic theory of phonetics and speech perception can be summarized this way:

  1. humans can hear the sounds of human speech (only) in terms of a set of discrete sound categories usually called `phones',
  2. each phone is a simultaneous combination of phonetic features,
  3. each phonetic feature is a static, atomic object with a simple, unitary articulatory and auditory specification and no internal temporal structure,
  4. there is some closed set of such features for human language,
  5. individual languages employ subsets of the universal set for their lexico-phonological systems,
  6. children learning their language employ these units to organize the perception of speech and the phonological grammar of the language.

It follows that two phones may be either identical or distinct. If they have the same phonetic features, then they should be phonetically identical - that is, as far as linguistic control is concerned. (Wouldn't this imply that no human should be able to distinguish them perceptually?) If they are different (that is, have distinct phonetic features), then, at the very least, language learning children and native speakers should be able to differentiate them easily and reliably. Why? Because if children could not be counted on to make the appropriate distinctions for any language, then how could various languages be reliably and accurately acquired? This criterion prevents the theoretical linguist from simply enlarging the phonetic alphabet without limit. So on close examination, one discovers that according to the standard theory, the stability of the universal inventory of phonetic units is what provides the explanation for the universal stability of language acquisition.

Acquisition: Adults vs. Children

It is well known that many sound units in a non-native language may be quite difficult to acquire when learned by adults. So one must suppose that adults may fail to distinguish the novel sounds of languages due to having somehow lost much of their innate ability to recognize the sound distinctions of human speech [Werker and Tees, 1984, Lively et al., 1994]. Apparently for most lay speaker-hearers, only the phonetic categories used in their native language are normally useable for speech perception as adults.

So what about phoneticians and linguists? How do they evade this phonetic atrophy? Presumably the loss of general phonetic resolution may be reduced by phonetic training. The International Phonetic Association alphabet (IPA) and Chapter 7 of SPE are two examples of scientific attempts to organize and list the full set of controllable aspects of speech perception and production. These are supposed to be all that is potentially under linguistic control - whether for contrasting words or for simply controlling the motor system. Linguists and phoneticians cultivate the distinctiveness of these features so as to produce and perceive them. It would seem that a strict version of the standard phonetic theory should predict that humans could never hear more detail than what is provided by the `phoneticizer' (though little is normally made of this). An implication of the notion of a universal phonetic alphabet is that we may hope that professional linguists and phoneticians should be able to approach the sum of perceptual and motor skills of native speakers of all languages.

Formal Theories in Linguistics.

The Chomsky-Halle model recognized that beyond Competence there is Performance, and acknowledged that the physical signal supporting speech perception is characterized by continuous change over time (eg, formant trajectories that result from an articulatory motion) and by continuously variable parameters (such as formant frequencies, intensities, lip positions, etc). However, C-H are very clear that the only aspects of continuous speech events that could be relevant to linguistic competence are those differences that reflect distinct phonetic transcriptions. The grammars of specific languages can only use some specific universal list of phonetic elements.

The fundamental reason for making this bold assumption is the one pointed out by Haugeland (1985, pp. 52-58): symbolic theories simply must assume a set of positively identifiable symbols. That is, the formal system itself - the grammar - must have symbolic objects that are discrete. They must be discrete in order to be infallibly recognizable. The symbols must also be stable over indefinitely long periods of time: if you put a symbol somewhere in memory, it must still be there when the system comes back later to read it. Formal models depend on these properties in order for their rules to function at all and for data structures to literally hold themselves together. In the execution of a computer program (one familiar example of a formal system), these stabilities are assured due to the engineering of the chip. For human cognition, if it is to be a genuine `competence model' as C-H clearly intend it to be, then these properties must be assumed. Without discreteness, infallible recognition and indefinite time stability, computational models simply will not work. Rules cannot be executed if the system cannot be sure when it is looking at an A rather than a B. In short, formal linguistics as we know it cannot be done without the assumption of discrete phonetic symbols.

So C-H had to propose that there is some phoneticizer that chops messy speech into useable symbols. It is assumed that only the output of the speech perception mechanism is available to any language, and this output must be constrained to provide only atomic and static phonetic features selected from the universal set.

This theory, the standard theory of linguistic phonetics, and has remained essentially unchanged in the phonological literature since the mid-1960s although it is probably fair to say that C-H formalized ideas that were current from the 1930s on. The most fundamental problem with competence is that within competence there is a sharp distinction between Symbols (as indefinitely time-stable states of the system) and the instantaneous Transitions between states. This idealization of processing within competence acts as though time is nonexistent! Real time exists neither in the Symbol (which is static) nor in the Transition (which is instantaneous). But real time moves inexorably and can always be looked at over much shorter (or much longer) time scales. Events that appear `instantaneous' to our cognitive intuitions may look very slow at the much faster time scale of neurons (just as global neuron behavior looks slow relative to even faster processes like ion channel activity). The Competence-Performance distinction thus amounts to an assumption that whatever might be happening at any shorter time scale within Performance cannot be relevant in any way for what happens at the longer Competence time scale. This is a bold yet almost unexamined assumption (see Port and van Gelder, 1995 and Kelso, 1995).

By idealizing Symbols and Transitions in this way, computational models of language make continued scientific progress on a theory of language very difficult. A theory of phonology (and of linguistics as a whole) that can be incorporated into modern cognitive science must begin with a far more sophisticated view of the relation between cognition and the physical aspects of the body and the physical world than merely the simple mapping of performance symbols onto competence symbols offered by an interface alphabet. Insisting on a mere mapping relation between Competence and Performance makes it impossible to understand how language is situated in a nervous system. A practical approach should not assume that linguistics is the study of the symbolic and formal structures of languages, but rather it should view linguistic structures of all kinds, from phones to words to sentences, as events in time. Different kinds of structures `live' on different time scales (eg, sentences are longer than phonemes). Of course, in modern times we have the technology to write words down on paper or put them in a computer file. We can even sample sound waves and put them in a file as well. Then we scan these displays in both directions looking for patterns. But such a display can not be assumed to be available - at least not apriori - to human cognition. If such a spatial display of words and sentences does exist cognitively, then accounting for how it could work is a empirical problem. However, to assume that such representations exist seems theoretically reckless, since there is no direct evidence for it whatever [Port et al., 1995]. As used by human speakers and as experienced by cognitive systems, the true dimensional axis of language is time, not space.

It is the questionable assumption that `language is a formal symbolic system' that forces phonology to insist that linguistic phonetics must provide discrete universal objects: Phonology needs something formal to manipulate. Of course, there are also a number of other arguments for discreteness that have appeared along the way. Although they are often put forth as relevant evidence by both phonologists and phoneticians, none of these empirical arguments, in my opinion, has more than tangential relevance to the central issue. Nevertheless, it is worth considering these performance-related arguments that seem to be related to the assumption of discrete phonemes.


next up previous
Next: Empirical Arguments for Phonetic Up: The Discreteness of Previous: What is `incomplete neutralization'?

Robert Port
Mon Mar 3 21:05:28 EST 1997