The problem of lexical ambiguity in information retrieval offers an opportunity for fruitful collaboration between linguistics, mathematical modeling and information retrieval research. In this talk, I will first take a fresh look at some linguistic questions about lexical ambiguity that are relevant in the context of information retrieval. How do we determine how many senses an ambiguous word has? Why is there so much inter-judge disagreement in disambiguation by humans? Is "co-activation", the simultaneous invocation of several senses, possible or even common? I will then sketch a probabilistic model of lexical ambiguity that addresses some of these questions. The parameters of the model are estimated using the EM algorithm. Finally, I will show how this model can be usefully employed in information retrieval with a significant improvement in performance and accuracy.
In 1988, Rodman made the point that In the long run, the field of sociolinguistics may make the largest contribution to the field of speech recognition. In 1992, an entire issue of Speech Communication was devoted to the influence of style on speech variation, and in 1998, Language and Speech presented a cross-section of articles on prosodic variation in different discourse settings; see also recent edited volumes (e.g., Biber and Finegan 1994; Biber 1995; Rickford and Eckert in press). The consensus appears to be not only that speech does vary radically in different styles, but that the next generation of text to speech systems will have to be more sensitive to style variables. In fact, it is only a matter of time before recognition modules will be adapted to various styles of speech, rather than being limited to man-machine interactive style.This talk will take this evidence of a growing need for knowledge about style variation and its results as a given, and will discuss some specific factors which influence style variation, using evidence gleaned from a variety of discourse settings. The talk will also analyze the effects of specific sociolinguistic parameters on the realization of negatives in actual discourse. After a review of the literature, and a description of the style variables I have found to be important, I will describe the corpora to be analyzed, present evidence of variation, and draw conclusions. I will show how the factors found to be important to the analysis can be incorporated into both the collection and analysis of linguistic data, and I will suggest specific lines of research which should prove fruitful.
REFERENCES
D. Biber ed. (1995). Dimensions of Register Variation. Cambridge: Cambridge University Press.
D. Biber & E. Finegan, eds. (1994). Perspectives on Register: Situating Register Variation within Sociolinguistics. Oxford: Oxford University Press.
J. Rickford & P. Eckert, eds. in press. Style and Variation. Cambridge: Cambridge University Press.
Josie Bernicot, J. Comeau & H. Feider (1994) Dialogues between French speaking mothers and daughters in two cultures: France and Quebec. Discourse Processes 18:19-34.
The process of text-to-speech synthesis can be divided conceptually into three tasks: text analysis, prosody generation, and waveform generation. At OGI, our group's focus over the past two years has been on developing data-driven methodologies for the waveform generation component of TTS. In this talk, I will describe our investigation of various techniques and efforts deploy them in interesting application testbeds. Automatic speech recognition systems have long used "data-driven" (trained) statistical models to condense the vast amount of variability in the feature space of speech into a manageable computational framework. In speech synthesis, the object is to regenerate this variation in accordance with the expectations of human listeners. We are currently exploring methods of using high-quality, single-speaker databases for speech synthesis. For example, "unit selection" is the process of dynamically selecting synthesizer parameters for various phonetic contexts from data, leveraging some of the computational machinery developed for ASR. In comparison to more standard approaches like concatenating waveforms from tables of diphones, unit selection promises to allow the realization of more natural phonetic and prosodic context variability (e.g., vowel reduction). A broader view of unit selection is as a method of selecting targets for parametric waveform models. Modifying the outputs of a unit selection algorithm offers the opportunity to (a) generalize to unseen contexts, and (b) to suppress undesired variability. We are currently exploring various signal representations that allow for this kind of flexible modification of spectral envelope trajectories, voice source characteristics, and speaker identity.
We propose a new probabilistic approach to information retrieval based upon the ideas and methods of statistical machine translation. The central ingredient in this approach is a statistical model of how a user might translate a given document into a simple query or summary. In order to assess the relevance of a document to a user's query, we estimate the probability that the query would have been generated as a translation of the document, and factor in the user's general preferences in the form of a document prior. We propose a simple and computationally efficient, yet natural and well-motivated statistical model of the document-query translation process, and show how the parameters of this model can be learned in an unsupervised manner. As we demonstrate through a series of experiments on TREC data, significantly improved performance over traditional vector space methods is achieved using this simple translation model, which only begins to tap the full potential of the approach. This is joint work with Adam Berger at CMU.
I have spent a lot of time focusing on users--designing programs for them, testing programs with them, observing them using (and not using) technology, and trying to understand exactly how technology affects their work and interactions with each other. From pilots to air traffic controllers to car dealers there ARE common threads. For the last two years, my focus has been on speech and multi-modal interactions. Observing users I have realized how much more social human computer interaction is than I believed--even without speech or NLU. Meanwhile, the common threads continue. Not to mention that fieldwork in darkened rooms and speech recognition can both wreck a nice beach!
The performance of state-of-the-art speech recognition systems is still far worse than that of humans. This is partly caused by the use of poor statistical models by such systems. In a general statistical pattern classification task, the probabilistic models should represent the statistical structure unique to and distinguishing those objects to be classified. In many cases, however, model families are selected without verification of their ability to represent vital discriminative properties. For example, Hidden Markov Models (HMMs) are frequently used in automatic speech recognition systems even though they possess conditional independence properties that can cause inaccuracies when modeling and classifying speech signals. In this work, a new method for automatic speech recognition is developed where the natural statistical properties of speech are used to determine the probabilistic model. Starting from an HMM, new models are created by adding dependencies only if they are not already well captured by the HMM, and only if they increase the model's ability to distinguish one object from another. Based on conditional mutual information, a new measure is developed and used for dependency selection. If dependencies are selected to maximize this measure, then the class posterior probability is better approximated leading to a lower Bayes classification error. The method can be seen as a general discriminative structure-learning procedure for Bayesian networks. In a large-vocabulary isolated-word speech recognition task, test results have shown appreciable word-error reductions relative to a pure HMM system.
Information extraction from speech is a crucial step on the way from speech recognition to speech understanding. A preliminary step toward speech understanding is the detection of topic boundaries, sentence boundaries, and proper names in speech recognizer output. This is important since speech recognizer output lacks the usual textual cues to these entities (such as headers, paragraphs, sentence punctuation, and capitalization). Numerous word-based approaches to these tasks have been developed in the past; in this work we demonstrate the use of prosodic cues, alone and in combination with words, for segmentation and name finding. In experiments on the Broadcast News corpus, we find that prosodic cues alone allow sentence and topic segmentation that is at least as good as word-based methods alone, and that combining both types of cues gives significant wins. Named entity recognition, on the other hand, currently does not seem to benefit from prosodic cues, for several interesting reasons. This is joint work with Andreas Stolcke and Elizabeth Shriberg.
Have you ever wondered why just about every sentence seems like a garden path sentence to parsers? For example in a sentence like "Pat loves cats and dogs," traditional parsers are likely to examine and then abandon the VP structure VP(V(loves) N(cats)) in favor of VP(V(loves) NP(cats and dogs)). This non-deterministic backtracking search process is ubiquitous in our conception of computational linguistics because of the entrenched role of Boolean logic formalisms in linguistics. In this talk, I will introduce an approach based on convex optimizations (i.e., finding the lowest part of bowl-like shapes), which is computationally efficient, deterministic, and has appealing probabilistic and linguistic interpretations.In my thesis work on recognizing intonation I wanted to approximate an utterance pitch track with a set of connected broken lines. This problem, like parsing, is usually treated as a search problem. It is a difficult problem in that, for example, for 60 possible break locations, there will be 2^60 possible location combinations, the full examination of which would take over 36 million years at the rate of one thousand combinations per second. Just as in parsing, there are clever algorithms for reducing the search (e.g., equivalents of top-down and dynamic programming). I will visually explain an approach that replaces the difficult search with an easy convex optimization. Continuing the bowl analogy, this is like replacing a surface which is essentially bowl shaped but has many uneven parts, with a similar surface that is exactly bowl shaped. This approach is possible because of the inherent structure in utterance pitch tracks.
I suggest that natural language understanding problems also have inherent structure which can be exploited. I will re-cast parsing in a convex optimization framework and demonstrate how this approach scales gracefully to context-sensitive languages and to joint intonation and syntax parsing.
------------
**NOTE: This talk does not presume detailed mathematical understanding.
Existing audio tools are inadequate to handle the increasing amounts of audio and multimedia data that is becoming available on the Web and elsewhere. The reason is that they treat audio as a linear monolithic block of data. To address this problem various techniques for extracting content and structure have been proposed.Traditionally most approaches have used classification using some statistical class model. Another approach is to segment audio and especially music without use of any class model based on temporal changes. The segmentation is achieved by tracking multiple spectral features in time. The time lines resulting from the segmentation can be used for browsing and annotation. In addition they can be used as a preprocessing step for classification techniques.
In this talk an algorithm for segmenting music will be described. A framework for prototyping audio analysis tools that uses this algorithm in addition with classification algorithms has been developed and used for evaluation. Future work and possible connections with Speech research will also be mentioned.
This paper describes a CALL (computer-aided language learning) system for correcting vowel insertions in Japanese-accented English. Epenthesis (specifically anaptyxis and proparalepsis) is a frequent pronunciation error among Japanese learners of English as a second language. Epenthetic speech is incomprehensible to native speakers of English even after considerable exposure to Japanese-accented speech. However most Japanese teachers of English are unaware of the severe impact epenthesis has on intelligibility because they understand epenthetic speech perfectly.Our language learning system heightens the awareness of epenthesis and improves English pronounced by Japanese speakers. The system's core is a speech recognizer running in forced alignment mode (i.e., phone labels are obtained given a small set of possibly correct transcriptions of the utterance). The learner's speech is recognized phone by phone using a pronunciation lattice of American English phones including Japanese vowels where epenthesis may occur. We use bilingual acoustic models to accommodate nonnative accents at the phone level.
The system displays English words, phrases, and sentences on the computer screen and instructs the learner to read them aloud. The reading material consists of basic vocabulary items containing many consonant clusters and syllable-final consonants that trigger vowel insertion. Loans from English having fixed Japanese pronunciations are included to illustrate pronunciation differences between the two languages. We created a set of phonological rules that convert pronunciation lattices of English citation forms into pronunciation lattices of Japanese-accented speech. These phonological rules can be used to add new reading material easily. The system alerts the learner whenever epenthetic vowels are recognized, and instruct students on how to pronounce the target utterance correctly.
We ran evaluation experiments to verify the performance of the component technology. Three male native speakers of Japanese read twelve isolated English words and one four-word sentence. Error rates for phone-level speech recognition and for detecting epenthetic vowels (including false alarms) was calculated by comparing the system's recognition results with hand-labeled phone sequences. Phone-level recognition error rate was one percent and epenthetic vowel detection error rate was two percent.
We propose a subdomain of automatic speech recognition called native language identification (abbreviated L1ID, where L1 is a term borrowed from foreign language teaching meaning the speaker's native language). The objective of native language identification is determining the speaker's native language X given that he is speaking language Y. Imagine an automated telephony system that accepts rental car reservations in English. Suppose a native speaker of Japanese wants to rent a car. If the person speaks good English the system proceeds with the reservation request. Otherwise the call is transferred to a Japanese-speaking operator.The key to native language identification is modeling how a person's native language affects language production (for example in pronunciation, choice of words, and syntactic structure). This paper focuses on systematic phonological changes. We used HMMs for American English with phonological rules for American English, Japanese-accented English and Chinese-accented English. Experiment results suggest native language identification might be a useful technology.
We propose a new method of obtaining features from speech signals for robust analysis and recognition -- the Non-uniform Linear Prediction (NLP) cepstrum. The objective is to derive a representation that suppresses speaker-dependent characteristics while preserving the linguistic quality of speech segments.The analysis is based on two principles. First, Bark-frequency warping is performed on the LP spectrum to emulate the auditory spectrum. While widely used methods such as the mel-frequency and PLP analysis use the FFT spectrum as its basis for warping, the NLP analysis uses the LP-based vocal-tract spectrum with the glottal effects removed. Second, all-pole modeling (LP) is used before and after the warping for spectral smoothing. Pre-warp LP is used to first obtain the vocal-tract spectrum, while post-warp LP is performed to obtain an even smoother version of the warped spectrum, especially in the high frequency region.
Two pilot experiments were conducted to test the effectiveness of the proposed feature. First, linear discriminant analysis (LDA) was performed on English vowel data obtained from 23 speakers in order to measure the scatter of vowel clusters in the discriminant feature space. Also, the same dataset was used for a frame-based vowel classification experiment with a simple Gaussian model.
The NLP analysis was effective in reducing the inter-speaker variability of each vowel class while preserving the linguistically relevant cues. Recognition experiments involving larger tasks (e.g. WSJ) are currently being conducted.
I will present some preliminary results towards the design and implementation of an Interactive Vocal Information Retrieval System that can be used to access articles of a large newspaper archive using a telephone. I will also address some of the more general issues involved in accessing a large archive of textual documents using speech.
HCI work at RSC is focused on advancing the methods of interaction between man and machine. To this end, we develop new technologies, exploit commercial-off-the-shelf (COTS) technologies, and create testbeds for multi-modal interaction.
In this talk, we will first provide an overview of the technologies being developed in our group. Augmented Reality, a concept for presenting computer-generated information in synchrony with the user's view of the world, is an important new technology being developed in our group. We will describe our approach to develop augmented reality methods for outdoor and indoor scenarios. Next, we will present our approach to enable bimodal speech recognition through visual tracking of the speaker's lips. Then we will talk about our work on using COTS technologies for HCI (eye tracking, 3D audio, head tracking, speech recognition), and conclude with a description of our testbed for multi-modal interaction.
Bios: Sundar, Reinhold, and Michael have been with the HCI department at RSC since 1996. Sundar graduated from NYU with a thesis on image motion, and worked on robotic vision and neural network models before joining RSC; his research interests include virtual and augmented reality, and multi-modal testbeds. Reinhold's thesis was on visual lane tracking for automated highway applications; currently, he is working on registration and sensor fusion methods for augmented reality. Michael's thesis work was on probabilistic models for image reconstruction, and worked on deformable models for tracking before joining RSC; his research interests are in bimodal speech recognition, and real-time visual interfaces.
At the office, at home or on the go, computing devices are being asked to provide interfaces to an increasing number and variety of computerized tasks. Today, most of these applications use 2D Interfaces that can offer usability levels ranging anywhere from mediocre to excellent. Our claim is that in some instances, 3D Interfaces can provide for more intuitive and productive user experiences by enhancing data browsing and manipulation tasks. During this presentation we will explain why we believe so; introduce some of the issues associated with 3D Interfaces in general; show practical uses of such interfaces; discuss the latest advances in 3D desktop acceleration; and finally suggest directions of research in that area. To illustrate some of the concepts discussed in the presentation, the prototype of a real time 3D Interface will be demonstrated.Bio:
Denis Amselem started as an intern in the Virtual Reality Lab. at SRI International, in 1993, where he worked for two years on a Multi-User 3D Virtual Space project. In 1996, he joined 3Dfx Interactive, a 3D Graphics Chip Design start-up which has become today, one of the leading 3D Graphics manufacturers. His activities at 3Dfx Interactive involved evangelizing the company's technology by writing 3D demonstrations programs that illustrate the features and performance of the Voodoo chip family; making presentations and talks to game developers; and supporting developers with their technical questions.
Probabilistic techniques have brought great performance boosts to many areas of natural language processing, but in recent years computational linguistics conferences have started to look increasingly like machine learning conferences, where people report their small performance gains from the latest hot technique, rather than anyone doing much to improve our understanding of knowledge of language. How much can better linguistics contribute to improved statistical NLP? In this talk I'll present results from some small studies in the areas of part-of-speech tagging, subcategorization and parsing aimed at addressing this question, as well as presenting some of the difficulties.Chris Manning has a joint appointment in Computer Science and Linguistics at Stanford, he is co-author of the recent book "Foundations of Statistical Natural Language Processing".
This talk will describe a robust system for information extraction from spoken language data. The system extends previous HMM work in information extraction, using a state topology designed for explicit modeling of variable-length phrases and class-based statistical language model smoothing to produce state-of-the-art performance for a wide range of speech error rates. Experiments on broadcast news data show that the system performs well with temporal and source differences in the data. In addition, strategies for integrating word-level confidence estimates into the model are introduced, showing improved performance by using a generic error token for incorrectly recognized words in the training data and low confidence words in the test data.(This is joint work with Mari Ostendorf at UW and John Burger at MITRE.)
For about a decade, our group at ICSI/UC Berkeley has been building models of language learning and use, based on cognitive linguistics. Among the central ideas are: image schemas, action semantics, contextual frames, metaphors and integrated constructions. Recent results have encouraged us to apply this suite of ideas to problems of intermediate scale in HCI and to seek collaborators in this effort.
Noise compensation for Automatic Speech Recognition systems is a problem that has not been satisfactorily solved. The best noise compensation techniques available to date work on the explicit or implicit assumption that the corrupting noise is stationary in nature. In reality, however, corrupting noise is very often transient or non-stationary. A different domain of techniques is required to handle such noise - one that makes no assumption whatsoever about the statistical nature of the noise. One such approach is that of missing feature compensation, wherein noise-corrupted spectrographic regions are identified as "missing", or "corrupted" and are estimated/compensated using various techniques. There is no implicit or explicit assumption about the stationarity of the noise. There are two aspects to this problem - identifying these damaged regions in the spectrogram, and effectively compensating for them. This talk deals with the second aspect.Almost all currently reported techniques that work within the framework of missing feature compensation do so by explicit modelling of the effect of missing features on the recognizer. This aggravates the issue of computational complexity and complicates the issue of feature usage by the recognizer. At CMU we are developing missing feature inference methods that are independent of the recognizer, generalizable to noise type and computationally efficient. The scope and variability of feature usage by the recognizer remains unaffected. This talk is limited to a discussion of two such techniques. The first uses a cluster based representation to capture a-priori information about the statistical properties of speech. The second uses simple correlation statistics to capture the a-priori information. These a-priori statistics are then used to reconstruct damaged or missing portions of spectrograms. When handling noise corrupted speech, additional bound information is implicit in the noisy speech. These bounds are also incorporated into the estimation procedure in such cases. Experimental results show that the presented techniques are highly effective when the corrupt portions of the spectrogram are known. They also offer greater flexibility as compared to existing techniques.