Starting from the Bayes decision rule as in speech recognition, we derive the architecture of an automatic system for text translation. This approach results in three modeling components: the lexicon model, the alignment model, and the language model of the target language. The generation of the target sentence is accomplished by the search process as required by Bayes decision rule. In particular, the approach implemented and tested makes use of HMM-like alignment models. In addition, the input sentence is subjected to several preprocessing steps, and a word re-ordering procedure so that the source sentence and the target sentence become more similar. As a result, the generation of the target sentence can be performed by using dynamic programming. First experimental test were obtained on the Eutrans task (300-word vocabulary) and on the Verbmobil task (2000-word vocabulary).
One of the best speech enhancement algorithms to emerge in recent years is based on autoregressive hidden Markov models (HMMs). This system forms a minimum mean squared error (MMSE) estimator of the speech using trained speech and noise HMMs. The work presented here extends this approach to the case where the noise models are unknown. A maximum likelihood technique is used to reestimate the noise statistics from the utterance to be enhanced. Results are presented for experiments using the "Lynx" noise from the NOISEX database. These verify the efficacy of the algorithm. In addition the ability of autoregressive HMMs to model speech is discussed.
This talk will try to answer the following question: is there a place for text-to-speech synthesis in a world of cheaper and cheaper digital storage, faster and faster microprocessors, better and better speech coding algorithms, nearly ubiquitous Internet connections and consumers with no patience with Interactive Voice Response systems? After the obligatory mention of Von Kempelen, we will review the different speech synthesis techniques used over the past decades, concluding with the apparent consensus that methods ending in OLA yield the better results, and not only for Spanish synthesis. The traditional blocks of a text-to-speech system will be described (input to text, text to phoneme, prosodic rules, speech parameter generation and signal processing),while noting that state-of-the-art systems display increasingly blurred lines between these different blocks. The presentation will then veer from how it's done to why it's done, i.e. a description of the current applications of text to speech synthesis with their relative strengths and weaknesses. Potential future applications will be proposed and discussed. The conclusion will try not to be too pessimistic for text-to-speech researchers.
What does it mean to understand a discourse between family members ? At first it seems to be hopeless to do higher level understanding on a task like CallHome Spanish, where both the speech as well as the understanding task are incredibly complicated, let alone doing both. The question of understanding can however be formulated in a way that allows to apply robust methods. The initial question is therefore what to detect and I will report on our efforts on this. In the main part of the talk I will introduce the concept of discriminative distributions. The HMM of ngrams, as used for speech act recognition on the John Hopkins Summerworkshop, is a powerful tool yet it might waste a lot of parameters on modeling non-discriminative features. To showcase this I will open the hood of our current speech act recognizer (and segmenter) and give first evidence on what the discriminative distributions for speech act recognition and segmentation are. Following the same leads of modeling only the discriminative distribution a different integration of prosodic features into a speech act recognition algorithm will be shown, although there are no results available on this yet. Discriminative distributions also play a role in language model adaptation.
This talk presents a new training procedure for speaker verification systems. The procedure extends previous speaker verification work by (1) developing a new discriminative a posteriori-based training algorithm, and (2) extending the algorithm to directly optimize speaker verification performance. The key features of the new training algorithm include leveraging current state of the art technology by initializing the system with Bayesian-adapted Gaussian mixture models. The discriminative training algorithm then adjusts parameters of these models to directly minimize a verification cost function (VCF) representing the expected costs of falsely accepting impostors and falsely rejecting true claimants. Results are presented from the 1997 NIST Speaker Recognition Evaluation corpus indicating that the VCF performance can be improved with this procedure, but at the expense of reduced system performance at other operating points (different false alarm and false rejection costs).Joint work with Yochai Konig.
Speech recognition systems that are based on hidden Markov modeling (HMM), assume that the mean trajectory feature vector within a state is constant over time. In recent years, segment models that attempt to describe the dynamics of the speech signal within a phonetic unit, have been proposed. Some of these models describe the mean trajectory over time as a random process. In this talk we present the concept of a scaled random trajectory segment model, which aims to overcome the modeling problem created by the fact that segment realizations of the same phonetic unit differ in length. The new model is supported by a direct experimental evidence. It offers the following advantages over the standard (non-scaled) model. First, it shows improved performance compared to the non-scaled model. This is demonstrated using phone classification experiments. Second, it yields closed form expressions for the estimated parameters, unlike the previously suggested, non-scaled model, that requires more complicated iterative estimation procedures.
American English listeners perceive lexical stress through a complex combination of acoustic correlates such as energy, duration, and fundamental frequency (F0). These three acoustic correlates were used in a study to detect the stress level of the full vowels in the database. Two different approaches, a segment-based approach and a rhythm unit- based approach, were developed.Energy and duration measurements are segment bounded measurements. The segment-based approach uses pattern recognition with energy- and duration-based measurements as features to build Bayesian classifiers to detect the stress level of a vowel segment. Feature normalization and classifier design are the main focus of the development of the segment-based lexical stress detection algorithm. Three different pairs of routines were established to remove the vowels' intrinsic differences such as vowel length and amplitude difference between different vowels, and also to remove the environmental effects in the utterance such as speaking rate and end- of-phrase lengthening. Three sets of classifiers were created to study whether the separation of the vowel category has a strong effect on the performance of the classifiers. The vowel category is determined by whether the vowel segement is momophthong or diphthong and whether the vowel segment is part of a monosyllabic word or a polysyllabic word.
In the second approach, the F0-based measurement was used as the feature for determining the most prominent vowel in a polysyllabic word. For F0, although the segment-based measures are important (i.e., the mean and the peak F0 values of the vowel segment), the F0 contour pattern of the region around the vowel segment cannot be neglected. Therefore, the F0-based measurement is above the boundary of a vowel segment (it is suprasegmental). Based on the concept of 'foot' in Metrical Phonology, an intermediate unit between a syllable and a word was defined and was named as 'rhythm unit'. A duration-based segmentation routine was developed to break polysyllabic words into rhythm units.
Not available.
Self-study methods for pronunciation learning should tell learners what their mistakes were, whether their speech is intelligible, or what they can do to improve their pronunciation. With these goals in mind, I developed a CALL (computer-aided language learning) system for teaching the pronunciation of Japanese tokushuhaku (long vowels, the mora nasal and mora obstruents) to nonnative speakers of Japanese. Long vowels and short vowels are spectrally almost identical but their phone durations differ significantly. Similar conditions exist between mora nasals and non-mora nasals, and between mora and non-mora obstruents.My CALL system asks the learner to read tokushuhaku minimal pairs. The system uses a speech recognizer measures the durations of each phone and to tell the learner the percentage of native speakers who will understand the learner's utterance as the learner intended. These percentages are based on perception experiments where native speakers judged the confusability of tokushuhaku minimal pairs containing tokushuhaku with various synthesized durations. The system then instructs the learner to either shorten or lengthen his pronunciation.
In addition to teaching phone duration, the system detects mistakes in phone quality by using a speech recognizer incorporating bilingual monophone models of both the learner's native and target languages (in this study, L1 = American English and L2 = Japanese). HMMs for the two languages are trained separately on language-dependent speech data, but are bundled together during recognition so that the closest phone recognized indicates the nonnativeness of the utterance should the recognized phone not be Japanese.
This project, still in its first year, aims to compare statistical language models in Speech Recognition (as bigrams) versus finite-state language models obtained by grammatical inference. We use real corpora from an oral dialogue system, within a contract with France-Telecom CNET. Presently, two types of grammars are in experimentation : non stochastic finite-state (with an amelioration of the ECGI inference method) and stochastic automata (with the ALERGIA algorithm).
In subjective perceptual evaluations of audio codec quality, lack of agreement between listeners or groups of listeners is a common problem arising even in the most carefully conducted tests, under tightly controlled conditions. Examples can be found in listening tests on the MPEG-2 Non-Backwards Compatible algorithm (1996), tests of MPEG-2 algorithms (1994), and tests performed for the FCC Advisory Committee on Advanced Television Service (1993). In particular, in MPEG testing, listener groups at different test sites have repeatedly been found to generate sufficiently statistically different ratings that combining their results would obscure the effects of the codecs and musical excerpts being judged. Because listening conditions at the test sites were fairly strictly controlled, rating differences were likely caused by differences among the listeners in their sensitivity and amount of attention paid to various kinds of artifacts. Therefore, to examine listener differences, we designed an evaluation procedure which would yield information both about codec quality and about the strategies used by each listener in making judgements. The analysis used a multivariate statistical technique to build listener-specific models which generate (a generalization of) perceptual quality ratings monotonically related to the ratings given by a listener. Reliable differences between listeners are thus captured as differences in the listener models. A model of the perceptions of an "average listener" or an "expert's expert" can then be created and used to evaluate codec outputs. The models can also be interpreted to give information about the acoustic attributes upon which listeners base their judgements, hence can guide further development of an algorithm. In addition, application of this evaluation procedure in an evaluation experiment produced results concerning within-listener stability, cross-listener agreement or divergence, and the influence of the musical excerpt under test; these data will also be discussed.
Since I intend to organize this talk around the motivations that make ME pursue research in the text-to-speech (TTS) synthesis area, chances are it will sound more like an analysis (in the psychological meaning of the term) than like yet-another-technology-update-on-TTS-synthesis.After summarizing the current results and perspectives of the MBROLA project (speech synthesis, freely available for non-commercial applications, in as many languages and voices as possible; see http://tcts.fpms.ac.be/synthesis), I will comment on three aspects of speech synthesis research that I currently find very exciting:
- Unit-selection-based TTS synthesis, which I consider as the last major innovation in the speech synthesis arena (comparable to the invention of HMMs for speech recognition). I will explain what the issues are, and what still needs to be solved to come to a synthesizer that can really pass the Turing test.
- Software engineering issues. In my forthcoming ICSLP paper, I claim that : 1. Future milestones in speech processing will come from labs with strong commitment to solid, portable, and extensible code; 2. Speech scientists and software engineers will soon be the same people. I'll try to explain how this led me to "Plug'n Play programming" in C++ and to the EULER TTS synthesis system.
- Aids to persons with disabilities. For years, I have been asked by people from the real world how TTS synthesis could help them interact more easily with their surrounding. For years, I have had to answer that it was only a matter of time for TTS technology to be fully integrated into aids for the disabled. It never really was the breakthrough I expected. So I have started working myself on a project aimed at providing freely available TTS synthesis with adequate interfaces for handicapped people.
We developed the Japanese Speech-Aware Multimedia (JAM) which controls a World Wide Web (WWW) browser using speech. This system allows the user to browse a linked page by reading the anchor text within a Web page. The user can also control the browser using speech. The system integrates new vocabulary each time a new Web page is read by extracting the anchor text, converting this text to phonetic string notation, creating a new speech recognition grammar and integrating this grammar with the system dynamically. During this anchor text-to-phone conversion process, we employ numerous exception handling to accommodate counters, dates and many other phenomena. Preliminary tests show that the conversion results contain the correct phone sequence over 97% of the time. We also allowed limited English speech recognition since a large percentage of the Japanese Web pages include some English text in anchors. User tests showed that the prototype correctly understands the input speech 91.5% of the time, or 94.1 % if we exclude user errors caused by unfamiliarity with the system including erroneous readings or speech detection errors.
The presentation will summarise a novel approach to speech analysis and recognition research, how from this philosophy a speech model was developed and tested as a simple, but effective, speech synthesiser. An algorithm for measuring the main frequency components of waveforms as a continuous process was found to be effective as a discriminator for both vowels and fricatives. An analyser developed from this algorithm and the synthesiser were put together as an experimental low bandwidth voice codec system. The synthesiser was also used as a tool for analysis by synthesis. The implications of results presented will be discussed in the context of a practical and economical speaker independent speech recognition system.
An important task in designing automatic speech recognition systems is the creation of probability density functions for hypothesized utterances. Hidden Markov Models (HMMs) are the most commonly used statistical model for this purpose where two conditional independence relationships are assumed. While an HMM can theoretically model a given probability distribution to an arbitrarily high degree of accuracy by increasing the hidden state-space size, HMMs with a fixed hidden state-space size have not been shown to be sufficient for optimal speech recognition systems. Therefore, methods have been developed to extend the power of HMMs such as factorial HMMs, segmental HMMs, hybrid ANN/HMM systems, auto-regressive HMMs, etc.In this work, a new method is introduced that increases the modeling power of an HMM without increasing the number of states -- the method augments an HMM's statistical dependencies between observation vectors in a principled way. Using the natural statistical regularities measured from a corpus of data, an HMM's statistical dependencies are augmented to include only those dependencies to the surrounding observation context that 1) are not well modeled by the baseline HMM and 2) are found to increase discrimination. Discriminative conditional mutual information, the measure used to determine the additional dependencies, is introduced. When viewed as a Bayesian network, the above procedure can be viewed as a structure learning algorithm. The resulting model is called a buried Markov model (BMM) because the underlying Markov chain in an HMM is further hidden (buried) by specific cross-observation dependencies.
To test these models, Gaussian mixture HMMs are extended to incorporate the additional BMM dependencies, and new EM update equations for maximum likelihood parameter estimation are derived. The result has been tested on two isolated word speech databases. The smaller database (Bellcore digits+) has shown an average 34% word error improvement over an HMM with the same number of states, and a 15% improvement over an HMM with a comparable number of parameters. On preliminary experiments using a single hidden state per mono-phone speech recognition system with the large-vocabulary (PHONEBOOK) isolated-word speech database, BMMs are able to achieve an 11% improvement in WER with only a 9.5% increase in the number of parameters relative to the baseline HMMs.
Speaker-dependent speech recognition systems outperform speaker independent ones, due to the variability of acoustical properties among speakers. Speaker normalization techniques attempt to reduce this variability by modifying spectral representations of speech waveforms.In our work we evaluate the use of acoustic features that are key to speech perception in speaker normalization algorithms. We use frequency warping for speaker normalization and study the feasibility of the use of such features, specifically formant frequencies, in improving warping characteristics/recognition accuracy. Minimization of computational complexity vis-a-vis conventional algorithms is an added consideration.
We examine several feature sets, propose a class of warping functions and calibrate their performance against existing algorithms. We also study the generality and fundamental performance bounds of this class of algorithms.