AT&T Laboratories Cambridge

DART: Speech Recognition

Speech Recognition

An important aspect of the DART project is the use of Automatic Speech Recognition (ASR) techniques for the automatic transcription of an audio soundtrack. In combination with Information Retrieval (IR) techniques, this facilitates indexing and retrieval of any document type that has an audio component. Typical applications may retrieve multimedia documents such as news broadcasts, video lecture series, video mail messages or audio annotated still pictures. Initially, we have concentrated on broadcast news retrieval during basic system development, although we are now looking increasingly at the retrieval of audio-annotated still pictures.

The combination of speech recognition technology with information retrieval techniques was investigated in the VMR project carried out here and at Cambridge University. This established the feasibility of audio soundtrack information retrieval in a restricted domain and its phone lattice-based approach had the advantage of supporting unrestricted query terms. Its computationally intensive nature however affects scalability and accuracy is limited due to lack of linguistic constraints. Another project at CMU uses a large vocabulary recogniser to produce a rough transcription of the soundtrack. This can be searched faster and is more easily scalable though the available search terms are limited by the vocabulary of the recognition process. This is the approach taken initially within the DART project, though with a much larger vocabulary of over 60,000 words which partially alleviates problems of restricted keywords. In the future we hope to investigate further improvements, perhaps combining large-vocabulary recognition with the phone-lattice approach.

Recent years have seen improvements in the availability of speech resources and techniques particularly in acoustic modelling, decoding and adaptation. The resulting improvements in recognition performance on traditional low-noise read speech tasks have shifted the focus of research to more difficult domains such as the transcription of broadcast news. This domain introduces new problems including unplanned spontaneous speech of variable bandwidths and quality, random background sound and music and unpredictable lexical and syntactic content. The speech recognition side of the DART project aims to utilise recent advances to give accurate soundtrack transcription in a variety of domains without exceeding reasonable processing speeds. In order to achieve this, the core speech recognition is performed by state-of-the-art software developed by ECRL (Entropic Cambridge Research Laboratory) and previously CUED (Cambridge University Engineering Department)

A typical transcription process for a British TV news show is as follows. Firstly, the audio component of the MPEG episode is extracted, decoded and down-sampled. The waveform for the whole show is parameterised into a sequence of feature vectors which are automatically segmented at speaker boundaries or other points of broad acoustical change. Each segment is classified in terms of speech presence, music presence, audio quality, language and gender for example before being transcribed by the recogniser using appropriately constructed acoustic and language models. If accuracy takes precedence over processing time, the transcription process may be repeated in a second pass after acoustic model adaptation using the first pass transcriptions. This whole process is integrated with the DART database and information about the show, segments and transcriptions is added automatically to the database for subsequent retrieval.

This work is on-going with resources and software being developed for a variety of applications. In addition to the Broadcast News Retrieval work, a corpus is currently being recorded using an off-the-shelf digital MPEG camera. This is with a view to automatically transcribing the audio annotation of still pictures or home movies. Speech recognition may additionally be used to supplement text-based entry in the user interface of the retrieval engine.