AT&T Laboratories Cambridge

DART: Video Parsing

Video Parsing

With the emergence of new digital video transmission and storage technologies (including video-on-demand services, multimedia-capable computer networks and digital cameras), massive archives of video data will soon be available to home, education and business users for instant recall.

One of the challenges of this new technology is the development of software to organize and index all this footage, so that video clips relevant to the users' needs can be retrieved efficiently.

An effective video indexing system must be able to index new documents automatically and in reasonable time, and scale to support terabytes of video information. It must support a wide range of queries based on program information, audio and speech content, image content, action and camera motion. Finally, it must allow users to retrieve and browse individual frames, shots, and entire programs; and to navigate between related programs and multimedia documents.

Video parsing is just one component of DART; it will combine with text, audio and still image indexing to form an integrated multimedia retrieval system.

Tools for video analysis and indexing

We are developing tools to extract meaningful features of digital video, including:

Scene and shot segmentation

Cut

Dissolve

Transient

Most video programs are created by editing together a series of shots. The edits may consist of abrupt cuts or more subtle transitions such as dissolves and fades. These edits must be detected so as to break down the video into chunks suitable for presentation, and to reveal its logical structure.

We currently use simple image statistics and a multi-timescale comparison algorithm to detect and distinguish between cuts and dissolves, and to reject transients such as flashbulbs, and continuous motion. For greater efficiency, we plan to use a hybrid approach combining compressed-domain and image-domain statistics.

Shots are generally grouped together into scenes. This grouping can be detected by looking for repetition amongst nearby shots, and using other cues such as soundtrack continuity.

Camera motion recovery and motion segmentation

An important cue for the retrieval of video is the motion of the camera; 'zoom' effects being particularly notable. In general we can determine this motion by tracking features or regions moving across the image, and fitting pan, rotation and zoom parameters to their apparrent motion. In the compressed domain this can be performed very efficiently, using robust statistics based on the MPEG block vectors.

Many scenes contain several moving objects, and visual tracking can be used to segment the moving objects and recover their individual motions.

Extraction and analysis of key frames

With current technology, detailed image analysis algorithms are too slow to be applied to every frame of a video sequence. Thus a relatively small number of `key frames' must be extracted from each shot for further processing (key frames can also be used as a storyboard to give users an overview of the video content).

The choice of the number and position of key frames is of great significance to the speed and accuracy of a video indexing system; we are investigating algorithms for keyframing based on shot break detection, motion estimation and other video stream statistics.

In the DART system, keyframes will be analysed in a similar manner to still images: they are divided into regions of similar colour and texture (corresponding roughly to objects in the scene), and the shape, size and motion of each region is described, together with colour distributions of the regions and of the entire image. These descriptions may be used as the basis for a high-level language for describing scene content.

Caption detection and OCR

Many video documents feature captions, which are a useful source of keywords by which relevant passages may be retrieved. A caption may typically contain the proper name of the person speaking.

Conventional OCR systems require well-segmented character images (usually black and white) and fare poorly on images from video. MPEG encoding blurs letters together and there is no threshold which will resolve them reliably. We are experimenting with novel filtering and adaptive thresholding algorithms to detect and preprocess captions. The frequent spelling errors in OCR output also demand new approaches to word indexing and retrieval.

Where the fonts are known in advance, we can decode captions reliably using a correlation-based approach. Letter hypotheses are generated by template matching, and hypotheses lying along a common baseline are selected using dynamic programming to yield the most likely string of characters. The results have been used successfully to index TV news stories.