AT&T Laboratories |
|||||
Powerful and easy-to-use textual document retrieval systems have become pervasive and constitute one of the major driving forces behind the internet. Given that so many people are familiar with the use of simple keyword strings to retrieve documents from vast online collections, it seems natural to extend language based querying to multimedia data. Content based image retrieval (CBIR) on the basis of short query sentences is likely to prove more efficient and intuitive than alternative query composition schemes such as iterative search-by-example and user sketches which are employed by most current systems.
However, the comparatively small number of query languages designed for CBIR have largely failed to attain the standards necessary for general adoption. A major reason for this is the fact that most language or text based image retrieval systems rely on manual annotations, captions, document context, or pre-generated keywords, which leads to a loss of flexibility through the initial choice of annotation and indexing. Formal query languages such as extensions of SQL are limited in their expressive power and extensibility and require a certain level of user experience and sophistication.
In order to address these issues while keeping user search overheads at a minimum, we have developed the oquel query description language. It provides an extensible language framework based on a context free grammar and a base vocabulary. Words in the language represent predicates on image features and target content at different semantic levels and serve as nouns, adjectives, and prepositions. Sentences are prescriptions of desired characteristics which are to hold for relevant retrieved images. They can represent spatial, object compositional, and more abstract relationships between terms and sub-sentences.
The language is portable to other image content representation systems in that the lower level words and the evaluation functions which act on them can be changed or re-implemented with little or no impact on the conceptually higher language elements. It is also extensible since new terms can be defined both on the basis of existing constructs and based on new sources of image knowledge and metadata. This allows the definition of customised ontologies of objects and abstract relations. The process of assessing image relevance can be made dynamic in the sense that the way in which elements of a query are evaluated depends on the query as a whole (information flows both up and down) and any domain specific information with respect to the ontological makeup of the query which may be available at the time it is processed.
The primary aim in designing oquel has been to provide both ordinary users and professional image archivists with an intuitive and highly versatile means of expressing their retrieval requirements through the use of familiar natural language words and a straightforward syntax. Ongoing work seeks to extend the language core to provide more advanced programmatic constructs offering capabilities familiar from database query languages and to enable autonomous learning of new concepts.
Oquel queries (sentences) are prescriptive rather than descriptive, i.e. the focus is on making it easy to formulate desired image characteristics as concisely as possible. It is therefore neither necessary nor desirable to provide an exhaustive description of the visual features and semantic content of particular images. Instead a query represents only as much information as is required to discriminate relevant from non-relevant images.
In order to allow users to enter both simple keyword phrases and arbitrarily complex compound queries, the language grammar features constructs such as predicates, relations, conjunctions, and a specification syntax for image content. The latter includes adjectives for image region properties (i.e. shape, colour, and texture) and both relative and absolute object location. Desired image content can be denoted by nouns such as labels for automatically recognised visual categories of stuff ("grass", "cloth", "sky", etc.) and through the use of derived higher level terms for composite objects and scene description (e.g. "animals", "vegetation", "winter scene"). The latter includes a distinction between singular and plural, hence "people" will be evaluated differently from "person".
Oquel has been implemented in our ICON (image content organisation and navigation) system. A versatile query parser allows oquel sentences to be entered or refined by the user. Users may also manipulate the syntax tree representation directly using a graphical tool. This ensures that sentences are constructed according to the rules of the grammar without requiring the user to understand the full language specification. In addition, ICON provides a forms-based interface which enables users to specify complex queries consisting of visual constraints and content specifications without requiring any knowledge of the underlying language syntax.
The screenshot below shows search results for the oquel text query "[bright red and stripy] and [tarmac in bottom half, size >10%]":
© AT&T Laboratories Cambridge, 2001