Overview and Motivation

In surveillance and security today it is a common goal to locate a subject of interest purely from a semantic description; think an “offender description form” handed into a law enforcement agency. These tasks are primarily undertaken by operators on the ground either by manually searching a premises or by combing through hours of video footage. As such, the Australian Federal Police identified this area as a significant problem within law enforcement. To date, researchers have focused on person re-identification methodologies to solve this complex problem, however, in circumstances where pre-search subject enrollment images are not available, these techniques fail.

Semantic search is of primary interest as it does not require pre-search subject enrollment and instead searches video footage based on a textually supplied target query. The aim of this challenge is to attempt to solve this problem through two tasks, each of these tasks aims to locate a subject of interest based on a soft biometric signature.

The Challenges

The main aim of the two tasks in this challenge is to solve the following outcome: subject localisation using soft biometric traits without pre-search subject localisation. Both tasks contain individualised data, however, it is permissible that data from one task may be used to augment the training data of the other (i.e. data from task one can be used in some manner to increase the performance in task two). One example of this may be the creation of comparative labels, where data in task 1 can be used to create these labels and then used for searching in task 2.

In both of the tasks, for each of the subjects a set of soft biometric traits are labelled (different signatures are used in each task). Using the supplied soft biometrics participants are able to use either the complete list or a subset of those available, dependent on the technique they feel best solves the problem.

A general breakdown of the two tasks is as follows:

  1. Person retrieval based on a semantic query, where ranked outputs between the query and a gallery of images are used. This can be seen as analogous to a person re-identification task, where probe images are replaced with queries.
  2. Subject localisation and retrieval, where a person matching a soft biometric description must be accurately localised in a video clip.

Task 1 – Image Ranking

This task can be considered analogous to re-identification without the initial probe image, instead, in this case using a textual query as an input. The training database consists of 520 subject images cropped to varying sizes with a directory structure listed below. Upon release the testing data will have a similar structure, however, no parsed information will be supplied.


 [Task 1] [Train Data] [Binary Maps]
[Originals] A_001_01_001.png
Subject_MultipleColours.xml P_IP70_10086.png

For each RGB image there is a corresponding parsed image, where each region of interest on the subject is labelled. From the image below these regions can be seen as hair, legs, luggage, shoes, arm skin, facial skin, leg skin, and torso. For each of these regions, the associated colour codes are described in the Readme document contained in the root directory.


Along with these parsed images, binary images have been included which can be used depending on the desired usage. Each of these binary images is a 3 channel image, where all channels are equivalent. Again, these masks include hair, legs, luggage, shoes, arm skin, facial skin, leg skin, and torso. With a single example of each shown below.


To create the textual query for search each subject is annotated with 11 potential soft biometric traits. The annotated traits include gender, the clothing type being worn, the colour(s) of the clothing being worn, and the pattern on the clothing. Explanation of the numerical values assigned to each trait is described in the XML documentation, however, in all cases, if -1 is assigned this indicates that either the trait was too ambiguous or that it did not exist.


The primary goal of this task is to locate the desired subject based solely on their soft biometric traits, either all or a subset, from the gallery of test images. To achieve this goal, the annotation from one subject is taken as the query, then based on the developed techniques this query is used to compare against the entire corpus. A ranked output is then required, such that the rank of the subject that was annotated with this query is returned as the score. This task is repeated for all the unique soft biometric signatures within the dataset.

The yet to be released test set will include approximately 150 subjects annotated in the same manner as the training set. Only the RGB images will be provided and no subject parsing will be included.

More to follow.

Task 2 – Surveillance Imagery Search

This task more thoroughly represents that of a wide ranging surveillance problem. Initially, a soft biometric signature is supplied, then surveillance imagery is searched for a target matching that description. In this task, the footage can contain ambiguous matches to the query (where only high levels of signature use can supply a direct match), with varying levels of crowd density and crowd flow.

The supplied training set contains 110 subjects with varying sequence lengths between 21 and 290 frames. It should be noted that the subject does not appear in each frame and that for each subject a minimum of 30 frames is supplied for background model initialisation (or other pre-search requirements). This training database is based on the work by Halstead et. al. [2], which has primarily undergone simplifying of the directory structure.  For each sequence, the captured frame size is 704*576 pixels. The database for the training set is structured as shown below:

 [Task 2] [Train Data] [Callibration] [Extra Video Snippets]
[Tsai Camera]
[training_subject_000] training_subject_000_im_0000.png
[training_subject_001] training_subject_000_im_0001.png
[training_subject_109 training_subject_000_im_0164.png
QUTSoftBioSearch.xml training_subject_000.xml

Along with the video snippets, two extra callibration directories have been included. Firstly, videos of each of the cameras are included for building various models for each camera depending on requirements. The second directory contains information for Tsai camera calibration [1] for each of the cameras. Inclusion of this information is to assist in obtaining real world co-ordinates from image co-ordinates.

The sequences are recorded from one of six cameras located within a security network of a university campus. Subjects are also recorded over various days and time periods, creating variability in lighting conditions, crowd flow, and crowd density. For each subject, once the full signature is considered, every attempt is made to ensure the target subject is distinct from other subjects in the footage, creating a unique match to the query.


In the supplementary data a collection of image patches from the same network is contained for colour classification (if required). This information can be used to build colour models. This is similar for the texture snippets included in the auxiliary data, however, these are captured from outside this camera network.

For each subject (in a sequence) an XML document is included, where each XML contains the full annotation for that subject. In total, 16 soft biometric traits are annotated, and nine body markers are annotated for subject localisation during evaluation. The soft biometric traits include instances of torso and leg clothing colours, clothing texture, and clothing type, along with age, gender, shoe colour, height and build (both calculated using Tsai camera calibration). In this case build is an aspect ratio calculated based on the height and the two most extreme real world co-ordinates of the shoulders and waist.

Each of these annotations are supplied in the XML documentation where a -1 identifies as being difficult to annotate or non-existent on the subject (i.e. a second colour).


The body markers include those shown below, where the nine key points are: top of the head, left and right neck, left and right shoulder, left and right waist, approximate toe position of the feet. It should be noted here that the neck, shoulders, and waist positions are based on the two most extreme locations at those positions, and not necessarily the shoulder points or hip bones.


These body markers are used to construct the bounding boxes for evaluation, where the ‘y’ position of the head and the lowest ‘y’ position of the feet construct the height of the bounding box, and the two most extreme pixel locations from the other locations are used to construct the width. Similarly to that of the soft biometrics, if a body marker cannot be annotated it is labelled with a -1.

In each of the sequences, a minimum of 30 frames is set aside for initialisation requirements. In some instances, the subject appears prior to this threshold and although they may be annotated in these frames it should be considered that they do not exist as only frames after this threshold will be evaluated in the test set.

The testing set will be annotated in a similar manner to the training set, it will include 40 unseen subjects. The testing set will aim to include varying levels of complexity, including subject ambiguity, crowd flow and density changes, and occlusion.

As previously stated the evaluation will be dependent on a bounding box output from the participant’s techniques. These outputs will be compared to those generated from the annotation where the construction is based on the head and feet, and the extremeties of the other traits.

This will utilise an intersection-over-union (IOU) based approach, where S is the calculated score at each frame that is considered annotated. The participant detection is considered D and the ground truth box is GT. As previously stated in this case the ground truth bounding box is considered the top of the head (y0), the lowest of the feet (y1), and the two most extreme locations of the other annotations (x0, x1).


The score is then averaged over the entire sequence (where annotation exists) to provide an average IOU score. Finally, to evaluate overall performance, the average IOU from each sequence is summed then averaged to gauge the participant’s overall system performance. The approach proposed by Denman et. al. [3] will be considered the baseline approach for this task.

The yet to be released testing data will contain approximately 40 new and annotated sequences. These annotations will follow those used in the training set with both soft biometric and body markers included.

More on this to follow.


  1. R. Y. Tsai, “An efficient and accurate camera calibration technique for 3d machine vision,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 364-374, 1986.
  2. M. Halstead, S. Denman, S. Sridharan, C. Fookes, “Locating people in video from semantic descriptions: A new database and approach,” International Conference on Pattern Recognition, pp. 4501-4506, 2014.
  3. S. Denman, M. Halstead, C. Fookes, S. Sridharan, “Searching for people using semantic soft biometric descriptions,” Pattern Recognition Letters, 68 (Part 2), pp. 306-315, 2015.