Video recording is now ubiquitous in the study of animal behavior, but its analysis on a large scale is prohibited by the time and resources needed to manually process large volumes of data. We present a deep convolutional neural network (CNN) approach that p…
Video data have become indispensable in the retrospective analysis and monitoring of wild animal species presence, abundance, distribution, and behavior (1, 2). The accumulation of decades worth of large video databases and archives has immense potential for answering biological questions that require longitudinal data (3). However, exploiting video data is currently severely limited by the amount of human effort required to manually process it, as well as the training and expertise necessary to accurately code such information. Citizen science platforms have allowed large-scale processing of databases such as camera trap images (4); however, ad hoc volunteer coders working independently typically only tag at the species level and cannot solve tasks such as recognizing individual identities. Here, we provide a fully automated computational approach to data collection from animals using the latest advances in artificial intelligence to detect, track, and recognize individual chimpanzees (Pan troglodytes verus) from a longitudinal archive. Automating the process of individual identification could represent a step change in our use of large image databases from the wild to open up vast amounts of data available for ethologists to analyze behavior for research and conservation in the wildlife sciences.
We developed an automated pipeline that can individually identify and track wild apes in raw video footage and demonstrate its use on a dataset spanning 14 years of a longitudinal video archive of chimpanzees (P. troglodytes verus) from Bossou, Guinea (20). Data used were collected in the Bossou forest, southeastern Guinea, West Africa, a long-term chimpanzee field site established by Kyoto University in 1976 (21). Bossou is home to an outdoor laboratory: a natural forest clearing (7 m by 20 m) located in the core of the Bossou chimpanzees home range (07°39N; 008°30W) where raw materials for tool usestones and nutsare provisioned, and the same group has been recorded since 1988. The use of standardized video recording over many field seasons has led to the accumulation of over 30 years of video data, providing unique opportunities to analyze chimpanzee behavior over multiple generations (22). Our framework consists of detection and tracking of individuals through the video (localization in space and time) as well as sex and identity recognition (Fig. 1 and movie S1). Both the detection and tracking stage and the sex and identity recognition stage use a deep CNN model.
Fig. 1Fully unified pipeline for wild chimpanzee face tracking and recognition from raw video footage.The pipeline consists of the following stages: (A) Frames are extracted from raw video. (B) Detection of faces is performed using a deep CNN single-shot detector (SSD) model. (C) Face tracking, which is implemented using a Kanade-Lucas-Tomasi (KLT) tracker (25) to group detections into face tracks. (D) Facial identity and sex recognition, which are achieved through the training of deep CNN models. (E) The system only requires the raw video as input and produces labeled face tracks and metadata as temporal and spatial information. (F) This output from the pipeline can then be used to support, for example, social network analysis. (Photo credit: Kyoto University, Primate Research Institute)
We applied this pipeline to ca. 50 hours of footage featuring 23 individuals, resulting in a total of 10 million face detections (Figs. 2 and 3) and more than 20,000 face tracks (see Fig. 1A and Materials and Methods). The training set for the face recognition model consisted of 15,274 face tracks taken from four different years (2000, 2004, 2008, and 2012) within the full dataset, belonging to 23 different chimpanzees of the Bossou community, ranging in estimated age from newborn to 57 years (table S1). A proportion of face tracks were held out to test the models performance in each year, as well as to provide an all-years overall accuracy (table S2). Our chimpanzee face detector achieved an average precision of 81% (fig. S1), and our recognition model performed well on extreme poses and profile faces typical of videos recorded in the wild (Fig. 2B, table S3, and movie S1), achieving an overall recognition accuracy of 92.47% for identity and 96.16% for sex. We tested both frame-level accuracy, wherein our model is applied to detections in every frame to obtain predictions, and track-level accuracy, which averages the predictions for each face track. Using track-level labels compared with frame-level labels provided a large accuracy boost (table S3), demonstrating the superiority of our video-based method to frame-level approaches. We note that these results include faces from all viewpoints (frontal, profile, and extreme profile); if only frontal faces were used, then the identity recognition accuracy improves to 95.07% and the sex recognition accuracy to 97.36% (table S3).
Fig. 2Face recognition results demonstrating the CNN models robustness to variations in pose, lighting, scale, and age over time.(A) Example of a correctly labeled face track. The first two faces (nonfrontal) were initially labeled incorrectly by the model but were corrected automatically by recognition of the other faces in the track, demonstrating the benefit of our face track aggregation approach. (B) Examples of chimpanzee face detections and recognition results in frames extracted from raw video. Note how the system has achieved invariance to scale and is able to perform identification despite extreme poses and occlusions from vegetation and other individuals. (C) Examples of correctly identified faces for two individuals. The individuals age 12 years from left to right (top row: from 41 to 53 years; bottom row: from 2 to 14 years). Note how the model can recognize extreme profiles, as well as faces with motion blur and lighting variations. (Photo credit: Kyoto University, Primate Research Institute)
Fig. 3Face detection and recognition results.(A) Histograms of detection numbers for individuals in the training and test years of the dataset (2000, 2004, 2006, 2008, 2012, and 2013). (B) Output of model for number of individuals detected in each year and proportion of individuals in different age categories based on existing estimates of individual ages.
Our model demonstrates the efficacy of using deep neural network architectures for a direct biological application: the detection, tracking, and recognition of individual animals in longitudinal video archives from the wild. Unlike previous automation attempts (17, 18), we operate on a very large scale, processing millions of faces. In turn, the scale of the dataset allows us to use state-of-the-art deep learning, avoiding the use of the older, less powerful classifiers. Our approach is also enriched by the use of a video-based, rather than frame-based, method, which improves accuracy by pooling multiple detections of the same individual before coming to a decision. We demonstrate that face recognition is possible on data at least 1 year beyond that supplied during the training phase, opening up the possibility of analyzing years that human coders may not have even seen themselves.
We do not constrain the video data in any way, as is done for other primate face recognition works [e.g., (13, 18)], by aligning face poses or selecting for age, resolution, or lighting. We do this to perform the task in the wild and ensure an end-to-end pipeline that will work on raw video with minimum preprocessing. Hence, the performance of our model is highly dependent on numerous factors, such as variation in image quality and pose. For example, model accuracy increases monotonically with image resolution (fig. S4), and testing only on frontals increases performance. On unconstrained faces, our model outperformed humans, highlighting the difficulty of the task. Humans poor performance is likely due to the specificity of the task: Normally, researchers who observe behavior in situ can rely on multiple additional cues, e.g., behavioral context, full body posture and movement, handedness, and proximity to other individuals, while those coding video footage have the possibility to replay scenes.
While our model was developed using a chimpanzee dataset, the extent of its generalizability to other species is an important question for its immediate value for research. We show some preliminary examples of our face detector (with no further modification) applied to other primate species in Fig. 5. Our detector, trained solely on chimpanzee faces, generalized well, and the tracking part of our pipeline is completely agnostic to the species to be tracked (25). Individual recognition will require a corpus annotated with identity labels; however, we release all software open source such that researchers can produce their own training sets using our automated framework. Such corpus may not have to be as large as the corpus that we use in this study; in supervised machine learning, features learned on large datasets are often directly useful in similar tasks, even those that are data poor. For instance, in the visual domain, features learnt on ImageNet (26) are routinely used as input representations in other computer vision tasks with smaller datasets (27). Hence, the features learnt by our deep model will likely also be useful for other primate-related tasks, even if the datasets are smaller.
Fig. 5Preliminary results from the face detector model tested on other primate species.Top row:P. troglodytes schweinfurthii, Pan paniscus, Gorilla beringei, Pongo pygmaeus, Hylobates muelleri, and Cebus imitator.Bottom row:Papio ursinus (x2), Chlorocebus pygerythrus (x2), Eulemur macaco, and Nycticebus coucang. Image sources: Chimpanzee: www.youtube.com/watch?v=c2u3NKXbGeo; Bonobo: www.youtube.com/watch?v=JF8v_HWvfLc&t=9s; Gorilla: www.youtube.com/watch?v=wDECqJsiGqw&t=28s; Orangutan: www.youtube.com/watch?v=Gj2W5BHu-SI;Gibbon: www.youtube.com/watch?v=C6HucIWKsVc;Capuchin: Lynn Lewis-Bevan (personal data); Baboon: Lucy Baehren (personal data); Vervet monkey: Lucy Baehren (personal data); Loris: www.youtube.com/watch?v=2Syd_BUbl5A&t=2s.
The ultimate goal for using computational frameworks in wildlife science is to move beyond the use of visual images for the monitoring and censusing of populations to automated analyses of behaviors, quantifying social interactions and group dynamics. For example, sampling the sheer quantity of wild animals complex social interactions for social network analysis typically represents a daunting methodological challenge (28). The use of animal-borne biologgers and passive transponders has automated data collection at high resolution for numerous species (29), but these technologies require capturing subjects, are expensive and labor intensive to install and maintain, their application may be location specific (e.g., depends on animals approaching a receiver in a fixed location), and the data recorded typically lack contextual visual information.
We show that by using our face detector, tracker, and recognition pipeline, we are able to automate the sampling of social networks over multiple years, providing high-resolution output on the spatiotemporal occurrence and co-occurrence of specific group members. This automated pipeline can aid conservation and behavioral analyses, allowing us to retrospectively analyze key events in the history of a wild community, for example, by quantifying how the decrease in population size and loss of key individuals in the community affect the network structure, with a decrease in the connectivity and average degree of the network (Fig. 4 and table S5). Traditional ethology has been reliant on human observation, but adopting a deep learning approach for the automation of individual recognition and tracking will improve the speed and amount of data processed and introduce a set of quantifiable algorithms with the potential to standardize behavioral analysis and, thus, allow for reproducibility across different studies (30, 31).
Trending tweets for more information:
— Andres Vilariño (@andresvilarino) September 5, 2019