Animal Wildlife Population Estimation Using Social Media Images Collections
We are losing biodiversity at an unprecedented scale and in many cases, we do not even know the basic data for the species. Traditional methods for wildlife monitoring are inadequate. Development of new computer vision tools enables the use of images as the source of information about wildlife. Social media is the rich source of wildlife images, which come with a huge bias, thus thwarting traditional population size estimate approaches. Here, we present a new framework to take into account the social media bias when using this data source to provide wildlife population size estimates. We show that, surprisingly, this is a learnable and potentially solvable problem.
According to the recent UN report (Díaz et al., 2019), we are losing biodiversity at an unprecedented scale and rate. In many cases, we do not have the basic data for the species, such as population sizes. Traditional methods for wildlife monitoring are expensive, unsustainable, and unscalable (Witmer, 2005). Alternative approaches are urgently needed.
Development of new computer vision tools for wildlife, such as Wildbook™ (Berger-Wolf et al., 2017), allows to analyze a huge quantity of animal images and to identify animals at the individual level. This identification process is fundamental for the use of several wildlife estimation techniques, such as capture-mark-recapture (Lancia et al., 2005). This enables the use of images as the source of information about wildlife. People upload images according to their tastes and experiences, creating biases which have been proven to affect wildlife population estimates. Therefore, it is necessary to develop new frameworks to take into account the social media bias when using this data source in providing wildlife estimates. (Olteanu et al., 2016)
Recently, work has started to leverage social media data for animal wildlife population size estimation (Menon et al., 2016; Menon, 2017). This work has focused on predicting the likelihood of an image being shared on social media. Our goal is to directly estimate the number of animals photographed by a social media user, given the pictures they shared online. These two numbers may differ if the user posts online only a subset of pictures taken. Moreover, we elevate the problem from single images to image collections since we believe that the shareability of an image is a function not only of the image features, but also of the features of the other photos taken at the same event.
2. Problem Statement
Let be the set of images from a single image collection posted on a social media platform. Let be the set of images in one or more SD cards from which the images in have been taken.
First, given an images collection shared on social media by a photographer , we want to estimate for the real number of animals present in the set of the SD cards of the photographer . We refer to this problem as the estimate problem. We model it as a regression model: given an images collection , we map it to a set of features which includes the number of individual animals in the collection . Our regression problem is then to predict from . Then, using a set of estimates , we want to provide an estimate for the number of animals of the given species.
Secondly, given the collection of pictures that were taken by the photographer, in addition to knowing which were shared, we define our problem as a binary classification. The labels are “shared” and “not shared”. We refer to this part of the problem as shareability problem.
GGR1 and GGR2. For our research, we focus on Grevy’s zebras (Equus grevyi) for two reasons. First, its population lives in a delimited area (Rubenstein et al., 2016) which makes it possible to census it. Most recent such census were done using photographs from volunteer citizen scientists over a course of 2 days in 2016 and in 2018: the Great Grevy’s Rally (Berger-Wolf et al., 2016). Wildbook™was then used to analyze the images and identify individual animals, resulting in two datasets GGR1(2016) and GGR2(2018), which have also been used to provide the ground truth for the population size.
Flickr™ was used as the social media data source. Using API, we downloaded images matching “grevy’s zebra” keyword, together with their albums.
For the estimate problem, we lacked the ground truth of people’s preferences in sharing images. Using the images from GGR1 and GGR2 we created an online survey for each of the SD card collected during the event. The survey contained a list of all the photos of the SD card, and for each of these photos we asked the interviewee whether they would share it on social media or not.
From each image we extracted a set of features to describe various aspects of its beauty as well as biological information regarding the animals in the photo, similar to (Menon, 2017). We used Wildbook™to identify each individual animal in all the photos. We have also introduced features modeling the structure of the source collection, in order to account for all the pictures that were taken.
Using our labelled pictures from GGR1 and GGR2, we trained a regression model to predict the percentage of animals that the interviewee would share. The inverse of this number is a coefficient so that the estimated number of animal photographed by the user is . This model has then been applied to each images collection retrieved from Flickr™.
In order to provide an estimate for the entire population, we relied on the Jolly-Seber method (Jolly, 1965; Seber, 1965), a capture-mark-recapture technique that uses multiple years’ data. Given two years and , we took all the coefficients coming from images collections of photos taken in those two years, and then estimated the overall coefficient using the average of these coefficients: . Having all the variables for the Jolly-Seber method we provided an estimate for the Grevy’s zebra population. Finally, we multiply this estimate by the computed coefficient to take into account the recaptures which cannot be detected by a computer vision tool due to the lack of symmetry in Grevy’s zebra stripes on either side.
5. Experimental Setup
For the estimate problem we chose as weak classifier the mean and the mode, and as strong classifiers elastic net, gradient boosting trees and SVR. The models have been evaluated using a 10 times 10-fold cross-validation and R2 and MSE metrics.
For the shareability problem, we tested the performance of the state-of-the art models: Logistic Regression, K-Nearest Neighbors, Decision Trees, Random Tree Forests, AdaBoost and XGB). The models were evaluated using 10-fold cross validation and we aimed at maximizing the F-1 score and the Accuracy. GGR1 and GGR2 have different structures and we anticipated that the former, being more regular, would be easier to learn for our models. For this reason we trained on our models using either dataset, the combination of the two, and GGR1 as training set, tested on GGR2.
eXtreme Gradient Boosting (XGB) (Chen and Guestrin, 2016) provided the best results in both problems. For the estimate problem, it achieved a performance of R2 of 0.417, with a standard deviation of 0.155, and an MSE of 0.070, with a standard deviation of 0.020, whereas the baselines scored a negative R2 value. Results are shown in Table 1. For the shareability problem, this classifier achieved an Accuracy of with a F-1 Score of on the combined dataset, and an Accuracy of (F-1 score of ) on GGR1 alone.
7. Conclusions and Future Work
We proposed a species-independent framework to estimate the size of a wildlife population using social media images. Moreover, we demonstrated that image shareability problem is learnable and that the structure of the original collection provides useful information when modeling user behavior.
Our approach needs to be tested on other social media platforms and other species. We expect that features accounting for the special diversity that is present in realistic datasets will improve the performance. Collecting more data will also allow for an easier detection of existing patterns in the sharing behavior, particularly in more complex and irregular datasets.
-  (2016) The great grevyâs rally: the need, methods, findings, implications and next steps. Technical report Technical Report, Grevyâs Zebra Trust, Nairobi, Kenya. Cited by: §3.
-  (2017) Wildbook: crowdsourcing, computer vision, and data science for conservation. arXiv preprint arXiv:1710.08880. Cited by: §1.
-  (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §6.
-  (2019-May 6,) Summary for policymakers of the global assessment report on biodiversity and ecosystem services of the intergovernmental science-policy platform on biodiversity and ecosystem services. Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES). Note: \urlhttps://www.ipbes.net/system/tdf/spm_global_unedited_advance.pdf?file=1&type=node&id=35245 Cited by: §1.
-  (1965) Explicit estimates from capture-recapture data with both death and immigration-stochastic model. Biometrika 52 (1/2), pp. 225–247. Cited by: §4.
-  (2005) Estimating the number of animals in wildlife populations. Cited by: §1.
-  (2016) Animal population estimation using flickr images. Cited by: §1.
-  (2017) Animal wildlife population estimation using social media images. Ph.D. Thesis. Cited by: §1, §4.
-  (2016) Social data: biases, methodological pitfalls, and ethical boundaries. Cited by: §1.
-  (2016) Equus grevyi. the iucn red list of threatened species 2016: e. t7950a89624491. Cited by: §3.
-  (1965) A note on the multiple-recapture census. Biometrika 52 (1/2), pp. 249–259. Cited by: §4.
-  (2005) Wildlife population monitoring: some practical considerations. Wildlife Research 32 (3), pp. 259–263. Cited by: §1.