An Entity Resolution Approach to Isolate Instances of
Human Trafficking Online
Human trafficking is a challenging law enforcement problem, and a large amount of such activity manifests itself on various online forums. Given the large, heterogeneous and noisy structure of this data, building models to predict instances of trafficking is an even more convolved a task. In this paper we propose and entity resolution pipeline using a notion of proxy labels, in order to extract clusters from this data with prior history of human trafficking activity. We apply this pipeline to 5M records from backpage.com and report on the performance of this approach, challenges in terms of scalability, and some significant domain specific characteristics of our resolved entities.
Chirag Nagpal, Kyle Miller, Benedikt Boecking and Artur Dubrawski firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com Carnegie Mellon University
Over the years human trafficking has grown to be a challenging law enforcement problem. The advent of the internet has brought the problem in the public domain making it an ever greater societal concern. Prior studies (Kennedy, 2012) have leveraged computational techniques to this data to detect spatio-temporal patterns, by utilizing certain features of the ads. Certain studies (Dubrawski et al., 2015) have utilized machine learning approaches to identify if ads could be possibly involved in human trafficking activity. Significant work has also been carried out in building large distributed systems, to store and process such data, and carry out entity resolution to establish ontological relationships between various entities. (Szekely et al., 2015)
In this paper we explore the possibility of leveraging this information to identify sources of these advertisements, isolate such clusters and identify potential sources of human trafficking from this data using prior domain knowledge.
In case of ordinary Entity Resolution schemes, each record is considered to represent a single entity. A popular approach in such scenarios is a ‘merge and purge’ strategy whereas records are compared and matched, they are merged into a single more informative record, and the individual records are deleted from the dataset. (Benjelloun et al., 2009)
While our problem can be considered a case of Entity Resolution, however, escort advertisements are a challenging, noisy and unstructured dataset. In case of escort advertisements, a single advertisement, may represent one or a group of entities. The advertisements hence might contain features belonging to more than one individual or group.
The advertisements are also associated with multiple features, including Text, Hyperlinks, Images, Timestamps, Locations etc. In order to featurize characteristics from text we use the regex based information extractor based on the GATE framework (Cunningham, 2002). This allows us to generate certain domain specific features from our dataset, including, the aliases, cost, location, phone numbers, specific URLs, etc of the entities advertised. We use these features, along with other generic text, the images, etc as features for our classifier. The high reuse of similar features makes it difficult to use exact match over a single feature in order to perform entity resolution.
We proceed to leverage machine learning approaches to learn a function that can predict if two advertisements are from the same source. The challenge with this is that we have no prior knowledge of the source of advertisements. We thus depend upon a strong feature, in our case Phone Numbers, which can be used as proxy evidence for the source of the advertisements and can help us generate labels for the Training and Test data for a classifier. We can therefore use such strong evidence as to learn another function, which can help us generate labels for our dataset, this semi-supervised approach is described as ‘surrogate learning’ in (Veeramachaneni and Kondadadi, 2009). Pairwise comparisons result in an extremely high number of comparisons over the entire dataset. In order to reduce this, we use a blocking scheme using certain features.
The resulting clusters are isolated for human trafficking using prior expert knowledge and featurized. Rule learning is used to establish differences between these and other components. The entire pipeline is represented by Figure 2.
2 Domain and Feature Extraction
Figure 1 is illustrative of the search results of escort advertisements and a page advertising a particular individual. The text is inundated with special characters, Emojis, as well as misspelled words that are specific markers and highly informative to domain experts. the text consists of information, regarding the escorts area of operation, phone number, any particular client preferences, and the advertised cost. We proceed to build Regular expression based feature extractors to extract this information and store in a fixed schema, using the popular JAPE tool part of the GATE suite of NLP tools. The extractor we build for this domain, AnonymousExtractor is open source and publically available at github.com/mille856/CMU_memex.
Table 1 lists the performance of our extraction tool on 1,000 randomly sampled escort advertisements, for the various features. Most of the features are self explanatory. (The reader is directed to (Dubrawski et al., 2015) for a complete description of the fields extracted.) The noisy nature, along with intentional obfuscations, especially in case of features like Names results in lower performance as compared to the other extracted features.
Apart from the Regular Expression based features, we also extract the hashcodes of the images in the advertisements, the posting date and time, and location.111These features are present as metadata, and do not require the use of hand engineered Regexs.
3 Entity Resolution
We approach the problem of extracting connected components from our dataset using pairwise entity resolution. The similarity or connection between two nodes is treated as a learning problem, with training data for the problem generated by using ‘proxy’ labels from existing evidence of connectivity from strong features.
More formally the problem can be considered to be to sample all connected components from a graph . Here, , the set of vertices () is the set of advertisements and , is the set of edges between individual records, the presence of which indicates they represent the same entity.
We need to learn a function such that
The set of strong features present in a given record can be considered to be the function ‘’. Thus, in our problem, represents all the phone numbers associated with .
Thus . Here,
Now, let us further consider the graph defined on the set of vertices , such that if (more simply, the graph described by strong features.)
Let be the set of all the of connected components defined on the graph
Now, function is such that for any
3.2 Sampling Scheme
For our classifier we need to generate a set of training examples ‘’, and & are the subsets of samples labeled positive and negative.
In order to ensure that the sampling scheme does not end up sampling near duplicate pairs, we introduce a sampling bias such that for every feature vector ,
This reduces the likelihood of sampling near-duplicates as evidenced in Figure 4, which is a histogram of the Jaccards Similarity between the set of the unigrams of the text contained in the pair of ads.
We observe that although we do still end with some near duplicates (), we have high number of non duplicates. () which ensures robust training data for our classifier.
To train our classifier we experiment with various classifiers like Logistic Regression, Naive Bayes and Random Forest using Scikit. (Pedregosa et al., 2011) Table 2 shows the most informative features learnt by the Random Forest classifier. It is interesting to note that the most informative features include, the spatial (Location), Temporal (Time Difference, Posting Date) and also the Linguistic (Number of Special Characters, Longest Common Substring) features. We also find that the domain specific features, extracted using regexs, prove to be informative.
|Top 10 Features|
|2||Number of Special Characters|
|3||Longest Common Substring|
|4||Number of Unique Tokens|
|6||If Posted on Same Day|
|7||Presence of Ethnicity|
|8||Presence of Rate|
|9||Presence of Restrictions|
|10||Presence of Names|
The ROC curves for the classifiers we tested with different feature sets are presented in Figure 5. The classifiers performs well, with extremely low false positive rates. Such a behavior is desirable for the classifier to act as a match function, in order to generate sensible results for the downstream tasks. High False Positive rates, increase the number of links between our records, leading to a ‘snowball effect’ which results in a break-down of the downstream Entity Resolution process as evidenced in Figure 6.
In order to minimize this breakdown, we need to heuristically learn an appropriate confidence value for our classifier. This is done by carrying out the ER process on 10,000 randomly selected records from our dataset. The value of size of the largest extracted connected component and the number of such connected components isolated is calculated for different confidence values of our classifier. This allows us to come up with a sensible heuristic for the confidence value.
3.4 Blocking Scheme
Our dataset consists of over 5 million records. Naive pairwise comparisons across the dataset, makes this problem computationally intractable. In order to reduce the number of comparisons, we introduce a blocking scheme and performa exhaustive pairwise comparisons only within each block before resolving the dataset across blocks. We block the dataset on features like Rare Unigrams, Rare Bigrams and Rare Images.
4 Rule Learning
We extract clusters and identify records that are associated with human trafficking using domain knowledge from experts. We featurize the extracted components, using features like size of the cluster, the spatio-temporal characteristics, and the connectivity of the clusters. For our analysis, we consider only components with more than 300 advertisements. we then train a random forest to predict if the clusters is linked to human trafficking. In order to establish statistical significance, we compare the ROC results of our classifier in 4 cross validation for 100 random connected components versus the positive set. Figure 9 & Table 4 lists the performance of the classifier in terms of False Positive and True Positive Rate while Table 5 lists the most informative features for this classifier.
We then proceed to learn rules from our featureset. Some of the rules with corresponding Ratios and Lift are given in Table 3. PN curves corresponding to various rules learnt are presented in the Figure 10 It can be observed that the features used by the rule learning to learn rules with maximum support and ratios, correspond to the ones labeled by the random forest as informative. This also serves as validation for the use of rule learning.
|Top 5 Features|
|3||Std-Dev. of Image Frequency|
|4||Norm. No. of Names|
|5||Norm. No. of Unique Images|
In this paper we approached the problem of isolating sources of human trafficking from online escort advertisements with a pairwise Entity Resoltuion approach. We trained a classifier able to predict if two advertisements are from the same source using phone numbers as a strong feature and exploit it as proxy ground truth to generate training data for our classifier. The resultant classifier, proved to be robust, as evidenced from extremely low false positive rates. Other approraches (Szekely et al., 2015) aims to build similar knowledge graphs using similarity score between each feature. This has some limitations. Firstly, we need labelled training data inorder to train match functions to detect ontological relations. The challenge is aggravated since this approach considers each feature independently making generation of enough labelled training data for training multiple match functions an extremely complicated task.
Since we utilise existing features as proxy evidence, our approach can generate a large amount of training data without the need of any human annotation. Our approach requires just learning a single function over the entire featureset, hence our classifier can learn multiple complicated relations between features to predict a match, instead of the naive feature independence assumption.
We then proceeded to use this classifier in order to perform entity resolution using a heurestically learned value for the score of classifier, as the match threshold. The resultant connected components were again featurised, and a classifier model was fit before subjecting to rule learning. On comparison with (Dubrawski et al., 2015), the connected component classifier performs a little better with higher values of the area under the ROC curve and the TPR@FPR=1% indicating a steeper, ROC curve. We hypothesize that due to the entity resolution process, we are able to generate larger, more robust amount of training data which is immune to the noise in labelling and results in a stronger classifier. The learnt rules show high ratios and lift for reasonably high supports as evidenced from Table 3. Rule learning also adds an element of interpretability to the models we built, and as compared to more complex ensemble methods like Random Forests, having hard rules as classification models are preferred by Domain Experts to build evidence for incrimination.
6 Future Work
While our blocking scheme performs well to reduce the number of comparisons, however since our approach involves naive pairwise comparisons, scalability is a significant challenge. One approach could be to design such a pipeline in a distributed environment. Another approach could be to use a computationally inexpensive technique to de-duplicate the dataset of the near duplicate ads, which would greatly help with regard to scalability.
In our approach, the ER process depends upon the heuristically learnt match threshold. Lower threshold values can significantly degrade the performance, with extremely large connected components. The possibility of treating this attribute as a learning task, would help making this approach more generic, and non domain specific.
Hashcodes of the images associated with the ads were also utilized as a feature for the match function. However, simple features like number of unique and common images etc., did not prove to be very informative. Further research is required in order to make better use of such visual data.
The authors would like to thank all staff, faculty and students who made the Robotics Institute Summer Scholars program 2015 at Carnegie Mellon University possible.
- Bastian et al. (2009) Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. 2009. Gephi: An open source software for exploring and manipulating networks. http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154.
- Benjelloun et al. (2009) Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. The VLDB JournalâThe International Journal on Very Large Data Bases 18(1):255–276.
- Cunningham (2002) Hamish Cunningham. 2002. Gate, a general architecture for text engineering. Computers and the Humanities 36(2):223–254.
- Dubrawski et al. (2015) Artur Dubrawski, Kyle Miller, Matthew Barnes, Benedikt Boecking, and Emily Kennedy. 2015. Leveraging publicly available data to discern patterns of human-trafficking activity. Journal of Human Trafficking 1(1):65–85.
- Kennedy (2012) Emily Kennedy. 2012. Predictive patterns of sex trafficking online .
- Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12:2825–2830. http://dl.acm.org/citation.cfm?id=1953048.2078195.
- Szekely et al. (2015) Pedro Szekely, Craig A. Knoblock, Jason Slepicka, Andrew Philpot, Amandeep Singh, Chengye Yin, Dipsy Kapoor, Prem Natarajan, Daniel Marcu, Kevin Knight, David Stallard, Subessware S. Karunamoorthy, Rajagopal Bojanapalli, Steven Minton, Brian Amanatullah, Todd Hughes, Mike Tamayo, David Flynt, Rachel Artiss, Shih-Fu Chang, Tao Chen, Gerald Hiebel, and Lidia Ferreira. 2015. Building and using a knowledge graph to combat human trafficking. In Proceedings of the 14th International Semantic Web Conference (ISWC 2015).
- Veeramachaneni and Kondadadi (2009) Sriharsha Veeramachaneni and Ravi Kumar Kondadadi. 2009. Surrogate learning: from feature independence to semi-supervised classification. In Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing. Association for Computational Linguistics, pages 10–18.