An Entity Resolution Approach to Isolate Instances ofHuman Trafficking Online

An Entity Resolution Approach to Isolate Instances of
Human Trafficking Online

Chirag Nagpal, Kyle Miller, Benedikt Boecking    Artur Dubrawski
chiragn@cs.cmu.edu, mille856@andrew.cmu.edu, bboecking@andrew.cmu.edu, awd@cs.cmu.edu
Carnegie Mellon University
Abstract

Human trafficking is a challenging law enforcement problem, and a large amount of such activity manifests itself on various online forums. Given the large, heterogeneous and noisy structure of this data, building models to predict instances of trafficking is an even more convolved a task. In this paper we propose and entity resolution pipeline using a notion of proxy labels, in order to extract clusters from this data with prior history of human trafficking activity. We apply this pipeline to 5M records from backpage.com and report on the performance of this approach, challenges in terms of scalability, and some significant domain specific characteristics of our resolved entities.

An Entity Resolution Approach to Isolate Instances of
Human Trafficking Online


Chirag Nagpal, Kyle Miller, Benedikt Boecking and Artur Dubrawski chiragn@cs.cmu.edu, mille856@andrew.cmu.edu, bboecking@andrew.cmu.edu, awd@cs.cmu.edu Carnegie Mellon University

1 Introduction

Over the years human trafficking has grown to be a challenging law enforcement problem. The advent of the internet has brought the problem in the public domain making it an ever greater societal concern. Prior studies (Kennedy, 2012) have leveraged computational techniques to this data to detect spatio-temporal patterns, by utilizing certain features of the ads. Certain studies (Dubrawski et al., 2015) have utilized machine learning approaches to identify if ads could be possibly involved in human trafficking activity. Significant work has also been carried out in building large distributed systems, to store and process such data, and carry out entity resolution to establish ontological relationships between various entities. (Szekely et al., 2015)

In this paper we explore the possibility of leveraging this information to identify sources of these advertisements, isolate such clusters and identify potential sources of human trafficking from this data using prior domain knowledge.

In case of ordinary Entity Resolution schemes, each record is considered to represent a single entity. A popular approach in such scenarios is a ‘merge and purge’ strategy whereas records are compared and matched, they are merged into a single more informative record, and the individual records are deleted from the dataset. (Benjelloun et al., 2009)

While our problem can be considered a case of Entity Resolution, however, escort advertisements are a challenging, noisy and unstructured dataset. In case of escort advertisements, a single advertisement, may represent one or a group of entities. The advertisements hence might contain features belonging to more than one individual or group.

The advertisements are also associated with multiple features, including Text, Hyperlinks, Images, Timestamps, Locations etc. In order to featurize characteristics from text we use the regex based information extractor based on the GATE framework (Cunningham, 2002). This allows us to generate certain domain specific features from our dataset, including, the aliases, cost, location, phone numbers, specific URLs, etc of the entities advertised. We use these features, along with other generic text, the images, etc as features for our classifier. The high reuse of similar features makes it difficult to use exact match over a single feature in order to perform entity resolution.

(a) Search Results on backpage.com
(b) Representative escort advertisement
Figure 1: Escort advertisements are a classic source of what can be described as Noisy Text. Notice the excessive use of Emojis, Intentional misspelling and relatively benign colloquialisms to obfuscate a more nefarious intent. Domain experts extract meaningful cues from the spatial and temporal indicators, and other linguistic markers to suspect trafficking activity, which further motivate the leveraging of computational approaches to support such decision making.

We proceed to leverage machine learning approaches to learn a function that can predict if two advertisements are from the same source. The challenge with this is that we have no prior knowledge of the source of advertisements. We thus depend upon a strong feature, in our case Phone Numbers, which can be used as proxy evidence for the source of the advertisements and can help us generate labels for the Training and Test data for a classifier. We can therefore use such strong evidence as to learn another function, which can help us generate labels for our dataset, this semi-supervised approach is described as ‘surrogate learning’ in (Veeramachaneni and Kondadadi, 2009). Pairwise comparisons result in an extremely high number of comparisons over the entire dataset. In order to reduce this, we use a blocking scheme using certain features.

The resulting clusters are isolated for human trafficking using prior expert knowledge and featurized. Rule learning is used to establish differences between these and other components. The entire pipeline is represented by Figure 2.

Figure 2: The proposed Entity Resolution pipeline

2 Domain and Feature Extraction

Figure 1 is illustrative of the search results of escort advertisements and a page advertising a particular individual. The text is inundated with special characters, Emojis, as well as misspelled words that are specific markers and highly informative to domain experts. the text consists of information, regarding the escorts area of operation, phone number, any particular client preferences, and the advertised cost. We proceed to build Regular expression based feature extractors to extract this information and store in a fixed schema, using the popular JAPE tool part of the GATE suite of NLP tools. The extractor we build for this domain, AnonymousExtractor is open source and publically available at github.com/mille856/CMU_memex.

Feature Precision Recall Score
Age 0.980 0.731 0.838
Cost 0.889 0.966 0.926
E-mail 1.000 1.000 1.000
Ethnicity 0.969 0.876 0.920
Eye Color 1.000 0.962 0.981
Hair Color 0.981 0.959 0.970
Name 0.896 0.801 0.846
Phone Number 0.998 0.995 0.997
Restriction(s) 0.949 0.812 0.875
Skin Color 0.971 0.971 0.971
URL 0.854 0.872 0.863
Height 0.978 0.962 0.970
Measurement 0.919 0.883 0.901
Weight 0.976 0.912 0.943
Table 1: Performance of TJBatchExtractor

Table 1 lists the performance of our extraction tool on 1,000 randomly sampled escort advertisements, for the various features. Most of the features are self explanatory. (The reader is directed to (Dubrawski et al., 2015) for a complete description of the fields extracted.) The noisy nature, along with intentional obfuscations, especially in case of features like Names results in lower performance as compared to the other extracted features.

Apart from the Regular Expression based features, we also extract the hashcodes of the images in the advertisements, the posting date and time, and location.111These features are present as metadata, and do not require the use of hand engineered Regexs.

3 Entity Resolution

3.1 Definition

We approach the problem of extracting connected components from our dataset using pairwise entity resolution. The similarity or connection between two nodes is treated as a learning problem, with training data for the problem generated by using ‘proxy’ labels from existing evidence of connectivity from strong features.

More formally the problem can be considered to be to sample all connected components from a graph . Here, , the set of vertices () is the set of advertisements and , is the set of edges between individual records, the presence of which indicates they represent the same entity.

We need to learn a function such that

The set of strong features present in a given record can be considered to be the function ‘’. Thus, in our problem, represents all the phone numbers associated with .

Thus . Here,

Now, let us further consider the graph defined on the set of vertices , such that if (more simply, the graph described by strong features.)

Let be the set of all the of connected components defined on the graph

Now, function is such that for any

Figure 3: On applying our match function, weak links are generated for classifier scores above a certain match threshold. The strong links between nodes are represented by Solid Lines. Dashed lines represent the weak links generated by our classifier.

3.2 Sampling Scheme

For our classifier we need to generate a set of training examples ‘’, and & are the subsets of samples labeled positive and negative.
,
,

In order to ensure that the sampling scheme does not end up sampling near duplicate pairs, we introduce a sampling bias such that for every feature vector ,
This reduces the likelihood of sampling near-duplicates as evidenced in Figure 4, which is a histogram of the Jaccards Similarity between the set of the unigrams of the text contained in the pair of ads.


We observe that although we do still end with some near duplicates (), we have high number of non duplicates. () which ensures robust training data for our classifier.

Figure 4: Text Similarity for our Sampling Scheme. We use Jaccards Similarity between the ad unigrams as a measure of text similarity. The histogram shows that the sampling scheme results in both, a large number of near duplicates and non duplicates. Such a behavior is desired to ensure a robust match function.
(a) Regx
(b) Regx+Temporal
(c) Regx+Temporal+NLP
(d) Regx+Temporal+NLP+Spatial
Figure 5: ROC Curves for our Match Function trained on various feature sets. The ROC curve shows reasonably large True Positive rates for extremely low False Positive rates, which is a desirable behaviour of the match function.

3.3 Training

To train our classifier we experiment with various classifiers like Logistic Regression, Naive Bayes and Random Forest using Scikit. (Pedregosa et al., 2011) Table 2 shows the most informative features learnt by the Random Forest classifier. It is interesting to note that the most informative features include, the spatial (Location), Temporal (Time Difference, Posting Date) and also the Linguistic (Number of Special Characters, Longest Common Substring) features. We also find that the domain specific features, extracted using regexs, prove to be informative.

Top 10 Features
1 Location (State)
2 Number of Special Characters
3 Longest Common Substring
4 Number of Unique Tokens
5 Time Difference
6 If Posted on Same Day
7 Presence of Ethnicity
8 Presence of Rate
9 Presence of Restrictions
10 Presence of Names
Table 2: Most Informative Features
(a) Logistic Regression
(b) Random Forest
Figure 6: The plots represents the number of connected components and the size of the largest component versus the match threshold.

The ROC curves for the classifiers we tested with different feature sets are presented in Figure 5. The classifiers performs well, with extremely low false positive rates. Such a behavior is desirable for the classifier to act as a match function, in order to generate sensible results for the downstream tasks. High False Positive rates, increase the number of links between our records, leading to a ‘snowball effect’ which results in a break-down of the downstream Entity Resolution process as evidenced in Figure 6.

Rule Support Ratio Lift Xminchars<=250, 120000<Xmaximgfrq, 3<Xmnweeks<=3.4, 4<Xmnmonths<=6.5 11 90.9% 2.67 Xminchars<=250, 120000<Xmaximgfrq 4<Xmnmonths<=6.5, 16 81.25% 2.4 Xstatesnorm<=0.03, 3.6<Xuniqimgsnorm<=5.2, 3.2<Xstdmonths 17 100.0% 2.5 Xstatesnorm<=0.03, 1.95<Xstdweeks<=2.2, 3.2<Xstdmonths 19 94.74% 2.37

Table 3: Results Of Rule Learning

In order to minimize this breakdown, we need to heuristically learn an appropriate confidence value for our classifier. This is done by carrying out the ER process on 10,000 randomly selected records from our dataset. The value of size of the largest extracted connected component and the number of such connected components isolated is calculated for different confidence values of our classifier. This allows us to come up with a sensible heuristic for the confidence value.

(a)          Bigrams
(b)          Unigrams
(c)          Images
Figure 7: Blocking Scheme

3.4 Blocking Scheme

Our dataset consists of over 5 million records. Naive pairwise comparisons across the dataset, makes this problem computationally intractable. In order to reduce the number of comparisons, we introduce a blocking scheme and performa exhaustive pairwise comparisons only within each block before resolving the dataset across blocks. We block the dataset on features like Rare Unigrams, Rare Bigrams and Rare Images.

 

 

(a) This pair of ads have extremely similar textual content including use of non-latin and special characters. The ad also advertises the same individual, as strongly evidenced by the common alias, ‘Paris’.

 

 

(b) The first ad here does not include any specific names of individuals. However, The strong textual similarity with the second ad and the same advertised cost, helps to match them and discover the individuals being advertised as ‘Nick’ and ‘Victoria’.

 

 

(c) While this pair is not extremely similar in terms of language, however the existence of the rare alias ‘SierraDayna’ in both advertisemets helps the classifier in matching them. This match can also easily be verified by the similar language structure of the pair.

 

 

(d) The first advertisement represents entities ‘Black China’ and ‘Star Quality’, while the second advertisement, reveals that the pictures used in the first advertisement are not original and belong to the author of the second ad. This example pair shows the robustness of our match function. It also reveals how complicated relationships between various ads can be.
Figure 8: Representative results of advertisement pairs matched by our classifier. In all the four cases the advertisement pairs had no phone number information (strong feature) in order to detect connections. Note that sensitive elements have been intentionally obfuscated.
Figure 9: ROC for the Connected Component classifier. The Black line is the positive set, while the Red line is the average ROC for 100 randomly guessed predictors.
Figure 10: PN Curve for rule learning. The figure presents PN curves for various values of the Maximum Rules learnt for the classification.

4 Rule Learning

We extract clusters and identify records that are associated with human trafficking using domain knowledge from experts. We featurize the extracted components, using features like size of the cluster, the spatio-temporal characteristics, and the connectivity of the clusters. For our analysis, we consider only components with more than 300 advertisements. we then train a random forest to predict if the clusters is linked to human trafficking. In order to establish statistical significance, we compare the ROC results of our classifier in 4 cross validation for 100 random connected components versus the positive set. Figure 9 & Table 4 lists the performance of the classifier in terms of False Positive and True Positive Rate while Table 5 lists the most informative features for this classifier.

We then proceed to learn rules from our featureset. Some of the rules with corresponding Ratios and Lift are given in Table 3. PN curves corresponding to various rules learnt are presented in the Figure 10 It can be observed that the features used by the rule learning to learn rules with maximum support and ratios, correspond to the ones labeled by the random forest as informative. This also serves as validation for the use of rule learning.

Figure 11: Representative Entity isolated by our pipeline, believed to be involved in human trafficking. The nodes represent advertisements, while the edges represent links between advertisements. This entity has 802 nodes and 39,383 edges. This visualization is generated using Gephi. (Bastian et al., 2009). This entity operated in cities, across states and advertised multiple different individuals along with multiple phone numbers. This suggests a more complicated and organised activity and serves as an example of how complicated certain entities can be in this trade.

.

AUC TPR@FPR=1% FPR@TPR=50%
90.38% 66.6% 0.6%
Table 4: Metrics for the Connected Component classifier
Top 5 Features
1 Posting Months
2 Posting Weeks
3 Std-Dev. of Image Frequency
4 Norm. No. of Names
5 Norm. No. of Unique Images
Table 5: Most Informative Features

5 Conclusion

In this paper we approached the problem of isolating sources of human trafficking from online escort advertisements with a pairwise Entity Resoltuion approach. We trained a classifier able to predict if two advertisements are from the same source using phone numbers as a strong feature and exploit it as proxy ground truth to generate training data for our classifier. The resultant classifier, proved to be robust, as evidenced from extremely low false positive rates. Other approraches (Szekely et al., 2015) aims to build similar knowledge graphs using similarity score between each feature. This has some limitations. Firstly, we need labelled training data inorder to train match functions to detect ontological relations. The challenge is aggravated since this approach considers each feature independently making generation of enough labelled training data for training multiple match functions an extremely complicated task.

Since we utilise existing features as proxy evidence, our approach can generate a large amount of training data without the need of any human annotation. Our approach requires just learning a single function over the entire featureset, hence our classifier can learn multiple complicated relations between features to predict a match, instead of the naive feature independence assumption.

We then proceeded to use this classifier in order to perform entity resolution using a heurestically learned value for the score of classifier, as the match threshold. The resultant connected components were again featurised, and a classifier model was fit before subjecting to rule learning. On comparison with (Dubrawski et al., 2015), the connected component classifier performs a little better with higher values of the area under the ROC curve and the TPR@FPR=1% indicating a steeper, ROC curve. We hypothesize that due to the entity resolution process, we are able to generate larger, more robust amount of training data which is immune to the noise in labelling and results in a stronger classifier. The learnt rules show high ratios and lift for reasonably high supports as evidenced from Table 3. Rule learning also adds an element of interpretability to the models we built, and as compared to more complex ensemble methods like Random Forests, having hard rules as classification models are preferred by Domain Experts to build evidence for incrimination.

6 Future Work

While our blocking scheme performs well to reduce the number of comparisons, however since our approach involves naive pairwise comparisons, scalability is a significant challenge. One approach could be to design such a pipeline in a distributed environment. Another approach could be to use a computationally inexpensive technique to de-duplicate the dataset of the near duplicate ads, which would greatly help with regard to scalability.

In our approach, the ER process depends upon the heuristically learnt match threshold. Lower threshold values can significantly degrade the performance, with extremely large connected components. The possibility of treating this attribute as a learning task, would help making this approach more generic, and non domain specific.

Hashcodes of the images associated with the ads were also utilized as a feature for the match function. However, simple features like number of unique and common images etc., did not prove to be very informative. Further research is required in order to make better use of such visual data.

Acknowledgments

The authors would like to thank all staff, faculty and students who made the Robotics Institute Summer Scholars program 2015 at Carnegie Mellon University possible.

References

  • Bastian et al. (2009) Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. 2009. Gephi: An open source software for exploring and manipulating networks. http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154.
  • Benjelloun et al. (2009) Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. The VLDB Journal—The International Journal on Very Large Data Bases 18(1):255–276.
  • Cunningham (2002) Hamish Cunningham. 2002. Gate, a general architecture for text engineering. Computers and the Humanities 36(2):223–254.
  • Dubrawski et al. (2015) Artur Dubrawski, Kyle Miller, Matthew Barnes, Benedikt Boecking, and Emily Kennedy. 2015. Leveraging publicly available data to discern patterns of human-trafficking activity. Journal of Human Trafficking 1(1):65–85.
  • Kennedy (2012) Emily Kennedy. 2012. Predictive patterns of sex trafficking online .
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12:2825–2830. http://dl.acm.org/citation.cfm?id=1953048.2078195.
  • Szekely et al. (2015) Pedro Szekely, Craig A. Knoblock, Jason Slepicka, Andrew Philpot, Amandeep Singh, Chengye Yin, Dipsy Kapoor, Prem Natarajan, Daniel Marcu, Kevin Knight, David Stallard, Subessware S. Karunamoorthy, Rajagopal Bojanapalli, Steven Minton, Brian Amanatullah, Todd Hughes, Mike Tamayo, David Flynt, Rachel Artiss, Shih-Fu Chang, Tao Chen, Gerald Hiebel, and Lidia Ferreira. 2015. Building and using a knowledge graph to combat human trafficking. In Proceedings of the 14th International Semantic Web Conference (ISWC 2015).
  • Veeramachaneni and Kondadadi (2009) Sriharsha Veeramachaneni and Ravi Kumar Kondadadi. 2009. Surrogate learning: from feature independence to semi-supervised classification. In Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing. Association for Computational Linguistics, pages 10–18.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
305103
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description