Content-Based Features to Rank Influential Hidden Services of the Tor Darknet
The unevenness importance of criminal activities in the onion domains of the Tor Darknet and the different levels of their appeal to the end-user make them tangled to measure their influence. To this end, this paper presents a novel content-based ranking framework to detect the most influential onion domains. Our approach comprises a modeling unit that represents an onion domain using forty features extracted from five different resources: user-visible text, HTML markup, Named Entities, network topology, and visual content. And also, a ranking unit that, using the Learning-to-Rank (LtR) approach, automatically learns a ranking function by integrating the previously obtained features. Using a case-study based on drugs-related onion domains, we obtained the following results. (1) Among the explored LtR schemes, the listwise approach outperforms the benchmarked methods with an NDCG of 0.95 for the top-10 ranked domains. (2) We proved quantitatively that our framework surpasses the link-based ranking techniques. Also, (3) with the selected feature, we observed that the textual content, composed by text, NER, and HTML features, is the most balanced approach, in terms of efficiency and score obtained. The proposed framework might support Law Enforcement Agencies in detecting the most influential domains related to possible suspicious activities.
The Onion Router (Tor) network, which is known to be one of the most famous Darknet networks, gives the end-users a high level of privacy and anonymity. The Tor project was proposed in the mid-1990s by the US military researchers to secure intelligence communications. However, a few years later, and as part of their strategy for secrecy, they made the Tor project available for the public .
The onion domains proliferated rapidly and the latest statistics stated by the onion metrics website111https://metrics.torproject.org/hidserv-dir-onions-seen.html has reported that the number of currently existing onion domains has increased from K to almost K between April 2015 and October 2019.
The community of the Tor network refers to an onion domain hosted in the Tor darknet by a Hidden Service (HS). Those services can be accessed via a particular web browser called Tor Browser222https://www.torproject.org/projects/torbrowser.html.en or a proxy such as Tor2Web333https://tor2web.org/.
There are many legal uses for the Tor network, such as personal blogs, news domains, and discussion forums [6, 22]. However, due to its level of anonymity, Tor Darknet is being exploited by services traders allowing them to promote their products freely, including, but not limited to Child Sexual Abuse (CSA) [34, 30], drugs trading [5, 30, 23, 51], and counterfeit personal identifications [46, 12, 4].
The high level of privacy and anonymity provided by the Tor network obstructed the authorities monitoring tools from controlling the content or even identifying the IP address of the hosts who are behind any suspicious service. We collaborate with the Spanish National Cybersecurity Institute (INCIBE444In Spanish, it stands for the Instituto Nacional de Ciberseguridad de España), to develop tools that could ease the task of monitoring the Tor Darknet and detecting existing or new suspicious contents. These tools are designed to support the Spanish Law Enforcement Agencies (LEAs) in their surveillance of the Tor hidden services. An overview of two of our current contributions to the Tor monitoring tool is summarized in Figure 1.
The first contribution to the monitoring system presented in Figure 1 was a classification module, which detects and isolates the categories of suspicious onion domains that Spanish LEAs are interested in monitoring. For this task, we used our supervised text classifier presented in , which categorizes HSs into eight classes: Pornography, Cryptocurrency, Counterfeit Credit Cards, Drugs, Violence, Hacking, Counterfeit Money, and Counterfeit Personal Identification including Driving-License, Identification, and Passport.
The second module, which is also the focus of this work, addresses the problem of ranking the hidden services that were classified as suspicious. Once they are ranked, a police officer can prioritize her/his work by focusing on the most influential HSs, i.e., those which are in the first positions of the rank. In our previous work , we presented a ranking algorithm, called ToRank, to sort the onion domains by analyzing the connectivity of their hyperlinks, what was a linked-based approach. In this case, sharing the same objective, we propose a more rich solution for ranking them by analyzing as well the content of the domains.
One of the difficulties we faced was how to define the influence of a given onion domain based on its ability to attract the public. In this work, we assess the attractiveness of an onion domain, and we assign accordingly to it a score that reflects its influence among the other domains with similar content. Hence, the more attractive is a domain, the higher the score it receives. Using our text classifier, the ranking module we are proposing works at category-level and detects the influential HS in each category. Therefore, this paper aims to answer the following question: What are the most attractive onion domains in a determined area of activities?
The answer to this question could improve the capability of LEAs in keeping a close eye on the suspicious Hidden Services more influential by concentrating their efforts in monitoring them rather than the less influential ones. Moreover, in the case a Law Enforcement Agency takes a suspicious HS down, the proposed ranking module would recognize the same domain again if it still holds the same content, even if it were hosted under a different or new address. Similarly, when a new domain is introduced to the network for the first time, and it hosts suspicious content similar to an HS that was previously nominated as influential, the ranking module could capture it before becoming popular among Tor users. Therefore, LEAs will have the needed information to strike the suspicious domains preemptively earlier.
A straightforward strategy to detect the influential onion domains is to sort them by the number of clients’ requests that they receive, i.e., analyzing the traffic of the network. However, the design of the Tor network is oriented to preventing this behavior . Chaabane et al.  conducted a deep analysis for the Tor network traffic through establishing six exit nodes distributed over the world with the default exit policy. However, this approach can not assess the traffic of onion domains that are not reachable through these exit nodes. Furthermore, it could be risky because the Tor network users could reach any onion domain, regardless of its legality, through the IP addresses of the machines dedicated to that purpose. Biryukov et al.  tried to exploit the concept of entry guard nodes  to de-anonymize clients of a Tor hidden service. However, this proposal will not be feasible as soon as the vulnerability is fixed.
Another strategy is to employ a link-based ranking algorithm such as ToRank , PageRank , Hyperlink-Induced Topic Search (HITS) , or Katz . We explored it in our previous work , and we concluded that the main drawback of this approach lies in its dependency on hyperlinks connectivity between the onion domains . Hence, if an influential but isolated domain exists in the network, this technique can not recognize it as an essential item.
In this paper, we present an alternative approach that incorporates several features that are extracted from the HSs into a Learning to Rank (LtR) schema . Given a list of hidden services, our model ranks onion domains based on two key steps: content feature extraction and onion domain ranking. First, we represent each onion domain by a forty elements feature vector extracted from five different resources that are: 1) the textual content of the domain, 2) the textual Named Entities (NE) in the user-visible text like products names and organizations names, 3) the HTML markup code by taking advantage of specific HTML tags, 4) the visual content like the images exposed in the domain, and finally 5) the position of the targeted onion domain with respect to the Tor network topology. Second, the extracted features are cleaned and normalized to train a ranking function using the LtR approach to rank the domains and to propose top- HSs as the most influential.
The ranking problem addressed in this paper is close to the field of Information Retrieval (IR) but with a significant difference. Both issues will retrieve a ranked list of elements similarly to how search engines work. For example, the Google search engine considers more than factors to generate a ranked list of websites concerning a query. However, in the context of our problem, we do not have a search term to order the results accordingly. Instead, our objective is to rank the domains based on a virtual query: What are the most attractive onion domains in a determined area of activities? Hence, this model adopts IR to solve the problem of ranking and detecting the most influential onion domain in the Tor network, but without having available a search term.
Nevertheless, the proposed framework is not only restricted to ranking the onion domains of the Tor network. It could be generalized and be adapted to different areas with slight modifications in the feature vector, as for example, document ranking, web pages of the Surface Web, or users in a social network, among others. Our focus on this field, based on the special sensitivity of the HS contents, motivates our application in the Tor network, and, additionally, we wanted to test the use of IR techniques for single query ranking problems.
The main contributions of this paper can be summarised in the following way.
We propose a novel framework to rank the HSs and to detect the most influential ones. Our strategy exploits five groups of features extracted from Tor HSs via a Hidden Service Modeling Unit (HSMU). The extracted features are used to train the Supervised Learning-to-Rank Unit (SLRU). Our approach outperforms the link-based ranking technique, such as ToRank, PageRank, HITS, and Katz, when tested on samples of onion domains related to the marketing of drugs (Fig. 2).
We evaluated features extracted from five resources: 1) user-visible textual, 2) textual named entities, 3) the HTML markup code, 4) the visual content, and 5) features drawn from the topology of the Tor network. In particular, we address the effects of representing an onion domain by several variations of features on the ranking framework. We identify the most efficient combination of features compared to their cost of extraction in terms of the prediction time and the resources needed to build the features extraction model.
To select the best LtR schema, we explore and compare three popular architectures: pointwise, pairwise, and listwise.
Finally, we create a manually ranked dataset, that plays the role of ground truth for testing the models.
The rest of the paper is organized as follows. In Section 2 we summarize the related work. After that, Section 3 introduces the proposed ranking framework, including its main components. Section 4 presents the experimental settings along with the configuration of the framework units. Next, Section 5 addresses a case study to test the effectiveness of the proposed framework in a real case scenario. Finally, Section 6 presents the main conclusions of this work and introduces other approaches that we are planning to explore in the future.
2 Related Work
There are plenty of works tackling the study of the suspicious activities that take place in the Darknet of the Tor network, as are the illicit drugs markets [13, 9, 26], terrorist activities [60, 20], arms smuggling, and violence , or cybercrime [23, 4]
However, only a few of them have addressed the problem of analyzing the Darknet networks to detect the most influential domains. Some of them used approaches that depend on Social Networks Analysis (SNA) techniques to mine networks. Chen et al.  conducted a comprehensive exploration of terrorist organizations to examine the robustness of their networks against attacks. In particular, these attacks were simulated by the removal of the items featured by either their high in-degree or betweenness scores . Moreover, Al Nabki et al.  proposed an algorithm, called ToRank, to rank and detect the most influential domains in the Tor network. ToRank represents the Tor network by a directed graph of nodes and edges, and the most influential nodes are the ones which removal would reduce the connectivity among the nodes. However, the link-based approaches would fail in evaluating the isolated nodes which do not have connections to the rest of the community. Therefore, as we show in the experiments section, this approach can not surpass any of the benchmarked LtR techniques.
Another different strategy was followed by Anwar et al. , who presented a hybrid algorithm to detect the influential leaders of radical groups in the Darknet forums. Their proposal is based on mining the content of the user’s profiles and their historical posts to extract textual features representing their radicalness. Then, they incorporate the obtained features in a customized link-based ranking algorithm, based on PageRank , to build a ranked list of radically influential users.
A different perspective was followed by the study carried out by Biryukov et al. , who exploits the entry guard nodes concept  to de-anonymize clients of an onion domain in the Tor network. The popularity of onion domains in the Tor network is estimated by measuring its incoming traffic; however, this approach will not be feasible when the vulnerability is fixed.
The approach proposed in this work is entirely different and, to the best of our knowledge, is not present in the literature. We adopt a supervised framework that automatically learns how to order items following predefined ranking criteria. Concretely, we employ an LtR model to capture characteristics of a given ranked list and maps the learned rank into a new unsorted list of items.
LtR framework has been used widely in the field of IR [47, 37, 49]. Li et al.  proposed an algorithm to help software developers in dealing with unfamiliar Application Programming Interface (API) by offering software documentation recommendations and by training an LtR model with features extracted from four resources. Other examples are Agichtein et al. , who employed the RankNet algorithm to leverage search engine results by incorporating user behavior, or Wang et al. , who presented an LtR-based framework to rank input parameters values of online forms. They used categories of features extracted from user contexts and patterns of user inputs. Moreover, LtR has been employed for mining social networks [44, 27] or to detect and rank critical events in Twitter social network , but none of those aforementioned works addressed the study of Tor network from a content-analysis perspective.
3 Proposed Ranking Framework
In this paper, we present a supervised framework to rank Tor Hidden Services with the purpose of measuring the influence of each domain (Fig. 2). Our approach has two components: 1) Hidden Service Modeling Unit (HSMU), which analyzes and extracts features from a given onion domain, and 2) the Supervised Learning-to-Rank Unit (SLRU) that learns a function to order a collection of domains according to the pattern captured from previously sorted samples, a training set.
3.1 Hidden Service Modeling Unit
Given an onion domain , where is a set of onion domains, the HSMU analyzes to extract features belonging to five different categories: text, named entity, HTML, visual content, and network topology. Then, the HSMU encodes those features into numerical values that represent the HSs analyzed.
3.1.1 Text Features
The set of text features involves four types of descriptors constructed from the text in that is visible to the user.
Date and Time: A binary feature to indicate whether has been updated recently or not. A domain might be updated by its owner or after receiving reviews from customers concerning the offered service. We parsed the date and time patterns, and we compared them with a configurable threshold, computing this binary feature. This threshold is a particular date-time point, whereas the actions beyond it are considered obsolete.
Moreover, by counting these updates, we measured the number of recent changes that has received in the past. We refer to these two features by recently_updated and updates_counts, respectively.
HS Name: The address of a hidden service consists of characters, randomly generated. The prefix of an HS can be customized with tools like Shallot555https://github.com/katmagic/Shallot, allowing that the onion domain address includes words attractive to the customers. For example, a drug marketplace could add the words Cocaine or LSD to its HS’s URL. Nevertheless, the customization process is extensively time-consuming; for example, customizing the first seven characters requires one day of machine-time while customizing characters takes years of processing.
To explore this characteristic, we split the concatenated words using a probabilistic approach based on English Wikipedia unigram frequencies thanks to Wordninja tool666https://github.com/keredson/wordninja. We obtained two features: (i) the number of human-readable words and (ii) the number of their letters. We named them as address_words_count and address_letters_count, respectively.
Clones Rate: the number of onion domains that host the same content under different addresses. After reviewing Darknet Usage Addresses Text (DUTA-10K) dataset777http://gvis.unileon.es/dataset/duta-darknet-usage-text-addresses/, we realized that some onion domains have almost the same textual content but hosted under different addresses. This feature, i.e., HS duplication, might reflect concerns of the domains’ owners about their services of being taken down by authorities. The clones_rate feature of reflects the frequency of the MD5  hash code of its text.
Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer: we used this text vectorization technique to extract and to weight domain-dependent keywords . It allowed us to drow the following four features: 1) keyword_num: the number of words that are in common between the TF-IDF feature vector and the domain’s words - we consider these words as the keywords of the domain, 2) keyword_TF-IDF_Acc: the accumulated TF-IDF weights of the keywords, 3) keyword_avg_weight: the average weight of the keywords, and 4) keyword_to_total: the number of the domain’s words dived by the number of its keywords.
3.1.2 Named Entities Features
A textual Named Entity (NE) refers to a real proper name of an object, including, but not limited to, persons, organizations, or locations. To extract those named entities, our previous work  adopted a Named Entity Recognition (NER) model proposed by Aguilar et al.  to the Tor Darknet to recognize six categories of entities: persons (PER), locations (LOC), organizations (ORG), products (PRD), creative-work (CRTV), corporation (COR), and groups (GRP). We map the extracted NEs into the following five features:
NE Number: it counts the total number of entities in regardless of the category; we refer to this feature by NE_counter.
NE Popularity: an entity is popular if its appearance frequency value is above or equal to a specific threshold that we set to five, as is explained in Section 4.2.2. For every category identified by the NER model, we use a binary representation to capture the existence of popular entities in the domain () or () otherwise. We refer to this feature as popular_NEX, where is the corresponding NER category.
NE TF-IDF: accumulates the TF-IDF weight of all the detected NE in . This feature is denoted by NE_TF-IDF.
TF-IDF Popular NE: accumulated TF-IDF weight of the popular NE, and it is named popular_NE_TF-IDF.
3.1.3 HTML Markup Features
Using regular expressions, we parse the HTML markup code of to build the following eight features:
Internal Hyperlinks: number of unique hyperlinks that share the same domain name as . We denote it by internal_links.
External Hyperlinks: refers to the number of pages referenced by on Tor network or in the Surface Web. We refer to this feature by external_links.
Image Tag Count: denotes the number of images referenced in . It is calculated by counting the HTML tag in the HTML code of . We denote it by img_count.
Login and Password: a binary feature to indicate whether the domain needs login and password credentials or not. We use a regular expression pattern to parse such inputs. This feature is called needs_credential.
Domain Title: a binary feature that checks whether the HTML tag has a textual value or not. We call it has_title.
Domain Header: a binary feature that analyzes if the HTML tag has a header or not. We named it has_H1.
Title and Header TF-IDF: an accumulation of the TF-IDF weight for the title and header text. It is denoted by TF-IDF_title_H1.
TF-IDF Image Alternatives: TF-IDF weight accumulation of the alternative text. Some websites use an optional property called inside the image tag to hold a textual description for the image. This text becomes visible to the end-user to substitute the image in case it is not loaded properly. It is denoted as TF-IDF_alt.
3.1.4 Visual Content Features
The visual content could be an attractive factor for drawing the attention of end-users rather than the text, especially in the Tor HSs when the customer seeks to have a real image of the product before buying it. A suspicious services trader might incorporate real images of products to create an impression of credibility to a potential customer. However, the visual contents that are interesting for LEAs could be mixed up with other noisy contents, such as banners and images of logos. To isolate the interesting ones, we built a supervised image classifier that categorizes the visual content into nine categories, where eight of them are suspicious, and one is not. The definition of these categories is based on our previous works [4, 6], where we created DUTA dataset and its extended version DUTA-10K. The image classification model was built using Transfer Learning (TL) technique  on the top of a pre-trained Inception-ResNet V2 model . The visual content feature vector has six dimensions distributed in the following manner:
Image Count: represents the total number of images in , both suspicious and non-suspicious images regardless of their category. Suspicious stands for images that could contain illicit content. We denote these features by total_count, suspicious_count and noise_count, respectively.
Average Classification Confidence: represents the averaged confidence score of multiple images per category. These features are named avg_suspicious_conf and avg_normal_conf, respectively.
Majority Class: a binary flag to indicate whether the majority of the images published in are suspicious or not. This flag is denoted by suspicious_majority.
3.1.5 Network Structure Features
Additionally, we modeled the Tor network by a directed graph of nodes and edges. The nodes refer to onion domains and the hyperlinks between domains are captured by the edges. This representation allowed us to built the following seven features:
In-degree: refers to the number of onion domains pointing to the domain . It is called in_degree.
Out-degree: indicates the number of hidden services referenced by , and it is named out_degree.
Centrality Measures: for each domain in the Tor network graph, we evaluated three node’s centrality measures: closeness, betweenness, and eigenvector . In particular, the closeness measures how short the shortest paths are from to all domains in the network, and it is named cls. The betweenness measures the extent to which lies on paths between other domains, and it is named btwn. Finally, the eigenvector reflects the importance of for the connectivity of the graph and it is denoted eigvec.
ToRank Value: ToRank is a link-based ranking algorithm to order the items of a given network following their centrality . We applied ToRank to the Tor network to rank the onion domains, and we used the assigned rank as a feature of the node. Moreover, we used a binary flag to indicate whether is in the top-X domains of ToRank or not. We refer to those features as ToRank_rank and ToRank_top_X, respectively.
After computing the forty features described (Table I), we concatenate them to form a feature vector. However, given the variety of the scales of the features, we normalize them by removing the mean and scaling to unit variance.
|Feat. Source||Feat. Name|
|Textual||9||Date and Time||
|Clones Rate||- clones_rate|
|10||NE Popularity||- popular_NE (x)|
|Total NE Number||- NE_counter|
|TF-IDF NE||- NE_TF-IDF|
|TF-IDF Popular NE||- popular_NE_TF-IDF|
|Emerging NE||- emerging_NE|
|8||Internal Hyperlinks||- internal_links|
|External Hyperlinks||- external_links|
|Image Tag Count||- img_count|
|Login and Password||- needs_credential|
|Domain Title||- has-title|
|Domain Header||- has_H1|
|TF-IDF Title and Header||- TF-IDF_title_H1|
|Majority Class||- suspicious_majority|
3.2 Supervised Learning-to-Rank Unit
To learn the optimal order of the onion domains through their descriptors, we adopt an LtR approach adapted from the IR field. In a traditional IR problem, a training sample is a vector of three components: the relevance judgment, which can be binary  or with multiple levels of relevance , the query ID, and the feature vector that describes the ranked instance. However, our ranking system needs to answer a unique question: What are the most attractive onion domains in a determined area of activities?, and in the context of our problem, the relevance judgment of an item is a numerical score that is calculated and assigned per item. These scores are set manually by a group of annotators that we considered as ground truth for the ranking, whereas the annotation procedure is described in Section 5.1. Hence, the input vector for the SLRU is limited to the features of and its relevant judgment score in , where refers to the ground truth rank. The vector of each sample can be modeled as , , where is the feature of the domain and is the total number of the ranking features, i.e. .
Our LtR schema aims to learn a function that projects a feature vector into a rank value . Therefore, the goal of an LtR scheme is to obtain the optimal ranking function that ranks in a way similar to . The learning loss function depends on the used LtR architecture, as explained in the following three subsections.
The loss function of the Pointwise approach considers only a single instance of onion domains at a time . It is a supervised classifier/regressor that predicts a relevance score for each query domain independently. The ranking is achieved by sorting the onion domains according to yield scores. For this LtR schema, we explore the Multi-layer Perceptron (MLP) regressor .
It transforms the ranking task into a pairwise classification task. In particular, the loss function takes a pair of items at a time and tries to optimize their relative positions by minimizing the number of inversions comparing to the ground truth . In this work, we use the RankNet algorithm , which is one of the most popular pairwise LtR schemes.
This approach extends the pairwise schema by looking at the entire list of samples at once . One of the most well-known listwise schemes is ListNet algorithms . Given two ranked lists, the human-labeled scores and the predicted ones, the loss function minimizes the cross-entropy error between their permutation probability distributions.
4 Experimental Settings
To test our proposal, we designed an experiment to answer three research questions:
What is the most suitable LtR schema for the task of ranking the onion domains in the Tor network and for detecting the influential ones?
Which are the advantages of two different ranking approaches, the content-based and the link-based?
And what is the best combination of features for the LtR model performance?
In this section, we discuss the motivation behind these questions, describing in detail the analytical approach that we conducted, and finally, we present our findings.
4.1 Evaluation Measure
The two most popular metrics for ranking Information Retrieval systems are Mean Average Precision () and Normalized Discounted Cumulative Gain () [36, 50]. The main difference between the two is that the assumes a binary relevance of an item according to a given query, while allows relevance scores in the form of real numbers. Two characteristics suggest the use of to evaluate the problem addressed in this paper. Thanks to the first component of the Tor network monitoring pipeline - the classification components-, all the addressed onion domains are already relevant to the query, and second, the ground truth and the predicted rank produced by any of the previously commented LtR schemes are real numbers, not binary ones. To obtain the of a given query, we calculate the flowing the next formula (Eq. 1).
Where is the gain score at the first position in the obtained ranked list, is the gain score of the item in that list, and refers to the first items to calculate the . To obtain a normalized version of is necessary to divide it by , which is the ideal sorted by the gain scores in descending order (Eq. 2).
4.2 Modules Configurations
4.2.1 Hardware Configurations
Our experiments were conducted on a 2.8 GHz CPU (Intel i7) PC running Windows 10 OS with 16G of RAM. We implemented the ranking models using Python3.
4.2.2 HSMU Configurations
In the TF-IDF text vectorizer, we set the feature vector length to with a minimum frequency of , following our previous work . We used a NER model trained on WNUT-2017 dataset888https://noisy-text.github.io/2017/emerging-rare-entities.html. To set the popularity threshold of the popular_NEX feature, we examined four values (), and we assigned it to , experimentally. Additionally, we set the threshold of the recently_updated feature to three months earlier to the dataset scraping date. To extract features from the HTML code, we used BeautifulSoup library999https://pypi.org/project/beautifulsoup4/. To construct the Tor network graph, we used NetworkX101010https://networkx.github.io/ library.
In the image classifier, we used the Transfer Learning (TL) technique with the pre-trained Inception-ResNet V2 model released by Google111111http://download.tensorflow.org/models/inception_resnet_v2_2016_08_30.tar.gz. The model was trained and tested with a dataset of and samples, respectively, equally distributed over nine categories. The motivation behind selecting these categories is to have the same classes as in our text classifier . The dataset was collected from Google Images using a chrome plugin called Bulk Image Downloader. Table II shows the image classifier performance per category.
|Category Name||F1 Score (%)|
|Counterfeit Credit Cards||92.45|
|Counterfeit Personal Identification||95.16|
4.2.3 SLRU Configurations
We used the dataset described in Section 5.1 to train and test the three LtR models introduced in Section 3.2. Due to the small number of samples in the drug domain, , we conducted 5-fold cross-validation following recommendations from previous works . On each iteration, three folds are used for training the ranking model, one fold for validation, and one fold for testing. For the three LtR models, the number of iterations is controlled by an early stopping framework, which is triggered when there is no remarkable change in the validation set at .
The three LtR schemes commented in Section 3.2 shares the same network structure but with different loss functions. The neural network has two layers, with and neurons, respectively. For non-linearity, a Rectifier Linear Unit (ReLU) activation function is used . To avoid overfitting, the ReLU layer is followed by a dropout layer with a value of .
5 Results and Discussion: Drugs Case Study
5.1 Dataset Construction
Darknet Usage Text Addresses 10K (DUTA-10K) is a publicly available dataset proposed by Al-Nabki et al.  that holds more than onion domains downloaded from the Tor network and distributed in categories. In this case study, we address the ranking of the onion domains that were classified as Drugs in DUTA-10K. This category contains drugs-topics activities, including manufacture, cultivation, and marketing of drugs, in addition to drug forums and discussion groups. Out of drug domains in DUTA-10K, we selected only English language domains what makes a total of domains. This ranking approach could be adapted to any category of DUTA-10K, but we selected the drugs-related domains due to its popularity in the Tor network. We also want to stress that our approach is not only limited to web domains, and it could be extended to further fields such document ranking or influencers detection in social networks when a previously ranked list for training is available.
To create the content-based dataset, thirteen people, including the authors, ranked manually the drugs-related domains, to build a dataset that served as ground truth. For keeping the consistency in the ranking criteria among the annotators, we created a unified questionnaire of subjective binary questions (Table III) that were answered by the annotators for each domain. The ground truth is built in a pointwise manner assigning an annotator a value to each domain, coming from answering to every question with a or , that corresponds to Yes or No, respectively.
This process is repeated three times, assigning to each annotator a new batch of approximately domains every time, different in each iteration. Thus, each onion domain is judged three times by three different annotators, and as a result, each domain is represented by three binary vectors of answers. Following the majority voting approach, we unified these answers’ vectors of every hidden service into a single vector of dimensions, that correspond with the number of questions. Finally, for a given domain, the vector of answers is aggregated into a single real number, a gain value, by adding up its elements, corresponding the obtained sum with the ground truth rank of that domain. The higher the gain value, the higher the rank an onion domain obtains.
|Has a satisfactory FAQ?||Has a communication channel?|
|Has a professional design?||Has real images for the products?|
|Has a subjective title?||Sells between 2 to 10 products?|
|Provides safe shipping?||Domain name has a meaning?|
|Offers reward or discount?||Products majority are illegal?|
|Sell more than 10 products?||Still accessible in TOR network?|
|Shipping worldwide service?||Sells at least one popular product?|
|Reputation content?||Requires login/ registration?|
|Accepts only Cryptocurrency?||Recently updated?|
|Has more than 10 sub-pages?|
5.2 Learning to Ranking Schema Selection
In Section 3.2, we evaluated three well-known LtR schemes, namely, pointwise, pairwise, and listwise, and for each one, we explored a supervised ranking algorithm: MLP, RankNet, and ListNet, respectively. We wanted to know what is the most suitable LtR schema for the task of ranking the onion domains in the Tor network and detecting the influential ones. Fig. 3 compares the three LtR algorithms for at ten different values of , and given the importance of having the head of a list ranked correctly more than its tail [16, 58], we illustrate the top- values of individually.
Fig. 3 shows the superiority of the Listwise approach when each domain is represented by a vector of features extracted from five different kinds of resources (Section 3.1). The same figure shows that the of the ListNet is equal to one, which means that during the five folds of cross-validation, the algorithm ranked correctly the first domains tested, exactly as the ground truth. At , the curve starts dropping; however, the lowest value was for equals to . As also can be seen in Fig. 3, the pointwise approach, which is the MLP in our case, obtained the worst performance, which agrees with the conclusion of other researchers .
In addition to comparing the performance in terms of the , we registered the duration of the time required to train and to test each LtR model, i.e., starting from the moment the model receives a list of domains encoded by HSMU (Section 3.1) until it produces their rank. On average, for the five-folds, the ListNet model took seconds for training and seconds for testing. The RankNet took seconds for training and seconds for testing. Finally, the fastest one was the MLP model, which took seconds for training and seconds for testing. This comparison shows that the ListNet model is the slowest one due to the complexity of its loss function comparing to the ones of the RankNet and the MLP algorithms.
5.3 Link-based versus Content-based Ranking
Having two distinct ranking strategies raises a question: which is the most suitable ranking approach? Content-based or link-based?. To answer it, we explore four link-based algorithms, namely, ToRank , PageRank , Hyperlink-Induced Topic Search (HITS) , and Katz . In particular, we compare the best LtR model of our approach, i.e., ListNet, which is considered as a supervised ranking algorithm, with the four link-based algorithms that are unsupervised. We represent the Tor network by a directed graph that consists of nodes and directed edges, where nodes represent the onion domains, and the hyperlinks between them are captured using the directed edges.
Comparison configuration. Unlike our approach, the link-based algorithms do not require training data. To obtain a fair comparison, we carried out -folds cross-validation with the same random seed for both ranking approaches. Then, we constructed a directed graph out of the testing nodes and applied the link-based algorithms; finally, both approaches were evaluated using the same test set. For the link-based algorithms, we evaluated several configuration parameters and selected the ones that obtained the highest (Table IV).
|Algorithm Name||Parameter||Evaluated values|
|PageRank||alpha||0.5, 0.70, 0.75.0.80,0.85,0.90|
|max_iter||10, 100, 1000, 10000|
|ToRank||alpha||0.50, 0.70, 0.80, 0.90, 1.00|
|beta||0.1, 0.2, 0.3, 0.4, 0.5, 0.6|
|HITS||max_iter||10, 100, 1000, 10000|
|Katz||alpha||0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.9|
|beta||0.1, 0.3, 0.5, 0.7, 0.9, 1.0|
|max_iter||10, 100, 1000, 10000|
Fig. 4 shows that ListNet highly surpasses all the link-based ranking algorithms. We observe that the weakest LtR approach, i.e., MLP, which obtained a of , outperforms the best link-based ranking algorithm, ToRank, which scored of . This result emphasizes the importance of considering the content of domains rather than their hyperlinks connectivity only.
5.4 Feature Selection
In the previous sections, we concluded that ListNet outperforms the RankList and MLP content-based and the four link-based algorithms when the proposed forty features represent an onion domain. However, the cost of these features varies. Some of them, such as the visual content, requires building a dedicated image classification model, while other features could be extracted merely using a regular expression. The cost is reflected in the time necessary to obtain the features and to build the ranking model, in addition to the inference time. On average, per domain, the prediction of the image classification model was the most expensive and took seconds, followed by the NER model with seconds, and then it was the text features that required seconds. Finally, the HTML and the graph features were the fastest ones to be extracted, spending and seconds, respectively.
In the following, we want to answer the question: what is the feature or combination of features that produce the best performance of the LtR model ? To answer this question, we conducted feature analysis for the best LtR model, the ListNet one, using features and combining them from the five different resources. To discover the best combination of features for the ranking system performance, we trained and tested five ListNet models for each source of features and compared their performance.
When we analyze only features coming from a single source, without combining them, it can be seen in Fig. 5 that the features that are extracted only from text, denoted by text and in clear green in this figure, achieves the highest of . Very close to them is the model trained using only recognized named entities (NER) features, which obtains a of . After that, using only features extracted from HTML, the ListNet model obtains a of . In contrast, the graph features obtain the lowest of , which indicates their weakness for ranking the onion domains, unlike the features that are extracted from the text, which have shown a significant and positive impact on the metric. Hence, we conclude that the features extracted from the user-visible text are more representative comparing to the ones coming from the visual characteristics of the domain or the graph ones.
After the previous analysis, we decided to investigate the effect of combining features from different sources, to measure the impact of those combinations on the model performance. In Figure 5 it can be seen that the performance increases when the top- individual features, i.e. text, NER, and HTML are fused. Also, those three features combined could obtain a close to the one yield by combining all the features (All). Consequently, we could remove the graph and the visual features and keep the ranking performance relatively high and very close to the case when all the features are used.
6 Conclusions and Future Work
The Tor network is overfull with suspicious activities that LEAs are interested in monitoring. By ranking the onion domains according to their influence inside the Tor network, LEAs can prioritize the domains to leverage the monitoring process.
In this paper, we created several ranking frameworks using Learning-to-Rank (LtR) to detect the most influential onion domains in the Tor darknet using different sources of features. Our purpose was to determine which is the best approach that allows the best performance with the lower complexity in the model creation. The proposed framework consists of two components. 1) Hidden Service Modeling Unit (HSMU), that represents an onion domain by features that are extracted from five different resources: the domain user-visible text, the HTML markup of the web page, the named entities in the domain text, the visual content, and the Tor network structure; and 2) Supervised Learning-to-Rank Unit (SLRU), that learns how to rank the domains using LtR approach. To train the LtR model, we built a manually sorted dataset of drugs-related onion domains.
We tested the effectiveness of our framework on a manually ranked dataset of onion domains related to drug trading. We explored and evaluated three common LtR algorithms: MLP, RankNet, and ListNet, considering that the method which obtained the highest is the best one. We found that the ListNet algorithm outperforms the rest of the ranking algorithms with a of . Moreover, we contrasted our framework with four link-based ranking algorithms, and we observed that the MLP with a of , which is the worst LtR algorithm, is better than the best link-based one, ToRank, which obtained a of .
Given the superiority of the ListNet algorithm, we analyzed how the different kinds of features impact the ranking performance. We found that using only the features extracted from the user-visible textual content, including the text, the named entities, and HTML markup code, the model achieves a of , exactly the same as the model that uses all the features. However, at , the performance drops slightly to comparing to using the five different sources of features. Considering both the cost to obtain the features and to create the models and its score, we recommend the text-NER-HTML model because its cost is low, and its score is almost the same as the more complex approach that uses all the features.
In the future we plan to explore additional LtR methods from the listwise approach such as RankBoost  and LambdaMart . Moreover, we will explore the StarSpace algorithm  that attempts to learn objects representations into a common embedding space that could be used to entities ranking and recommendation systems. Finally, in order to ease the process of building the training dataset, both in terms of time and the number of the labeled samples, we plan to explore the Active Learning technique , which selects the most distinct samples to be sorted by an expert.
This work was supported by the framework agreement between the University of León and INCIBE (Spanish National Cybersecurity Institute) under Addendum 01 and 22.
-  (2019) Note: Accessed: 2019-06-17 External Links: Cited by: §1.
-  (2006) Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 19–26. Cited by: §2.
-  (2017) A multi-task approach for named entity recognition in social media data. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pp. 148–153. Cited by: §3.1.2.
-  (2017) Classifying illegal activities on tor network based on web textual contents. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Vol. 1, pp. 35–43. Cited by: §1, §1, §2, §3.1.4, §4.2.2, §4.2.2.
-  (2017) Detecting emerging products in tor network based on k-shell graph decomposition. III Jornadas Nacionales de Investigación en Ciberseguridad (JNIC) 1 (1), pp. 24–30. Cited by: §1, §3.1.2.
-  (2019) ToRank: identifying the most influential suspicious domains in the tor network. Expert Systems with Applications 123, pp. 212 – 226. External Links: Cited by: §1, §1, §1, §2, §3.1.4, §3.1.5, §5.1, §5.3.
-  (2019) DarkNER: a platform for named entity recognition in tor darknet. In Jornadas Nacionales de Investigación en Ciberseguridad (JNIC2019), pp. Poster session. Cited by: §3.1.2.
-  (2015) Ranking radically influential web forum users. IEEE Transactions on Information Forensics and Security 10 (6), pp. 1289–1298. Cited by: §2.
-  (2016) Everything you always wanted to know about drug cryptomarkets*(* but were afraid to ask). International Journal of Drug Policy 35, pp. 1–6. Cited by: §2.
-  (2017) Exploring and analyzing the tor hidden services graph. ACM Transactions on the Web (TWEB) 11 (4), pp. 24. Cited by: §1.
-  (2014) Content and popularity analysis of tor hidden services. In Distributed Computing Systems Workshops (ICDCSW), 2014 IEEE 34th International Conference on, pp. 188–193. Cited by: §1, §2.
-  (2017-12) Recognition of service domains on tor dark net using perceptual hashing and image classification techniques. In 8th International Conference on Imaging for Crime Detection and Prevention (ICDP 2017), Vol. , pp. 7–12. External Links: Cited by: §1.
-  (2016) Studying illicit drug trafficking on darknet markets: structure and organisation from a canadian perspective. Forensic science international 264, pp. 7–14. Cited by: §2.
-  (2005) Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pp. 89–96. Cited by: §3.2.2.
-  (2010) From ranknet to lambdarank to lambdamart: an overview. Microsoft Research Technical Report 11 (23-581), pp. 81–99. Cited by: §6.
-  (2006) Adapting ranking svm to document retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 186–193. Cited by: §5.2.
-  (2007) Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: §3.2.3, §4.2.3.
-  (2007) A model of internet topology using k-shell decomposition. Proceedings of the National Academy of Sciences 104 (27), pp. 11150–11154. Cited by: §3.1.2.
-  (2010) Digging into anonymous traffic: a deep analysis of the tor anonymizing network. In 2010 Fourth International Conference on Network and System Security, pp. 167–174. Cited by: §1.
-  (2008) Uncovering the dark web: a case study of jihad on the web. Journal of the American Society for Information Science and Technology 59 (8), pp. 1347–1359. Cited by: §2.
-  (2011) Dark web: exploring and data mining the dark side of the web. Springer Science & Business Media. Cited by: §2.
-  (2019) The language of legal and illegal activity on the darknet. In Proc. of ACL, Cited by: §1.
-  (2013) Deepweb and cybercrime. Trend Micro Report 9, pp. 1–22. Cited by: §1, §2.
-  (2008) Online learning from click data for sponsored search. In Proceedings of the 17th international conference on World Wide Web, pp. 227–236. Cited by: §3.2.1.
-  (2018) Google’s 200 ranking factors: the complete list (2018). Note: Accessed: 2019-04-17 External Links: Cited by: §1.
-  (2015) Evaluating drug trafficking on the tor network: silk road 2, the sequel. International Journal of Drug Policy 26 (11), pp. 1113–1123. Cited by: §2.
-  (2010) An empirical study on learning to rank of tweets. In Proceedings of the 23rd International Conference on Computational Linguistics, pp. 295–303. Cited by: §2.
-  (2012) Changing of the guards: a framework for understanding and improving entry guard selection in tor. In Proceedings of the 2012 ACM Workshop on Privacy in the Electronic Society, pp. 43–54. Cited by: §1, §2.
-  (2018-09) Boosting image classification through semantic attention filtering strategies. Pattern Recognition Letters 112, pp. 176–183. External Links: Cited by: §2.
-  (2019) Sex, drugs, and bitcoin: how much illegal activity is financed through cryptocurrencies?. The Review of Financial Studies 32 (5), pp. 1798–1853. Cited by: §1.
-  (1978) Centrality in social networks conceptual clarification. Social networks 1 (3), pp. 215–239. Cited by: §3.1.5.
-  (2003-12) An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4, pp. 933–969. External Links: Cited by: §6.
-  (2001-10) Greedy function approximation: a gradient boosting machine.. Ann. Statist. 29 (5), pp. 1189–1232. External Links: Cited by: §3.2.1.
-  (2017-12) Pornography and child sexual abuse detection in image and video: a comparative evaluation. In 8th International Conference on Imaging for Crime Detection and Prevention (ICDP 2017), Vol. , pp. 37–42. External Links: Cited by: §1.
-  (2002) Training products of experts by minimizing contrastive divergence. Neural computation 14 (8), pp. 1771–1800. Cited by: §4.2.3.
-  (2000) IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 41–48. Cited by: §4.1.
-  (2016) ROSF: leveraging information retrieval and supervised learning for recommending code snippets. IEEE Transactions on Services Computing, pp. 34–46. External Links: Cited by: §2.
-  (1953) A new status index derived from sociometric analysis. Psychometrika 18 (1), pp. 39–43. Cited by: §1, §5.3.
-  (1999) Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46 (5), pp. 604–632. Cited by: §1, §5.3.
-  (2013) Sparse learning-to-rank via an efficient primal-dual algorithm. IEEE Transactions on Computers 62 (6), pp. 1221–1233. Cited by: §4.2.3.
-  (2011) A short introduction to learning to rank. IEICE TRANSACTIONS on Information and Systems 94 (10), pp. 1854–1862. Cited by: §1.
-  (2011) Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies 4 (1), pp. 1–113. Cited by: §3.2, §5.2.
-  (2018) Leveraging official content and social context to recommend software documentation. IEEE Transactions on Services Computing (1), pp. 1–1. Cited by: §2.
-  (2016) FriendRank: a personalized approach for tweets ranking in social networks. In Advances in Social Networks Analysis and Mining (ASONAM), 2016 IEEE/ACM International Conference on, pp. 896–900. Cited by: §2.
-  (2012) Tedas: a twitter-based event detection and analysis system. In Data engineering (icde), 2012 ieee 28th international conference on, pp. 1273–1276. Cited by: §2.
-  (2015) Torward: discovery, blocking, and traceback of malicious traffic over tor. IEEE Transactions on Information Forensics and Security 10 (12), pp. 2515–2530. Cited by: §1.
-  (2009) Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3 (3), pp. 225–331. Cited by: §2.
-  (2015) Transfer learning using computational intelligence: a survey. Knowledge-Based Systems 80, pp. 14–23. Cited by: §3.1.4.
-  (2013-10-01) The whens and hows of learning to rank for web search. Information Retrieval 16 (5), pp. 584–628. External Links: Cited by: §2.
-  (2010) Introduction to information retrieval. Natural Language Engineering 16 (1), pp. 100–103. Cited by: §4.1.
-  (2018) Offline constraints in online drug marketplaces: an exploratory analysis of a cryptomarket trade network. International Journal of Drug Policy 56, pp. 92–100. Cited by: §1.
-  (2016) Activist: A new framework for dataset labelling. In Proceedings of the 24th Irish Conference on Artificial Intelligence and Cognitive Science, AICS 2016, Dublin, Ireland, September 20-21, 2016., pp. 140–148. Cited by: §6.
-  (2016) Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications 57, pp. 232–247. Cited by: §3.1.1.
-  (1999-11) The pagerank citation ranking: bringing order to the web.. Technical Report Technical Report 1999-66, Stanford InfoLab, Stanford InfoLab. External Links: Cited by: §1, §2, §5.3.
-  (1992) The md5 message-digest algorithm. Cited by: §3.1.1.
-  (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pp. 4278–4284. Cited by: §3.1.4.
-  (2018) Who uses tor?. Note: Accessed: 2019-04-17 External Links: Cited by: §1.
-  (2017-01) Context-aware service input ranking by learning from historical information. IEEE Transactions on Services Computing (01), pp. 1–1. External Links: Cited by: §2, §3.2, §5.2.
-  (1994) Social network analysis: methods and applications. Cambridge university press. Cited by: §2.
-  (2016) Terrorist migration to the dark web. Perspectives on Terrorism 10 (3), pp. 40–44. Cited by: §2.
-  (2018) StarSpace: embed all the things!. In AAAI Conference on Artificial Intelligence, pp. 5569–5577. External Links: Cited by: §6.
-  (2008) Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on Machine learning, pp. 1192–1199. Cited by: §3.2.3.
-  (2015) Empirical evaluation of rectified activations in convolutional network. In Deep Learning Workshop, ICML 15, Cited by: §4.2.3.