A Web Page Classifier Library Based on Random Image Content Analysis Using Deep Learning
In this paper we present a methodology and the corresponding Python library
Web classification is a hard but necessary task for different reasons: security, social conventions, parent control among others(Vanetti et al., 2011; Ali et al., 2017). It is also important for internet searchers because accurate classification can drive to better services. In the recent times a new application has emerged: Automating classification of healthcare information on the web (Cronin et al., 2015; Hirokawa and Ishita, 2014), were the main goal is to help users to avoid, filter or ranking web content during treatment of chronic diseases or improvement of habits for the treatment of internet addictions (Griffiths, 2001; Coughlin et al., 2017).
Traditional methods such as naive-Bayes or -nearest neighbours have shown a good performance for the classification of web content (Qi and Davison, 2009). However, the fast grown of the internet webpages mainly towards complex and interactive visual content makes the performance of these methods much poorer. In general, most of the techniques for the classification of webpages are based in the analysis of on-page features such as url name, textual content (Milicka and Burget, 2013) and tags (Qi and Davison, 2009) and more rarely visual content (de Boer et al., 2011; Burget and Rudolfova, 2009). With the rise of deep learning as the method for image classification due to its accuracy, almost to human level, and the everyday more powerful computational resources in basic workstations have motivated the study of the visual features as a suitable source of information for classification of internet content.
In this work we classified webpages using the content images by means of a pre-trained neural network. The links of the images from a given webpage are obtained and then randomly shuffled. Then, a defined associated subset of all images is downloaded and then classified by our algorithm: A random forest model build upon the features of the pre-trained network. The whole process is carry out in memory. The final outcome is a list of probabilities linked to each image. These values are related to the pre-trained classes, which in our case is related with the weapon content. The presented method is highly accurate and can be easily extended to a large number of classes.
The Website Image-Based Classification (WIBC) library written in Python 3.6 implements the novel functionality. It retrieves images from a given url and then calculates probabilities of belonging to a particular class for these images. Current study focuses on one relevant class: Weapons, for which a large corpus of training and test data is available from the currently running CloSer project, but the methodology is extensible to any number of arbitrary classes. The deep learning back-end runs on an open source
The whole url content of a webpage is accessed using the requests python package. Links to all image content are extracted using the lassie python package
A parallel code implementation is written using a ProcessPool job (from the Pebble python package
In the next step, the associated features of each image are extracted using a deep network. Here we used a pre-trained
The 21,841 class predictions of the pre-trained model are used as inputs for our second stage model. It is trained on own set of images: a sanitized dump of random 120,342 images, and a set of 16,358 images from weapon websites (with a mix of weapon content and random content). A pure set of 793 manually sorted images of weapon content is used for testing. The class probability predictions are done by Random Forest Classifier from the Scikit-learn library (Pedregosa et al., 2011). Therefore, in the last step, the matrix of extracted features is turned into weapon class probabilities by means of this model.
The WIBC library works by instantiating a WIBC_weapon object, that loads a deep learning model and a random forest classifier into its memory. Then the run(url) function fetches images from a website, and returns probabilities of them to belong to the Weapon category. The WIBC_weapon settings include the maximum number of images to download per website, and whether to use GPU acceleration (with Nvidia CUDA 7.5 or 8.0).
2.1. Methodology Under Development
There are three directions of development for the nearest future. The first one extracts the upper-layer features from a deep network. These features provide universal and robust image description, that is independent of object classes and needs no re-training. The original network computes its classes from them with logistic regression. The proposed solution is to train a regularized linear model from these features towards our classes of interest. Then re-training a model on new dataset would involve only re-training the linear model, that is very fast.
The second approach is to store already observed images from each website (in some robust representation), and check all new images against them. Web images are highly repetitive across webpages. Knowing that an image was observed on a website of a particular class provides valuable classification information, may be faster than a deep network, and works with any hard-to-classify image content like Cults website.
The third approach is to gather more training data from third-party image datasets. This involves running the WIBC library on large image datasets like Flickr1M (1 million images) to find probably relevant images to a particular category, and then manually filtering those images to extract the relevant ones. Another way to increase the training dataset are image rotation and mirroring.
Machine learning in health-care applications are a field that has been studied, but that also have great opportunities for further development. In Björk et al. 2016 (Björk et al., 2016), the outline of the Huntington disease prediction project was presented, whereas the actual missing value imputation framework for more accurate prediction of the Huntington disease was presented by Akusok et al. (Akusok et al., 2017).
3. Experimental Results
We tested our implementation on a curated list of webpages
The results for the list of weapon related webpages is presented in the Fig.1. The results were plotted in ascending order, depending of the mean value of the obtained probabilities of each webpage (see figure). We have analyzed a random subset of maximum ten images per webpage. The algorithm assigns a probability from 0 to 1 of belonging to a weapons class (from white to blue in the color scale). In average the result for a given webpage takes around 10 seconds, however this time can be tune depending of the user configuration (timeout, connection requests) and the internet speed connection. The download process can be done in parallel depending of the capabilities of the machine and for extracting the features of the images, a gpu option is included.
After collecting the results for all the webpages, the question to solve is what would be the criteria to decide if a webpage belongs to the weapons class or not? We have proposed two methods, first using the mean value of the probabilities assuming that for a given webpage these follow a Gaussian distribution, then we say if it belongs to the weapons class for a given threshold value. The Second method is classify the webpage if at least a number
In the Figs.2 and 3 appear plotted the Precision-Recall and ROC curves for the two methods. The curves for the first method is plotted in magenta-squares meanwhile the for the second method, for each value of , the result is plotted with colored circles.
Following the plots, it is clear that the best method, with the best Precision-Recall or less false positive rate consists, is the second proposed method with =1. The respective balance point (where Sensitivity = Specificity) is equal to 0.81 and it is obtained when the probability threshold is around 0.41. The second best result is obtained with the second method at =2. The first method is ranked third.
4. Conclusion and further research
In the classification of webpages, the analysis of the content of images is the best criteria so far. We propose that it is possible to classify a webpage in the weapons class by a single randomly selected image from the webpage belonging to this class. The threshold for this probability can be balanced depending of the precision or less false positive rate criteria.
Two highlights of the proposed method are high accuracy in image content analysis, and an extreme simplicity of implementation. The most complex part of image feature extraction is performed by a state-of-the-art deep learning model pre-trained on millions of images, that currently can be obtained with a single line of code
Further research includes exploration of the proposed method for classification of images in the health-care sector. In general, Machine learning techniques are very valuable in the health-care domain (e.g. the example of missing value imputation in (Björk et al., 2016; Akusok et al., 2017)). In this case the possible application areas may be different from the Huntington disease prediction. As discussed, there are lots of diagnostic images processed in the health-care system like brain tumor detection (Termenon et al., 2016). In addition presumptive health-care processes, like filtering the content for people who try to quit smoking (for instance) can be of extreme value.
This material is based on work jointly supported by the Tekes project Cloud-assisted Security Services (CloSer). The authors would like to thank their collaborators from F-Secure for sharing of the scientific research data.
- copyright: rightsretained
- doi: 10.475/123_4
- isbn: 123-4567-24-567/08/06
- conference: ACM Woodstock conference; June 26; Corfu, Greece
- journalyear: 2018
- article: 4
- price: 15.00
- ccs: Computing methodologies Computer vision
- ccs: Computing methodologies Machine learning algorithms
- ccs: Software and its engineering Software libraries and repositories
- ccs: Human-centered computing User interface management systems
- ccs: Applied computing Health care information systems
- Python package https://pypi.python.org/pypi/mxnet,
accelerated version https://pypi.python.org/pypi/mxnet-cu80mkl
- F-Secure, private communication
- can be 1, 2, 3, or 10., the maximum number of retrieved images.
- Brute-force missing data extreme learning machine for predicting huntington’s disease. In Proceedings of the 10th International Conference on PErvasive Technologies Related to Assistive Environments, PETRA ’17, New York, NY, USA, pp. 189–192. External Links: Cited by: §2.1, §4.
- A fuzzy ontology and svm–based web content classification system. IEEE Access 5, pp. 25781–25797. Cited by: §1.
- A new application of machine learning in health care. In Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments, PETRA ’16, New York, NY, USA, pp. 49:1–49:4. External Links: Cited by: §2.1, §4.
- Web page element classification based on visual features. In 2009 First Asian Conference on Intelligent Information and Database Systems, Vol. , pp. 67–72. External Links: Cited by: §1.
- MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs/1512.01274. Cited by: §2.
- A systematic review of studies of web portals for patients with diabetes mellitus. mHealth 3 (6). External Links: Cited by: §1.
- Automated classification of consumer health information needs in patient portal messages. AMIA Annu Symp Proc 2015, pp. 1861–1870. Note: 2245873[PII] External Links: Cited by: §1.
- Web page classification using image analysis features. In Web Information Systems and Technologies: 6th International Conference, WEBIST 2010, Valencia, Spain, April 7-10, 2010, Revised Selected Papers, pp. 272–285. Cited by: §1.
- ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. External Links: Cited by: §2.
- Sex on the internet: observations and implications for internet sex addiction. The Journal of Sex Research 38 (4), pp. 333–342. Cited by: §1.
- Non-topical classification of healthcare information on the web. In Smart Digital Futures 2014, Frontiers in Artificial Intelligence and Applications, Vol. 262, pp. 237–247. External Links: Cited by: §1.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 448–456. Cited by: §2.
- Web document description based on ontologies. In Informatics and Applications (ICIA), 2013 Second International Conference on, pp. 288–293. Cited by: §1.
- Scikit-learn: machine learning in python. Journal of Machine Learning Research 12 (Oct), pp. 2825–2830. Cited by: §2.
- Web page classification: features and algorithms. ACM Comput. Surv. 41 (2), pp. 12:1–12:31. External Links: Cited by: §1.
- Brain MRI morphological patterns extraction tool based on Extreme Learning Machine and majority vote classification. Neurocomputing 174, Part A, pp. 344 – 351. External Links: Cited by: §4.
- Content-based filtering in on-line social networks. In Privacy and Security Issues in Data Mining and Machine Learning: International ECML/PKDD Workshop, PSDML 2010, Barcelona, Spain, September 24, 2010. Revised Selected Papers, pp. 127–140. External Links: Cited by: §1.