A Web Page Classifier Library Based on Random Image Content Analysis Using Deep Learning

A Web Page Classifier Library Based on Random Image Content Analysis Using Deep Learning

Abstract.

In this paper we present a methodology and the corresponding Python library1 for the classification of webpages. Our method retrieves a fixed number of images from a given webpage, and based on them classifies the webpage into a set of established classes with a given probability. The library trains a random forest model build upon the features extracted from images by a pre-trained deep network. The implementation is tested by recognizing weapon class webpages in a curated list of 3859 websites. The results show that the best method of classifying a webpage into the studies classes is to assign the class according to the maximum probability of any image belonging to this (weapon) class being above the threshold, across all the retrieved images. Further research explores the possibilities for the developed methodology to also apply in image classification for healthcare applications.

webpage classification, computer vision, deep learning
2345678910111213

1. Introduction

Web classification is a hard but necessary task for different reasons: security, social conventions, parent control among others(Vanetti et al., 2011; Ali et al., 2017). It is also important for internet searchers because accurate classification can drive to better services. In the recent times a new application has emerged: Automating classification of healthcare information on the web (Cronin et al., 2015; Hirokawa and Ishita, 2014), were the main goal is to help users to avoid, filter or ranking web content during treatment of chronic diseases or improvement of habits for the treatment of internet addictions (Griffiths, 2001; Coughlin et al., 2017).

Traditional methods such as naive-Bayes or -nearest neighbours have shown a good performance for the classification of web content (Qi and Davison, 2009). However, the fast grown of the internet webpages mainly towards complex and interactive visual content makes the performance of these methods much poorer. In general, most of the techniques for the classification of webpages are based in the analysis of on-page features such as url name, textual content (Milicka and Burget, 2013) and tags (Qi and Davison, 2009) and more rarely visual content (de Boer et al., 2011; Burget and Rudolfova, 2009). With the rise of deep learning as the method for image classification due to its accuracy, almost to human level, and the everyday more powerful computational resources in basic workstations have motivated the study of the visual features as a suitable source of information for classification of internet content.

In this work we classified webpages using the content images by means of a pre-trained neural network. The links of the images from a given webpage are obtained and then randomly shuffled. Then, a defined associated subset of all images is downloaded and then classified by our algorithm: A random forest model build upon the features of the pre-trained network. The whole process is carry out in memory. The final outcome is a list of probabilities linked to each image. These values are related to the pre-trained classes, which in our case is related with the weapon content. The presented method is highly accurate and can be easily extended to a large number of classes.

2. Methodology

The Website Image-Based Classification (WIBC) library written in Python 3.6 implements the novel functionality. It retrieves images from a given url and then calculates probabilities of belonging to a particular class for these images. Current study focuses on one relevant class: Weapons, for which a large corpus of training and test data is available from the currently running CloSer project, but the methodology is extensible to any number of arbitrary classes. The deep learning back-end runs on an open source14 MXNet (Chen et al., 2015) library.

The whole url content of a webpage is accessed using the requests python package. Links to all image content are extracted using the lassie python package15 from the url, and optionally from all possible sub-urls. Then image formatting function fetches images from these links and sets them in RGB format, dropping corrupted, inaccessible or tiny images. A search for image objects and their formatting continues iteratively until a required number of valid images is obtained. Such approach reduces url analysis latency, because not all links are followed. Valid images are returned as python image objects in a list, without a need for drive storage.

A parallel code implementation is written using a ProcessPool job (from the Pebble python package16). Distributed jobs are created incrementally as more image links are discovered, and the process pool terminates immediately after obtaining a necessary number of valid images. Parallel implementation does not support sub-url search, because it does not run with incremental image link discovery that greatly reduces a single url processing time.

In the next step, the associated features of each image are extracted using a deep network. Here we used a pre-trained17 deep neural network formally known as Full ImageNet Network aka Inception21k. This model is a pre-trained model on full imagenet dataset (Deng et al., 2009) with 14,197,087 images in 21,841 classes. The model is trained by only random crop and mirror augmentation. The network is based on Inception-BN network (Ioffe and Szegedy, 2015). The final outcome is a numpy array of dimension: 21,841number of images.

The 21,841 class predictions of the pre-trained model are used as inputs for our second stage model. It is trained on own set of images: a sanitized dump of random 120,342 images, and a set of 16,358 images from weapon websites (with a mix of weapon content and random content). A pure set of 793 manually sorted images of weapon content is used for testing. The class probability predictions are done by Random Forest Classifier from the Scikit-learn library (Pedregosa et al., 2011). Therefore, in the last step, the matrix of extracted features is turned into weapon class probabilities by means of this model.

The WIBC library works by instantiating a WIBC_weapon object, that loads a deep learning model and a random forest classifier into its memory. Then the run(url) function fetches images from a website, and returns probabilities of them to belong to the Weapon category. The WIBC_weapon settings include the maximum number of images to download per website, and whether to use GPU acceleration (with Nvidia CUDA 7.5 or 8.0).

2.1. Methodology Under Development

There are three directions of development for the nearest future. The first one extracts the upper-layer features from a deep network. These features provide universal and robust image description, that is independent of object classes and needs no re-training. The original network computes its classes from them with logistic regression. The proposed solution is to train a regularized linear model from these features towards our classes of interest. Then re-training a model on new dataset would involve only re-training the linear model, that is very fast.

The second approach is to store already observed images from each website (in some robust representation), and check all new images against them. Web images are highly repetitive across webpages. Knowing that an image was observed on a website of a particular class provides valuable classification information, may be faster than a deep network, and works with any hard-to-classify image content like Cults website.

The third approach is to gather more training data from third-party image datasets. This involves running the WIBC library on large image datasets like Flickr1M (1 million images) to find probably relevant images to a particular category, and then manually filtering those images to extract the relevant ones. Another way to increase the training dataset are image rotation and mirroring.

Machine learning in health-care applications are a field that has been studied, but that also have great opportunities for further development. In Björk et al. 2016 (Björk et al., 2016), the outline of the Huntington disease prediction project was presented, whereas the actual missing value imputation framework for more accurate prediction of the Huntington disease was presented by Akusok et al. (Akusok et al., 2017).

3. Experimental Results

Figure 1. Result for a curated list of webpages with weapons content in ascending order. We retrieved a maximum of ten images (when is possible) and assigned a probability of belonging to a weapons class. The mean value, assuming a Gaussian distribution of probabilities, is plotted on the right side of the main graph.

We tested our implementation on a curated list of webpages18. The list includes a collection of 1006 websites with content related to weapons, 1010 to alcohol, 715 to dating, 999 to shopping and 504 to random internet content (such as nasa.gov, google.com, etc). We did not have access, in average, to 9% of the webpages mainly due to web crawling protection or non-existing domain.

The results for the list of weapon related webpages is presented in the Fig.1. The results were plotted in ascending order, depending of the mean value of the obtained probabilities of each webpage (see figure). We have analyzed a random subset of maximum ten images per webpage. The algorithm assigns a probability from 0 to 1 of belonging to a weapons class (from white to blue in the color scale). In average the result for a given webpage takes around 10 seconds, however this time can be tune depending of the user configuration (timeout, connection requests) and the internet speed connection. The download process can be done in parallel depending of the capabilities of the machine and for extracting the features of the images, a gpu option is included.

After collecting the results for all the webpages, the question to solve is what would be the criteria to decide if a webpage belongs to the weapons class or not? We have proposed two methods, first using the mean value of the probabilities assuming that for a given webpage these follow a Gaussian distribution, then we say if it belongs to the weapons class for a given threshold value. The Second method is classify the webpage if at least a number19 of the obtained images belong to the weapons class for a given threshold.

In the Figs.2 and 3 appear plotted the Precision-Recall and ROC curves for the two methods. The curves for the first method is plotted in magenta-squares meanwhile the for the second method, for each value of , the result is plotted with colored circles.

Figure 2. Precision-Recall curves for the different methods of measuring the presence of weapons in the studied websites. The first method is presented in magenta-squares and the second method for different values of , from 1 to 9 is presented in colored circles.
Figure 3. ROC curves for the different methods for measuring the presence of weapons in the studied websites. The first method is presented in magenta-squares and the second method for different values of , from 1 to 9 is presented in colored circles.

Following the plots, it is clear that the best method, with the best Precision-Recall or less false positive rate consists, is the second proposed method with =1. The respective balance point (where Sensitivity = Specificity) is equal to 0.81 and it is obtained when the probability threshold is around 0.41. The second best result is obtained with the second method at =2. The first method is ranked third.

4. Conclusion and further research

In the classification of webpages, the analysis of the content of images is the best criteria so far. We propose that it is possible to classify a webpage in the weapons class by a single randomly selected image from the webpage belonging to this class. The threshold for this probability can be balanced depending of the precision or less false positive rate criteria.

Two highlights of the proposed method are high accuracy in image content analysis, and an extreme simplicity of implementation. The most complex part of image feature extraction is performed by a state-of-the-art deep learning model pre-trained on millions of images, that currently can be obtained with a single line of code 20. This saves weeks of development time on model structure selection, training parameter tuning and the training itself. Random Forest is a robust classifier that learns in seconds from arbitrary input features. The simplicity of implementation allows for flexible applications of the proposed method to any problem domain, and for easy scalability.

Further research includes exploration of the proposed method for classification of images in the health-care sector. In general, Machine learning techniques are very valuable in the health-care domain (e.g. the example of missing value imputation in (Björk et al., 2016; Akusok et al., 2017)). In this case the possible application areas may be different from the Huntington disease prediction. As discussed, there are lots of diagnostic images processed in the health-care system like brain tumor detection (Termenon et al., 2016). In addition presumptive health-care processes, like filtering the content for people who try to quit smoking (for instance) can be of extreme value.

5. Acknowledgment

This material is based on work jointly supported by the Tekes project Cloud-assisted Security Services (CloSer). The authors would like to thank their collaborators from F-Secure for sharing of the scientific research data.

Footnotes

  1. https://bitbucket.org/akusok/wibc
  2. copyright: rightsretained
  3. doi: 10.475/123_4
  4. isbn: 123-4567-24-567/08/06
  5. conference: ACM Woodstock conference; June 26; Corfu, Greece
  6. journalyear: 2018
  7. article: 4
  8. price: 15.00
  9. ccs: Computing methodologies Computer vision
  10. ccs: Computing methodologies Machine learning algorithms
  11. ccs: Software and its engineering Software libraries and repositories
  12. ccs: Human-centered computing User interface management systems
  13. ccs: Applied computing Health care information systems
  14. Python package https://pypi.python.org/pypi/mxnet,
    accelerated version https://pypi.python.org/pypi/mxnet-cu80mkl
  15. http://lassie.readthedocs.io
  16. https://pebble.readthedocs.io
  17. https://github.com/dmlc/mxnet-model-gallery/blob/master/imagenet-21k-inception.md
  18. F-Secure, private communication
  19. can be 1, 2, 3, or 10., the maximum number of retrieved images.
  20. https://keras.io/applications

References

  1. Brute-force missing data extreme learning machine for predicting huntington’s disease. In Proceedings of the 10th International Conference on PErvasive Technologies Related to Assistive Environments, PETRA ’17, New York, NY, USA, pp. 189–192. External Links: ISBN 978-1-4503-5227-7 Cited by: §2.1, §4.
  2. A fuzzy ontology and svm–based web content classification system. IEEE Access 5, pp. 25781–25797. Cited by: §1.
  3. A new application of machine learning in health care. In Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments, PETRA ’16, New York, NY, USA, pp. 49:1–49:4. External Links: ISBN 978-1-4503-4337-4 Cited by: §2.1, §4.
  4. Web page element classification based on visual features. In 2009 First Asian Conference on Intelligent Information and Database Systems, Vol. , pp. 67–72. External Links: Document, ISSN Cited by: §1.
  5. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs/1512.01274. Cited by: §2.
  6. A systematic review of studies of web portals for patients with diabetes mellitus. mHealth 3 (6). External Links: ISSN 2306-9740 Cited by: §1.
  7. Automated classification of consumer health information needs in patient portal messages. AMIA Annu Symp Proc 2015, pp. 1861–1870. Note: 2245873[PII] External Links: ISSN 1942-597X Cited by: §1.
  8. Web page classification using image analysis features. In Web Information Systems and Technologies: 6th International Conference, WEBIST 2010, Valencia, Spain, April 7-10, 2010, Revised Selected Papers, pp. 272–285. Cited by: §1.
  9. ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. External Links: ISSN 1063-6919 Cited by: §2.
  10. Sex on the internet: observations and implications for internet sex addiction. The Journal of Sex Research 38 (4), pp. 333–342. Cited by: §1.
  11. Non-topical classification of healthcare information on the web. In Smart Digital Futures 2014, Frontiers in Artificial Intelligence and Applications, Vol. 262, pp. 237–247. External Links: ISBN 9781614994046 Cited by: §1.
  12. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 448–456. Cited by: §2.
  13. Web document description based on ontologies. In Informatics and Applications (ICIA), 2013 Second International Conference on, pp. 288–293. Cited by: §1.
  14. Scikit-learn: machine learning in python. Journal of Machine Learning Research 12 (Oct), pp. 2825–2830. Cited by: §2.
  15. Web page classification: features and algorithms. ACM Comput. Surv. 41 (2), pp. 12:1–12:31. External Links: ISSN 0360-0300 Cited by: §1.
  16. Brain MRI morphological patterns extraction tool based on Extreme Learning Machine and majority vote classification. Neurocomputing 174, Part A, pp. 344 – 351. External Links: ISSN 0925-2312, Document Cited by: §4.
  17. Content-based filtering in on-line social networks. In Privacy and Security Issues in Data Mining and Machine Learning: International ECML/PKDD Workshop, PSDML 2010, Barcelona, Spain, September 24, 2010. Revised Selected Papers, pp. 127–140. External Links: ISBN 978-3-642-19896-0 Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402424
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description