CommentWatcher: An Open Source Web-based platform for analyzing discussions on web forums
We present CommentWatcher, an open source tool aimed at analyzing discussions on web forums. Constructed as a web platform, CommentWatcher features automatic mass fetching of user posts from forum on multiple sites, extracting topics, visualizing the topics as an expression cloud and exploring their temporal evolution. The underlying social network of users is simultaneously constructed using the citation relations between users and visualized as a graph structure. Our platform addresses the issues of the diversity and dynamics of structures of webpages hosting the forums by implementing a parser architecture that is independent of the HTML structure of webpages. This allows easy on-the-fly adding of new websites. Two types of users are targeted: end users who seek to study the discussed topics and their temporal evolution, and researchers in need of establishing a forum benchmark dataset and comparing the performances of analysis tools.
CommentWatcher: An Open Source Web-based platform for analyzing discussions on web forums
|ERIC Lab, University Lyon2|
|ERIC Lab, University Lyon2|
|ERIC Lab, University Lyon2|
Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval]: Online Information Services—Web-based services ; I.2.7 [Artificial Intelligence]: Natural Language Processing—Language parsing and understanding, Text analysis ; H.3.5 [Information Storage and Retrieval]: Information Search and Retrieval—Clustering, Selection process
Social media analysis, topic extraction, visualization
The Web 2.0 has changed the way users discuss with other users. One of the preferred online discussion environments are the web forums. Users can react, post their opinions, discuss and debate any kind of subjects. The forums are usually thematic (e.g. Java programming forums) and new users have access to the past discussion (e.g. solutions posted by other users to a specific problem). Therefore the users become full collaborative participants in the information creation process. The subjects of discussion between readers are very dynamic and the overall sum of reactions gives a snapshot of the general trends that emerge in the user population. At the same time, the way users reply one to another suggests an underlying social network structure. The forum’s “reply-to” structural relations can be used to add links between users. Other types of relations can be added, like the name and textual citations [?]. Furthermore, based on such social networks constructed from web forums, adapted graph measures can be used to detect user social roles [?].
These forum data are still ill explored, even if they represent an important source of knowledge. News articles analysis and micro blogging (e.g. Twitter) analysis receive a lot of attention from the community. There are available tools that perform the analysis of news media [?], but without treating the social network aspect. Other tools concentrate on analyzing and visualizing the social dynamics [?] or detect events [?] based on twitter data. To the best of our knowledge, there are no publicly available tools that treat forums, while inferring a social network structure.
Another limitation concerns the forum benchmarks. There are a multitude of general purpose information retrieval data-sets (e.g. the ClueWeb12 dataset111http://lemurproject.org/clueweb12/specs.php of project Lemure) and of Twitter datasets (e.g. the infochimps collections222http://www.infochimps.com/collections/twitter-census). But dedicated web forum benchmark datasets are scarce. Those that exist are usually issued from a single forum website (e.g. the boards.ie Forums Dataset333http://www.icwsm.org/2012/submitting/datasets/ based on boards.ie website or the Ancestry.com Forum Dataset444http://www.cs.cmu.edu/~jelsas/data/ancestry.com/, based on ancestry.com website). This is due to the diverse and ever changing structure of the websites hosting the discussions and copyright problems. Each host website has its own license on the user-produced data, which is not always clearly stipulated. This leads researchers to develop their own house-bred parsers and create their own datasets. These datasets are rarely shared with the community, which poses problems when testing new proposals and comparing to existing approaches.
We address these issues by introducing CommentWatcher, an open source web-based platform for analyzing discussion on web forums. CommentWatcher was designed having in mind two types of users: the forum analyst, who seeks to understand the main topics of discussion and the social interactions between users, and the researcher who needs a benchmark to test his/her proposed approaches. Using CommentWatcher, the researcher can create forum discussions benchmarks without worrying for copyright issues, since the platform is open source and the text itself is not distributed (each researcher can locally recreate the benchmark dataset).
When building CommentWatcher we address the challenges that arise from retrieving forums from multiple web sources. Not only these sources are profoundly heterogeneous in structure, but they tend to change often and render parsers obsolete. We implement a parser architecture which is independent from the website structure and allows simple on-the-fly adding of new sources and updating the existing ones. CommentWatcher also supports mass fetching of forums from supported sources by using keyword search on the internet, extracting discussion topics, creating the underlying social network structure of users and visualizing it in relation with the extracted topics.
During the demonstration, the participants will be able to interact with CommentWatcher in a normal browser window, through the tools web interface. The tool itself will be hosted and executed on its dedicated machine, located at the ERIC laboratory. The tools capabilities will be illustrated by showing the participants, on-live, (a) how multiple discussion forums can be fetched by searching the web using keywords, (b) apply topics extraction algorithms and tweak their parameters, (c) visualize the extracted topic as a expression cloud and their temporal evolution and (d) visualize the social network constructed starting from the initial forums.
In this section, we describe the software technologies used in developing CommentWatcher, the general architecture and the different components to highlight their aim and the way they interact.
CommentWatcher is written using Java Servlets for server-side computing and Java Server Page for the dymanic webpage generation. The support for fetching forums discussions from websites is implemented using the XLS Transformation technology. New websites can be added dynamically, without changing the source code. A MySql database is used for storing forum structure, user characteristics and the text. The visualization is performed client-side into a Java Applet.
The application has three main modules, interconnected as shown in Figure CommentWatcher: An Open Source Web-based platform for analyzing discussions on web forums. The fetching module deals with downloading the forums, parsing the web pages and storing the data into the database. Optionally, it can perform a keyword web search to find forums that can be fetched. The topic extraction module performs topic extraction using an algorithm implemented as a library on a selection of forums. The visualization module has two views: (i) topic visualization as an expression cloud and as a temporal evolution graphic and (ii) social network visualization.
Figure \thefigure: The design of the fetching module (a) and a screenshot of the keyword mass fetching process (b)
This module deals with downloading, parsing and importing the forum data into the application. The main difficulty when parsing web pages is that the structure of each page is different. What is more, the structure of a certain web page tends to change over time. With CommentWatcher we have designed and implemented a meta-parser, which is independent on the website. The actual adaptation of the parser to a specific page is done using an external definition file, implemented in XSLT, a standardized and well documented language. Therefore, adding support for new websites or modifying existing ones boils down to just adding or modifying definition files, without any change in the parser’s source code.
The design schema of the fetching module, as well as its interactions with the user interface and the database, are given in Figure (a)a. The download action specifies the URL of a forum to be downloaded. The bulk download follows the same idea, but a keyword web search is performed using the Bing API and all results from supported websites are downloaded. A screenshot of the keyword web search and mass fetching is given in Figure (b)b. The specified page will be downloaded in raw HTML format which will undergo cleaning, XSL transformation and deserialization. The process of cleaning implies transforming the HTML document into a well formed XML. In the following step, the XSL transformation is applied to the valid XML document using one of the XSLT definition files of the supported websites. The result of the transformation is an XML document, which uses the same XML schema for all supported websites. The required data is then deserialized into Java objects, which can be further on stored in and retrieved from the database.
The advantages of implementing such a parsing process are that it is simple, reliable, easy to understand and modify. Furthermore, it does not hard-code the website’s structure and it allows adding new supported websites on-the-fly.
This module allows extracting topics from texts from a selection of forums, already fetched in the database. The design is modular, the extraction itself being performed by external libraries. The text from selected forums is prepared and packaged in the format required by the topic extraction library and then passed to the library. The user interface allows setting the parameters for each library. Once the extraction is finished, the results are saved into an XML document, which has the same format for all topic extraction libraries. The XML document contains the expressions associated to each topic and their scores.
At the present, CommentWatcher supports two topic extraction algorithms, provided by two libraries: Topical N-Grams [?] provided by the Mallet Toolkit library[?] and CKP [?], provided by the CKP library. Topical NGrams is a graphical model algorithms, which models topics as distributions of probabilities over n-grams. CKP uses overlapping textual clustering (one text can belong to multiple clusters) and considers each cluster of the partition as a topic. The expressions stored in the XML result document are either (i) the resulted n-grams (for Topical NGrams) or (ii) the frequent expressions (for CKP). Their score is (i) the probability to which an n-gram is associated to a topic (for Topical NGrams) or (ii) , where is the normalized distance between the frequent expression and the topic’s centroid (for CKP). Support for new algorithms and libraries can be added easily, but it requires writing adapters for the inputs and outputs.
The visualization module is designed to help the user to quickly understand the extracted topics and visualize their temporal evolution. It is the only module that is executed client-side, in a Java Applet. After the XML object resulting from the topic extraction is loaded by the applet, two visualizations are available: the expression cloud and the temporal evolution graphic. Figure CommentWatcher: An Open Source Web-based platform for analyzing discussions on web forums shows a screenshot with the two visualizations. The expression cloud visualization is similar to the word cloud visualization, which the exception that it uses the expressions generated at the topic extraction module and their sizes are proportional with their score. The temporal evolution graphic portrays the popularity of each topic over the period of time. The time is discretized in a configurable number of intervals, the user posts associated to each topic in each interval are counted and graphics are generated for each forum or for each hosting website.
To facilitate the exploration of the interactions between the members of the forum, we compute a visualization of the underlying social network. The network is colored according to the topics on which the users are interacting. We construct the social network as a labeled multidigraph, as shown in [?]. We map the network nodes on the authors of messages. We add an arc labeled with the topic between two nodes when there is, between the two users, at least one direct reply belonging to the respective topic. We further enrich the network with user’s features as the number of posts, the number of topics a user participates in, the number of threads a user participates in, etc. Further measures are calculated on the graph, such as the weighted in- and out-degree, the betweenness centrality and the closeness centrality.
Figure CommentWatcher: An Open Source Web-based platform for analyzing discussions on web forums shows how CommentWatcher displays the induced social network. The visualization is created with the Jung Graph Library555http://jung.sourceforge.net and is interactive, so nodes can be selected in order to see their features. Relations can also be filtered in order to show only the network corresponding to certain topics.
CommentWatcher is released under the opensource license GNU GPL v3666http://www.gnu.org/licenses/. The individual topic extraction and textual clustering software packages are the objects of their respective licenses. The present version of CommentWatcher comes with two Natural Language Processing toolkits: the Mallet Toolkit[?] v2.0.7, released under the open source Common Public License, and CKP[?] v0.2, released under the GNU GPL v3. The install files and the source code of CommentWatcher is available through a public Mercurial repository777http://eric.univ-lyon2.fr/~commentwatcher/cgi-bin/CommentWatcher.cgi/CommentWatcher/.
Several tools intending to extract knowledge from on-line discussions have been proposed in the recent years.
MAQSA [?] is a system for social analytics on news that allows its users to define their own topic of interest, in order to gather related articles, identify related topics, and extract the time-line and network of comments that show who commented which article and when.
Eddi [?] offers visualizations such as time-lines and tag clouds of topics extracted from tweets using a simple topic detection algorithm that uses a search engine as an external knowledge base.
OpinionCrawl888http://opinioncrawl.com is an on-line service that crawls various web-sources – such as blogs, news, forums and Twitter – searching for a user-defined topic and then presents key concepts as a tag cloud, provides a visualization of the temporal dynamics of the topic and performs a sentiment analysis.
SONDY [?] is an open-source plateform for analyzing on-line social network data. It features a data import and pre-processing service, a topic detection and trends analysis service, as well as a service for the interactive exploration of the corresponding networks (i.e., active authors for the considered topic(s)).
The aforementioned tools are limited for various reasons. They are either proprietary softwares and thus can’t be extended for scientific purposes or can’t directly crawl web sources and can only be used to analyze formatted datasets provided by the user. CommentWatcher intends to provide researchers with an open-source extendable tool that permits to crawl the web and build datasets that suit their needs.
In this paper we have presented CommentWatcher, an open source web-based platform for analyzing discussions on web forums. Our tool is designed for both end-users, as well as for researchers. End-users have at their disposal an easy to use, integrated tool that allows retrieving forum discussion from multiple websites, performs topic extraction to identify the main discussion topics and provides an expression cloud visualization to identify the most important expressions associated to each topic. The temporal popularity of topics can be evaluated using an evolution graphic. CommentWatcher also features extracting the underlying social network by using the direct citation links between users. The visualization of the social network is interactive, features of nodes can be visualized and relations can be filtered to show only the network corresponding to a certain topic. For researchers, CommentWatcher tackles the problem of creating multi source web forum datasets, thanks to its versatile parser which is independent of the structure of webpages. Support for new websites can be added on-the-fly. It can also solve the problem of copyright when sharing forum datasets, since no text is distributed and each researcher can easily recreate the dataset. As future work, we intend to add a credential mechanism and transform CommentWatcher into a multiuser tool. We consider implementing topic evaluation based on ontologies of concepts and a better plotting of the social network by using force-directed graph drawing.
-  S. Amer-Yahia, S. Anjum, A. Ghenai, A. Siddique, S. Abbar, S. Madden, A. Marcus, and M. El-Haddad. Maqsa: a system for social analytics on news. In SIGMOD ’12, pages 653–656. ACM, 2012.
-  N. Anokhin, J. Lanagan, and J. Velcin. Social citation: Finding roles in social networks. an analysis of tv-series web forums. In Workshop on Mining Communities and People Recommenders, pages 49–56, 2012.
-  M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and E. H. Chi. Eddi: interactive topic-based browsing of social status streams. In UIST ’10, page 303, 2010.
-  M. Forestier, J. Velcin, and D. Zighed. Extracting social networks to understand interaction. In ASONAM ’11, pages 213–219. IEEE, 2011.
-  A. Guille, C. Favre, H. Hacid, and D. Zighed. Sondy: An open source platform for social dynamics mining and analysis. In SIGMOD ’13, 2013.
-  A. Marcus, M. S. Bernstein, O. Badar, D. R. Karger, S. Madden, and R. C. Miller. Twitinfo: aggregating and visualizing microblogs for event exploration. In CHI ’11, pages 227–236, 2011.
-  A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.
-  M.-A. Rizoiu, J. Velcin, and J.-H. Chauchat. Regrouper les données textuelles et nommer les groupes à l’aide des classes recouvrantes. In EGC ’10, page 561, 2010.
-  X. Wang, A. McCallum, and X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In ICDM ’07, page 697, 2007.