Comparing the hierarchy of author given tags and repository given tags in a large document archive

Comparing the hierarchy of author given tags and repository given tags in a large document archive

Gergely Tibély tibelyg@hal.elte.hu Dept. of Biological Physics, Eötvös University, H-1117 Budapest, Hungary Péter Pollner MTA-ELTE Statistical and Biological Physics Research Group, Hungarian Academy of Sciences, H-1117 Budapest, Hungary Gergely Palla MTA-ELTE Statistical and Biological Physics Research Group, Hungarian Academy of Sciences, H-1117 Budapest, Hungary
Abstract

Folksonomies – large databases arising from collaborative tagging of items by independent users - are becoming an increasingly important way of categorizing information. In these systems users can tag items with free words, resulting in a tripartite item-tag-user network. Although there are no prescribed relations between tags, the way users think about the different categories presumably has some built in hierarchy, in which more special concepts are descendants of some more general categories. Several applications would benefit from the knowledge of this hierarchy. Here we apply a recent method to check the differences and similarities of hierarchies resulting from tags given by independent individuals and from tags given by a centrally managed repository system. The results from our method showed substantial differences between the lower part of the hierarchies, and in contrast, a relatively high similarity at the top of the hierarchies.

Keywords: tag, hierarchy, ontology reconstruction, folksonomy, knowledge mapping

1 Introduction

The recent appearance of tags in large online datasets represents a significant innovation in categorisation [1, 2, 3]. Tags allow multiple categories for each item, and tagging can be done in a bottom-up approach, in a parallel manner, by several users simultaneously [4, 5, 6]. This feature allows the tagging of huge datasets in a reasonable time. In contrast, traditional hierarchical categorisation typically allows one category per item, and it is done by a few experts, slowing down the process. Also, available categories are restricted in traditional expert-made hierarchies, while user given tags are usually allowed to take any expression deemed relevant by the user.

Although there is no prescribed structure between the tags, it is a reasonable assumption that tags are attached to objects according to hidden hierarchical relations, e.g., “poodle” is usually considered as a special case of “dog”. Consequently, it is an interesting non-trivial task to extract this implicit hierarchy from the co-appearance of tags solely. Indeed, a number of different methods have already been proposed in the literature, such as aggregation of user-defined shallow hierarchies for obtaining a global hierarchy [7, 8], integration of information from as many sources as possible [9], using a probabilistic criterion to define parent-child relations [10], applying pairwise similarities to centrality-ordered tags [11], or building up the hierarchy from bottom up based on the z-score between the tags [12].

Beside the organisation of different keywords or categories describing a given topic, signs of hierarchy are prevalent in a very wide range of systems. Among others, the transcriptional regulatory network of Escherichia coli [13], the dominant-subordinate hierarchy among crayfish [14], the leader-follower network of pigeon flocks [15], the rhesus macaque kingdoms [16], neural networks [17], technological networks [18], social interactions [19, 20, 21], urban planning [22, 23], ecological systems [24, 25], and evolution [26, 27] all show signs of hierarchical organisation. Different approaches were introduced to uncover hierarchy in networks, including the introduction of hierarchy measures [28, 29, 30, 31], statistical inference of hierarchy [32] and construction of hierarchical network models [33].

Here we analyse the hierarchies obtained for the scientific keywords from the Web of Science [34] by applying a recent generalisation of the method given in Ref.[12] presented in [35]. We treat the set of author given tags and the set of repository given tags separately, resulting in two alternative hierarchies. These are compared to each other and also to the 3-level classification of categories given by the Web of Science. The organisation of the paper is the following: in Sect. 2 we introduce the tag hierarchy construction methodology and describe the datasets to which it is applied. The obtained hierarchies are presented in Sect. 3, while the results are discussed in Sect. 4.

2 Materials and Methods

2.1 Tag hierarchy construction

In order to obtain a tag hierarchy, we will follow the method described in [12] and [35], for which a quick overview is provided here.

Given a set of objects and each object having a set of tags, the goal is to construct a hierarchy, i.e., a directed acyclic graph (DAG) of the tags, where links are directed from more general concepts to more special ones. Our method constructs a hierarchy in two steps: first the tags are ordered, defining which tag should be placed higher in the hierarchy and which lower, then for each tag an appropriate parent is chosen. Note, that in the second step here we allow to choose more than one parent for a tag, hence the resulting hierarchy can be more complex than a simple tree.

For the reader who is not familiar with the method [12] we briefly summarize the main steps below. First we rank first the tags according to the eigenvector centrality of the tag-coappearence graph. Nodes in the co-appearance graph correspond to the tags, and links represent the co-appearances of the tags on the same object. The weights of the links are given by the number of co-appearances. However, when calculating the eigenvector centrality, links having a z-score below a certain threshold value are neglected. The z-score is calculated as the observed number of objects where the two tags co-appear minus the expected number co-occurrences when tags are randomly shuffled. The z-score is normalized by the standard deviation of random co-occurrences,

(1)

where is the number of times tags and co-appear, and are the expected value and standard deviation, respectively, for randomly reshuffled tags.

In the second step the hierarchy is built according to a bottom-up approach, i.e., we look for parents at each tag in ascending order of their eigenvector centrality. We choose a tag to be the first parent of , when it has higher eigenvector centrality than and has maximal score among possible parents. The score here is the sum of the z-scores of the links between the candidate parent and the descendants of , and between itself. Note, that by aggregating the descendants’ z-scores, we take into account much more information than any pairwise similarity metric can provide. Finally, we allow further parents if they have links to with at least as high z-score as the first parent.

2.2 Dataset

We study the keywords of scientific papers between 1975 and 2011 obtained from the Web of Science. The dataset contains 35 371 214 papers, which are tagged by three type of tags. The first type (heading) gives a very broad categorisation of the paper, there are only 5 tags of this type: Arts & Humanities, Life Sciences & Biomedicine, Multidisciplinary Science & Technology,
Physical Sciences and Social Sciences. The second type (category) has 251 more fine-grained scientific areas like Chemistry, Analytical or Engineering, Geological. Tags of the third type are chosen from two sets of specific phrases. One set is composed from the keywords which originated from the authors of the papers. The other set is given by the Web of Science service, and targeted as complementary to the author-given keywords. We will refer to the first keywords as authorkeywords and to the other as woskeywords. There are a huge number of third-type-tags: the woskeywords set contains 2 245 143 phrases and the authorkeywords set contains 6 891 089, which are very specific, like Zygapophyseal arthritis or H-3 -R-alpha-methylhistamine binding. Although these keywords are aimed to be complementary on the level of individual papers, still 883 836 of them appear both in the set of woskeywords and authorkeywords. Finally, we note that the Web of Science does not define any hierarchical relations between the tags, i.e., the ancestors or descendants of the tags are not given in the data set, only the categorization into the three major types is provided.

3 Results

The aim here is to apply the methodology of Sec. 2.1 to the data described in Sec. 2.2, in order to study the differences and similarities of hierarchies resulting from tags given by independent individuals and from tags given by a repository. In the first case the input of the hierarchy reconstruction is given by heading, category and authorkeyword tags, while in the second case the heading, category and woskeywords tags. Note that the general and intermediately general type tags are common in both datasets, and these tags are given by the repository management system. The difference between the independent tagging and centrally managed tagging comes from the most numerous third level tags. We compare below the hierarchies of the two taggings. First we compare the upper most part of the reconstructed DAGs. Then the hierarchy level occupation statistics of the DAGs are compared for each tag types. Finally the horizontal (branching) structures of the DAGs are analysed.

In the reconstructed hierarchies obtained from our method both DAGs had 4 dominant roots at the highest level of the hierarchy, being the ancestors of - of the available tags. The DAGs contain several other non-dominant roots, corresponding to tiny connected components which cover only - of the tags. The four dominant roots coincide with the heading type tags except Multidisciplinary Science & Technology, which appears as a child of Physical Sciences.

Next, we compare the vertical structures of the two DAGs by analysing the hierarchy level distribution of different tag types. A technical difficulty arises from the fact that a tag may belong to more roots, thus it can have more level values depending on the root from which it is counted. Here we classify tags to hierarchy levels according to their closest root, i.e., from the possible level numbers we associate the highest possible level to each tag. The resulting level distributions are shown on Fig. 1. They indicate that the position in the DAG correlates strongly with the heading-category-(author/wos)keyword classification, i.e., the reconstruction is consistent with the a priori classification of the tags in this respect. However, it is interesting to note, that while tags from different types mostly appear below each other in the expected order, tags from the same type also appear below each other – the reconstruction finds structure within the types.

Figure 1: Level-wise ratio of tags, for the 3 tag types. Left panel is for the authorkeyword DAG, right panel for the woskeyword DAG. The distribution is calculated for the tags that are members in at least one of the descendant sets of the 4 dominant roots. Roots are at level 1.

The third aspect is the horizontal similarity of the DAGs. Here we analyse whether common members of the DAGs are in similar horizontal position, i.e., having similar descendant subgraphs. Since the DAGs are constructed from the same header and category type tags and the two different keyword tags, we compare the horizontal structure of the two DAGs in two ways: i) first we restrict the analysis only for those tags, that are common in the two DAGs (header, category and common keywords) ii) secondly we restrict the analysis even more, considering only the header and category type tags, that are common by definition of the DAGs.

For the first case, where we compare the horizontal position of the common keywords/categories/headers of the two DAGs, we calculated the linearised mutual information-based similarity of [12]. The result shows huge dissimilarity with for the mutual information111The linearised mutual information ranges from 0 to 1.. A sample of the DAGs is shown on Fig. 2, around “vegetation response”. In both DAGs, related tags appear below the chosen tag, however, according to Fig. 2, descendants in one DAG differ from descendants in the other. These results are in accordance with the complementary nature of the authorkeywords and woskeywords.

Figure 2: Samples from the reduced DAGs (to heading, category and common keywords) with the woskeywords (top) and authorkeywords (bottom). Node sizes show the number of descendants in the reduced DAGs, on a logarithmic scale.

If we restrict the calculation of the mutual information to the header and category tags only, the similarity jumps to , showing that the relations between general tags are quite robust, indeed, the hierarchies are built bottom-up, where the bottom parts are very different. Samples of these reduced DAGs are visualised on Fig. 3. They display a few branches below Life Sciences & Biomedicine, like Biochemistry & Molecular Biology, Cardiac & Cardiovascular Systems or Plant Sciences. The two sub-figures show that Neurosciences, Plant Sciences, Biophysics and Agronomy have more children in the woskeyword DAG, while Hematology is also connected to Transplantation in the authorkeyword DAG.
Note that the reconstruction strongly depends on the descendants of each tag, especially for tags having several descendants, thus the difference between the authorkeywords and woskeywords could have led to very different structure at the top of the DAG [12]. The very high similarity at the top of the hierarchy compared to the low similarity for the first case indicates, that the differences between the authorkeywords and the woskeywords result differences on the low levels of the hierarchy, while this difference does not propagate to the highest levels.

Figure 3: Samples from the reduced DAGs (to heading and category) of the woskeywords (top) and authorkeywords (bottom) based reconstructions. Reduction left only the 256 heading and category tags. Node sizes show the number of descendants in the reduced DAGs, on a logarithmic scale.

4 Discussion

Tag-based categorisation of large online datasets is becoming increasingly widespread. They allow free word tagging, multiple categories for items and user-based processing in a parallel manner instead of centralised expert-based processing. Although the tags have no predefined relations, it is reasonable to assume that users think to an extent in hierarchical relations between tags, i.e., using some tags as special cases of other, more general tags.

Here we applied a recently introduced hierarchy construction method to keywords of scientific papers from the Web of Science. Tags were pre-organised by the Web of Science into 3 types, from the very general to the very special. For the most special type, 2 different sets of keywords were obtained, author-given and repository-given. Accordingly, two different hierarchies were constructed, each time using one of these sets as the special type, accompanied by the more general tags.

First, the structures of the obtained hierarchies were compared to the 3 predefined tag types. Good correspondence was found here. For the most general type, 4 out of the 5 member tags appeared as level 1 roots in the constructed hierarchies (Arts & Humanities, Life Sciences & Biomedicine, Physical Sciences and Social Sciences), the fifth one being an immediate child of one of them (Multidisciplinary Science & Technology). The intermediate type tags populated the next levels in the hierarchies, and members of the most specific type were at the lowest levels. An interesting observation is that the tags were organised to significantly more levels than three, indicating that there is structure within the predefined types.

Second, the two constructed hierarchies, using two different set of special keywords, were compared to each other. The hierarchies were reduced to the tags common in both of them, in order to make direct comparison possible. It was found that the organisation of the tags are very different, their similarity scoring on a [0,1] scale. This is in accordance with their purpose, i.e., for each individual paper woskeywords are aimed to be complementary to the authorkeywords [36]. On the other hand, when reducing the hierarchies only to the general and intermediately general type tags, a much higher similarity was obtained, in spite of the fact that the hierarchies were constructed bottom up, allowing different lower levels resulting in different high levels. Interestingly, while the lower parts of the hierarchies were different, the more general tags were organised in a significantly similar way.

5 Acknowledgments

The research was partially supported by the European Union and the European Social Fund through project FuturICT.hu (grant no.:TAMOP-4.2.2.C-11/1/KONV-2012-0013) and by the Hungarian National Science Fund (OTKA K105447). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors declare no conflict of interest.

References

  • [1] Mika P (2005) Ontologies are us: A unified model of social networks and semantics. In: In International Semantic Web Conference 3729522–536.
  • [2] Spyns P, Moor AD, Vandenbussche J, Meersman R (2006) From Folksologies to Ontologies: How the Twain Meet. In: In Proceedings of OTM Conferences 1738–755.
  • [3] Voss J (2007) Tagging, folksonomy & Co - renaissance of manual indexing? ArXiv:cs/0701072v2.
  • [4] Cattuto C, Loreto V, Pietronero L (2007) Semiotic dynamics and collaborative tagging. Proc Natl Acad Sci USA 104: 1461–1464.
  • [5] Lambiotte R, Ausloos M (2006) Collaborative tagging as a tripartite network. Lect Notes in Computer Sci 3993: 1114–1117.
  • [6] Cattuto C, Barrat A, Baldassarri A, Schehr G, Loreto V (2009) Collective dynamics of social annotation. Proc Natl Acad Sci USA 106: 10511–10515.
  • [7] Plangprasopchok A, Lerman K (2009) Constructing folksonomies from user-specified relations on flickr. In: Proceedings of the World Wide Web conference pp. 781–790.
  • [8] Plangprasopchok A, Lerman K, Getoor L (2011) A probabilistic approach for learning folksonomies from structured data. In: Fourth ACM International Conference on Web Search and Data Mining (WSDM) pp. 555–564.
  • [9] Damme CV, Hepp M, Siorpaes K (2007) Folksontology: An integrated approach for turning folksonomies into ontologies. Social Networks 2: 57–70.
  • [10] Schmitz P (2006) Inducing ontology from flickr tags. In: Proc. of Collaborative Web Tagging Workshop at the 15th Int. Conf. on World Wide Web (WWW).
  • [11] Heymann P, Garcia-Molina H (2006) Collaborative creation of communal hierarchical taxonomies in social tagging systems. Technical report, Stanford InfoLab. URL http://ilpubs.stanford.edu:8090/775/.
  • [12] Tibély G, Pollner P, Vicsek T, Palla G (2013) Extracting Tag Hierarchies. PLoS ONE 8, e84133.
  • [13] H. W. Ma, J. Buer and A. P. Zeng (2004) Hierarchical sructure and modules in the Escherichia coli transcriptional regulatory network revealed by a new top-down approach. BMC Bioinformatics 5:199.
  • [14] C. Goessmann, C. Hemelrijk and R. Huber (2000) The formation and maintenance of crayfish hierarchies: behavioral and self-structuring properties. Behavioral Ecology and Sociobiology 48:418-428.
  • [15] M. Nagy, Z. Ákos, D. Biro and T. Vicsek (2010) Hierarchical group dynamics in pigeon flocks. Nature 464:890-893.
  • [16] H. Fushing, M. P. McAssey, B. Beisner and B. McCowan (2011) Ranking network of captive rhesus macaque society: A sophisticated corporative kingdom. PLoS ONE 6:e17817.
  • [17] M. Kaiser, C. C. Hilgetag and R. Kötter (2010) Hierarchy and dynamics of neural networks. Front. Neuroinform. 4:112.
  • [18] D. Pumain (2006) Hierarchy in Natural and Social Sciences. Methodos Series 3 (Springer Netherlands, Dodrecht, The Netherlands).
  • [19] R. Guimerà, L. Danon, A. Díaz-Guilera and F. Giralt and A. Arenas (2003) Self-similar community structure in a network of human interactions. Phys. Rev. E 68: 065103.
  • [20] P. Pollner, G. Palla and T. Vicsek (2006) Preferential attachment of communities: The same principle, but a higher level. Europhys. Lett. 73: 478–484.
  • [21] S. Valverde and R. V. Solé (2007) Self-organization versus hierarchy in open-source social networks. Phys. Rev. E 76:046118.
  • [22] P. R. Krugman (1996) Confronting the mystery of urban hierarchy. J. Jpn. Int. Econ. 10: 399-418.
  • [23] M. Batty and P. Longley (1994) Fractal Cities: A Geometry of Form and Function (Academic, San Diego).
  • [24] H. Hirata and R. Ulanowicz (1985) Information theoretical analysis of the aggregation and hierarchical structure of ecological networks. J. Theor. Biol. 116:321-341.
  • [25] J. Wickens and R. Ulanowicz (1988) On quantifying hierarchical connections in ecology. J. Soc. Biol. Struct. 11:369-378.
  • [26] N. Eldredge (1985) Unfinished Synthesis: Biological Hierarchies and Modern Evolutionary Thought. (Oxford Univ. Press, New York).
  • [27] D. W. McShea (2001) The hierarchical structure of organisms. Paleobiology 27:405-423.
  • [28] A. Trusina, S. Maslov, P. Minnhagen, K. Sneppen (2004) Hierarchy measures in complex networks. Phys. Rev. Lett. 92:178702.
  • [29] B. Corominas-Murtra, C. Rodríguez-Caso, J. Goñi and R. Solé (2011) Measuring the hierarchy of feedforward networks. Chaos 21:016108.
  • [30] E. Mones, L. Vicsek and T. Vicsek (2012) Hierarchy Measure for Complex Networks. PLoS ONE 7:e33799.
  • [31] B. Corominas-Murtra, J. Goñi, R. V. Solé and C. Rodríguez-Caso (2013) On the origins of hierarchy in complex networks. Proc. Natl. Acad. Sci. USA 110:13316-13321.
  • [32] A. Clauset, C. Moore and M. E. J. Newman (2008) Hierarchical structure and the prediction of missing links in networks. Nature 453:98–101.
  • [33] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, A.-L. Barabási (2002) Hierarchical Organization of Modularity in Metabolic Networks. Science 297:1551–1555.
  • [34] Isi web of knowledge. http://scientific.thomson.com/isi/ (Date of access: 01/01/2012).
  • [35] Palla G, Tibély G, Mones E, Pollner P, Vicsek T (2015) Hierarchical networks of scientific journals. arXiv:1506.05661 To appear in Palgrave Communications.
  • [36] http://interest.science.thomsonreuters.com/content/WOKUserTips-201010-SEA (Date of access: 15/12/2014).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
142714
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description