Building a Scientific Concept Hierarchy Database (SCHBase)
Eytan Adar and Srayan Data
Extracted keyphrases can enhance numerous applications ranging from search to tracking the evolution of scientific dis- course. We present SCHBase, a hierarchical database of keyphrases extracted from large collections of scientific literature. SCHBase relies on a tendency of scientists to generate new abbreviations that 'extend' existing forms as a form of signaling novelty. We demonstrate how these keyphrases/concepts can be extracted, and their viability as a database in relation to existing collections. We further show how keyphrases can be placed into a semantically-meaningful "phylogenetic" structure and describe key features of this structure.
Preprint: PDF (2Mb), In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China, July 2015
The tar.gz file of the dataset used in this project is downloadable here schbase version 1 (2.3Mb). If you use it, please cite the work using the citation above. We will update the file as we refine our pipeline.
There is one file for every "tree" (note that some nodes will repeat between files). In each file there is 1 line per "edge." The format is:
child_concept <tab> parent_concept <tab> child_first_year <tab> parent_first_year
Spaces are replaced with an underscore. The year is the first year the concept was detected in the corpus (note that this is generally correct, but can be off by a year or two).