Topic cropping: Leveraging latent topics for the analysis of small corpora

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Nam Khanh Tran
  • Sergej Zerr
  • Kerstin Bischoff
  • Claudia Niederée
  • Ralf Krestel

Research Organisations

External Research Organisations

  • University of California at Irvine
View graph of relations

Details

Original languageEnglish
Title of host publicationResearch and Advanced Technology for Digital Libraries
Subtitle of host publicationInternational Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings
Pages297-308
Number of pages12
Publication statusPublished - 2013
Event17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013 - Valletta, Malta
Duration: 22 Sept 201326 Sept 2013

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8092 LNCS
ISSN (Print)0302-9743
ISSN (electronic)1611-3349

Abstract

Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.

Keywords

    digital humanities, qualitative data, topic modeling

ASJC Scopus subject areas

Cite this

Topic cropping: Leveraging latent topics for the analysis of small corpora. / Tran, Nam Khanh; Zerr, Sergej; Bischoff, Kerstin et al.
Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings. 2013. p. 297-308 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8092 LNCS).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Tran, NK, Zerr, S, Bischoff, K, Niederée, C & Krestel, R 2013, Topic cropping: Leveraging latent topics for the analysis of small corpora. in Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8092 LNCS, pp. 297-308, 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Valletta, Malta, 22 Sept 2013. https://doi.org/10.1007/978-3-642-40501-3_30
Tran, N. K., Zerr, S., Bischoff, K., Niederée, C., & Krestel, R. (2013). Topic cropping: Leveraging latent topics for the analysis of small corpora. In Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings (pp. 297-308). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8092 LNCS). https://doi.org/10.1007/978-3-642-40501-3_30
Tran NK, Zerr S, Bischoff K, Niederée C, Krestel R. Topic cropping: Leveraging latent topics for the analysis of small corpora. In Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings. 2013. p. 297-308. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-642-40501-3_30
Tran, Nam Khanh ; Zerr, Sergej ; Bischoff, Kerstin et al. / Topic cropping : Leveraging latent topics for the analysis of small corpora. Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings. 2013. pp. 297-308 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Download
@inproceedings{e5af95b7b5284e588922cfd7b09418c7,
title = "Topic cropping: Leveraging latent topics for the analysis of small corpora",
abstract = "Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.",
keywords = "digital humanities, qualitative data, topic modeling",
author = "Tran, {Nam Khanh} and Sergej Zerr and Kerstin Bischoff and Claudia Nieder{\'e}e and Ralf Krestel",
year = "2013",
doi = "10.1007/978-3-642-40501-3_30",
language = "English",
isbn = "9783642405006",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "297--308",
booktitle = "Research and Advanced Technology for Digital Libraries",
note = "17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013 ; Conference date: 22-09-2013 Through 26-09-2013",

}

Download

TY - GEN

T1 - Topic cropping

T2 - 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013

AU - Tran, Nam Khanh

AU - Zerr, Sergej

AU - Bischoff, Kerstin

AU - Niederée, Claudia

AU - Krestel, Ralf

PY - 2013

Y1 - 2013

N2 - Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.

AB - Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.

KW - digital humanities

KW - qualitative data

KW - topic modeling

UR - http://www.scopus.com/inward/record.url?scp=84884720660&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-40501-3_30

DO - 10.1007/978-3-642-40501-3_30

M3 - Conference contribution

AN - SCOPUS:84884720660

SN - 9783642405006

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 297

EP - 308

BT - Research and Advanced Technology for Digital Libraries

Y2 - 22 September 2013 through 26 September 2013

ER -