Topic cropping: Leveraging latent topics for the analysis of small corpora

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

  • Nam Khanh Tran
  • Sergej Zerr
  • Kerstin Bischoff
  • Claudia Niederée
  • Ralf Krestel

Organisationseinheiten

Externe Organisationen

  • University of California at Irvine
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksResearch and Advanced Technology for Digital Libraries
UntertitelInternational Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings
Seiten297-308
Seitenumfang12
PublikationsstatusVeröffentlicht - 2013
Veranstaltung17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013 - Valletta, Malta
Dauer: 22 Sept. 201326 Sept. 2013

Publikationsreihe

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Band8092 LNCS
ISSN (Print)0302-9743
ISSN (elektronisch)1611-3349

Abstract

Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.

ASJC Scopus Sachgebiete

Zitieren

Topic cropping: Leveraging latent topics for the analysis of small corpora. / Tran, Nam Khanh; Zerr, Sergej; Bischoff, Kerstin et al.
Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings. 2013. S. 297-308 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 8092 LNCS).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Tran, NK, Zerr, S, Bischoff, K, Niederée, C & Krestel, R 2013, Topic cropping: Leveraging latent topics for the analysis of small corpora. in Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Bd. 8092 LNCS, S. 297-308, 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Valletta, Malta, 22 Sept. 2013. https://doi.org/10.1007/978-3-642-40501-3_30
Tran, N. K., Zerr, S., Bischoff, K., Niederée, C., & Krestel, R. (2013). Topic cropping: Leveraging latent topics for the analysis of small corpora. In Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings (S. 297-308). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 8092 LNCS). https://doi.org/10.1007/978-3-642-40501-3_30
Tran NK, Zerr S, Bischoff K, Niederée C, Krestel R. Topic cropping: Leveraging latent topics for the analysis of small corpora. in Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings. 2013. S. 297-308. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-642-40501-3_30
Tran, Nam Khanh ; Zerr, Sergej ; Bischoff, Kerstin et al. / Topic cropping : Leveraging latent topics for the analysis of small corpora. Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings. 2013. S. 297-308 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Download
@inproceedings{e5af95b7b5284e588922cfd7b09418c7,
title = "Topic cropping: Leveraging latent topics for the analysis of small corpora",
abstract = "Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.",
keywords = "digital humanities, qualitative data, topic modeling",
author = "Tran, {Nam Khanh} and Sergej Zerr and Kerstin Bischoff and Claudia Nieder{\'e}e and Ralf Krestel",
year = "2013",
doi = "10.1007/978-3-642-40501-3_30",
language = "English",
isbn = "9783642405006",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "297--308",
booktitle = "Research and Advanced Technology for Digital Libraries",
note = "17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013 ; Conference date: 22-09-2013 Through 26-09-2013",

}

Download

TY - GEN

T1 - Topic cropping

T2 - 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013

AU - Tran, Nam Khanh

AU - Zerr, Sergej

AU - Bischoff, Kerstin

AU - Niederée, Claudia

AU - Krestel, Ralf

PY - 2013

Y1 - 2013

N2 - Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.

AB - Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.

KW - digital humanities

KW - qualitative data

KW - topic modeling

UR - http://www.scopus.com/inward/record.url?scp=84884720660&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-40501-3_30

DO - 10.1007/978-3-642-40501-3_30

M3 - Conference contribution

AN - SCOPUS:84884720660

SN - 9783642405006

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 297

EP - 308

BT - Research and Advanced Technology for Digital Libraries

Y2 - 22 September 2013 through 26 September 2013

ER -