Topic cropping: Leveraging latent topics for the analysis of small corpora

Nam Khanh Tran; Sergej Zerr; Kerstin Bischoff; Claudia Niederée; Ralf Krestel

doi:10.1007/978-3-642-40501-3_30

Details

Originalsprache	Englisch
Titel des Sammelwerks	Research and Advanced Technology for Digital Libraries
Untertitel	International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings
Seiten	297-308
Seitenumfang	12
Publikationsstatus	Veröffentlicht - 2013
Veranstaltung	17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013 - Valletta, Malta Dauer: 22 Sept. 2013 → 26 Sept. 2013

Publikationsreihe

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Band	8092 LNCS
ISSN (Print)	0302-9743
ISSN (elektronisch)	1611-3349

Abstract

Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.

ASJC Scopus Sachgebiete

Mathematik (insg.)
Theoretische Informatik
Informatik (insg.)
Allgemeine Computerwissenschaft

Zitieren

Topic cropping: Leveraging latent topics for the analysis of small corpora. / Tran, Nam Khanh; Zerr, Sergej; Bischoff, Kerstin et al.
Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings. 2013. S. 297-308 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 8092 LNCS).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Tran, NK, Zerr, S, Bischoff, K, Niederée, C & Krestel, R 2013, Topic cropping: Leveraging latent topics for the analysis of small corpora. in Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Bd. 8092 LNCS, S. 297-308, 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Valletta, Malta, 22 Sept. 2013. https://doi.org/10.1007/978-3-642-40501-3_30

Tran, N. K., Zerr, S., Bischoff, K., Niederée, C., & Krestel, R. (2013). Topic cropping: Leveraging latent topics for the analysis of small corpora. In Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings (S. 297-308). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 8092 LNCS). https://doi.org/10.1007/978-3-642-40501-3_30

Tran NK, Zerr S, Bischoff K, Niederée C, Krestel R. Topic cropping: Leveraging latent topics for the analysis of small corpora. in Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings. 2013. S. 297-308. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-642-40501-3_30

Tran, Nam Khanh ; Zerr, Sergej ; Bischoff, Kerstin et al. / Topic cropping : Leveraging latent topics for the analysis of small corpora. Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings. 2013. S. 297-308 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

Download

@inproceedings{e5af95b7b5284e588922cfd7b09418c7,

title = "Topic cropping: Leveraging latent topics for the analysis of small corpora",

abstract = "Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.",

keywords = "digital humanities, qualitative data, topic modeling",

author = "Tran, {Nam Khanh} and Sergej Zerr and Kerstin Bischoff and Claudia Nieder{\'e}e and Ralf Krestel",

year = "2013",

doi = "10.1007/978-3-642-40501-3_30",

language = "English",

isbn = "9783642405006",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

pages = "297--308",

booktitle = "Research and Advanced Technology for Digital Libraries",

note = "17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013 ; Conference date: 22-09-2013 Through 26-09-2013",

}

Download

TY - GEN

T1 - Topic cropping

T2 - 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013

AU - Tran, Nam Khanh

AU - Zerr, Sergej

AU - Bischoff, Kerstin

AU - Niederée, Claudia

AU - Krestel, Ralf

PY - 2013

Y1 - 2013

N2 - Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.

AB - Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.

KW - digital humanities

KW - qualitative data

KW - topic modeling

UR - http://www.scopus.com/inward/record.url?scp=84884720660&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-40501-3_30

DO - 10.1007/978-3-642-40501-3_30

M3 - Conference contribution

AN - SCOPUS:84884720660

SN - 9783642405006

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 297

EP - 308

BT - Research and Advanced Technology for Digital Libraries

Y2 - 22 September 2013 through 26 September 2013

ER -

Research@Leibniz University

Topic cropping: Leveraging latent topics for the analysis of small corpora

Autoren

Organisationseinheiten

Externe Organisationen

Details

Publikationsreihe

Abstract

ASJC Scopus Sachgebiete

Zitieren