Details
Original language | English |
---|---|
Title of host publication | Research and Advanced Technology for Digital Libraries |
Subtitle of host publication | International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings |
Pages | 297-308 |
Number of pages | 12 |
Publication status | Published - 2013 |
Event | 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013 - Valletta, Malta Duration: 22 Sept 2013 → 26 Sept 2013 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 8092 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (electronic) | 1611-3349 |
Abstract
Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.
Keywords
- digital humanities, qualitative data, topic modeling
ASJC Scopus subject areas
- Mathematics(all)
- Theoretical Computer Science
- Computer Science(all)
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Research and Advanced Technology for Digital Libraries : International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Proceedings. 2013. p. 297-308 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8092 LNCS).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Topic cropping
T2 - 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013
AU - Tran, Nam Khanh
AU - Zerr, Sergej
AU - Bischoff, Kerstin
AU - Niederée, Claudia
AU - Krestel, Ralf
PY - 2013
Y1 - 2013
N2 - Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.
AB - Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.
KW - digital humanities
KW - qualitative data
KW - topic modeling
UR - http://www.scopus.com/inward/record.url?scp=84884720660&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-40501-3_30
DO - 10.1007/978-3-642-40501-3_30
M3 - Conference contribution
AN - SCOPUS:84884720660
SN - 9783642405006
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 297
EP - 308
BT - Research and Advanced Technology for Digital Libraries
Y2 - 22 September 2013 through 26 September 2013
ER -