Asynchronous Training ofWord Embeddings for Large Text Corpora

Avishek Anand; Megha Khosla; Jaspreet Singh; Jan Hendrik Zab; Zijian Zhang

doi:10.48550/arXiv.1812.03825

Details

Originalsprache	Englisch
Titel des Sammelwerks	WSDM 2019
Untertitel	Proceedings of the 12th ACM International Conference on Web Search and Data Mining
Erscheinungsort	New York
Herausgeber (Verlag)	Association for Computing Machinery (ACM)
Seiten	168-176
Seitenumfang	9
ISBN (elektronisch)	9781450359405
Publikationsstatus	Veröffentlicht - 30 Jan. 2019
Veranstaltung	12th ACM International Conference on Web Search and Data Mining, WSDM 2019 - Melbourne, Australien Dauer: 11 Feb. 2019 → 15 Feb. 2019

Abstract

Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input is typically sequentially processed and parameters are synchronously updated. Distributed architectures for asynchronous training that have been proposed either focus on scaling vocabulary sizes and dimensionality or suffer from expensive synchronization latencies. In this paper, we propose a scalable approach to train word embeddings by partitioning the input space instead in order to scale to massive text corpora while not sacrificing the performance of the embeddings. Our training procedure does not involve any parameter synchronization except a final sub-model merge phase that typically executes in a few minutes. Our distributed training scales seamlessly to large corpus sizes and we get comparable and sometimes even up to 45% performance improvement in a variety of NLP benchmarks using models trained by our distributed procedure which requires 1/10 of the time taken by the baseline approach. Finally we also show that we are robust to missing words in sub-models and are able to effectively reconstruct word representations.

ASJC Scopus Sachgebiete

Informatik (insg.)
Computernetzwerke und -kommunikation
Informatik (insg.)
Software
Informatik (insg.)
Angewandte Informatik

Zitieren

Asynchronous Training ofWord Embeddings for Large Text Corpora. / Anand, Avishek; Khosla, Megha; Singh, Jaspreet et al.
WSDM 2019: Proceedings of the 12th ACM International Conference on Web Search and Data Mining. New York: Association for Computing Machinery (ACM), 2019. S. 168-176.

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Anand, A, Khosla, M, Singh, J, Zab, JH & Zhang, Z 2019, Asynchronous Training ofWord Embeddings for Large Text Corpora. in WSDM 2019: Proceedings of the 12th ACM International Conference on Web Search and Data Mining. Association for Computing Machinery (ACM), New York, S. 168-176, 12th ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, Australien, 11 Feb. 2019. https://doi.org/10.48550/arXiv.1812.03825, https://doi.org/10.1145/3289600.3291011

Anand, A., Khosla, M., Singh, J., Zab, J. H., & Zhang, Z. (2019). Asynchronous Training ofWord Embeddings for Large Text Corpora. In WSDM 2019: Proceedings of the 12th ACM International Conference on Web Search and Data Mining (S. 168-176). Association for Computing Machinery (ACM). https://doi.org/10.48550/arXiv.1812.03825, https://doi.org/10.1145/3289600.3291011

Anand A, Khosla M, Singh J, Zab JH, Zhang Z. Asynchronous Training ofWord Embeddings for Large Text Corpora. in WSDM 2019: Proceedings of the 12th ACM International Conference on Web Search and Data Mining. New York: Association for Computing Machinery (ACM). 2019. S. 168-176 doi: 10.48550/arXiv.1812.03825, 10.1145/3289600.3291011

Anand, Avishek ; Khosla, Megha ; Singh, Jaspreet et al. / Asynchronous Training ofWord Embeddings for Large Text Corpora. WSDM 2019: Proceedings of the 12th ACM International Conference on Web Search and Data Mining. New York : Association for Computing Machinery (ACM), 2019. S. 168-176

Download

@inproceedings{181743140e7f4185907d4a10444bee53,

title = "Asynchronous Training ofWord Embeddings for Large Text Corpora",

abstract = "Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input is typically sequentially processed and parameters are synchronously updated. Distributed architectures for asynchronous training that have been proposed either focus on scaling vocabulary sizes and dimensionality or suffer from expensive synchronization latencies. In this paper, we propose a scalable approach to train word embeddings by partitioning the input space instead in order to scale to massive text corpora while not sacrificing the performance of the embeddings. Our training procedure does not involve any parameter synchronization except a final sub-model merge phase that typically executes in a few minutes. Our distributed training scales seamlessly to large corpus sizes and we get comparable and sometimes even up to 45% performance improvement in a variety of NLP benchmarks using models trained by our distributed procedure which requires 1/10 of the time taken by the baseline approach. Finally we also show that we are robust to missing words in sub-models and are able to effectively reconstruct word representations.",

author = "Avishek Anand and Megha Khosla and Jaspreet Singh and Zab, {Jan Hendrik} and Zijian Zhang",

note = "Funding information: This work is partially funded by ALEXANDRIA (ERC 339233) and SoBigData (Grant agreement No. 654024).; 12th ACM International Conference on Web Search and Data Mining, WSDM 2019 ; Conference date: 11-02-2019 Through 15-02-2019",

year = "2019",

month = jan,

day = "30",

doi = "10.48550/arXiv.1812.03825",

language = "English",

pages = "168--176",

booktitle = "WSDM 2019",

publisher = "Association for Computing Machinery (ACM)",

address = "United States",

}

Download

TY - GEN

T1 - Asynchronous Training ofWord Embeddings for Large Text Corpora

AU - Anand, Avishek

AU - Khosla, Megha

AU - Singh, Jaspreet

AU - Zab, Jan Hendrik

AU - Zhang, Zijian

N1 - Funding information: This work is partially funded by ALEXANDRIA (ERC 339233) and SoBigData (Grant agreement No. 654024).

PY - 2019/1/30

Y1 - 2019/1/30

N2 - Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input is typically sequentially processed and parameters are synchronously updated. Distributed architectures for asynchronous training that have been proposed either focus on scaling vocabulary sizes and dimensionality or suffer from expensive synchronization latencies. In this paper, we propose a scalable approach to train word embeddings by partitioning the input space instead in order to scale to massive text corpora while not sacrificing the performance of the embeddings. Our training procedure does not involve any parameter synchronization except a final sub-model merge phase that typically executes in a few minutes. Our distributed training scales seamlessly to large corpus sizes and we get comparable and sometimes even up to 45% performance improvement in a variety of NLP benchmarks using models trained by our distributed procedure which requires 1/10 of the time taken by the baseline approach. Finally we also show that we are robust to missing words in sub-models and are able to effectively reconstruct word representations.

AB - Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input is typically sequentially processed and parameters are synchronously updated. Distributed architectures for asynchronous training that have been proposed either focus on scaling vocabulary sizes and dimensionality or suffer from expensive synchronization latencies. In this paper, we propose a scalable approach to train word embeddings by partitioning the input space instead in order to scale to massive text corpora while not sacrificing the performance of the embeddings. Our training procedure does not involve any parameter synchronization except a final sub-model merge phase that typically executes in a few minutes. Our distributed training scales seamlessly to large corpus sizes and we get comparable and sometimes even up to 45% performance improvement in a variety of NLP benchmarks using models trained by our distributed procedure which requires 1/10 of the time taken by the baseline approach. Finally we also show that we are robust to missing words in sub-models and are able to effectively reconstruct word representations.

UR - http://www.scopus.com/inward/record.url?scp=85061738257&partnerID=8YFLogxK

U2 - 10.48550/arXiv.1812.03825

DO - 10.48550/arXiv.1812.03825

M3 - Conference contribution

AN - SCOPUS:85061738257

SP - 168

EP - 176

BT - WSDM 2019

PB - Association for Computing Machinery (ACM)

CY - New York

T2 - 12th ACM International Conference on Web Search and Data Mining, WSDM 2019

Y2 - 11 February 2019 through 15 February 2019

ER -

Research@Leibniz University

Asynchronous Training ofWord Embeddings for Large Text Corpora

Autorschaft

Organisationseinheiten

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren