Distant Supervision in BERT-based Adhoc Document Retrieval

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

  • Koustav Rudra
  • Avishek Anand

Organisationseinheiten

Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksCIKM 2020 - Proceedings of the 29th ACM International Conference on Information and Knowledge Management
Herausgeber (Verlag)Association for Computing Machinery (ACM)
Seiten2197-2200
Seitenumfang4
ISBN (elektronisch)9781450368599
PublikationsstatusVeröffentlicht - Okt. 2020
Veranstaltung29th ACM International Conference on Information and Knowledge Management - online, Virtual, Online, Irland
Dauer: 19 Okt. 202023 Okt. 2020

Abstract

Recently introduced pre-trained contextualized autoregressive models like BERT have shown improvements in document retrieval tasks. One of the major limitations of the current approaches can be attributed to the manner they deal with variable-size document lengths using a fixed input BERT model. Common approaches either truncate or split longer documents into small sentences/passages and subsequently label them - using the original document label or from another externally trained model. The other problem is the scarcity of labelled query-document pairs that directly hampers the performance of modern data hungry neural models. This process gets even more complicated with the partially labelled large dataset of queries derived from query logs (TREC-DL). In this paper, we handle both the issues simultaneously and introduce passage level weak supervision in contrast to standard document level supervision. We conduct a preliminary study on the document to passage label transfer and influence of unlabelled documents on the performance of adhoc document retrieval. We observe that direct transfer of relevance labels from documents to passages introduces label noise that strongly affects retrieval effectiveness. We propose a weak-supervision based transfer passage labelling scheme that helps in performance improvement and gathering relevant passages from unlabelled documents.

ASJC Scopus Sachgebiete

Zitieren

Distant Supervision in BERT-based Adhoc Document Retrieval. / Rudra, Koustav; Anand, Avishek.
CIKM 2020 - Proceedings of the 29th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery (ACM), 2020. S. 2197-2200.

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Rudra, K & Anand, A 2020, Distant Supervision in BERT-based Adhoc Document Retrieval. in CIKM 2020 - Proceedings of the 29th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery (ACM), S. 2197-2200, 29th ACM International Conference on Information and Knowledge Management, Virtual, Online, Irland, 19 Okt. 2020. https://doi.org/10.1145/3340531.3412124
Rudra, K., & Anand, A. (2020). Distant Supervision in BERT-based Adhoc Document Retrieval. In CIKM 2020 - Proceedings of the 29th ACM International Conference on Information and Knowledge Management (S. 2197-2200). Association for Computing Machinery (ACM). https://doi.org/10.1145/3340531.3412124
Rudra K, Anand A. Distant Supervision in BERT-based Adhoc Document Retrieval. in CIKM 2020 - Proceedings of the 29th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery (ACM). 2020. S. 2197-2200 doi: 10.1145/3340531.3412124
Rudra, Koustav ; Anand, Avishek. / Distant Supervision in BERT-based Adhoc Document Retrieval. CIKM 2020 - Proceedings of the 29th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery (ACM), 2020. S. 2197-2200
Download
@inproceedings{19cd50f941134533bd9134e3a53d23af,
title = "Distant Supervision in BERT-based Adhoc Document Retrieval",
abstract = "Recently introduced pre-trained contextualized autoregressive models like BERT have shown improvements in document retrieval tasks. One of the major limitations of the current approaches can be attributed to the manner they deal with variable-size document lengths using a fixed input BERT model. Common approaches either truncate or split longer documents into small sentences/passages and subsequently label them - using the original document label or from another externally trained model. The other problem is the scarcity of labelled query-document pairs that directly hampers the performance of modern data hungry neural models. This process gets even more complicated with the partially labelled large dataset of queries derived from query logs (TREC-DL). In this paper, we handle both the issues simultaneously and introduce passage level weak supervision in contrast to standard document level supervision. We conduct a preliminary study on the document to passage label transfer and influence of unlabelled documents on the performance of adhoc document retrieval. We observe that direct transfer of relevance labels from documents to passages introduces label noise that strongly affects retrieval effectiveness. We propose a weak-supervision based transfer passage labelling scheme that helps in performance improvement and gathering relevant passages from unlabelled documents.",
keywords = "adhoc retrieval, distant supervision, document ranking",
author = "Koustav Rudra and Avishek Anand",
note = "Funding information: Acknowledgement: Funding for this project was in part provided by the European Union{\textquoteright}s Horizon 2020 research and innovation programme under grant agreement No 832921.; 29th ACM International Conference on Information and Knowledge Management, CIKM 2020 ; Conference date: 19-10-2020 Through 23-10-2020",
year = "2020",
month = oct,
doi = "10.1145/3340531.3412124",
language = "English",
pages = "2197--2200",
booktitle = "CIKM 2020 - Proceedings of the 29th ACM International Conference on Information and Knowledge Management",
publisher = "Association for Computing Machinery (ACM)",
address = "United States",

}

Download

TY - GEN

T1 - Distant Supervision in BERT-based Adhoc Document Retrieval

AU - Rudra, Koustav

AU - Anand, Avishek

N1 - Funding information: Acknowledgement: Funding for this project was in part provided by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 832921.

PY - 2020/10

Y1 - 2020/10

N2 - Recently introduced pre-trained contextualized autoregressive models like BERT have shown improvements in document retrieval tasks. One of the major limitations of the current approaches can be attributed to the manner they deal with variable-size document lengths using a fixed input BERT model. Common approaches either truncate or split longer documents into small sentences/passages and subsequently label them - using the original document label or from another externally trained model. The other problem is the scarcity of labelled query-document pairs that directly hampers the performance of modern data hungry neural models. This process gets even more complicated with the partially labelled large dataset of queries derived from query logs (TREC-DL). In this paper, we handle both the issues simultaneously and introduce passage level weak supervision in contrast to standard document level supervision. We conduct a preliminary study on the document to passage label transfer and influence of unlabelled documents on the performance of adhoc document retrieval. We observe that direct transfer of relevance labels from documents to passages introduces label noise that strongly affects retrieval effectiveness. We propose a weak-supervision based transfer passage labelling scheme that helps in performance improvement and gathering relevant passages from unlabelled documents.

AB - Recently introduced pre-trained contextualized autoregressive models like BERT have shown improvements in document retrieval tasks. One of the major limitations of the current approaches can be attributed to the manner they deal with variable-size document lengths using a fixed input BERT model. Common approaches either truncate or split longer documents into small sentences/passages and subsequently label them - using the original document label or from another externally trained model. The other problem is the scarcity of labelled query-document pairs that directly hampers the performance of modern data hungry neural models. This process gets even more complicated with the partially labelled large dataset of queries derived from query logs (TREC-DL). In this paper, we handle both the issues simultaneously and introduce passage level weak supervision in contrast to standard document level supervision. We conduct a preliminary study on the document to passage label transfer and influence of unlabelled documents on the performance of adhoc document retrieval. We observe that direct transfer of relevance labels from documents to passages introduces label noise that strongly affects retrieval effectiveness. We propose a weak-supervision based transfer passage labelling scheme that helps in performance improvement and gathering relevant passages from unlabelled documents.

KW - adhoc retrieval

KW - distant supervision

KW - document ranking

UR - http://www.scopus.com/inward/record.url?scp=85095866363&partnerID=8YFLogxK

U2 - 10.1145/3340531.3412124

DO - 10.1145/3340531.3412124

M3 - Conference contribution

AN - SCOPUS:85095866363

SP - 2197

EP - 2200

BT - CIKM 2020 - Proceedings of the 29th ACM International Conference on Information and Knowledge Management

PB - Association for Computing Machinery (ACM)

T2 - 29th ACM International Conference on Information and Knowledge Management, CIKM 2020

Y2 - 19 October 2020 through 23 October 2020

ER -