Can we find documents in web archives without knowing their contents?

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

Research Organisations

View graph of relations

Details

Original languageEnglish
Title of host publicationWebSci 2016 - Proceedings of the 2016 ACM Web Science Conference
Pages173-182
Number of pages10
ISBN (electronic)9781450342087
Publication statusPublished - 22 May 2016
Event8th ACM Web Science Conference - Hannover, Germany
Duration: 22 May 201625 May 2016
Conference number: 8

Publication series

NameWebSci 2016 - Proceedings of the 2016 ACM Web Science Conference

Abstract

Recent advances of preservation technologies have led to an increasing number of Web archive systems and collections. These collections are valuable to explore the past of the Web, but their value can only be uncovered with effective access and exploration mechanisms. Ideal search and rank- ing methods must be robust to the high redundancy and the temporal noise of contents, as well as scalable to the huge amount of data archived. Despite several attempts in Web archive search, facilitating access to Web archive still remains a challenging problem. In this work, we conduct a first analysis on different rank- ing strategies that exploit evidences from metadata instead of the full content of documents. We perform a first study to compare the usefulness of non-content evidences to Web archive search, where the evidences are mined from the metadata of file headers, links and URL strings only. Based on these findings, we propose a simple yet surprisingly effective learning model that combines multiple evidences to distinguish "good" from \bad" search results. We conduct empirical experiments quantitatively as well as qualitatively to confirm the validity of our proposed method, as a first step towards better ranking in Web archives taking meta- data into account.

Keywords

    Feature analysis, Temporal ranking, Web archive search

ASJC Scopus subject areas

Cite this

Can we find documents in web archives without knowing their contents? / Vo, Khoi Duy; Tran, Tuan; Nguyen, Tu Ngoc et al.
WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference. 2016. p. 173-182 (WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Vo, KD, Tran, T, Nguyen, TN, Zhu, X & Nejdl, W 2016, Can we find documents in web archives without knowing their contents? in WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference. WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference, pp. 173-182, 8th ACM Web Science Conference, Hannover, Germany, 22 May 2016. https://doi.org/10.1145/2908131.2908165
Vo, K. D., Tran, T., Nguyen, T. N., Zhu, X., & Nejdl, W. (2016). Can we find documents in web archives without knowing their contents? In WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference (pp. 173-182). (WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference). https://doi.org/10.1145/2908131.2908165
Vo KD, Tran T, Nguyen TN, Zhu X, Nejdl W. Can we find documents in web archives without knowing their contents? In WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference. 2016. p. 173-182. (WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference). doi: 10.1145/2908131.2908165
Vo, Khoi Duy ; Tran, Tuan ; Nguyen, Tu Ngoc et al. / Can we find documents in web archives without knowing their contents?. WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference. 2016. pp. 173-182 (WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference).
Download
@inproceedings{8919f7a51c6a408ca2d61aaa298cf268,
title = "Can we find documents in web archives without knowing their contents?",
abstract = "Recent advances of preservation technologies have led to an increasing number of Web archive systems and collections. These collections are valuable to explore the past of the Web, but their value can only be uncovered with effective access and exploration mechanisms. Ideal search and rank- ing methods must be robust to the high redundancy and the temporal noise of contents, as well as scalable to the huge amount of data archived. Despite several attempts in Web archive search, facilitating access to Web archive still remains a challenging problem. In this work, we conduct a first analysis on different rank- ing strategies that exploit evidences from metadata instead of the full content of documents. We perform a first study to compare the usefulness of non-content evidences to Web archive search, where the evidences are mined from the metadata of file headers, links and URL strings only. Based on these findings, we propose a simple yet surprisingly effective learning model that combines multiple evidences to distinguish {"}good{"} from \bad{"} search results. We conduct empirical experiments quantitatively as well as qualitatively to confirm the validity of our proposed method, as a first step towards better ranking in Web archives taking meta- data into account.",
keywords = "Feature analysis, Temporal ranking, Web archive search",
author = "Vo, {Khoi Duy} and Tuan Tran and Nguyen, {Tu Ngoc} and Xiaofei Zhu and Wolfgang Nejdl",
year = "2016",
month = may,
day = "22",
doi = "10.1145/2908131.2908165",
language = "English",
series = "WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference",
pages = "173--182",
booktitle = "WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference",
note = "8th ACM Web Science Conference, WebSci 2016 ; Conference date: 22-05-2016 Through 25-05-2016",

}

Download

TY - GEN

T1 - Can we find documents in web archives without knowing their contents?

AU - Vo, Khoi Duy

AU - Tran, Tuan

AU - Nguyen, Tu Ngoc

AU - Zhu, Xiaofei

AU - Nejdl, Wolfgang

N1 - Conference code: 8

PY - 2016/5/22

Y1 - 2016/5/22

N2 - Recent advances of preservation technologies have led to an increasing number of Web archive systems and collections. These collections are valuable to explore the past of the Web, but their value can only be uncovered with effective access and exploration mechanisms. Ideal search and rank- ing methods must be robust to the high redundancy and the temporal noise of contents, as well as scalable to the huge amount of data archived. Despite several attempts in Web archive search, facilitating access to Web archive still remains a challenging problem. In this work, we conduct a first analysis on different rank- ing strategies that exploit evidences from metadata instead of the full content of documents. We perform a first study to compare the usefulness of non-content evidences to Web archive search, where the evidences are mined from the metadata of file headers, links and URL strings only. Based on these findings, we propose a simple yet surprisingly effective learning model that combines multiple evidences to distinguish "good" from \bad" search results. We conduct empirical experiments quantitatively as well as qualitatively to confirm the validity of our proposed method, as a first step towards better ranking in Web archives taking meta- data into account.

AB - Recent advances of preservation technologies have led to an increasing number of Web archive systems and collections. These collections are valuable to explore the past of the Web, but their value can only be uncovered with effective access and exploration mechanisms. Ideal search and rank- ing methods must be robust to the high redundancy and the temporal noise of contents, as well as scalable to the huge amount of data archived. Despite several attempts in Web archive search, facilitating access to Web archive still remains a challenging problem. In this work, we conduct a first analysis on different rank- ing strategies that exploit evidences from metadata instead of the full content of documents. We perform a first study to compare the usefulness of non-content evidences to Web archive search, where the evidences are mined from the metadata of file headers, links and URL strings only. Based on these findings, we propose a simple yet surprisingly effective learning model that combines multiple evidences to distinguish "good" from \bad" search results. We conduct empirical experiments quantitatively as well as qualitatively to confirm the validity of our proposed method, as a first step towards better ranking in Web archives taking meta- data into account.

KW - Feature analysis

KW - Temporal ranking

KW - Web archive search

UR - http://www.scopus.com/inward/record.url?scp=84976358897&partnerID=8YFLogxK

U2 - 10.1145/2908131.2908165

DO - 10.1145/2908131.2908165

M3 - Conference contribution

AN - SCOPUS:84976358897

T3 - WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference

SP - 173

EP - 182

BT - WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference

T2 - 8th ACM Web Science Conference

Y2 - 22 May 2016 through 25 May 2016

ER -

By the same author(s)