Details
Original language | English |
---|---|
Title of host publication | WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference |
Pages | 173-182 |
Number of pages | 10 |
ISBN (electronic) | 9781450342087 |
Publication status | Published - 22 May 2016 |
Event | 8th ACM Web Science Conference - Hannover, Germany Duration: 22 May 2016 → 25 May 2016 Conference number: 8 |
Publication series
Name | WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference |
---|
Abstract
Recent advances of preservation technologies have led to an increasing number of Web archive systems and collections. These collections are valuable to explore the past of the Web, but their value can only be uncovered with effective access and exploration mechanisms. Ideal search and rank- ing methods must be robust to the high redundancy and the temporal noise of contents, as well as scalable to the huge amount of data archived. Despite several attempts in Web archive search, facilitating access to Web archive still remains a challenging problem. In this work, we conduct a first analysis on different rank- ing strategies that exploit evidences from metadata instead of the full content of documents. We perform a first study to compare the usefulness of non-content evidences to Web archive search, where the evidences are mined from the metadata of file headers, links and URL strings only. Based on these findings, we propose a simple yet surprisingly effective learning model that combines multiple evidences to distinguish "good" from \bad" search results. We conduct empirical experiments quantitatively as well as qualitatively to confirm the validity of our proposed method, as a first step towards better ranking in Web archives taking meta- data into account.
Keywords
- Feature analysis, Temporal ranking, Web archive search
ASJC Scopus subject areas
- Computer Science(all)
- Computer Networks and Communications
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference. 2016. p. 173-182 (WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Can we find documents in web archives without knowing their contents?
AU - Vo, Khoi Duy
AU - Tran, Tuan
AU - Nguyen, Tu Ngoc
AU - Zhu, Xiaofei
AU - Nejdl, Wolfgang
N1 - Conference code: 8
PY - 2016/5/22
Y1 - 2016/5/22
N2 - Recent advances of preservation technologies have led to an increasing number of Web archive systems and collections. These collections are valuable to explore the past of the Web, but their value can only be uncovered with effective access and exploration mechanisms. Ideal search and rank- ing methods must be robust to the high redundancy and the temporal noise of contents, as well as scalable to the huge amount of data archived. Despite several attempts in Web archive search, facilitating access to Web archive still remains a challenging problem. In this work, we conduct a first analysis on different rank- ing strategies that exploit evidences from metadata instead of the full content of documents. We perform a first study to compare the usefulness of non-content evidences to Web archive search, where the evidences are mined from the metadata of file headers, links and URL strings only. Based on these findings, we propose a simple yet surprisingly effective learning model that combines multiple evidences to distinguish "good" from \bad" search results. We conduct empirical experiments quantitatively as well as qualitatively to confirm the validity of our proposed method, as a first step towards better ranking in Web archives taking meta- data into account.
AB - Recent advances of preservation technologies have led to an increasing number of Web archive systems and collections. These collections are valuable to explore the past of the Web, but their value can only be uncovered with effective access and exploration mechanisms. Ideal search and rank- ing methods must be robust to the high redundancy and the temporal noise of contents, as well as scalable to the huge amount of data archived. Despite several attempts in Web archive search, facilitating access to Web archive still remains a challenging problem. In this work, we conduct a first analysis on different rank- ing strategies that exploit evidences from metadata instead of the full content of documents. We perform a first study to compare the usefulness of non-content evidences to Web archive search, where the evidences are mined from the metadata of file headers, links and URL strings only. Based on these findings, we propose a simple yet surprisingly effective learning model that combines multiple evidences to distinguish "good" from \bad" search results. We conduct empirical experiments quantitatively as well as qualitatively to confirm the validity of our proposed method, as a first step towards better ranking in Web archives taking meta- data into account.
KW - Feature analysis
KW - Temporal ranking
KW - Web archive search
UR - http://www.scopus.com/inward/record.url?scp=84976358897&partnerID=8YFLogxK
U2 - 10.1145/2908131.2908165
DO - 10.1145/2908131.2908165
M3 - Conference contribution
AN - SCOPUS:84976358897
T3 - WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference
SP - 173
EP - 182
BT - WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference
T2 - 8th ACM Web Science Conference
Y2 - 22 May 2016 through 25 May 2016
ER -