Details
Original language | English |
---|---|
Title of host publication | SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval |
Pages | 153-162 |
Number of pages | 10 |
ISBN (electronic) | 9781450336215 |
Publication status | Published - 9 Aug 2015 |
Event | 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015 - Santiago, Chile Duration: 9 Aug 2015 → 13 Aug 2015 |
Abstract
Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.
Keywords
- Discovery, Frontier ranking, Random walks, Result relevance, URL prioritization, Web crawling, Web frontier, Web search engine
ASJC Scopus subject areas
- Computer Science(all)
- Information Systems
- Computer Science(all)
- Software
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2015. p. 153-162.
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking
AU - Tran, Giang
AU - Turk, Ata
AU - Cambazoglu, B. Barla
AU - Nejdl, Wolfgang
N1 - Funding information: This work was supported by the ERC Advanced Grant ALEXANDRIA (339233) and the LEADS project (ICT- 318809), funded by the European Community.
PY - 2015/8/9
Y1 - 2015/8/9
N2 - Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.
AB - Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.
KW - Discovery
KW - Frontier ranking
KW - Random walks
KW - Result relevance
KW - URL prioritization
KW - Web crawling
KW - Web frontier
KW - Web search engine
UR - http://www.scopus.com/inward/record.url?scp=84953711427&partnerID=8YFLogxK
U2 - 10.1145/2766462.2767737
DO - 10.1145/2766462.2767737
M3 - Conference contribution
AN - SCOPUS:84953711427
SP - 153
EP - 162
BT - SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
T2 - 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015
Y2 - 9 August 2015 through 13 August 2015
ER -