A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

Research Organisations

External Research Organisations

  • Boston University (BU)
  • Yahoo Research Labs
View graph of relations

Details

Original languageEnglish
Title of host publicationSIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
Pages153-162
Number of pages10
ISBN (electronic)9781450336215
Publication statusPublished - 9 Aug 2015
Event38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015 - Santiago, Chile
Duration: 9 Aug 201513 Aug 2015

Abstract

Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.

Keywords

    Discovery, Frontier ranking, Random walks, Result relevance, URL prioritization, Web crawling, Web frontier, Web search engine

ASJC Scopus subject areas

Cite this

A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking. / Tran, Giang; Turk, Ata; Cambazoglu, B. Barla et al.
SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2015. p. 153-162.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Tran, G, Turk, A, Cambazoglu, BB & Nejdl, W 2015, A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking. in SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 153-162, 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, Santiago, Chile, 9 Aug 2015. https://doi.org/10.1145/2766462.2767737
Tran, G., Turk, A., Cambazoglu, B. B., & Nejdl, W. (2015). A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking. In SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 153-162) https://doi.org/10.1145/2766462.2767737
Tran G, Turk A, Cambazoglu BB, Nejdl W. A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking. In SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2015. p. 153-162 doi: 10.1145/2766462.2767737
Tran, Giang ; Turk, Ata ; Cambazoglu, B. Barla et al. / A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking. SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2015. pp. 153-162
Download
@inproceedings{a7464169c6e249ed89c077b6fbe6217c,
title = "A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking",
abstract = "Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.",
keywords = "Discovery, Frontier ranking, Random walks, Result relevance, URL prioritization, Web crawling, Web frontier, Web search engine",
author = "Giang Tran and Ata Turk and Cambazoglu, {B. Barla} and Wolfgang Nejdl",
note = "Funding information: This work was supported by the ERC Advanced Grant ALEXANDRIA (339233) and the LEADS project (ICT- 318809), funded by the European Community.; 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015 ; Conference date: 09-08-2015 Through 13-08-2015",
year = "2015",
month = aug,
day = "9",
doi = "10.1145/2766462.2767737",
language = "English",
pages = "153--162",
booktitle = "SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval",

}

Download

TY - GEN

T1 - A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking

AU - Tran, Giang

AU - Turk, Ata

AU - Cambazoglu, B. Barla

AU - Nejdl, Wolfgang

N1 - Funding information: This work was supported by the ERC Advanced Grant ALEXANDRIA (339233) and the LEADS project (ICT- 318809), funded by the European Community.

PY - 2015/8/9

Y1 - 2015/8/9

N2 - Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.

AB - Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.

KW - Discovery

KW - Frontier ranking

KW - Random walks

KW - Result relevance

KW - URL prioritization

KW - Web crawling

KW - Web frontier

KW - Web search engine

UR - http://www.scopus.com/inward/record.url?scp=84953711427&partnerID=8YFLogxK

U2 - 10.1145/2766462.2767737

DO - 10.1145/2766462.2767737

M3 - Conference contribution

AN - SCOPUS:84953711427

SP - 153

EP - 162

BT - SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

T2 - 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015

Y2 - 9 August 2015 through 13 August 2015

ER -

By the same author(s)