A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking

Giang Tran; Ata Turk; B. Barla Cambazoglu; Wolfgang Nejdl

doi:10.1145/2766462.2767737

Details

Original language	English
Title of host publication	SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
Pages	153-162
Number of pages	10
ISBN (electronic)	9781450336215
Publication status	Published - 9 Aug 2015
Event	38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015 - Santiago, Chile Duration: 9 Aug 2015 → 13 Aug 2015

Abstract

Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.

Keywords

Discovery, Frontier ranking, Random walks, Result relevance, URL prioritization, Web crawling, Web frontier, Web search engine

ASJC Scopus subject areas

Computer Science(all)
Information Systems
Computer Science(all)
Software

Cite this

A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking. / Tran, Giang; Turk, Ata; Cambazoglu, B. Barla et al.
SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2015. p. 153-162.

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Tran, G, Turk, A, Cambazoglu, BB & Nejdl, W 2015, A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking. in SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 153-162, 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, Santiago, Chile, 9 Aug 2015. https://doi.org/10.1145/2766462.2767737

Tran, G., Turk, A., Cambazoglu, B. B., & Nejdl, W. (2015). A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking. In SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 153-162) https://doi.org/10.1145/2766462.2767737

Tran G, Turk A, Cambazoglu BB, Nejdl W. A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking. In SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2015. p. 153-162 doi: 10.1145/2766462.2767737

Tran, Giang ; Turk, Ata ; Cambazoglu, B. Barla et al. / A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking. SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2015. pp. 153-162

Download

@inproceedings{a7464169c6e249ed89c077b6fbe6217c,

title = "A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking",

abstract = "Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.",

keywords = "Discovery, Frontier ranking, Random walks, Result relevance, URL prioritization, Web crawling, Web frontier, Web search engine",

author = "Giang Tran and Ata Turk and Cambazoglu, {B. Barla} and Wolfgang Nejdl",

note = "Funding information: This work was supported by the ERC Advanced Grant ALEXANDRIA (339233) and the LEADS project (ICT- 318809), funded by the European Community.; 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015 ; Conference date: 09-08-2015 Through 13-08-2015",

year = "2015",

month = aug,

day = "9",

doi = "10.1145/2766462.2767737",

language = "English",

pages = "153--162",

booktitle = "SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval",

}

Download

TY - GEN

T1 - A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking

AU - Tran, Giang

AU - Turk, Ata

AU - Cambazoglu, B. Barla

AU - Nejdl, Wolfgang

N1 - Funding information: This work was supported by the ERC Advanced Grant ALEXANDRIA (339233) and the LEADS project (ICT- 318809), funded by the European Community.

PY - 2015/8/9

Y1 - 2015/8/9

N2 - Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.

AB - Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.

KW - Discovery

KW - Frontier ranking

KW - Random walks

KW - Result relevance

KW - URL prioritization

KW - Web crawling

KW - Web frontier

KW - Web search engine

UR - http://www.scopus.com/inward/record.url?scp=84953711427&partnerID=8YFLogxK

U2 - 10.1145/2766462.2767737

DO - 10.1145/2766462.2767737

M3 - Conference contribution

AN - SCOPUS:84953711427

SP - 153

EP - 162

BT - SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

T2 - 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015

Y2 - 9 August 2015 through 13 August 2015

ER -

Research@Leibniz University

A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking

Authors

Research Organisations

External Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets