NEAR-Miner: Mining evolution associations of web site directories for efficient maintenance of web archives

Ling Chen; Sourav S. Bhowmick; Wolfgang Nejdl

doi:10.14778/1687627.1687757

Details

Original language	English
Pages (from-to)	1150-1161
Number of pages	12
Journal	Proceedings of the VLDB Endowment
Volume	2
Issue number	1
Publication status	Published - 1 Aug 2009

Abstract

Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive. In our approach, we propose an off-line data mining algorithm called NEAR-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called warm that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the "freshness" of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover nears frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.

ASJC Scopus subject areas

Computer Science(all)
Computer Science (miscellaneous)
Computer Science(all)
General Computer Science

Cite this

NEAR-Miner: Mining evolution associations of web site directories for efficient maintenance of web archives. / Chen, Ling; Bhowmick, Sourav S.; Nejdl, Wolfgang.
In: Proceedings of the VLDB Endowment, Vol. 2, No. 1, 01.08.2009, p. 1150-1161.

Research output: Contribution to journal › Article › Research › peer review

Chen, L, Bhowmick, SS & Nejdl, W 2009, 'NEAR-Miner: Mining evolution associations of web site directories for efficient maintenance of web archives', Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 1150-1161. https://doi.org/10.14778/1687627.1687757

Chen, L., Bhowmick, S. S., & Nejdl, W. (2009). NEAR-Miner: Mining evolution associations of web site directories for efficient maintenance of web archives. Proceedings of the VLDB Endowment, 2(1), 1150-1161. https://doi.org/10.14778/1687627.1687757

Chen L, Bhowmick SS, Nejdl W. NEAR-Miner: Mining evolution associations of web site directories for efficient maintenance of web archives. Proceedings of the VLDB Endowment. 2009 Aug 1;2(1):1150-1161. doi: 10.14778/1687627.1687757

Chen, Ling ; Bhowmick, Sourav S. ; Nejdl, Wolfgang. / NEAR-Miner : Mining evolution associations of web site directories for efficient maintenance of web archives. In: Proceedings of the VLDB Endowment. 2009 ; Vol. 2, No. 1. pp. 1150-1161.

Download

@article{7915860ac96a4d60b6c3e62f4b58ea76,

title = "NEAR-Miner: Mining evolution associations of web site directories for efficient maintenance of web archives",

abstract = "Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive. In our approach, we propose an off-line data mining algorithm called NEAR-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called warm that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the {"}freshness{"} of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover nears frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.",

author = "Ling Chen and Bhowmick, {Sourav S.} and Wolfgang Nejdl",

year = "2009",

month = aug,

day = "1",

doi = "10.14778/1687627.1687757",

language = "English",

volume = "2",

pages = "1150--1161",

number = "1",

}

Download

TY - JOUR

T1 - NEAR-Miner

T2 - Mining evolution associations of web site directories for efficient maintenance of web archives

AU - Chen, Ling

AU - Bhowmick, Sourav S.

AU - Nejdl, Wolfgang

PY - 2009/8/1

Y1 - 2009/8/1

N2 - Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive. In our approach, we propose an off-line data mining algorithm called NEAR-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called warm that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the "freshness" of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover nears frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.

AB - Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive. In our approach, we propose an off-line data mining algorithm called NEAR-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called warm that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the "freshness" of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover nears frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.

UR - http://www.scopus.com/inward/record.url?scp=79952762492&partnerID=8YFLogxK

U2 - 10.14778/1687627.1687757

DO - 10.14778/1687627.1687757

M3 - Article

AN - SCOPUS:79952762492

VL - 2

SP - 1150

EP - 1161

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 1

ER -

Research@Leibniz University

NEAR-Miner: Mining evolution associations of web site directories for efficient maintenance of web archives

Authors

Research Organisations

External Research Organisations

Details

Abstract

ASJC Scopus subject areas

Cite this

By the same author(s)

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions