NEAR-Miner: Mining evolution associations of web site directories for efficient maintenance of web archives

Research output: Contribution to journalArticleResearchpeer review

Authors

Research Organisations

External Research Organisations

  • Nanyang Technological University (NTU)
View graph of relations

Details

Original languageEnglish
Pages (from-to)1150-1161
Number of pages12
JournalProceedings of the VLDB Endowment
Volume2
Issue number1
Publication statusPublished - 1 Aug 2009

Abstract

Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive. In our approach, we propose an off-line data mining algorithm called NEAR-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called warm that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the "freshness" of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover nears frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.

ASJC Scopus subject areas

Cite this

NEAR-Miner: Mining evolution associations of web site directories for efficient maintenance of web archives. / Chen, Ling; Bhowmick, Sourav S.; Nejdl, Wolfgang.
In: Proceedings of the VLDB Endowment, Vol. 2, No. 1, 01.08.2009, p. 1150-1161.

Research output: Contribution to journalArticleResearchpeer review

Chen L, Bhowmick SS, Nejdl W. NEAR-Miner: Mining evolution associations of web site directories for efficient maintenance of web archives. Proceedings of the VLDB Endowment. 2009 Aug 1;2(1):1150-1161. doi: 10.14778/1687627.1687757
Chen, Ling ; Bhowmick, Sourav S. ; Nejdl, Wolfgang. / NEAR-Miner : Mining evolution associations of web site directories for efficient maintenance of web archives. In: Proceedings of the VLDB Endowment. 2009 ; Vol. 2, No. 1. pp. 1150-1161.
Download
@article{7915860ac96a4d60b6c3e62f4b58ea76,
title = "NEAR-Miner: Mining evolution associations of web site directories for efficient maintenance of web archives",
abstract = "Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive. In our approach, we propose an off-line data mining algorithm called NEAR-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called warm that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the {"}freshness{"} of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover nears frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.",
author = "Ling Chen and Bhowmick, {Sourav S.} and Wolfgang Nejdl",
year = "2009",
month = aug,
day = "1",
doi = "10.14778/1687627.1687757",
language = "English",
volume = "2",
pages = "1150--1161",
number = "1",

}

Download

TY - JOUR

T1 - NEAR-Miner

T2 - Mining evolution associations of web site directories for efficient maintenance of web archives

AU - Chen, Ling

AU - Bhowmick, Sourav S.

AU - Nejdl, Wolfgang

PY - 2009/8/1

Y1 - 2009/8/1

N2 - Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive. In our approach, we propose an off-line data mining algorithm called NEAR-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called warm that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the "freshness" of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover nears frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.

AB - Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive. In our approach, we propose an off-line data mining algorithm called NEAR-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called warm that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the "freshness" of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover nears frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.

UR - http://www.scopus.com/inward/record.url?scp=79952762492&partnerID=8YFLogxK

U2 - 10.14778/1687627.1687757

DO - 10.14778/1687627.1687757

M3 - Article

AN - SCOPUS:79952762492

VL - 2

SP - 1150

EP - 1161

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 1

ER -

By the same author(s)