ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Helge Holzmann; Vinay Goel; Avishek Anand

doi:10.1145/2910896.2910902

Details

Original language	English
Title of host publication	JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries
Pages	83-92
Number of pages	10
ISBN (electronic)	9781450342292
Publication status	Published - 2016
Event	16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016 - Newark, United States Duration: 19 Jun 2016 → 23 Jun 2016

Abstract

Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.

Keywords

cs.DL, cs.DB, Big Data, Data Extraction, Web Archives

ASJC Scopus subject areas

Engineering(all)
General Engineering

Cite this

ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. / Holzmann, Helge; Goel, Vinay; Anand, Avishek.
JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 2016. p. 83-92 7559568.

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research

Holzmann, H, Goel, V & Anand, A 2016, ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. in JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries., 7559568, pp. 83-92, 16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016, Newark, United States, 19 Jun 2016. https://doi.org/10.1145/2910896.2910902

Holzmann, H., Goel, V., & Anand, A. (2016). ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. In JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (pp. 83-92). Article 7559568 https://doi.org/10.1145/2910896.2910902

Holzmann H, Goel V, Anand A. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. In JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 2016. p. 83-92. 7559568 doi: 10.1145/2910896.2910902

Holzmann, Helge ; Goel, Vinay ; Anand, Avishek. / ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 2016. pp. 83-92

Download

@inproceedings{0d42483f8ee84a27aaa5616ff3ab70f6,

title = "ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation",

abstract = " Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools. ",

keywords = "cs.DL, cs.DB, Big Data, Data Extraction, Web Archives",

author = "Helge Holzmann and Vinay Goel and Avishek Anand",

note = "Publisher Copyright: {\textcopyright} 2016 ACM.; 16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016 ; Conference date: 19-06-2016 Through 23-06-2016",

year = "2016",

doi = "10.1145/2910896.2910902",

language = "English",

isbn = "978-1-4503-4229-2",

pages = "83--92",

booktitle = "JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries",

}

Download

TY - GEN

T1 - ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

AU - Holzmann, Helge

AU - Goel, Vinay

AU - Anand, Avishek

PY - 2016

Y1 - 2016

N2 - Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.

AB - Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.

KW - cs.DL

KW - cs.DB

KW - Big Data

KW - Data Extraction

KW - Web Archives

UR - http://www.scopus.com/inward/record.url?scp=84989892199&partnerID=8YFLogxK

U2 - 10.1145/2910896.2910902

DO - 10.1145/2910896.2910902

M3 - Conference contribution

SN - 978-1-4503-4229-2

SP - 83

EP - 92

BT - JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries

T2 - 16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016

Y2 - 19 June 2016 through 23 June 2016

ER -

Research@Leibniz University

ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Authors

Research Organisations

External Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this