Details
Original language | English |
---|---|
Title of host publication | JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries |
Pages | 83-92 |
Number of pages | 10 |
ISBN (electronic) | 9781450342292 |
Publication status | Published - 2016 |
Event | 16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016 - Newark, United States Duration: 19 Jun 2016 → 23 Jun 2016 |
Abstract
Keywords
- cs.DL, cs.DB, Big Data, Data Extraction, Web Archives
ASJC Scopus subject areas
- Engineering(all)
- General Engineering
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 2016. p. 83-92 7559568.
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research
}
TY - GEN
T1 - ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation
AU - Holzmann, Helge
AU - Goel, Vinay
AU - Anand, Avishek
N1 - Publisher Copyright: © 2016 ACM.
PY - 2016
Y1 - 2016
N2 - Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.
AB - Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.
KW - cs.DL
KW - cs.DB
KW - Big Data
KW - Data Extraction
KW - Web Archives
UR - http://www.scopus.com/inward/record.url?scp=84989892199&partnerID=8YFLogxK
U2 - 10.1145/2910896.2910902
DO - 10.1145/2910896.2910902
M3 - Conference contribution
SN - 978-1-4503-4229-2
SP - 83
EP - 92
BT - JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries
T2 - 16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016
Y2 - 19 June 2016 through 23 June 2016
ER -