Micro Archives as Rich Digital Object Representations

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Helge Holzmann
  • Mila Runnwerth

Research Organisations

External Research Organisations

  • German National Library of Science and Technology (TIB)
View graph of relations

Details

Original languageEnglish
Title of host publicationWebSci '18
Subtitle of host publicationProceedings of the 10th ACM Conference on Web Science
Pages353-357
Number of pages5
ISBN (electronic)978-1-4503-5563-6
Publication statusPublished - 15 May 2018
Event10th ACM Conference on Web Science, WebSci 2018 - Amsterdam, Netherlands
Duration: 27 May 201830 May 2018

Abstract

Digital objects as well as real-world entities are commonly referred to in literature or on the Web by mentioning their name, linking to their website or citing unique identifiers, such as DOI and OR-CID, which are backed by a set of meta information. All of these methods have severe disadvantages and are not always suitable though: They are not very precise, not guaranteed to be persistent or mean a big additional effort for the author, who needs to collect the metadata to describe the reference accurately. Especially for complex, evolving entities and objects like software, pre-defined metadata schemas are often not expressive enough to capture its temporal state comprehensively. We found in previous work that a lot of meaningful information about software, such as a description, rich metadata, its documentation and source code, is usually available online. However, all of this needs to be preserved coherently in order to constitute a rich digital representation of the entity. We show that this is currently not the case, as only 10% of the studied blog posts and roughly 30% of the analyzed software websites are archived completely, i.e., all linked resources are captured as well. Therefore, we propose Micro Archives as rich digital object representations, which semantically and logically connect archived resources and ensure a coherent state. With Micrawler we present a modular solution to create, cite and analyze such Micro Archives. In this paper, we show the need for this approach as well as discuss opportunities and implications for various applications also beyond scholarly writing.

Keywords

    Crawling, Data representation, Scientific workflow, Web archives

ASJC Scopus subject areas

Cite this

Micro Archives as Rich Digital Object Representations. / Holzmann, Helge; Runnwerth, Mila.
WebSci '18: Proceedings of the 10th ACM Conference on Web Science. 2018. p. 353-357.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Holzmann, H & Runnwerth, M 2018, Micro Archives as Rich Digital Object Representations. in WebSci '18: Proceedings of the 10th ACM Conference on Web Science. pp. 353-357, 10th ACM Conference on Web Science, WebSci 2018, Amsterdam, Netherlands, 27 May 2018. https://doi.org/10.1145/3201064.3201110, https://doi.org/10.15488/3963
Holzmann, H., & Runnwerth, M. (2018). Micro Archives as Rich Digital Object Representations. In WebSci '18: Proceedings of the 10th ACM Conference on Web Science (pp. 353-357) https://doi.org/10.1145/3201064.3201110, https://doi.org/10.15488/3963
Holzmann H, Runnwerth M. Micro Archives as Rich Digital Object Representations. In WebSci '18: Proceedings of the 10th ACM Conference on Web Science. 2018. p. 353-357 doi: 10.1145/3201064.3201110, 10.15488/3963
Holzmann, Helge ; Runnwerth, Mila. / Micro Archives as Rich Digital Object Representations. WebSci '18: Proceedings of the 10th ACM Conference on Web Science. 2018. pp. 353-357
Download
@inproceedings{73821873e0634ff88ddb2d47cfc51445,
title = "Micro Archives as Rich Digital Object Representations",
abstract = "Digital objects as well as real-world entities are commonly referred to in literature or on the Web by mentioning their name, linking to their website or citing unique identifiers, such as DOI and OR-CID, which are backed by a set of meta information. All of these methods have severe disadvantages and are not always suitable though: They are not very precise, not guaranteed to be persistent or mean a big additional effort for the author, who needs to collect the metadata to describe the reference accurately. Especially for complex, evolving entities and objects like software, pre-defined metadata schemas are often not expressive enough to capture its temporal state comprehensively. We found in previous work that a lot of meaningful information about software, such as a description, rich metadata, its documentation and source code, is usually available online. However, all of this needs to be preserved coherently in order to constitute a rich digital representation of the entity. We show that this is currently not the case, as only 10% of the studied blog posts and roughly 30% of the analyzed software websites are archived completely, i.e., all linked resources are captured as well. Therefore, we propose Micro Archives as rich digital object representations, which semantically and logically connect archived resources and ensure a coherent state. With Micrawler we present a modular solution to create, cite and analyze such Micro Archives. In this paper, we show the need for this approach as well as discuss opportunities and implications for various applications also beyond scholarly writing.",
keywords = "Crawling, Data representation, Scientific workflow, Web archives",
author = "Helge Holzmann and Mila Runnwerth",
note = "Publisher Copyright: {\textcopyright} 2018 Association for Computing Machinery.; 10th ACM Conference on Web Science, WebSci 2018 ; Conference date: 27-05-2018 Through 30-05-2018",
year = "2018",
month = may,
day = "15",
doi = "10.1145/3201064.3201110",
language = "English",
pages = "353--357",
booktitle = "WebSci '18",

}

Download

TY - GEN

T1 - Micro Archives as Rich Digital Object Representations

AU - Holzmann, Helge

AU - Runnwerth, Mila

N1 - Publisher Copyright: © 2018 Association for Computing Machinery.

PY - 2018/5/15

Y1 - 2018/5/15

N2 - Digital objects as well as real-world entities are commonly referred to in literature or on the Web by mentioning their name, linking to their website or citing unique identifiers, such as DOI and OR-CID, which are backed by a set of meta information. All of these methods have severe disadvantages and are not always suitable though: They are not very precise, not guaranteed to be persistent or mean a big additional effort for the author, who needs to collect the metadata to describe the reference accurately. Especially for complex, evolving entities and objects like software, pre-defined metadata schemas are often not expressive enough to capture its temporal state comprehensively. We found in previous work that a lot of meaningful information about software, such as a description, rich metadata, its documentation and source code, is usually available online. However, all of this needs to be preserved coherently in order to constitute a rich digital representation of the entity. We show that this is currently not the case, as only 10% of the studied blog posts and roughly 30% of the analyzed software websites are archived completely, i.e., all linked resources are captured as well. Therefore, we propose Micro Archives as rich digital object representations, which semantically and logically connect archived resources and ensure a coherent state. With Micrawler we present a modular solution to create, cite and analyze such Micro Archives. In this paper, we show the need for this approach as well as discuss opportunities and implications for various applications also beyond scholarly writing.

AB - Digital objects as well as real-world entities are commonly referred to in literature or on the Web by mentioning their name, linking to their website or citing unique identifiers, such as DOI and OR-CID, which are backed by a set of meta information. All of these methods have severe disadvantages and are not always suitable though: They are not very precise, not guaranteed to be persistent or mean a big additional effort for the author, who needs to collect the metadata to describe the reference accurately. Especially for complex, evolving entities and objects like software, pre-defined metadata schemas are often not expressive enough to capture its temporal state comprehensively. We found in previous work that a lot of meaningful information about software, such as a description, rich metadata, its documentation and source code, is usually available online. However, all of this needs to be preserved coherently in order to constitute a rich digital representation of the entity. We show that this is currently not the case, as only 10% of the studied blog posts and roughly 30% of the analyzed software websites are archived completely, i.e., all linked resources are captured as well. Therefore, we propose Micro Archives as rich digital object representations, which semantically and logically connect archived resources and ensure a coherent state. With Micrawler we present a modular solution to create, cite and analyze such Micro Archives. In this paper, we show the need for this approach as well as discuss opportunities and implications for various applications also beyond scholarly writing.

KW - Crawling

KW - Data representation

KW - Scientific workflow

KW - Web archives

UR - http://www.scopus.com/inward/record.url?scp=85049394671&partnerID=8YFLogxK

U2 - 10.1145/3201064.3201110

DO - 10.1145/3201064.3201110

M3 - Conference contribution

AN - SCOPUS:85049394671

SP - 353

EP - 357

BT - WebSci '18

T2 - 10th ACM Conference on Web Science, WebSci 2018

Y2 - 27 May 2018 through 30 May 2018

ER -