Inferring Missing Categorical Information in Noisy and Sparse Web Markup

Nicolas Tempelmeier; Elena Demidova; Stefan Dietze

doi:10.1145/3178876.3186028

Details

Original language	English
Title of host publication	The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018
Pages	1297-1306
Number of pages	10
ISBN (electronic)	9781450356398
Publication status	Published - 10 Apr 2018
Event	27th International World Wide Web, WWW 2018 - Lyon, France Duration: 23 Apr 2018 → 27 Apr 2018

Publication series

Name	The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018

Abstract

Embedded markup of Web pages has seen widespread adoption throughout the past years driven by standards such as RDFa and Microdata and initiatives such as schema.org, where recent studies show an adoption by 39% of all Web pages already in 2016. While this constitutes an important information source for tasks such as Web search, Web page classification or knowledge graph augmentation, individual markup nodes are usually sparsely described and often lack essential information. For instance, from 26 million nodes describing events within the Common Crawl in 2016, 59% of nodes provide less than six statements and only 257,000 nodes (0.96%) are typed with more specific event subtypes. Nevertheless, given the scale and diversity of Web markup data, nodes that provide missing information can be obtained from the Web in large quantities, in particular for categorical properties. Such data constitutes potential training data for inferring missing information to significantly augment sparsely described nodes. In this work, we introduce a supervised approach for inferring missing categorical properties in Web markup. Our experiments, conducted on properties of events and movies, show a performance of 79% and 83% F1 score correspondingly, significantly outperforming existing baselines.

Keywords

Information inferring, Supervised learning, Web markup

ASJC Scopus subject areas

Computer Science(all)
Computer Networks and Communications
Computer Science(all)
Software

Cite this

Inferring Missing Categorical Information in Noisy and Sparse Web Markup. / Tempelmeier, Nicolas; Demidova, Elena; Dietze, Stefan.
The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018. 2018. p. 1297-1306 (The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Tempelmeier, N, Demidova, E & Dietze, S 2018, Inferring Missing Categorical Information in Noisy and Sparse Web Markup. in The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018. The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018, pp. 1297-1306, 27th International World Wide Web, WWW 2018, Lyon, France, 23 Apr 2018. https://doi.org/10.1145/3178876.3186028, https://doi.org/10.15488/4771

Tempelmeier, N., Demidova, E., & Dietze, S. (2018). Inferring Missing Categorical Information in Noisy and Sparse Web Markup. In The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018 (pp. 1297-1306). (The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018). https://doi.org/10.1145/3178876.3186028, https://doi.org/10.15488/4771

Tempelmeier N, Demidova E, Dietze S. Inferring Missing Categorical Information in Noisy and Sparse Web Markup. In The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018. 2018. p. 1297-1306. (The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018). doi: 10.1145/3178876.3186028, 10.15488/4771

Tempelmeier, Nicolas ; Demidova, Elena ; Dietze, Stefan. / Inferring Missing Categorical Information in Noisy and Sparse Web Markup. The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018. 2018. pp. 1297-1306 (The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018).

Download

@inproceedings{7deed01423c3416a8d99bba0c7c18986,

title = "Inferring Missing Categorical Information in Noisy and Sparse Web Markup",

abstract = "Embedded markup of Web pages has seen widespread adoption throughout the past years driven by standards such as RDFa and Microdata and initiatives such as schema.org, where recent studies show an adoption by 39% of all Web pages already in 2016. While this constitutes an important information source for tasks such as Web search, Web page classification or knowledge graph augmentation, individual markup nodes are usually sparsely described and often lack essential information. For instance, from 26 million nodes describing events within the Common Crawl in 2016, 59% of nodes provide less than six statements and only 257,000 nodes (0.96%) are typed with more specific event subtypes. Nevertheless, given the scale and diversity of Web markup data, nodes that provide missing information can be obtained from the Web in large quantities, in particular for categorical properties. Such data constitutes potential training data for inferring missing information to significantly augment sparsely described nodes. In this work, we introduce a supervised approach for inferring missing categorical properties in Web markup. Our experiments, conducted on properties of events and movies, show a performance of 79% and 83% F1 score correspondingly, significantly outperforming existing baselines.",

keywords = "Information inferring, Supervised learning, Web markup",

author = "Nicolas Tempelmeier and Elena Demidova and Stefan Dietze",

note = "Funding Information: This work was partially funded by the European Commission ({"}AFEL{"} project, grant ID 687916) and the BMBF ({"}Data4UrbanMobility{"} project, grant ID 02K15A040). ; 27th International World Wide Web, WWW 2018 ; Conference date: 23-04-2018 Through 27-04-2018",

year = "2018",

month = apr,

day = "10",

doi = "10.1145/3178876.3186028",

language = "English",

series = "The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018",

pages = "1297--1306",

booktitle = "The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018",

}

Download

TY - GEN

T1 - Inferring Missing Categorical Information in Noisy and Sparse Web Markup

AU - Tempelmeier, Nicolas

AU - Demidova, Elena

AU - Dietze, Stefan

N1 - Funding Information: This work was partially funded by the European Commission ("AFEL" project, grant ID 687916) and the BMBF ("Data4UrbanMobility" project, grant ID 02K15A040).

PY - 2018/4/10

Y1 - 2018/4/10

N2 - Embedded markup of Web pages has seen widespread adoption throughout the past years driven by standards such as RDFa and Microdata and initiatives such as schema.org, where recent studies show an adoption by 39% of all Web pages already in 2016. While this constitutes an important information source for tasks such as Web search, Web page classification or knowledge graph augmentation, individual markup nodes are usually sparsely described and often lack essential information. For instance, from 26 million nodes describing events within the Common Crawl in 2016, 59% of nodes provide less than six statements and only 257,000 nodes (0.96%) are typed with more specific event subtypes. Nevertheless, given the scale and diversity of Web markup data, nodes that provide missing information can be obtained from the Web in large quantities, in particular for categorical properties. Such data constitutes potential training data for inferring missing information to significantly augment sparsely described nodes. In this work, we introduce a supervised approach for inferring missing categorical properties in Web markup. Our experiments, conducted on properties of events and movies, show a performance of 79% and 83% F1 score correspondingly, significantly outperforming existing baselines.

AB - Embedded markup of Web pages has seen widespread adoption throughout the past years driven by standards such as RDFa and Microdata and initiatives such as schema.org, where recent studies show an adoption by 39% of all Web pages already in 2016. While this constitutes an important information source for tasks such as Web search, Web page classification or knowledge graph augmentation, individual markup nodes are usually sparsely described and often lack essential information. For instance, from 26 million nodes describing events within the Common Crawl in 2016, 59% of nodes provide less than six statements and only 257,000 nodes (0.96%) are typed with more specific event subtypes. Nevertheless, given the scale and diversity of Web markup data, nodes that provide missing information can be obtained from the Web in large quantities, in particular for categorical properties. Such data constitutes potential training data for inferring missing information to significantly augment sparsely described nodes. In this work, we introduce a supervised approach for inferring missing categorical properties in Web markup. Our experiments, conducted on properties of events and movies, show a performance of 79% and 83% F1 score correspondingly, significantly outperforming existing baselines.

KW - Information inferring

KW - Supervised learning

KW - Web markup

UR - http://www.scopus.com/inward/record.url?scp=85075443220&partnerID=8YFLogxK

U2 - 10.1145/3178876.3186028

DO - 10.1145/3178876.3186028

M3 - Conference contribution

AN - SCOPUS:85075443220

T3 - The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018

SP - 1297

EP - 1306

BT - The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018

T2 - 27th International World Wide Web, WWW 2018

Y2 - 23 April 2018 through 27 April 2018

ER -

Research@Leibniz University