STAPI: An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Nan Zhang
  • Shomir Wilson
  • Prasenjit Mitra

Research Organisations

External Research Organisations

  • Pennsylvania State University
View graph of relations

Details

Original languageEnglish
Title of host publication2022 Language Resources and Evaluation Conference, LREC 2022
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Jan Odijk, Stelios Piperidis
Pages3461-3470
Number of pages10
ISBN (electronic)9791095546726
Publication statusPublished - 2022
Event13th International Conference on Language Resources and Evaluation Conference, LREC 2022 - Marseille, France
Duration: 20 Jun 202225 Jun 2022

Abstract

Formal documents often are organized into sections of text, each with a title, and extracting this structure remains an under-explored aspect of natural language processing. This iterative title-text structure is valuable data for building models for headline generation and section title generation, but there is no corpus that contains web documents annotated with titles and prose texts. Therefore, we propose the first title-text dataset on web documents that incorporates a wide variety of domains to facilitate downstream training. We also introduce STAPI (Section Title And Prose text Identifier), a two-step system for labeling section titles and prose text in HTML documents. To filter out unrelated content like document footers, its first step involves a filter that reads HTML documents and proposes a set of textual candidates. In the second step, a typographic classifier takes the candidates from the filter and categorizes each one into one of the three pre-defined classes (title, prose text, and miscellany). We show that STAPI significantly outperforms two baseline models in terms of title-text identification. We release our dataset along with a web application to facilitate supervised and semi-supervised training in this domain.

Keywords

    Automatic Scraping, Information Extraction, Iterative Title-Text Dataset, Title-Text Identification

ASJC Scopus subject areas

Cite this

STAPI: An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents. / Zhang, Nan; Wilson, Shomir; Mitra, Prasenjit.
2022 Language Resources and Evaluation Conference, LREC 2022. ed. / Nicoletta Calzolari; Frederic Bechet; Philippe Blache; Khalid Choukri; Christopher Cieri; Thierry Declerck; Sara Goggi; Hitoshi Isahara; Bente Maegaard; Joseph Mariani; Helene Mazo; Jan Odijk; Stelios Piperidis. 2022. p. 3461-3470.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Zhang, N, Wilson, S & Mitra, P 2022, STAPI: An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents. in N Calzolari, F Bechet, P Blache, K Choukri, C Cieri, T Declerck, S Goggi, H Isahara, B Maegaard, J Mariani, H Mazo, J Odijk & S Piperidis (eds), 2022 Language Resources and Evaluation Conference, LREC 2022. pp. 3461-3470, 13th International Conference on Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20 Jun 2022. <https://aclanthology.org/2022.lrec-1.371>
Zhang, N., Wilson, S., & Mitra, P. (2022). STAPI: An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents. In N. Calzolari, F. Bechet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), 2022 Language Resources and Evaluation Conference, LREC 2022 (pp. 3461-3470) https://aclanthology.org/2022.lrec-1.371
Zhang N, Wilson S, Mitra P. STAPI: An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents. In Calzolari N, Bechet F, Blache P, Choukri K, Cieri C, Declerck T, Goggi S, Isahara H, Maegaard B, Mariani J, Mazo H, Odijk J, Piperidis S, editors, 2022 Language Resources and Evaluation Conference, LREC 2022. 2022. p. 3461-3470
Zhang, Nan ; Wilson, Shomir ; Mitra, Prasenjit. / STAPI : An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents. 2022 Language Resources and Evaluation Conference, LREC 2022. editor / Nicoletta Calzolari ; Frederic Bechet ; Philippe Blache ; Khalid Choukri ; Christopher Cieri ; Thierry Declerck ; Sara Goggi ; Hitoshi Isahara ; Bente Maegaard ; Joseph Mariani ; Helene Mazo ; Jan Odijk ; Stelios Piperidis. 2022. pp. 3461-3470
Download
@inproceedings{0cffe8ccf7cb49b29bd7d27195908fea,
title = "STAPI: An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents",
abstract = "Formal documents often are organized into sections of text, each with a title, and extracting this structure remains an under-explored aspect of natural language processing. This iterative title-text structure is valuable data for building models for headline generation and section title generation, but there is no corpus that contains web documents annotated with titles and prose texts. Therefore, we propose the first title-text dataset on web documents that incorporates a wide variety of domains to facilitate downstream training. We also introduce STAPI (Section Title And Prose text Identifier), a two-step system for labeling section titles and prose text in HTML documents. To filter out unrelated content like document footers, its first step involves a filter that reads HTML documents and proposes a set of textual candidates. In the second step, a typographic classifier takes the candidates from the filter and categorizes each one into one of the three pre-defined classes (title, prose text, and miscellany). We show that STAPI significantly outperforms two baseline models in terms of title-text identification. We release our dataset along with a web application to facilitate supervised and semi-supervised training in this domain.",
keywords = "Automatic Scraping, Information Extraction, Iterative Title-Text Dataset, Title-Text Identification",
author = "Nan Zhang and Shomir Wilson and Prasenjit Mitra",
year = "2022",
language = "English",
pages = "3461--3470",
editor = "Nicoletta Calzolari and Frederic Bechet and Philippe Blache and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Helene Mazo and Jan Odijk and Stelios Piperidis",
booktitle = "2022 Language Resources and Evaluation Conference, LREC 2022",
note = "13th International Conference on Language Resources and Evaluation Conference, LREC 2022 ; Conference date: 20-06-2022 Through 25-06-2022",

}

Download

TY - GEN

T1 - STAPI

T2 - 13th International Conference on Language Resources and Evaluation Conference, LREC 2022

AU - Zhang, Nan

AU - Wilson, Shomir

AU - Mitra, Prasenjit

PY - 2022

Y1 - 2022

N2 - Formal documents often are organized into sections of text, each with a title, and extracting this structure remains an under-explored aspect of natural language processing. This iterative title-text structure is valuable data for building models for headline generation and section title generation, but there is no corpus that contains web documents annotated with titles and prose texts. Therefore, we propose the first title-text dataset on web documents that incorporates a wide variety of domains to facilitate downstream training. We also introduce STAPI (Section Title And Prose text Identifier), a two-step system for labeling section titles and prose text in HTML documents. To filter out unrelated content like document footers, its first step involves a filter that reads HTML documents and proposes a set of textual candidates. In the second step, a typographic classifier takes the candidates from the filter and categorizes each one into one of the three pre-defined classes (title, prose text, and miscellany). We show that STAPI significantly outperforms two baseline models in terms of title-text identification. We release our dataset along with a web application to facilitate supervised and semi-supervised training in this domain.

AB - Formal documents often are organized into sections of text, each with a title, and extracting this structure remains an under-explored aspect of natural language processing. This iterative title-text structure is valuable data for building models for headline generation and section title generation, but there is no corpus that contains web documents annotated with titles and prose texts. Therefore, we propose the first title-text dataset on web documents that incorporates a wide variety of domains to facilitate downstream training. We also introduce STAPI (Section Title And Prose text Identifier), a two-step system for labeling section titles and prose text in HTML documents. To filter out unrelated content like document footers, its first step involves a filter that reads HTML documents and proposes a set of textual candidates. In the second step, a typographic classifier takes the candidates from the filter and categorizes each one into one of the three pre-defined classes (title, prose text, and miscellany). We show that STAPI significantly outperforms two baseline models in terms of title-text identification. We release our dataset along with a web application to facilitate supervised and semi-supervised training in this domain.

KW - Automatic Scraping

KW - Information Extraction

KW - Iterative Title-Text Dataset

KW - Title-Text Identification

UR - http://www.scopus.com/inward/record.url?scp=85144373238&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85144373238

SP - 3461

EP - 3470

BT - 2022 Language Resources and Evaluation Conference, LREC 2022

A2 - Calzolari, Nicoletta

A2 - Bechet, Frederic

A2 - Blache, Philippe

A2 - Choukri, Khalid

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Goggi, Sara

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Mariani, Joseph

A2 - Mazo, Helene

A2 - Odijk, Jan

A2 - Piperidis, Stelios

Y2 - 20 June 2022 through 25 June 2022

ER -