Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task

Research output: Chapter in book/report/conference proceedingConference abstractResearchpeer review

View graph of relations

Details

Original languageEnglish
Title of host publicationInstructional Design and Technology Enhanced Learning: Current States and Future Perspectives
Subtitle of host publicationBook of Abstracts
Pages57
Number of pages1
Publication statusPublished - 23 Aug 2024
EventEARLI SIG 6&7 Biennial Conference 2024: Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives - University of Tübingen, Tübingen, Germany
Duration: 21 Aug 202423 Aug 2024
https://www.earli.org/sig-6-7-conference-2024

Abstract

Synthetic data generation is a solution to mitigate data scarcity. We investigate the generation of synthetic text data via prompting a pre-trained Large Language Model (LLM). The prompt design is based on reconstructive analyses from biology education of real student texts. Prompts were designed for the generation of positive and negative samples for intentional explanation patterns for the evolutionary adaptation of whales. We propose a mixed methods approach for the evaluation of the dataset: investigating statistical commonalities and differences between synthetic and real data and assessing frame-related aspects and correctness via an annotation study. Our preliminary findings show that ranges for text lengths and number of sentences are similar for synthetic and real data. We get mixed results for the similarity and lexical complexity of texts. The range of vocabulary sizes is similar in both datasets. We find that it is possible to generate data with indicators for the intentional patterns though we also get false samples. Generating positive samples worked better than for negative samples. Due to generation errors, further usage as training data requires cleaning of the synthetic data. The inter-annotator agreement in the annotation study was high. The study revealed crucial differences in frame annotations for correct positive and negative samples. We identify open questions and further steps for future research.

Cite this

Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. / Stanja, Judith; Hoppe, Anett; Dannemann, Sarah et al.
Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts. 2024. p. 57.

Research output: Chapter in book/report/conference proceedingConference abstractResearchpeer review

Stanja, J, Hoppe, A, Dannemann, S & Krugel, J 2024, Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. in Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts. pp. 57, EARLI SIG 6&7 Biennial Conference 2024, Tübingen, Baden-Württemberg, Germany, 21 Aug 2024. <https://www.earli.org/assets/images/2024SIG6-7Conference_BookAbstract_Corrected.pdf>
Stanja, J., Hoppe, A., Dannemann, S., & Krugel, J. (2024). Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. In Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts (pp. 57) https://www.earli.org/assets/images/2024SIG6-7Conference_BookAbstract_Corrected.pdf
Stanja J, Hoppe A, Dannemann S, Krugel J. Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. In Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts. 2024. p. 57
Stanja, Judith ; Hoppe, Anett ; Dannemann, Sarah et al. / Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task. Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives: Book of Abstracts. 2024. pp. 57
Download
@inbook{7c976e86ee3c487e9ae6836d957e5346,
title = "Generation and Evaluation of Synthetic Text Data for the Students{\textquoteright} Conceptions Identification Task",
abstract = "Synthetic data generation is a solution to mitigate data scarcity. We investigate the generation of synthetic text data via prompting a pre-trained Large Language Model (LLM). The prompt design is based on reconstructive analyses from biology education of real student texts. Prompts were designed for the generation of positive and negative samples for intentional explanation patterns for the evolutionary adaptation of whales. We propose a mixed methods approach for the evaluation of the dataset: investigating statistical commonalities and differences between synthetic and real data and assessing frame-related aspects and correctness via an annotation study. Our preliminary findings show that ranges for text lengths and number of sentences are similar for synthetic and real data. We get mixed results for the similarity and lexical complexity of texts. The range of vocabulary sizes is similar in both datasets. We find that it is possible to generate data with indicators for the intentional patterns though we also get false samples. Generating positive samples worked better than for negative samples. Due to generation errors, further usage as training data requires cleaning of the synthetic data. The inter-annotator agreement in the annotation study was high. The study revealed crucial differences in frame annotations for correct positive and negative samples. We identify open questions and further steps for future research.",
author = "Judith Stanja and Anett Hoppe and Sarah Dannemann and Johannes Krugel",
year = "2024",
month = aug,
day = "23",
language = "English",
pages = "57",
booktitle = "Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives",
note = "EARLI SIG 6&amp;7 Biennial Conference 2024 ; Conference date: 21-08-2024 Through 23-08-2024",
url = "https://www.earli.org/sig-6-7-conference-2024",

}

Download

TY - CHAP

T1 - Generation and Evaluation of Synthetic Text Data for the Students’ Conceptions Identification Task

AU - Stanja, Judith

AU - Hoppe, Anett

AU - Dannemann, Sarah

AU - Krugel, Johannes

PY - 2024/8/23

Y1 - 2024/8/23

N2 - Synthetic data generation is a solution to mitigate data scarcity. We investigate the generation of synthetic text data via prompting a pre-trained Large Language Model (LLM). The prompt design is based on reconstructive analyses from biology education of real student texts. Prompts were designed for the generation of positive and negative samples for intentional explanation patterns for the evolutionary adaptation of whales. We propose a mixed methods approach for the evaluation of the dataset: investigating statistical commonalities and differences between synthetic and real data and assessing frame-related aspects and correctness via an annotation study. Our preliminary findings show that ranges for text lengths and number of sentences are similar for synthetic and real data. We get mixed results for the similarity and lexical complexity of texts. The range of vocabulary sizes is similar in both datasets. We find that it is possible to generate data with indicators for the intentional patterns though we also get false samples. Generating positive samples worked better than for negative samples. Due to generation errors, further usage as training data requires cleaning of the synthetic data. The inter-annotator agreement in the annotation study was high. The study revealed crucial differences in frame annotations for correct positive and negative samples. We identify open questions and further steps for future research.

AB - Synthetic data generation is a solution to mitigate data scarcity. We investigate the generation of synthetic text data via prompting a pre-trained Large Language Model (LLM). The prompt design is based on reconstructive analyses from biology education of real student texts. Prompts were designed for the generation of positive and negative samples for intentional explanation patterns for the evolutionary adaptation of whales. We propose a mixed methods approach for the evaluation of the dataset: investigating statistical commonalities and differences between synthetic and real data and assessing frame-related aspects and correctness via an annotation study. Our preliminary findings show that ranges for text lengths and number of sentences are similar for synthetic and real data. We get mixed results for the similarity and lexical complexity of texts. The range of vocabulary sizes is similar in both datasets. We find that it is possible to generate data with indicators for the intentional patterns though we also get false samples. Generating positive samples worked better than for negative samples. Due to generation errors, further usage as training data requires cleaning of the synthetic data. The inter-annotator agreement in the annotation study was high. The study revealed crucial differences in frame annotations for correct positive and negative samples. We identify open questions and further steps for future research.

M3 - Conference abstract

SP - 57

BT - Instructional Design and Technology Enhanced Learning: Current States and Future Perspectives

T2 - EARLI SIG 6&amp;7 Biennial Conference 2024

Y2 - 21 August 2024 through 23 August 2024

ER -

By the same author(s)