Data Augmentation for Supervised Code Translation Learning

Binger Chen; Jacek Golebiowski; Ziawasch Abedjan

doi:10.1145/3643991.3644923

Details

Originalsprache	Englisch
Titel des Sammelwerks	2024 IEEE/ACM 21st International Conference on Mining Software Repositories
Untertitel	MSR 2024
Seiten	444-456
Seitenumfang	13
ISBN (elektronisch)	9798400705878
Publikationsstatus	Veröffentlicht - 2 Juli 2024
Veranstaltung	21st IEEE/ACM International Conference on Mining Software Repositories, MSR 2024 - Lisbon, Portugal Dauer: 15 Apr. 2024 → 16 Apr. 2024

Abstract

Data-driven program translation has been recently the focus of several lines of research. A common and robust strategy is supervised learning. However, there is typically a lack of parallel training data, i.e., pairs of code snippets in the source and target language. While many data augmentation techniques exist in the domain of natural language processing, they cannot be easily adapted to tackle code translation due to the unique restrictions of programming languages. In this paper, we develop a novel rule-based augmentation approach tailored for code translation data, and a novel retrieval-based approach that combines code samples from unorganized big code repositories to obtain new training data. Both approaches are language-independent. We perform an extensive empirical evaluation on existing Java-C#-benchmarks showing that our method improves the accuracy of state-of-the-art supervised translation techniques by up to 35%.

ASJC Scopus Sachgebiete

Informatik (insg.)
Angewandte Informatik
Informatik (insg.)
Software
Ingenieurwesen (insg.)
Sicherheit, Risiko, Zuverlässigkeit und Qualität

Zitieren

Data Augmentation for Supervised Code Translation Learning. / Chen, Binger; Golebiowski, Jacek; Abedjan, Ziawasch.
2024 IEEE/ACM 21st International Conference on Mining Software Repositories: MSR 2024. 2024. S. 444-456.

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Chen, B, Golebiowski, J & Abedjan, Z 2024, Data Augmentation for Supervised Code Translation Learning. in 2024 IEEE/ACM 21st International Conference on Mining Software Repositories: MSR 2024. S. 444-456, 21st IEEE/ACM International Conference on Mining Software Repositories, MSR 2024, Lisbon, Portugal, 15 Apr. 2024. https://doi.org/10.1145/3643991.3644923

Chen, B., Golebiowski, J., & Abedjan, Z. (2024). Data Augmentation for Supervised Code Translation Learning. In 2024 IEEE/ACM 21st International Conference on Mining Software Repositories: MSR 2024 (S. 444-456) https://doi.org/10.1145/3643991.3644923

Chen B, Golebiowski J, Abedjan Z. Data Augmentation for Supervised Code Translation Learning. in 2024 IEEE/ACM 21st International Conference on Mining Software Repositories: MSR 2024. 2024. S. 444-456 doi: 10.1145/3643991.3644923

Chen, Binger ; Golebiowski, Jacek ; Abedjan, Ziawasch. / Data Augmentation for Supervised Code Translation Learning. 2024 IEEE/ACM 21st International Conference on Mining Software Repositories: MSR 2024. 2024. S. 444-456

Download

@inproceedings{0cb5630ba0ca435c94674d2060e90623,

title = "Data Augmentation for Supervised Code Translation Learning",

abstract = "Data-driven program translation has been recently the focus of several lines of research. A common and robust strategy is supervised learning. However, there is typically a lack of parallel training data, i.e., pairs of code snippets in the source and target language. While many data augmentation techniques exist in the domain of natural language processing, they cannot be easily adapted to tackle code translation due to the unique restrictions of programming languages. In this paper, we develop a novel rule-based augmentation approach tailored for code translation data, and a novel retrieval-based approach that combines code samples from unorganized big code repositories to obtain new training data. Both approaches are language-independent. We perform an extensive empirical evaluation on existing Java-C#-benchmarks showing that our method improves the accuracy of state-of-the-art supervised translation techniques by up to 35%.",

author = "Binger Chen and Jacek Golebiowski and Ziawasch Abedjan",

note = "Publisher Copyright: {\textcopyright} 2024 ACM.; 21st IEEE/ACM International Conference on Mining Software Repositories, MSR 2024 ; Conference date: 15-04-2024 Through 16-04-2024",

year = "2024",

month = jul,

day = "2",

doi = "10.1145/3643991.3644923",

language = "English",

pages = "444--456",

booktitle = "2024 IEEE/ACM 21st International Conference on Mining Software Repositories",

}

Download

TY - GEN

T1 - Data Augmentation for Supervised Code Translation Learning

AU - Chen, Binger

AU - Golebiowski, Jacek

AU - Abedjan, Ziawasch

PY - 2024/7/2

Y1 - 2024/7/2

N2 - Data-driven program translation has been recently the focus of several lines of research. A common and robust strategy is supervised learning. However, there is typically a lack of parallel training data, i.e., pairs of code snippets in the source and target language. While many data augmentation techniques exist in the domain of natural language processing, they cannot be easily adapted to tackle code translation due to the unique restrictions of programming languages. In this paper, we develop a novel rule-based augmentation approach tailored for code translation data, and a novel retrieval-based approach that combines code samples from unorganized big code repositories to obtain new training data. Both approaches are language-independent. We perform an extensive empirical evaluation on existing Java-C#-benchmarks showing that our method improves the accuracy of state-of-the-art supervised translation techniques by up to 35%.

AB - Data-driven program translation has been recently the focus of several lines of research. A common and robust strategy is supervised learning. However, there is typically a lack of parallel training data, i.e., pairs of code snippets in the source and target language. While many data augmentation techniques exist in the domain of natural language processing, they cannot be easily adapted to tackle code translation due to the unique restrictions of programming languages. In this paper, we develop a novel rule-based augmentation approach tailored for code translation data, and a novel retrieval-based approach that combines code samples from unorganized big code repositories to obtain new training data. Both approaches are language-independent. We perform an extensive empirical evaluation on existing Java-C#-benchmarks showing that our method improves the accuracy of state-of-the-art supervised translation techniques by up to 35%.

UR - http://www.scopus.com/inward/record.url?scp=85194843107&partnerID=8YFLogxK

U2 - 10.1145/3643991.3644923

DO - 10.1145/3643991.3644923

M3 - Conference contribution

AN - SCOPUS:85194843107

SP - 444

EP - 456

BT - 2024 IEEE/ACM 21st International Conference on Mining Software Repositories

T2 - 21st IEEE/ACM International Conference on Mining Software Repositories, MSR 2024

Y2 - 15 April 2024 through 16 April 2024

ER -

Research@Leibniz University

Data Augmentation for Supervised Code Translation Learning

Autoren

Organisationseinheiten

Externe Organisationen

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren