Details
Original language | English |
---|---|
Title of host publication | 2024 IEEE/ACM 21st International Conference on Mining Software Repositories |
Subtitle of host publication | MSR 2024 |
Pages | 444-456 |
Number of pages | 13 |
ISBN (electronic) | 9798400705878 |
Publication status | Published - 2 Jul 2024 |
Event | 21st IEEE/ACM International Conference on Mining Software Repositories, MSR 2024 - Lisbon, Portugal Duration: 15 Apr 2024 → 16 Apr 2024 |
Abstract
Data-driven program translation has been recently the focus of several lines of research. A common and robust strategy is supervised learning. However, there is typically a lack of parallel training data, i.e., pairs of code snippets in the source and target language. While many data augmentation techniques exist in the domain of natural language processing, they cannot be easily adapted to tackle code translation due to the unique restrictions of programming languages. In this paper, we develop a novel rule-based augmentation approach tailored for code translation data, and a novel retrieval-based approach that combines code samples from unorganized big code repositories to obtain new training data. Both approaches are language-independent. We perform an extensive empirical evaluation on existing Java-C#-benchmarks showing that our method improves the accuracy of state-of-the-art supervised translation techniques by up to 35%.
ASJC Scopus subject areas
- Computer Science(all)
- Computer Science Applications
- Computer Science(all)
- Software
- Engineering(all)
- Safety, Risk, Reliability and Quality
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
2024 IEEE/ACM 21st International Conference on Mining Software Repositories: MSR 2024. 2024. p. 444-456.
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Data Augmentation for Supervised Code Translation Learning
AU - Chen, Binger
AU - Golebiowski, Jacek
AU - Abedjan, Ziawasch
N1 - Publisher Copyright: © 2024 ACM.
PY - 2024/7/2
Y1 - 2024/7/2
N2 - Data-driven program translation has been recently the focus of several lines of research. A common and robust strategy is supervised learning. However, there is typically a lack of parallel training data, i.e., pairs of code snippets in the source and target language. While many data augmentation techniques exist in the domain of natural language processing, they cannot be easily adapted to tackle code translation due to the unique restrictions of programming languages. In this paper, we develop a novel rule-based augmentation approach tailored for code translation data, and a novel retrieval-based approach that combines code samples from unorganized big code repositories to obtain new training data. Both approaches are language-independent. We perform an extensive empirical evaluation on existing Java-C#-benchmarks showing that our method improves the accuracy of state-of-the-art supervised translation techniques by up to 35%.
AB - Data-driven program translation has been recently the focus of several lines of research. A common and robust strategy is supervised learning. However, there is typically a lack of parallel training data, i.e., pairs of code snippets in the source and target language. While many data augmentation techniques exist in the domain of natural language processing, they cannot be easily adapted to tackle code translation due to the unique restrictions of programming languages. In this paper, we develop a novel rule-based augmentation approach tailored for code translation data, and a novel retrieval-based approach that combines code samples from unorganized big code repositories to obtain new training data. Both approaches are language-independent. We perform an extensive empirical evaluation on existing Java-C#-benchmarks showing that our method improves the accuracy of state-of-the-art supervised translation techniques by up to 35%.
UR - http://www.scopus.com/inward/record.url?scp=85194843107&partnerID=8YFLogxK
U2 - 10.1145/3643991.3644923
DO - 10.1145/3643991.3644923
M3 - Conference contribution
AN - SCOPUS:85194843107
SP - 444
EP - 456
BT - 2024 IEEE/ACM 21st International Conference on Mining Software Repositories
T2 - 21st IEEE/ACM International Conference on Mining Software Repositories, MSR 2024
Y2 - 15 April 2024 through 16 April 2024
ER -