Unsupervised Learning for Program Repair and Grammatical Error Correction
Break-It-Fix-It (BIFI) is an unsupervised method to train a model for fixing source code errors, purely from unlabeled data (raw code snippets) and a critic (a tool that checks the validity of code, such as compiler).
Moreover, to extend BIFI to fixing text errors where there is no perfect critic, we develop LM-Critic, an approximate critic that uses pretrained language models to assess the validity of text input.
BIFI and LM-Critic provide significant performance boost over existing systems on program repair and grammatical error correction benchmarks.
Motivation
Automating the correction of source code (e.g. fixing compiler errors) and text (e.g. fixing typos and grammatical errors) can greatly enhance our productivity of programming and writing. A major challenge in training machine learning models (fixers) for such repair tasks is to obtain parallel data, e.g. labeled broken-fixed code pairs. However, manual labeling is expensive. Existing works create synthetic paired data by applying heuristics to corrupt good code, but it does not generalize to errors in real human-written code or text. How to obtain inexpensive yet realistic training data for repair tasks remains an open research problem.
Method: Break-It-Fix-It
To tackle this challenge, we develop a new approach, Break-It-Fix-It (BIFI), which bootstraps realistic paired data and trains the fixer without manual labeling. The BIFI algorithm takes three inputs:
-
Critic, which is a tool that assesses the validity of input code, such as a compiler and code analyzer.
-
Unlabeled data, which can be raw code snippets on the web such as GitHub. Using the critic, we obtain a set of good code and a set of bad code.
-
Initial fixer, which can be the one trained on the naive synthetic data
and outputs a better fixer.
Specifically, BIFI performs the following 4 steps shown in the figure below, to iteratively train the desired
fixer that maps bad code to good code and a
breaker that maps good code to bad code, while using them in conjunction to generate more realistic training data. The
pink annotation ("keep if fixed" and "keep if broken") indicates that we use the critic to select realistic paired data generated in this process.
The intuition is that a better fixer and breaker will be able to generate more realistic paired data, which in turn helps to train a better fixer and breaker. BIFI achieves significant performance boost over existing systems on code repair tasks such as fixing syntax errors of Python code and fixing compiler errors of C code.
We refer interested readers to
our paper and
blog post for more details and results.
Advanced method: LM-Critic
When applying BIFI to fixing source code errors, we can use a perfect critic for code such as compilers. However, in domains other than code such as natural language text, such oracle critic does not exist. To generalize BIFI to fixing text errors (grammatical error correction), we develop
LM-Critic, an approximate critic that uses pretrained language models to assess the validity of text input. A language model (LM) learns a probability distribution of sentences from text corpora. The idea of LM-Critic is to deem a sentence as valid if the pretrained LM assigns it a higher probability than its local perturbations.
We then apply BIFI to grammatical error correction, using this LM-Critic:
In essence, using pretrained LMs, our method (BIFI + LM-Critic) can generate realistic paired data from unlabeled data alone. Our method achieves performance boost over existing systems across various grammatical error correction benckmarks: CoNLL-2014, BEA-2019, GMEG-wiki, GMEG-yahoo. We refer interested readers to
our paper for more details and results.
Code
Our code is available on GitHub:
Datasets
The datasets used are included in the code repository.
Contributors
The following people contributed to this project:
Michihiro Yasunaga
Jure Leskovec
Percy Liang
References
Break-It-Fix-It: Unsupervised Learning for Program Repair. M. Yasunaga, P. Liang.
International Conference on Machine Learning (ICML), 2021.
LM-Critic: Language Models for Unsupervised Grammatical Error Correction. M. Yasunaga, J. Leskovec, P. Liang.
Empirical Methods in Natural Language Processing (EMNLP), 2021.