An Empirical Study on the Suitability of Test-based Patch Acceptance Criteria

Auteurs

Zemín L., Godio A., Cornejo C., Degiovanni R., Gutiérrez Brida S., Regis G., Aguirre N., Frias M.F.

Référence

ACM Transactions on Software Engineering and Methodology, vol. 34, n° 3, art. no. 57, 2025

Description

In this article, we empirically study the suitability of tests as acceptance criteria for automated program fixes, by checking patches produced by automated repair tools using a bug-finding tool, as opposed to previous works that used tests or manual inspections. We develop a number of experiments in which faulty programs from IntroClass, a known benchmark for program repair techniques, are fed to the program repair tools GenProg, Angelix, AutoFix, and Nopol, using test suites of varying quality, including those accompanying the benchmark. We then check the produced patches against formal specifications using a bug-finding tool. Our results show that, in the studied scenarios, automated program repair tools are significantly more likely to accept a spurious program fix than producing an actual one. Using bounded-exhaustive suites larger than the originally given ones (with about 100 and 1,000 tests) we verify that overfitting is reduced but (a) few new correct repairs are generated and (b) some tools see their performance reduced by the larger suites and fewer correct repairs are produced. Finally, by comparing with previous work, we show that overfitting is underestimated in semantics-based tools and that patches not discarded using held-out tests may be discarded using a bug-finding tool.

Lien

doi:10.1145/3702971

Partager cette page :