Study of using LLMs for debugging verified programs

25 July 2025 by Ilgiz Mustafin

We have submitted a new article Do AI models help produce verified bugs? for publication. It analyses how and when software engineers profit (or not) from using LLMs in working with verified software.

The paper is authored by Li Huang, Ilgiz Mustafin, Marco Piccioni, Alessandro Schena, Reto Weber, Bertrand Meyer. Also, 25 software engineers contributed their time and effort as the experiment participants. I thank all participants for their work!

We asked the experiment participants to fix bugs in 9 programs. Participants were split into two groups: 10 participants were allowed to use AI and 15 participants were not allowed to use AI during the experiment. Each participant had 2 hours for the tasks and had to record the video of their screen for the further analysis. The 50 hours of video recordings were manually analysed by the authors.

Participants used the AutoProof verifier to check the correctness of the fixes, in contrast with the usual practice of running tests or manual inspection of the code.

I think this paper highlights an promising approach of combining a very informal vibe-coding (vibe-debugging here) and the use of formal methods (static verification).

Also, I was happy to introduce some of my friends to the program verification tools in general and AutoProof in particular. Thanks again for taking part in the experiment!