Towards Assessing and Repairing LLM-Generated Code via Model Checking and MaxSAT-Based Fault Localisation

May 5, 2026·
Pedro Orvalho
Pedro Orvalho
· 0 min read
Image credit: Schloss Dagstuhl
Abstract
Large Language Models (LLMs) are increasingly used for programming tasks, yet their ability to truly understand program semantics remains unclear. In this talk, we investigate the robustness and reliability of LLMs in reasoning about code. We first present an empirical study evaluating whether LLMs’ output predictions are grounded in sound reasoning or arise from superficial pattern matching. By introducing a set of semantics-preserving code mutations—such as variable renaming, control-flow transformations, and loop modifications—we show that even state-of-the-art models frequently change their predictions, revealing limited robustness to syntactic variations. Motivated by these findings, we explore how to improve the correctness of LLM-generated code by integrating formal methods. We present an approach that combines model checking with MaxSAT-based fault localisation to guide LLMs toward more accurate program repairs. Our approach leverages counterexamples and minimal diagnoses to identify faulty code regions, which are then used to refine LLM prompts. Finally, we introduce a novel approach for verifying Python programs by leveraging LLM-based transpilation to C and bounded model checking. This hybrid pipeline enables precise fault localisation and verification of non-trivial Python programs, bridging the gap between the flexibility of LLMs and the rigor of formal verification tools. Overall, this work highlights fundamental limitations in current code models and demonstrates how neurosymbolic approaches can enhance their reliability in Software Engineering tasks.
Date
May 5, 2026 4:30 PM — 5:00 PM
Event
Location

Dagstuhl

Wadern, Saarland, Germany.