Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?
In this talk, I will present our evaluation on whether state-of-the-art LLMs with up to 8B parameters can reason about Python programs or are simply guessing.
