13th International Conference on Artificial Intelligence & Applications (ARIA 2026)

August 22 ~ 23, 2026, Dubai, UAE

Accepted Papers


Introducing Stochastic Stability Indexin AI-Generated Code Outputs

Jayanth Ramakrishnan and Parvendan Rangaswamy, Department of Computer Science and Engineering, SRM University-AP, Amaravati, Andhra Pradesh, India

ABSTRACT

Large Language Models (LLMs) have gotten really good at generating code, and we usually measure thatwith Pass@k. But Pass@k only tells you how often the model gets it right on its best try. It says nothingabout how much the outputs vary or whether they’re structurally consistent across multiple runs. That’s abig problem for real-world use, where you typically call the model just once.So we came up with the Stochastic Stability Index (SSI). It’s a combined metric that looks at three things:whether the code is syntactically valid (Msyn), how similar its structure is to a correct solution using thenormalised Zhang–Shasha Tree Edit Distance on Abstract Syntax Trees (Mast), and whether it actuallyworks when run in a safe sandbox (Mfun). SSI directly penalises variation from one sample to the next,giving you a single reliability score that reflects actual deployment risk, not just best-case performance.We tested SSI on HumanEval and MBPP using three top code generation models—Qwen, DeepSeek-Coder-v2, and Yi-Coder-9b—at three temperatures (T ∈{0.2, 0.5, 0.8}), generating 20 independent samples perproblem (that’s n = 96,840 programs total). What we found is a Creativity–Stability Paradox that shows upacross all models and both benchmarks: Pass@10 goes up as you raise the temperature, but SSI goes downas outputs become more variable. For Qwen on HumanEval, Pass@10 jumps from 86.8% to 94.0%, whileSSI falls from 0.6833 to 0.6563. A one-way ANOVA on structural similarity across every model–benchmarkcombination (in all cases p > 0.05) shows that the drop comes almost entirely from functional inconsistency,not structural divergence. That tells us something important about how to design better stability metrics. Thebottom line: if you really care about production-ready code generation, you need reliability-aware metricsalongside Pass@k.

Keywords

Large Language Models,Code Generation,Stochastic Stability Index,Abstract Syntax Trees,Pass@k,Inference Temperature,Benchmark Evaluation,Output Variance