💡 Introduction 💡
LLMs have become the go-to choice for code generation tasks. However, prior benchmarks contain only a very limited set of problems. Further, due to popularity and age, many benchmarks are prone to data leakage where example solutions can be readily found on the web. Such limitations inevitably lead us to inquire is the leaderboard performance on existing benchmarks reliable and comprehensive enough to measure the program synthesis ability of LLMs? To address this, we introduce EvoEval – a program synthesis benchmark suite created by evolving existing benchmarks into different targeted domains for a comprehensive evaluation of LLM coding abilities.
Evolving Existing Benchmarks
We transform and evolve existing coding benchmark (e.g, HumanEval) into problems in domains like Difficult, Creative, Subtle, Combine and Tool Use
📈 Result Overview 📉
Our study on 51 LLMs found that:
- Compared to the high performance obtained on standard benchmarks like HumanEval, when evaluated on EvoEval, popular LLMs significantly drop in performance (on average 39.6%).
- This drop is not uniform across all LLMs and can range from 20.0% to 47.7%, leading to drastic ranking changes amongst top performing models.
- Demonstrate that certain LLMs cannot keep up their high performance obtained in HumanEval when evaluated on more challenging or problems in different domains, highlighting the possibilities of overfitting to existing benchmarks.
- Moreover, while instruction-following LLMs perform well in solving self-contained problems, they struggle with the tool using aspect of utilizing already provided auxiliary functions.
- Additionally, current state-of-the-art LLMs fail to effectively compose multiple general coding concepts to solve more complex variants, or address subproblems decomposed from previously solved difficult problem.