💡 Introduction 💡

LLMs have become the go-to choice for code generation tasks. However, prior benchmarks contain only a very limited set of problems. Further, due to popularity and age, many benchmarks are prone to data leakage where example solutions can be readily found on the web. Such limitations inevitably lead us to inquire is the leaderboard performance on existing benchmarks reliable and comprehensive enough to measure the program synthesis ability of LLMs? To address this, we introduce EvoEval – a program synthesis benchmark suite created by evolving existing benchmarks into different targeted domains for a comprehensive evaluation of LLM coding abilities.

Evolving Existing Benchmarks

We transform and evolve existing coding benchmark (e.g, HumanEval) into problems in domains like Difficult, Creative, Subtle, Combine and Tool Use

📈 Result Overview 📉

Our study on 51 LLMs found that: