Top model scores may be skewed by Git history leaks in SWE-bench
Link: github.com/SWE-bench/SWE-bench…
Discussion: news.ycombinator.com/item?id=4…
Repo State Loopholes During Agentic Evaluation
We've identified multiple loopholes with SWE Bench Verified where agents may look at future repository state (by querying it directly or through a variety of methods), and cases in which future rep...jacobkahn (GitHub)
casey is remote
in reply to Hacker News 50 • • •