InfoSynth: Information-Guided Benchmark Synthesis for LLMs

University of California, Berkeley
*Corresponding author

Abstract

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs.

Novelty and Diversity Metrics

We define benchmark novelty using a k-nearest neighbor KL-divergence estimator comparing new vs. seed distributions, and diversity using the Kozachenko–Leonenko differential entropy estimator to capture global variety. Problems are embedded (sentence transformers) and jointly projected with UMAP to preserve relative geometry for fair comparison. These metrics provide efficient, intuitive quality signals without needing costly test-taker evaluations.

Problem Generation Process

Starting from seed datasets (e.g., MBPP, LeetCode), InfoSynth generates multiple variants via mutation (easy/medium/hard) and crossover, selects low-similarity candidates (k-farthest), and verifies each problem with auto-produced solutions and tests in a Python execution environment. Failures trigger iterative feedback with full error history until tests pass; passing problems are deduplicated and lightly post-processed for edge-case clarity.

InfoSynth generation pipeline
Mutation & crossover generation, k-farthest filtering, iterative code feedback, and MinHash+LSH deduplication.

Example Problems

Starting from the seed data, at each iteration, we randomly apply either mutation or crossover to generate new coding instructions. Our mutation prompts ask the model to modify an existing problem in three different difficulty variations: easier, equally difficult, and more difficult, encouraging diversity. Crossover prompts combine existing questions into new ones.

Mutation Example:

Seed Question:
Write a Python function to find the sum of an array.

Hard Mutation Variant:
Write a Python function to find the sum of an array, where the array may contain nested lists of integers at any depth.

Crossover Example:

Seed Questions:

  1. Write a function to rotate a given list by a specified number of items to the right direction.
  2. Write a function to find the maximum sum that can be formed which has no three consecutive elements present.

Crossover Variant:
Write a function to rotate a list by a specified number of steps to the right, ensuring that the sum of any three consecutive elements in the newly rotated list does not exceed a given threshold.

Experimental Results

InfoSynth produces benchmarks with ~97% problem correctness and ~99–100% test coverage. Across models, datasets synthesized with k-farthest filtering and post-processing show higher diversity and robust performance differences, enabling controllable difficulty and improved evaluation quality.

Test-taker performance across datasets and difficulty
Summary of model performance on InfoSynth benchmarks.

Controllable Difficulty: Hard mutations reduce model scores by 8–15% compared to original problems, demonstrating effective difficulty control. This comes with a tradeoff in diversity and novelty, as harder problems concentrate around fewer challenging topics. InfoSynth enables careful tuning of mutation difficulty to balance benchmark challenge and variety.

Iterative Code Feedback: Solution-test pairs that pass increase by 20% over 5 feedback iterations, with error rates dropping as the model fixes syntax and runtime issues. Three iterations provide the optimal balance between quality gains and inference cost. The feedback process acts as chain-of-thought reasoning, leveraging full error history to refine solutions and tests.

Novelty and Diversity Tradeoffs: The pipeline consistently produces datasets more novel and diverse than seed datasets. K-farthest-neighbor filtering further improves these metrics but generates easier problems, revealing a fundamental tradeoff: the generator struggles with difficult problems unless they align conceptually with seeds. Post-processing and filtering reduce novelty slightly, likely due to model memorization making out-of-distribution problems harder to solve. InfoSynth's modular design allows precise control over the novelty–diversity–difficulty balance.

Topic Composition

InfoSynth increases coverage across core coding topics (e.g., arrays, strings, hash tables, DP, greedy, math), producing more diverse and widely covering problems than seed datasets. This improves benchmark resilience and provides a broader evaluation of reasoning abilities.

Topic composition across datasets
Distribution of topics covered by InfoSynth-generated problems.

BibTeX

@article{garg2026infosynth,
      	title={InfoSynth: Information-Guided Benchmark Synthesis for LLMs},
      	author={Garg, Ishir and Kolhe, Neel and Zhao, Xuandong and Song, Dawn},
      	journal={arXiv preprint xxx},
      	year={2026}
      }