Table of Contents

IneqMath Logo

Solving Inequality Proofs with Large Language Models

Jiayi Sheng*\(\textcolor[rgb]{0,0.196,0.384}{\boldsymbol{\beta}}\), Luna Lyu*\(\textcolor[rgb]{0.549,0.082,0.082}{\boldsymbol{\alpha}}\), Jikai Jin\(\textcolor[rgb]{0.549,0.082,0.082}{\boldsymbol{\alpha}}\), Tony Xia\(\textcolor[rgb]{0.549,0.082,0.082}{\boldsymbol{\alpha}}\), Alex Gu\(\textcolor[rgb]{0,0,0}{\boldsymbol{\gamma}}\), James Zou\(\dagger\textcolor[rgb]{0.549,0.082,0.082}{\boldsymbol{\alpha}}\), Pan Lu\(\dagger\textcolor[rgb]{0.549,0.082,0.082}{\boldsymbol{\alpha}}\)
\(\textcolor[rgb]{0.549,0.082,0.082}{\boldsymbol{\alpha}}\) Stanford University \(\textcolor[rgb]{0,0.196,0.384}{\boldsymbol{\beta}}\) UC Berkeley \(\textcolor[rgb]{0,0,0}{\boldsymbol{\gamma}}\) Massachusetts Institute of Technology
* Co-first authors \(\dagger\) Co-senior authors

Final-answer accuracy versus overall accuracy for leading LLMs across different categories on the IneqMath benchmark of Olympiad-level inequality problems. Overall accuracy, measuring both answer correctness and step soundness, is substantially lower than final-answer accuracy for all model types. This highlights a critical gap: while LLMs may find correct final answers to these inequality problems, their reasoning is often unsound. Each model used its optimal maximal tokens.

Introduction

Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction.

Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws.

A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement.

IneqMath Examples

Below are training and testing examples from IneqMath. Each problem belongs to one of two automatically checkable subtasks: bound estimation or relation prediction. Each training problem includes step-wise solutions, with up to four solutions per problem, and 76.8% (962 problems) are annotated with relevant theorems. The test problems are each crafted and reviewed by IMO-level medalists to ensure both originality and difficulty.

Fine-grained Informal Judges

Traditional evaluation methods fall short in this setting: expert annotation is accurate but prohibitively labor-intensive, while automated techniques such as string matching or value equivalence fail to capture step-by-step correctness—an essential aspect of inequality problem solving. To evaluate the correctness of IneqMath solutions, we propose a fine-grained LLM-as-judge framework, consisting of a final-answer judge for verifying the predicted answer and four specialized step-wise judges targeting common reasoning flaws. A solution is deemed correct overall only if it passes all five judges. As shown in the following table and confusion matrix, these judges achieve strong alignment with human annotations (F1 = 0.93), providing a scalable yet reliable alternative to manual evaluation.

Leaderboard on IneqMath

🏆 The interactive leaderboard for the IneqMath is available here.

Model Search

Model Size

Model Type

Model Source

Model Info Columns

Step Accuracy Columns

Rank Model Size Type Source Date Overall Acc Answer Acc Step Acc (NTC) Step Acc (NLG) Step Acc (NAE) Step Acc (NCE)
Loading leaderboard data...
Results are evaluated using our fine-grained LLM-as-judge framework. Overall accuracy requires passing all five specialized judges. The content in parentheses next to model's name represents reasoning effort and the max tokens respectively, with the default value for max tokens being 10K. Icons Explanation:
Type: 🧠 = Reasoning Model, 📝 = Chat Model, 🔧 = Tool-augmented Model
Source: 🔒 = Proprietary Model, 🌐 = Open-source Model

Step Accuracy Abbreviations:
NTC: No Toy Case - Step accuracy excluding using toy-case for general conclusions
NLG: No Logical Gap - Step accuracy without logical reasoning gaps
NAE: No Approximation Error - Step accuracy excluding approximation errors
NCE: No Calculation Error - Step accuracy excluding all calculation errors

🚨 You can submit your model's results through this link and receive an accuracy score.

🚨 You may also choose to add your score to the leaderboard or remove it at any time by clicking the "Process Query 🔍" button and managing your submission in the same link.

Dataset Overview

The IneqMath dataset comprises 200 test problems for benchmarking, 100 development problems with public ground truth, and 1,252 training problems split evenly between bound estimation and relation prediction tasks, as shown in the table below. The dataset also features 83 named theorems across 29 categories, with their distribution illustrated in the figure below.

The table below is a comparison of datasets for inequalities and theorem proving. IneqMath provides expert-annotated training and test/dev sets, featuring high-quality named theorems and step-wise solutions for model development. Unlike prior datasets using synthesis or autoformalization, IneqMath presents problems in informal language across multiple-choice (MC) and open-ended (Open) formats, and employs LLM-as-judge for evaluation.

Visualization

Data Visualization

Loading default problem (Training ID 0)...

In-depth Study

We conduct an in-depth analysis to understand the performance characteristics and limitations of current models on IneqMath. Our study encompasses three key areas: failure solution analysis to identify common error patterns in LLM-generated proofs, scaling laws in model size to examine how model capacity affects reasoning accuracy, and scaling laws in test-time computation to investigate the impact of extended reasoning chains. These analyses reveal critical insights into the challenges of inequality proving and provide guidance for improving mathematical reasoning capabilities.

Improvement Strategies

We explore four promising strategies to enhance LLM performance on IneqMath: retrieving relevant theorems as hints, self-improvement via critic as feedback, taking annotated theorems as hints, and retrieving training problems as demonstrations.

BibTeX

@article{jiayi2025solving,
    author = {Jiayi, Sheng and Luna, Lyu and Jikai, Jin and Tony, Xia and Alex, Gu and James, Zou and Pan, Lu},
    title = {Solving Inequality Proofs with Large Language Models},
    journal = {arXiv preprint arXiv:2506.07927},
    year = {2025}
}