IneqMath: Solving Inequality Proofs with Large Language Models

Introduction

Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction.

Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws.

A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement.

IneqMath Examples

Below are training and testing examples from IneqMath. Each problem belongs to one of two automatically checkable subtasks: bound estimation or relation prediction. Each training problem includes step-wise solutions, with up to four solutions per problem, and 76.8% (962 problems) are annotated with relevant theorems. The test problems are each crafted and reviewed by IMO-level medalists to ensure both originality and difficulty.

Fine-grained Informal Judges

Traditional evaluation methods fall short in this setting: expert annotation is accurate but prohibitively labor-intensive, while automated techniques such as string matching or value equivalence fail to capture step-by-step correctness—an essential aspect of inequality problem solving. To evaluate the correctness of IneqMath solutions, we propose a fine-grained LLM-as-judge framework, consisting of a final-answer judge for verifying the predicted answer and four specialized step-wise judges targeting common reasoning flaws. A solution is deemed correct overall only if it passes all five judges. As shown in the following table and confusion matrix, these judges achieve strong alignment with human annotations (F1 = 0.93), providing a scalable yet reliable alternative to manual evaluation.

Performance metrics of LLM-as-Judge framework on development set.

Confusion matrices for five judges, which exhibit strong agreement with human labels.

Confusion matrix for the judge baseline.

IneqMath Leaderboard

🏆 The evaluation platform for the IneqMath is available here.

Model Info Columns

Size Type Source Date

Step Accuracy Columns

Step Acc (NTC) Step Acc (NLG) Step Acc (NAE) Step Acc (NCE)

Rank	Model	Size	Type	Source	Date	Overall Acc	Answer Acc	Step Acc (NTC)	Step Acc (NLG)	Step Acc (NAE)	Step Acc (NCE)
Loading leaderboard data...

Results are evaluated using our fine-grained LLM-as-judge framework. Overall accuracy requires passing all five specialized judges. The content in parentheses next to model's name represents reasoning effort and the max tokens respectively, with the default value for max tokens being 10K. Icons Explanation:
Type: 🧠 = Reasoning Model, 📝 = Chat Model, 🔧 = Tool-augmented Model
Source: 🔒 = Proprietary Model, 🌐 = Open-source Model

Step Accuracy Abbreviations:
NTC: No Toy Case - Step accuracy excluding using toy-case for general conclusions
NLG: No Logical Gap - Step accuracy without logical reasoning gaps
NAE: No Approximation Error - Step accuracy excluding approximation errors
NCE: No Calculation Error - Step accuracy excluding all calculation errors

🚨 You can submit your model's results to the evaluation platform through this link and receive your own accuracy score.

Dataset Overview

The IneqMath dataset comprises 200 test problems for benchmarking, 100 development problems with public ground truth, and 1,252 training problems split evenly between bound estimation and relation prediction tasks, as shown in the table below. The dataset also features 83 named theorems across 29 categories, with their distribution illustrated in the figure below.

The table below is a comparison of datasets for inequalities and theorem proving. IneqMath provides expert-annotated training and test/dev sets, featuring high-quality named theorems and step-wise solutions for model development. Unlike prior datasets using synthesis or autoformalization, IneqMath presents problems in informal language across multiple-choice (MC) and open-ended (Open) formats, and employs LLM-as-judge for evaluation.

Visualization

Data Visualization

Loading default problem (Training ID 0)...

In-depth Study

We conduct an in-depth analysis to understand the performance characteristics and limitations of current models on IneqMath. Our study encompasses three key areas: failure solution analysis to identify common error patterns in LLM-generated proofs, scaling laws in model size to examine how model capacity affects reasoning accuracy, and scaling laws in test-time computation to investigate the impact of extended reasoning chains. These analyses reveal critical insights into the challenges of inequality proving and provide guidance for improving mathematical reasoning capabilities.

Failure solution analysis: As shown in the table above, the most common step-wise errors in LLM-generated solutions are logical gaps (85.0% average failure rate across models) and improper generalization from toy cases (59.7%). Less frequent, but still significant, are errors from numerical approximations (26.9%) and miscalculations (6.8%).

Scaling laws in model size: This figure shows how final-answer accuracy (which evaluates only the correctness of the final predicted answer) scales with model size for LLMs. As model size increases, we observe a steady improvement in answer accuracy, reflecting an empirical scaling law: larger models are better at inferring correct bounds and inequality relationships.

Scaling laws in model size: However, the trend for answer accuracy does not hold when considering overall accuracy—which requires both a correct answer and valid intermediate reasoning steps—as shown in the figure above. In this case, the scaling curve flattens, indicating that increased model size alone is insufficient to eliminate step-by-step reasoning errors.

Scaling laws in test-time computation: The figure above shows that while models like Gemini 2.5 Pro and o3 initially improve with more tokens, performance gains saturate (e.g., beyond 20K tokens). This indicates that merely increasing computational budget offers diminishing returns for achieving rigorous, step-wise correct proofs, highlighting the need for more than just longer thought processes.

Improvement Strategies

We explore four promising strategies to enhance LLM performance on IneqMath: retrieving relevant theorems as hints, self-improvement via critic as feedback, taking annotated theorems as hints, and retrieving training problems as demonstrations.

Retrieving relevant theorems as hints: As shown in the figure, providing one or two such theorems decreases overall accuracy for weaker models (e.g., Grok 3 mini, o3-mini, o4-mini), likely due to misapplication or distraction by potentially irrelevant information. Conversely, stronger models like Gemini 2.5 Pro benefit from these hints, suggesting advanced reasoning is crucial to effectively use such guidance. These results underscore the potential of theorem-guided reasoning but also highlight the critical need for more sophisticated theorem retrieval mechanisms (e.g., RAG) to reliably enhance LLM performance in inequality proving.

Self-improvement via critic as feedback: As the figure shows, self-critique consistently improves performance—e.g., Gemini 2.5 Pro's overall accuracy rises from 43% to 48%. This upward trend underscores self-critique as a promising, supervision-free method to enhance the logical rigor and solution quality of LLMs in inequality reasoning.

Taking annotated theorems as hints: The figure reveals a consistent uplift in overall accuracy across models, with gains reaching up to 11% (e.g., for o3-mini).

Retrieving training problems as demonstrations: As shown by the overall accuracy in the figure, Grok 3 mini's performance improves from a baseline of 10% (with no demonstration problem) to 13% when provided with one such problem. However, its accuracy drops sharply to 3% when two problems are used as demonstrations. Similarly, Gemini 2.5 Pro peaks at 53% accuracy with one demonstration problem, declining to 45% with two. o4-mini reaches 23% accuracy with one demonstration problem, a 3% increase from its 20% baseline (without demonstrations).

BibTeX

@article{sheng2025solving,
    title={Solving Inequality Proofs with Large Language Models},
    author={Sheng, Jiayi and Lyu, Luna and Jin, Jikai and Xia, Tony and Gu, Alex and Zou, James and Lu, Pan},
    journal={arXiv preprint arXiv:2506.07927},
    year={2025}
}

Table of Contents

Solving Inequality Proofs with Large Language Models