Spaces:
Runtime error
Runtime error
Evaluation on STEM Benchmarks
To test Minerva’s quantitative reasoning abilities we evaluated the model on STEM benchmarks ranging in difficulty from grade school level problems to graduate level coursework.
- MATH: High school math competition level problems
- MMLU-STEM: A subset of the Massive Multitask Language Understanding benchmark focused on STEM, covering topics such as engineering, chemistry, math, and physics at high school and college level.
- GSM8k: Grade school level math problems involving basic arithmetic operations that should all be solvable by a talented middle school student.
We also evaluated Minerva on OCWCourses, a collection of college and graduate level problems covering a variety of STEM topics such as solid state chemistry, astronomy, differential equations, and special relativity that we collected from MIT OpenCourseWare.
In all cases, Minerva obtains state-of-the-art results, sometimes by a wide margin.
Reference: https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html