google/deplot · Difficulty in Reproducing Results on PlotQA Test Set

I attempted to reproduce the results of the chart-to-table task on the PlotQA test set using your provided model checkpoints. I followed the evaluation methodology described in your paper, but failed to achieve the reported performance — there is a significant gap of around 30% between my results and those in the paper.

Based on manual inspection, the model outputs appear to have notable issues, particularly a high error rate in the table column headers (i.e., incorrect or missing axis/category labels).

Could you clarify whether the reported results in the paper were computed under these conditions? Specifically, does the evaluation metric only consider accuracy of the numerical cells, and exclude errors in structural elements such as headers and row/column labels?

Thank you for your clarification.