- DVQA: Understanding Data Visualizations via Question Answering Bar charts are an effective way to convey numeric information, but today's algorithms cannot parse them. Existing methods fail when faced with even minor variations in appearance. Here, we present DVQA, a dataset that tests many aspects of bar chart understanding in a question answering framework. Unlike visual question answering (VQA), DVQA requires processing words and answers that are unique to a particular bar chart. State-of-the-art VQA algorithms perform poorly on DVQA, and we propose two strong baselines that perform considerably better. Our work will enable algorithms to automatically extract numeric and semantic information from vast quantities of bar charts found in scientific publications, Internet articles, business reports, and many other areas. 4 authors · Jan 24, 2018
- DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation Large Language Models (LLMs) have demonstrated remarkable capabilities, revolutionizing the integration of AI in daily life applications. However, they are prone to hallucinations, generating claims that contradict established facts, deviating from prompts, and producing inconsistent responses when the same prompt is presented multiple times. Addressing these issues is challenging due to the lack of comprehensive and easily assessable benchmark datasets. Most existing datasets are small and rely on multiple-choice questions, which are inadequate for evaluating the generative prowess of LLMs. To measure hallucination in LLMs, this paper introduces a comprehensive benchmark dataset comprising over 75,000 prompts across eight domains. These prompts are designed to elicit definitive, concise, and informative answers. The dataset is divided into two segments: one publicly available for testing and assessing LLM performance and a hidden segment for benchmarking various LLMs. In our experiments, we tested six LLMs-GPT-3.5, LLama 2, LLama 3, Gemini, Mixtral, and Zephyr-revealing that overall factual hallucination ranges from 59% to 82% on the public dataset and 57% to 76% in the hidden benchmark. Prompt misalignment hallucination ranges from 6% to 95% in the public dataset and 17% to 94% in the hidden counterpart. Average consistency ranges from 21% to 61% and 22% to 63%, respectively. Domain-wise analysis shows that LLM performance significantly deteriorates when asked for specific numeric information while performing moderately with person, location, and date queries. Our dataset demonstrates its efficacy and serves as a comprehensive benchmark for LLM performance evaluation. Our dataset and LLMs responses are available at https://github.com/ashikiut/DefAn{https://github.com/ashikiut/DefAn}. 4 authors · Jun 13, 2024
- Saliency Map Verbalization: Comparing Feature Importance Representations from Model-free and Instruction-based Methods Saliency maps can explain a neural model's predictions by identifying important input features. They are difficult to interpret for laypeople, especially for instances with many features. In order to make them more accessible, we formalize the underexplored task of translating saliency maps into natural language and compare methods that address two key challenges of this approach -- what and how to verbalize. In both automatic and human evaluation setups, using token-level attributions from text classification tasks, we compare two novel methods (search-based and instruction-based verbalizations) against conventional feature importance representations (heatmap visualizations and extractive rationales), measuring simulatability, faithfulness, helpfulness and ease of understanding. Instructing GPT-3.5 to generate saliency map verbalizations yields plausible explanations which include associations, abstractive summarization and commonsense reasoning, achieving by far the highest human ratings, but they are not faithfully capturing numeric information and are inconsistent in their interpretation of the task. In comparison, our search-based, model-free verbalization approach efficiently completes templated verbalizations, is faithful by design, but falls short in helpfulness and simulatability. Our results suggest that saliency map verbalization makes feature attribution explanations more comprehensible and less cognitively challenging to humans than conventional representations. 6 authors · Oct 13, 2022
- Improving Information Extraction on Business Documents with Specific Pre-Training Tasks Transformer-based Language Models are widely used in Natural Language Processing related tasks. Thanks to their pre-training, they have been successfully adapted to Information Extraction in business documents. However, most pre-training tasks proposed in the literature for business documents are too generic and not sufficient to learn more complex structures. In this paper, we use LayoutLM, a language model pre-trained on a collection of business documents, and introduce two new pre-training tasks that further improve its capacity to extract relevant information. The first is aimed at better understanding the complex layout of documents, and the second focuses on numeric values and their order of magnitude. These tasks force the model to learn better-contextualized representations of the scanned documents. We further introduce a new post-processing algorithm to decode BIESO tags in Information Extraction that performs better with complex entities. Our method significantly improves extraction performance on both public (from 93.88 to 95.50 F1 score) and private (from 84.35 to 84.84 F1 score) datasets composed of expense receipts, invoices, and purchase orders. 4 authors · Sep 11, 2023
12 Stylecodes: Encoding Stylistic Information For Image Generation Diffusion models excel in image generation, but controlling them remains a challenge. We focus on the problem of style-conditioned image generation. Although example images work, they are cumbersome: srefs (style-reference codes) from MidJourney solve this issue by expressing a specific image style in a short numeric code. These have seen widespread adoption throughout social media due to both their ease of sharing and the fact they allow using an image for style control, without having to post the source images themselves. However, users are not able to generate srefs from their own images, nor is the underlying training procedure public. We propose StyleCodes: an open-source and open-research style encoder architecture and training procedure to express image style as a 20-symbol base64 code. Our experiments show that our encoding results in minimal loss in quality compared to traditional image-to-style techniques. 1 authors · Nov 19, 2024 2
26 Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information While the ability of language models to elicit facts has been widely investigated, how they handle temporally changing facts remains underexplored. We discover Temporal Heads, specific attention heads primarily responsible for processing temporal knowledge through circuit analysis. We confirm that these heads are present across multiple models, though their specific locations may vary, and their responses differ depending on the type of knowledge and its corresponding years. Disabling these heads degrades the model's ability to recall time-specific knowledge while maintaining its general capabilities without compromising time-invariant and question-answering performances. Moreover, the heads are activated not only numeric conditions ("In 2004") but also textual aliases ("In the year ..."), indicating that they encode a temporal dimension beyond simple numerical representation. Furthermore, we expand the potential of our findings by demonstrating how temporal knowledge can be edited by adjusting the values of these heads. 5 authors · Feb 19 2
1 The Geometry of Numerical Reasoning: Language Models Compare Numeric Properties in Linear Subspaces This paper investigates whether large language models (LLMs) utilize numerical attributes encoded in a low-dimensional subspace of the embedding space when answering logical comparison questions (e.g., Was Cristiano born before Messi?). We first identified these subspaces using partial least squares regression, which effectively encodes the numerical attributes associated with the entities in comparison prompts. Further, we demonstrate causality by intervening in these subspaces to manipulate hidden states, thereby altering the LLM's comparison outcomes. Experimental results show that our findings hold for different numerical attributes, indicating that LLMs utilize the linearly encoded information for numerical reasoning. 5 authors · Oct 16, 2024
- Classification of Geological Borehole Descriptions Using a Domain Adapted Large Language Model Geological borehole descriptions contain detailed textual information about the composition of the subsurface. However, their unstructured format presents significant challenges for extracting relevant features into a structured format. This paper introduces GEOBERTje: a domain adapted large language model trained on geological borehole descriptions from Flanders (Belgium) in the Dutch language. This model effectively extracts relevant information from the borehole descriptions and represents it into a numeric vector space. Showcasing just one potential application of GEOBERTje, we finetune a classifier model on a limited number of manually labeled observations. This classifier categorizes borehole descriptions into a main, second and third lithology class. We show that our classifier outperforms both a rule-based approach and GPT-4 of OpenAI. This study exemplifies how domain adapted large language models enhance the efficiency and accuracy of extracting information from complex, unstructured geological descriptions. This offers new opportunities for geological analysis and modeling using vast amounts of data. 3 authors · Jun 24, 2024
- Do NLP Models Know Numbers? Probing Numeracy in Embeddings The ability to understand and work with numbers (numeracy) is critical for many complex reasoning tasks. Currently, most NLP models treat numbers in text in the same way as other tokens---they embed them as distributed vectors. Is this enough to capture numeracy? We begin by investigating the numerical reasoning capabilities of a state-of-the-art question answering model on the DROP dataset. We find this model excels on questions that require numerical reasoning, i.e., it already captures numeracy. To understand how this capability emerges, we probe token embedding methods (e.g., BERT, GloVe) on synthetic list maximum, number decoding, and addition tasks. A surprising degree of numeracy is naturally present in standard embeddings. For example, GloVe and word2vec accurately encode magnitude for numbers up to 1,000. Furthermore, character-level embeddings are even more precise---ELMo captures numeracy the best for all pre-trained methods---but BERT, which uses sub-word units, is less exact. 5 authors · Sep 17, 2019
- Benchmarking Multimodal AutoML for Tabular Data with Text Fields We consider the use of automated supervised learning systems for data tables that not only contain numeric/categorical columns, but one or more text fields as well. Here we assemble 18 multimodal data tables that each contain some text fields and stem from a real business application. Our publicly-available benchmark enables researchers to comprehensively evaluate their own methods for supervised learning with numeric, categorical, and text features. To ensure that any single modeling strategy which performs well over all 18 datasets will serve as a practical foundation for multimodal text/tabular AutoML, the diverse datasets in our benchmark vary greatly in: sample size, problem types (a mix of classification and regression tasks), number of features (with the number of text columns ranging from 1 to 28 between datasets), as well as how the predictive signal is decomposed between text vs. numeric/categorical features (and predictive interactions thereof). Over this benchmark, we evaluate various straightforward pipelines to model such data, including standard two-stage approaches where NLP is used to featurize the text such that AutoML for tabular data can then be applied. Compared with human data science teams, the fully automated methodology that performed best on our benchmark (stack ensembling a multimodal Transformer with various tree models) also manages to rank 1st place when fit to the raw text/tabular data in two MachineHack prediction competitions and 2nd place (out of 2380 teams) in Kaggle's Mercari Price Suggestion Challenge. 5 authors · Nov 4, 2021
- FiNCAT: Financial Numeral Claim Analysis Tool While making investment decisions by reading financial documents, investors need to differentiate between in-claim and outof-claim numerals. In this paper, we present a tool which does it automatically. It extracts context embeddings of the numerals using one of the transformer based pre-trained language model called BERT. After this, it uses a Logistic Regression based model to detect whether the numerals is in-claim or out-of-claim. We use FinNum-3 (English) dataset to train our model. After conducting rigorous experiments we achieve a Macro F1 score of 0.8223 on the validation set. We have open-sourced this tool and it can be accessed from https://github.com/sohomghosh/FiNCAT_Financial_Numeral_Claim_Analysis_Tool 2 authors · Jan 26, 2022
- NumHTML: Numeric-Oriented Hierarchical Transformer Model for Multi-task Financial Forecasting Financial forecasting has been an important and active area of machine learning research because of the challenges it presents and the potential rewards that even minor improvements in prediction accuracy or forecasting may entail. Traditionally, financial forecasting has heavily relied on quantitative indicators and metrics derived from structured financial statements. Earnings conference call data, including text and audio, is an important source of unstructured data that has been used for various prediction tasks using deep earning and related approaches. However, current deep learning-based methods are limited in the way that they deal with numeric data; numbers are typically treated as plain-text tokens without taking advantage of their underlying numeric structure. This paper describes a numeric-oriented hierarchical transformer model to predict stock returns, and financial risk using multi-modal aligned earnings calls data by taking advantage of the different categories of numbers (monetary, temporal, percentages etc.) and their magnitude. We present the results of a comprehensive evaluation of NumHTML against several state-of-the-art baselines using a real-world publicly available dataset. The results indicate that NumHTML significantly outperforms the current state-of-the-art across a variety of evaluation metrics and that it has the potential to offer significant financial gains in a practical trading context. 5 authors · Jan 5, 2022
1 Symlink: A New Dataset for Scientific Symbol-Description Linking Mathematical symbols and descriptions appear in various forms across document section boundaries without explicit markup. In this paper, we present a new large-scale dataset that emphasizes extracting symbols and descriptions in scientific documents. Symlink annotates scientific papers of 5 different domains (i.e., computer science, biology, physics, mathematics, and economics). Our experiments on Symlink demonstrate the challenges of the symbol-description linking task for existing models and call for further research effort in this area. We will publicly release Symlink to facilitate future research. 4 authors · Apr 26, 2022
- Long Text and Multi-Table Summarization: Dataset and Method Automatic document summarization aims to produce a concise summary covering the input document's salient information. Within a report document, the salient information can be scattered in the textual and non-textual content. However, existing document summarization datasets and methods usually focus on the text and filter out the non-textual content. Missing tabular data can limit produced summaries' informativeness, especially when summaries require covering quantitative descriptions of critical metrics in tables. Existing datasets and methods cannot meet the requirements of summarizing long text and multiple tables in each report. To deal with the scarcity of available data, we propose FINDSum, the first large-scale dataset for long text and multi-table summarization. Built on 21,125 annual reports from 3,794 companies, it has two subsets for summarizing each company's results of operations and liquidity. To summarize the long text and dozens of tables in each report, we present three types of summarization methods. Besides, we propose a set of evaluation metrics to assess the usage of numerical information in produced summaries. Dataset analyses and experimental results indicate the importance of jointly considering input textual and tabular data when summarizing report documents. 4 authors · Feb 7, 2023
1 FiNER: Financial Numeric Entity Recognition for XBRL Tagging Publicly traded companies are required to submit periodic reports with eXtensive Business Reporting Language (XBRL) word-level tags. Manually tagging the reports is tedious and costly. We, therefore, introduce XBRL tagging as a new entity extraction task for the financial domain and release FiNER-139, a dataset of 1.1M sentences with gold XBRL tags. Unlike typical entity extraction datasets, FiNER-139 uses a much larger label set of 139 entity types. Most annotated tokens are numeric, with the correct tag per token depending mostly on context, rather than the token itself. We show that subword fragmentation of numeric expressions harms BERT's performance, allowing word-level BILSTMs to perform better. To improve BERT's performance, we propose two simple and effective solutions that replace numeric expressions with pseudo-tokens reflecting original token shapes and numeric magnitudes. We also experiment with FIN-BERT, an existing BERT model for the financial domain, and release our own BERT (SEC-BERT), pre-trained on financial filings, which performs best. Through data and error analysis, we finally identify possible limitations to inspire future work on XBRL tagging. 7 authors · Mar 12, 2022
1 A Survey of Quantization Methods for Efficient Neural Network Inference As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area. 6 authors · Mar 25, 2021
1 Dataset and Baseline System for Multi-lingual Extraction and Normalization of Temporal and Numerical Expressions Temporal and numerical expression understanding is of great importance in many downstream Natural Language Processing (NLP) and Information Retrieval (IR) tasks. However, much previous work covers only a few sub-types and focuses only on entity extraction, which severely limits the usability of identified mentions. In order for such entities to be useful in downstream scenarios, coverage and granularity of sub-types are important; and, even more so, providing resolution into concrete values that can be manipulated. Furthermore, most previous work addresses only a handful of languages. Here we describe a multi-lingual evaluation dataset - NTX - covering diverse temporal and numerical expressions across 14 languages and covering extraction, normalization, and resolution. Along with the dataset we provide a robust rule-based system as a strong baseline for comparisons against other models to be evaluated in this dataset. Data and code are available at https://aka.ms/NTX. 3 authors · Mar 31, 2023
1 MathWriting: A Dataset For Handwritten Mathematical Expression Recognition We introduce MathWriting, the largest online handwritten mathematical expression dataset to date. It consists of 230k human-written samples and an additional 400k synthetic ones. MathWriting can also be used for offline HME recognition and is larger than all existing offline HME datasets like IM2LATEX-100K. We introduce a benchmark based on MathWriting data in order to advance research on both online and offline HME recognition. 3 authors · Apr 16, 2024
- NaturalProofs: Mathematical Theorem Proving in Natural Language Understanding and creating mathematics using natural mathematical language - the mixture of symbolic and natural language used by humans - is a challenging and important problem for driving progress in machine learning. As a step in this direction, we develop NaturalProofs, a multi-domain corpus of mathematical statements and their proofs, written in natural mathematical language. NaturalProofs unifies broad coverage, deep coverage, and low-resource mathematical sources, allowing for evaluating both in-distribution and zero-shot generalization. Using NaturalProofs, we benchmark strong neural methods on mathematical reference retrieval and generation tasks which test a system's ability to determine key results that appear in a proof. Large-scale sequence models show promise compared to classical information retrieval methods, yet their performance and out-of-domain generalization leave substantial room for improvement. NaturalProofs opens many avenues for research on challenging mathematical tasks. 6 authors · Mar 23, 2021
5 A Comparative Study on Automatic Coding of Medical Letters with Explainability This study aims to explore the implementation of Natural Language Processing (NLP) and machine learning (ML) techniques to automate the coding of medical letters with visualised explainability and light-weighted local computer settings. Currently in clinical settings, coding is a manual process that involves assigning codes to each condition, procedure, and medication in a patient's paperwork (e.g., 56265001 heart disease using SNOMED CT code). There are preliminary research on automatic coding in this field using state-of-the-art ML models; however, due to the complexity and size of the models, the real-world deployment is not achieved. To further facilitate the possibility of automatic coding practice, we explore some solutions in a local computer setting; in addition, we explore the function of explainability for transparency of AI models. We used the publicly available MIMIC-III database and the HAN/HLAN network models for ICD code prediction purposes. We also experimented with the mapping between ICD and SNOMED CT knowledge bases. In our experiments, the models provided useful information for 97.98\% of codes. The result of this investigation can shed some light on implementing automatic clinical coding in practice, such as in hospital settings, on the local computers used by clinicians , project page https://github.com/Glenj01/Medical-Coding. 4 authors · Jul 18, 2024 2
- Arbitrary Length Generalization for Addition This paper introduces a novel training methodology that enables a small Transformer model to generalize the addition of two-digit numbers to numbers with unseen lengths of digits. The proposed approach employs an autoregressive generation technique, processing from right to left, which mimics a common manual method for adding large numbers. To the best of my knowledge, this methodology has not been previously explored in the literature. All results are reproducible, and the corresponding R code is available at: https://github.com/AGPatriota/ALGA-R/. 1 authors · May 30, 2024 1
- Open data for Moroccan license plates for OCR applications : data collection, labeling, and model construction Significant number of researches have been developed recently around intelligent system for traffic management, especially, OCR based license plate recognition, as it is considered as a main step for any automatic traffic management system. Good quality data sets are increasingly needed and produced by the research community to improve the performance of those algorithms. Furthermore, a special need of data is noted for countries having special characters on their licence plates, like Morocco, where Arabic Alphabet is used. In this work, we present a labeled open data set of circulation plates taken in Morocco, for different type of vehicles, namely cars, trucks and motorcycles. This data was collected manually and consists of 705 unique and different images. Furthermore this data was labeled for plate segmentation and for matriculation number OCR. Also, As we show in this paper, the data can be enriched using data augmentation techniques to create training sets with few thousands of images for different machine leaning and AI applications. We present and compare a set of models built on this data. Also, we publish this data as an open access data to encourage innovation and applications in the field of OCR and image processing for traffic control and other applications for transportation and heterogeneous vehicle management. 4 authors · Apr 16, 2021
1 Foundations of Vector Retrieval Vectors are universal mathematical objects that can represent text, images, speech, or a mix of these data modalities. That happens regardless of whether data is represented by hand-crafted features or learnt embeddings. Collect a large enough quantity of such vectors and the question of retrieval becomes urgently relevant: Finding vectors that are more similar to a query vector. This monograph is concerned with the question above and covers fundamental concepts along with advanced data structures and algorithms for vector retrieval. In doing so, it recaps this fascinating topic and lowers barriers of entry into this rich area of research. 1 authors · Jan 17, 2024