Question Answering on Scientific Research Papers
It’s hard for researchers to keep up with the ever-increasing number of papers getting published every day.
Is there some way we can make the consumption of scientific content more efficient?
We believe there is.
Imagine the typical process involved in reading a research paper. You first read the title and the abstract, and have a pretty good idea of what the paper is about. You might have some follow-up questions to the ideas introduced in the abstract that you then try to answer by reading the rest of the paper. Let’s look at an example: the title and the abstract of a paper that appeared (and won the best paper award) at NAACL 2004.
Catching the Drift
Probabilistic Content Models, with Applications to Generation and Summarization
We consider the problem of modeling the content structure of texts within a specific domain, in terms of the topics the texts address and the order in which these topics appear. We first present an effective knowledge-lean method for learning content models from un-annotated documents, utilizing a novel adaptation of algorithms for Hidden Markov Models. We then apply our method to two complementary tasks: information ordering and extractive summarization. Our experiments show that incorporating content models in these applications yields substantial improvement over previously-proposed methods.
Some questions you might have after reading this abstract are:
- What domains did they experiment with?
- How do they adapt algorithms for Hidden Markov Models?
- How do they define the information ordering task?
- What previously proposed methods do they compare against?
- Did they experiment with languages other than English?
To answer these questions, you have to scroll through the paper, locate and read the sections that seem relevant, and infer the answers.
What if instead there were an automated system that could reliably point you to the parts of the paper that you’re looking for, or even answer your questions directly? We believe such a question-answering (QA) system could help researchers who are just looking to quickly get targeted pieces of information out of a research paper, and it is with this practical use case in mind that we introduce a new QA task, Question Answering on Scientific Research Papers (or Qasper for short).
You can read about our definition of the task, our new dataset, and the baseline models we built for the task in our NAACL 2021 paper here. We discuss some key ideas from the paper below.
A New Dataset:
To build a Qasper system, we need appropriate data to train the underlying model. The NLP community has built many QA datasets in the past decade, but the existing datasets focus on tasks that are not quite like what we are trying to do.
Reading Comprehension tasks like SQuAD, HotpotQA, and DROP focus on verifying whether a model can process specific types of information presented in a document. Hence, these datasets contain questions written by crowd-workers who already knew the answers to the questions they were writing.
Models built for such datasets are not expected to work well for Qasper since the follow-up questions you might have after reading an abstract are information-seeking in nature.
Information-Seeking Question Answering tasks like Natural Questions, TyDiQA, and BioASQ come with datasets containing questions asked by real people who did not know the answers to those questions. However, they were not asked in the context of specific documents and were later linked to potentially relevant documents. We believe that having a good understanding of what is in a document will prompt readers to ask questions that are more strongly grounded in documents, like the ones shown above.
Since existing datasets are not directly applicable to our task, we built a new QA dataset.
We hired graduate students and NLP practitioners for two separate tasks:
Writing Questions: We showed the annotators only the titles and abstracts of papers and asked them to write questions that the abstracts do not answer, and are expected to be answered by the papers. We encouraged them to write questions that they think can be answered in a sentence or two to limit the scope of the task to answering specific questions.
Providing Answers and Evidence: To obtain answers, we showed the annotators entire papers and questions. For each question, we asked them to highlight a minimal set of paragraphs, figures, or tables that provide all the information required to answer it. This serves as the evidence for answering the question. After selecting the evidence, they were asked to provide a concise answer that is either a span in the paper, a written out phrase, “Yes”, or “No”. Many questions may not be answerable from the papers too, in which case, we asked them to mark them as such.
By separating the two tasks, we ensured that the question writers did not know the answers to the questions, and thus ensured that the data we collected is more realistic. We also paid the data providers an hourly rate, and not a per-question rate, incentivizing quality over quantity. The process resulted in 5,049 questions over 1,585 NLP papers, with about 44% of the questions having multiple annotations.
Building a Model:
Qasper requires processing entire documents. For this reason, we chose the Longformer model to encode the long contexts. The dataset has different types of answers: extractive (when the annotators select spans in the paper), abstractive (when the answers are written out), boolean (yes/no), and null (when the questions are not answerable). To handle all these types, we used Longformer in the Encoder-Decoder setting, which is trained to encode entire papers and directly generate answers. For example, for questions that are unanswerable, the model is trained to generate the string “Unanswerable”.
In addition to generating the answers, the model is also trained to select a set of paragraphs as the evidence. For this sub-task, we simply train the model to make binary decisions on all paragraphs in the paper as to whether they should be included in the evidence or not. We train the model to minimize both the answer and evidence losses jointly. Note that the dataset contains questions for which evidence includes figures and/or tables (12%of the data). The model cannot handle multiple modalities, and hence we ignore evidence that is not text.
You can read our paper for more details on our experiments and results.
Here are the high-level results:
The Longformer Encoder Decoder models generally do significantly worse than humans: about 27 F1 points lower at generating answers, and 32 F1 points lower in terms of selecting evidence. When models are trained to answer questions given gold evidence, they do significantly better, up to 24 F1 points on extractive and abstractive answers, indicating that the difficulty mostly lies in selecting appropriate evidence. Manual error analysis shows that the two most common error classes are of the model incorrectly predicting questions to be unanswerable, and the entity types of the generated answers being incorrect, indicating that the model (unsurprisingly) lacks domain knowledge.
You can test the model yourself by playing with our demo.
Now that we have a dataset, we are closer to having a usable QA system over research papers, but we are not fully there yet. If you test the model using our demo, you will see that it does well on some common questions like What datasets do they use?, but not on questions that require a deeper understanding of the papers. One such example is the question How do they adapt algorithms for Hidden Markov Models? on the paper we cited at the beginning of this article. You can test it out on the paper here.
So what needs to be done to make a usable Qasper system? Here are some potential directions.
Domain-specific pre-training: The Longformer model we used is based on BART, which was pre-trained on web text and publicly available books, which are very different from research papers. It is possible that pre-training on research papers, or at least transferring a pre-trained model into the research paper domain, can significantly improve the model’s end task performance.
More task-specific data: Our current Qasper dataset is smaller than general domain document-level QA datasets. Training on more data is expected to result in better models. Moreover, the dataset currently includes only NLP papers.
While models trained on this set may perform well on related fields like Machine Learning, they are not expected to be directly applicable to say, Biomedicine. Hence, we need more data, including data from other fields of research as well. However, collecting such data would be very expensive since we need domain experts to provide data, and we cannot rely on crowd-workers. Hence we need more efficient data collection methods.
The models still have a lot of room for improvement, as indicated by the large gap between model and human performance. We invite the NLP community to work on this task.
Allen Institute for AI
Towards Data Science Blog: https://medium.com/ai2-blog/question-answering-on-scientific-research-papers-f6d6da9fd55c
Allen Institute for AI: https://allenai.org/