Tackling Question Answerability with Few-Shot Classifiers

When working with embeddings, question answerability, the ability of a given database to have enough information to answer a user’s question—is not always a guarantee with the prompt engineering and embedding threshold, the techniques we knew about. That’s why we dived into this challenge, eager to improve the accuracy of our results.

Understanding the challenge Before we began our research we knew about two techniques to improve question answering.

Prompt Engineering: A technique to craft efficient prompts or instructions for a model, guiding AI systems to understand context and generate pertinent responses.

Embedding Threshold: Setting a minimum similarity score between embeddings of the question and potential answers to ensure the semantic alignment of responses.

However, relying only on these techniques is not enough to handle our challenge. The model may either hallucinate or completely ignore the directives on one side, and adjusting the threshold is far from straightforward on the other.

Exploring the Few-shot Classifier Technique We can apply few-shot classifiers trained specifically to determine if a question is "on-topic" or “off-topic” relative to the provided content. By integrating this with our initial techniques, we had the hypothesis we could enhance both the workflow and the performance of the language model's responses.

Interestingly, while our research found state-of-the-art models to classify questions based on context, there weren't many specialized for this purpose. Yet, some researchers had adapted datasets like SQuAD2 to train models for the challenge.

One standout model we tested was longformer-large-4096-answerable-squad2. It’s trained exclusively for English, and since we were working with content in spanish, we had to translate it. Still, the results were promising.

Considering this and that this model was launched in 2020, making it a relatively outdated architecture, it's worth pondering the benefits of training our own model—perhaps even with multilingual capabilities.

More challenges ahead One big challenge associated with Few-Shot Classifiers is to fine-tune or train a model good enough to be better than using prompt engineering. Also, to create a classifier that is efficient in determining question answerability, these are the challenges we’d face:

Model Architecture Decision: Based on the experiments with longformer, we believe that it might not be necessary to use an LLM. We could employ a simple transformer model. Thus, seeking a highly optimized architecture to ensure a low-latency response is vital.

Dataset Adaptation and Creation: Identifying suitable datasets for training or fine-tuning this model is crucial. While we have the SQuAD dataset for English, it might need adaptation for our specific problem. The real challenge lies in obtaining a quality dataset for Spanish. It's likely that we'll need to create our dataset.

Conclusions Question answerability is still an open challenge. We believe using a few-shot classifier can improve performance so we're focusing on creating a better model and might even develop a Spanish dataset. We think multilingual model would be great and useful for the whole AI open-source community.