Everyone is very familiar with the current hype around Large Language Models (LLM) such as GPT-3 and Image Generation models such as DALL-E 2 and Stable diffusion. However, the results of these models come at a price.
This is due to the large number of GPU’s required to process and train these models. Besides the cost, the amount of labelled data required to achieve these results is difficult to source. Previously it was not possible for startups to train models of this calibre because of these two factors until now— introducing Sentence Transformer Fine-tuning (SetFit) a simple and efficient alternative for few-shot text classification unveiled by the teams at Intel Labs, UKP Labs and Hugging Face.
Few shot classification is a NLP task in which a model aims to classify text into a large number of categories, given only a few training examples per category
Compared to other few-shot learning methods, SetFit has several unique features:
Prompt engineering is when the description of the task is embedded in the input, e.g., as a question instead of it being implicitly given.
To start, you have to understand what a sentence transformer is. It’s a popular approach for text and image embeddings that encode a vector representation based on its semantic signature. The representation is built during contrastive training which is a form of self-suprvised learning that augments views of the same input. For example, “What time is it?” is semantically the same as “How late is it”. The aim in contrastive training is to minimise the distance between semantically similar sentences and maximise distances between sentences that are semantically distant. Model Hub on Hugging Face contains over 100 pre-trained sentence transformers based on a variety of datasets.
The first step of SetFit is choosing a Sentence Transformer (ST) from from the model hub. The ability to select any ST from model hub is what enables SetFit to have multilingual support since there are ST models for over 100 languages. SetFit then fine-tunes the Sentence Transformer based on a small number of labelled examples using contrastive learning — where the positive pairs are two sentences chosen randomly from the same class and the negative pairs are two sentences chosen randomly from different classes. An adapted ST is now produced.
The sentences in the training data are now encoded using the adapted ST which creates sentence embeddings. These sentence embeddings are then utilised to create a simple logistic regression model for simplicity. At inference time, the data is encoded at using the adapted ST and classified using the trained Logistic Regression model.
RAFT is a few-shot classification benchmark designed to match real world scenarios by restricting the number of training samples to 50 labeled examples per task and not providing validation sets.
The prominent characteristics of SetFit that impact its performance include:
Due to the small data requirement of SetFit, its extremely fast to train. Hugging Face reported the following statistics on training:
Training SetFit on an NVIDIA V100 with 8 labeled examples takes just 30 seconds, at a cost of $0.025. By comparison, training T-Few 3B requires an NVIDIA A100 and takes 11 minutes, at a cost of around $0.7 for the same experiment — a factor of 28x more. In fact, SetFit can run on a single GPU like the ones found on Google Colab and you can even train SetFit on CPU in just a few minutes! As shown in the figure above, SetFit’s speed-up comes with comparable model performance.
SetFit is an extremely effective method for few-shot classification tasks as shown by the evaluation on RAFT. It allows businesses with a small amount of data to build powerful text classifiers at a cheap cost — two criteria that often prohibited startups from implementing these types of solutions. If you want to learn how to fine-tune your own model you can go through the short tutorial on Hugging Face here. Otherwise get in touch with us and we will help you!