Finetuners
If you're interested in how to use these components, you'll probably want to read this section first.
FeedForwardTuner
Bases: BaseEstimator
, TransformerMixin
Create a feed forward model to finetune the embeddings towards a class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hidden_dim |
The size of the hidden layer |
50
|
|
n_epochs |
The number of epochs to run the optimiser for |
500
|
|
learning_rate |
The learning rate of the feed forward model |
0.01
|
Source code in embetter/finetune/_forward.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
|
ContrastiveTuner
Bases: BaseEstimator
, TransformerMixin
Run a contrastive network to finetune the embeddings towards a class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hidden_dim |
the dimension of the new learned representation |
50
|
|
n_neg |
number of negative example pairs to sample per positive item |
3
|
|
n_epochs |
number of epochs to use for training |
required | |
learning_rate |
learning rate of the contrastive network |
0.001
|
Source code in embetter/finetune/_contrastive_tuner.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
|
ContrastiveLearner
A learner model that can finetune on pairs of data on top of numeric embeddings.
It's similar to the scikit-learn models that you're used to, but it accepts
two inputs X1
and X2
and tries to predict if they are similar.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sent_tfm |
an instance of a |
required | |
batch_size |
int
|
the batch size during training |
16
|
epochs |
int
|
the number of epochs to use while training |
1
|
warmup_steps |
the number of warmup steps before training |
required |
Usage:
from sentence_transformers import SentenceTransformer
from embetter.finetune import ContrastiveLearner
import random
sent_tfm = SentenceTransformer('all-MiniLM-L6-v2')
learner = SbertLearner(sent_tfm)
def sample_generator(examples, n_neg=3):
# A generator that assumes examples to be a dictionary of the shape
# {"text": "some text", "cats": {"label_a": True, "label_b": False}}
# this is typically a function that's very custom to your use-case though
labels = set()
for ex in examples:
for cat in ex['cats'].keys():
if cat not in labels:
labels = labels.union([cat])
for label in labels:
pos_examples = [ex for ex in examples if label in ex['cats'] and ex['cats'][label] == 1]
neg_examples = [ex for ex in examples if label in ex['cats'] and ex['cats'][label] == 0]
for ex in pos_examples:
sample = random.choice(pos_examples)
yield (ex['text'], sample['text'], 1.0)
for n in range(n_neg):
sample = random.choice(neg_examples)
yield (ex['text'], sample['text'], 0.0)
learn_examples = sample_generator(examples, n_neg=3)
X1, X2, y = zip(*learn_examples)
# Learn a new representation
learner.fit(X1, X2, y)
# You now have an updated model that can create more "finetuned" embeddings
learner.transform(X1)
learner.transform(X2)
After a learning is done training it can be used inside of a scikit-learn pipeline as you normally would.
Source code in embetter/finetune/_constrastive_learn.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
|
SbertLearner
A learner model that can finetune on pairs of data that leverages SBERT under the hood.
It's similar to the scikit-learn models that you're used to, but it accepts
two inputs X1
and X2
and tries to predict if they are similar.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sent_tfm |
SentenceTransformer
|
an instance of a |
required |
batch_size |
int
|
the batch size during training |
16
|
epochs |
int
|
the number of epochs to use while training |
1
|
warmup_steps |
int
|
the number of warmup steps before training |
100
|
Usage:
from sentence_transformers import SentenceTransformer
from embetter.finetune import SbertLearner
import random
sent_tfm = SentenceTransformer('all-MiniLM-L6-v2')
learner = SbertLearner(sent_tfm)
def sample_generator(examples, n_neg=3):
# A generator that assumes examples to be a dictionary of the shape
# {"text": "some text", "cats": {"label_a": True, "label_b": False}}
# this is typically a function that's very custom to your use-case though
labels = set()
for ex in examples:
for cat in ex['cats'].keys():
if cat not in labels:
labels = labels.union([cat])
for label in labels:
pos_examples = [ex for ex in examples if label in ex['cats'] and ex['cats'][label] == 1]
neg_examples = [ex for ex in examples if label in ex['cats'] and ex['cats'][label] == 0]
for ex in pos_examples:
sample = random.choice(pos_examples)
yield (ex['text'], sample['text'], 1.0)
for n in range(n_neg):
sample = random.choice(neg_examples)
yield (ex['text'], sample['text'], 0.0)
learn_examples = sample_generator(examples, n_neg=3)
X1, X2, y = zip(*learn_examples)
# Learn a new representation
learner.fit(X1, X2, y)
# You now have an updated model that can create more "finetuned" embeddings
learner.transform(X1)
learner.transform(X2)
After a learning is done training it can be used inside of a scikit-learn pipeline as you normally would.
Source code in embetter/finetune/_sbert_learn.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
|