I was working on a project that tries to detect topics in academic articles found on arxiv. One of the topics I was interested in was "new datasets". If an article presents a new dataset there's usually something interesting happening so I wanted to build a classifier for it.
You could build a classifier on the entire text and that could work, but it takes a lot of effort to annotate because you'd need to read the entire abstract. It's probably hard for an algorithm too. It has to figure out what part of the abstract is important for the topic of interest and there's a lot of text that might matter.
But what if we choose to approach the problem slightly differently?
Maybe it makes sense to split the text into sentences and run a classifier on each one of those. This might not be perfect for every scenario out there. But it seems like a valid starting point to help you annotate and get started.
If you have sentence level predictions, you could re-use that to do abstract-level predictions.
And the library is set up in such a way that you can add as many labels as you like. We're even able to do some clever fine-tuning tricks internally.
Note: the finetuning bit is still a work in progress in the library.
This approach isn't state of the art, there's probably a whole bunch of things we can improve. But it does seem like a pragmatic, and understandable, starting point for a lot of text categorisation projects.
Quickstart
This project is all about making text classification models by predicting properties on a sentence level first.
fromsentence_modelsimportSentenceModelfromsklearn.feature_extraction.textimportHashingVectorizer# Learn a new sentence-model using a stateless encoder.encoder=HashingVectorizer()smod=SentenceModel(encoder=encoder).learn_from_disk("annotations.jsonl")# Make a predictionexample="In this paper we introduce a new dataset for citrus fruit detection. We also contribute a state of the art algorithm."smod(example)
The output of this model will make a prediction for each sentence so that you may build downstream rules on top of it. Here's what the predictions might look like, depending on the labels in the annotations.jsonl file.
{'text':'In this paper we introduce a new dataset for citrus fruit detection. We also contribute a state of the art algorithm.','sentences':[{'sentence':'In this paper we introduce a new dataset for citrus fruit detection.','cats':{"new-dataset":0.8654,"llm":0.212,"benchmark":0.321}},{'sentence':'We also contribute a state of the art algorithm.','cats':{"new-dataset":0.398,"llm":0.431,"benchmark":0.967}},]}
Learning from data
The SentenceModel can learn from a .jsonl file directly, but will assume a specific structure when it is learning. Internally it runs the following PyDantic model to ensure the data is in the right format.
{"text":"In this paper we introduce a new dataset for citrus fruit detection","target":{"new-dataset":True,"llm":False,"benchmark":False}}
It is preferable have text keys that represent a single sentence. It's not required. But the library will assume sentences when it makes a prediction.
Note that you don't need to have all labels available in each example. That's a feature! Typically when you're annotating it's a lot simpler to just annotate one label at a time and it's perfectly fine if you have examples annotated that don't contain all the labels that you are interested in.
Embedding models
You might prefer to use pretrained embedding models as an encoder in our setup. For this, you may prefer
to use the Embetter library. This library supports many embedding
techniques and makes sure that they all adhere to the scikit-learn API. If you want to use the popular
sentence-transformers library, you can use the following snippet.
fromsentence_modelsimportSentenceModelfromembetter.textimportSentenceEncoder# Learn a new sentence-model using a stateless encoder from sentence-transformers.smod=SentenceModel(encoder=SentenceEncoder())
API
SentenceModel
This is the main object that you'll interact with.
SentenceModel
This object represents a model that can apply predictions per sentence.
classSentenceModel:""" **SentenceModel** This object represents a model that can apply predictions per sentence. **Usage:** ```python from sentence_model import SentenceModel smod = SentenceModel() ``` You can customise some of the settings if you like, but it comes with sensible defaults. ```python from sentence_model import SentenceModel from embetter.text import SentenceEncoder from sklearn.linear_model import LogisticRegression smod = SentenceModel( encoder=SentenceEncoder(), clf_head=LogisticRegression(class_weight="balanced"), spacy_model="en_core_web_sm", verbose=False ) ``` """def__init__(self,encoder:TransformerMixin,clf_head:ClassifierMixin=LogisticRegression(class_weight="balanced"),spacy_model:str="en_core_web_sm",verbose:bool=False,finetuner=None):self.encoder=encoderself.clf_head=clf_headself.spacy_model=spacy_modelifisinstance(spacy_model,Language)elsespacy.load(spacy_model,disable=["ner","lemmatizer","tagger"])self.classifiers={}self.verbose=verboseself.finetuner=finetunerself.log("SentenceModel initialized.")deflog(self,msg:str)->None:ifself.verbose:console.log(msg)# TODO: add support for finetuners# def _generate_finetune_dataset(self, examples):# if self.verbose:# console.log("Generating pairs for finetuning.")# all_labels = {cat for ex in examples for cat in ex['target'].keys()}# # Calculating embeddings is usually expensive so only run this once# arrays = {}# for label in all_labels:# subset = [ex for ex in examples if label in ex['target'].keys()]# texts = [ex['text'] for ex in subset]# arrays[label] = self.encoder.transform(texts)# def concat_if_exists(main, new):# """This function is only used here, so internal"""# if main is None:# return new# return np.concatenate([main, new])# X1 = None# X2 = None# lab = None# for label in all_labels:# subset = [ex for ex in examples if label in ex['target'].keys()]# labels = [ex['target'][label] for ex in subset]# pairs = generate_pairs_batch(labels)# X = arrays[label]# X1 = concat_if_exists(X1, np.array([X[p.e1] for p in pairs]))# X2 = concat_if_exists(X2, np.array([X[p.e2] for p in pairs]))# lab = concat_if_exists(lab, np.array([p.val for p in pairs], dtype=float))# if self.verbose:# console.log(f"Generated {len(lab)} pairs for contrastive finetuning.")# return X1, X2, lab# def _learn_finetuner(self, examples):# X1, X2, lab = self._generate_finetune_dataset(examples)# self.finetuner.construct_models(X1, X2)# self.finetuner.learn(X1, X2, lab)def_prepare_stream(self,stream):lines=LazyLines(stream).map(lambdad:Example(**d))lines_orig,lines_new=lines.tee()labels={labforexinlines_origforlabinex.target.keys()}mapper={}forexinlines_new:ifex.textnotinmapper:mapper[ex.text]={}forlabinex.target.keys():mapper[ex.text][lab]=ex.target[lab]self.log(f"Found {len(mapper)} examples for {len(labels)} labels.")returnlabels,mapperdeflearn(self,examples:List[Dict])->"SentenceModel":""" Learn from a generator of examples. Can update a previously loaded model. Each example should be a dictionary with a "text" key and a "target" key. Internally this method checks via this Pydantic model: ```python class Example(BaseModel): text: str target: Dict[str, bool] ``` As long as your generator emits dictionaries in this format, all will go well. **Usage:** ```python from sentence_model import SentenceModel smod = SentenceModel().learn(some_generator) ``` """labels,mapper=self._prepare_stream(examples)# if self.finetuner is not None:# self._learn_finetuner([{"text": k, "target": v} for k, v in mapper.items()])self.classifiers={lab:clone(self.clf_head)forlabinlabels}forlab,clfinself.classifiers.items():texts=[textfortext,targetsinmapper.items()iflabintargets]labels=[mapper[text][lab]fortextintexts]X=self.encode(texts)clf.fit(X,labels)self.log(f"Trained classifier head for {lab=}")returnselfdeflearn_from_disk(self,path:Path)->"SentenceModel":""" Load a JSONL file from disk and learn from it. **Usage:** ```python from sentence_model import SentenceModel smod = SentenceModel().learn_from_disk("path/to/file.jsonl") ``` """returnself.learn(list(read_jsonl(Path(path))))def_to_sentences(self,text:str):forsentinself.spacy_model(text).sents:yieldsent.textdefencode(self,texts:List[str]):""" Encode a list of texts into a matrix of shape (n_texts, n_features) **Usage::** ```python from sentence_model import SentenceModel smod = SentenceModel() smod.encode(["example text"]) ``` """ifself.finetuner:console.log(self.finetuner)X=self.encoder.transform(texts)ifself.finetunerisnotNone:returnself.finetuner.encode(X)returnXdef__call__(self,text:str):""" Make a prediction for a single text. **Usage:** ```python from sentence_model import SentenceModel smod = SentenceModel().learn_from_disk("path/to/file.jsonl") smod("Predict this. Per sentence!") ``` """result={"text":text}sents=list(self._to_sentences(text))result["sentences"]=[{"sentence":sent,"cats":{}}forsentinsents]X=self.encode(sents)forlab,clfinself.classifiers.items():probas=clf.predict_proba(X)[:,1]fori,probainenumerate(probas):result["sentences"][i]['cats'][lab]=float(proba)returnresultdefpipe(self,texts):# Currently undocumented because I want to make it fasterforexintexts:yieldself(ex)defto_disk(self,folder:Union[str,Path])->None:""" Writes a `SentenceModel` to disk. **Usage:** ```python from sentence_model import SentenceModel smod = SentenceModel().learn_from_disk("path/to/file.jsonl") smod.to_disk("path/to/model") ``` """self.log(f"Storing {self}.")folder=Path(folder)folder.mkdir(exist_ok=True,parents=True)forname,clfinself.classifiers.items():self.log(f"Writing to disk {folder}/{name}.skops")dump(clf,folder/f"{name}.skops")ifself.finetunerisnotNone:self.finetuner.to_disk(folder)settings={"encoder_str":str(self.encoder)}srsly.write_json(folder/"settings.json",settings)self.log(f"Model stored in {folder}.")def__repr__(self):returnf"SentenceModel(encoder={self.encoder}, heads={list(self.classifiers.keys())})"@classmethoddeffrom_disk(cls,folder:Union[str,Path],encoder,spacy_model:str="en_core_web_sm",verbose:bool=False)->"SentenceModel":""" Loads a `SentenceModel` from disk. **Usage:** ```python from sentence_model import SentenceModel from embetter.text import SentenceEncoder # It's good to be explicit with the encoder. Internally this method will check if # the encoder matches what was available during training. The spaCy model is less # critical because it merely splits the sentences during inference. smod = SentenceModel.from_disk("path/to/model", encoder=SentenceEncoder(), spacy_model="en_core_web_sm") smod("Predict this. Per sentence!") ``` """folder=Path(folder)models={p.parts[-1].replace(".skops",""):load(p,trusted=True)forpinfolder.glob("*.skops")}iflen(models)==0:raiseValueError(f"Did not find any `.skops` files in {folder}. Are you sure folder is correct?")settings=srsly.read_json(folder/"settings.json")assertstr(encoder)==settings["encoder_str"],f"The encoder at time of saving ({settings['encoder_str']}) differs from this one ({encoder})."smod=SentenceModel(encoder=encoder,clf_head=list(models.values())[0],spacy_model=spacy_model,verbose=verbose,finetuner=None)smod.classifiers=modelsreturnsmod
__call__(text)
Make a prediction for a single text.
Usage:
fromsentence_modelimportSentenceModelsmod=SentenceModel().learn_from_disk("path/to/file.jsonl")smod("Predict this. Per sentence!")
def__call__(self,text:str):""" Make a prediction for a single text. **Usage:** ```python from sentence_model import SentenceModel smod = SentenceModel().learn_from_disk("path/to/file.jsonl") smod("Predict this. Per sentence!") ``` """result={"text":text}sents=list(self._to_sentences(text))result["sentences"]=[{"sentence":sent,"cats":{}}forsentinsents]X=self.encode(sents)forlab,clfinself.classifiers.items():probas=clf.predict_proba(X)[:,1]fori,probainenumerate(probas):result["sentences"][i]['cats'][lab]=float(proba)returnresult
encode(texts)
Encode a list of texts into a matrix of shape (n_texts, n_features)
defencode(self,texts:List[str]):""" Encode a list of texts into a matrix of shape (n_texts, n_features) **Usage::** ```python from sentence_model import SentenceModel smod = SentenceModel() smod.encode(["example text"]) ``` """ifself.finetuner:console.log(self.finetuner)X=self.encoder.transform(texts)ifself.finetunerisnotNone:returnself.finetuner.encode(X)returnX
fromsentence_modelimportSentenceModelfromembetter.textimportSentenceEncoder# It's good to be explicit with the encoder. Internally this method will check if # the encoder matches what was available during training. The spaCy model is less # critical because it merely splits the sentences during inference.smod=SentenceModel.from_disk("path/to/model",encoder=SentenceEncoder(),spacy_model="en_core_web_sm")smod("Predict this. Per sentence!")
@classmethoddeffrom_disk(cls,folder:Union[str,Path],encoder,spacy_model:str="en_core_web_sm",verbose:bool=False)->"SentenceModel":""" Loads a `SentenceModel` from disk. **Usage:** ```python from sentence_model import SentenceModel from embetter.text import SentenceEncoder # It's good to be explicit with the encoder. Internally this method will check if # the encoder matches what was available during training. The spaCy model is less # critical because it merely splits the sentences during inference. smod = SentenceModel.from_disk("path/to/model", encoder=SentenceEncoder(), spacy_model="en_core_web_sm") smod("Predict this. Per sentence!") ``` """folder=Path(folder)models={p.parts[-1].replace(".skops",""):load(p,trusted=True)forpinfolder.glob("*.skops")}iflen(models)==0:raiseValueError(f"Did not find any `.skops` files in {folder}. Are you sure folder is correct?")settings=srsly.read_json(folder/"settings.json")assertstr(encoder)==settings["encoder_str"],f"The encoder at time of saving ({settings['encoder_str']}) differs from this one ({encoder})."smod=SentenceModel(encoder=encoder,clf_head=list(models.values())[0],spacy_model=spacy_model,verbose=verbose,finetuner=None)smod.classifiers=modelsreturnsmod
learn(examples)
Learn from a generator of examples. Can update a previously loaded model.
Each example should be a dictionary with a "text" key and a "target" key.
Internally this method checks via this Pydantic model:
deflearn(self,examples:List[Dict])->"SentenceModel":""" Learn from a generator of examples. Can update a previously loaded model. Each example should be a dictionary with a "text" key and a "target" key. Internally this method checks via this Pydantic model: ```python class Example(BaseModel): text: str target: Dict[str, bool] ``` As long as your generator emits dictionaries in this format, all will go well. **Usage:** ```python from sentence_model import SentenceModel smod = SentenceModel().learn(some_generator) ``` """labels,mapper=self._prepare_stream(examples)# if self.finetuner is not None:# self._learn_finetuner([{"text": k, "target": v} for k, v in mapper.items()])self.classifiers={lab:clone(self.clf_head)forlabinlabels}forlab,clfinself.classifiers.items():texts=[textfortext,targetsinmapper.items()iflabintargets]labels=[mapper[text][lab]fortextintexts]X=self.encode(texts)clf.fit(X,labels)self.log(f"Trained classifier head for {lab=}")returnself
deflearn_from_disk(self,path:Path)->"SentenceModel":""" Load a JSONL file from disk and learn from it. **Usage:** ```python from sentence_model import SentenceModel smod = SentenceModel().learn_from_disk("path/to/file.jsonl") ``` """returnself.learn(list(read_jsonl(Path(path))))
defto_disk(self,folder:Union[str,Path])->None:""" Writes a `SentenceModel` to disk. **Usage:** ```python from sentence_model import SentenceModel smod = SentenceModel().learn_from_disk("path/to/file.jsonl") smod.to_disk("path/to/model") ``` """self.log(f"Storing {self}.")folder=Path(folder)folder.mkdir(exist_ok=True,parents=True)forname,clfinself.classifiers.items():self.log(f"Writing to disk {folder}/{name}.skops")dump(clf,folder/f"{name}.skops")ifself.finetunerisnotNone:self.finetuner.to_disk(folder)settings={"encoder_str":str(self.encoder)}srsly.write_json(folder/"settings.json",settings)self.log(f"Model stored in {folder}.")