whatlies.language.BytePairLanguage
¶
This object is used to lazily fetch Embeddings or EmbeddingSets from a Byte-Pair Encoding backend. This object is meant for retreival, not plotting.
This language represents token-free pre-trained subword embeddings. Originally created by Benjamin Heinzerling and Michael Strube.
Important
These vectors will auto-download by the BPEmb package. You can also specify "multi" to download multi language embeddings. A full list of available languages can be found here. The article that belongs to this work can be found here Recognition should be given to Benjamin Heinzerling and Michael Strube for making these available. The availability of vocabulary size as well as dimensionality can be varified on the project website. See here for an example link in English. Please credit the original authors if you use their work.
Warning
This class used to be called BytePairLang
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
lang |
name of the model to load | required | |
vs |
vocabulary size of the byte pair model | 10000 |
|
dim |
the embedding dimensionality | 100 |
|
cache_dir |
The folder in which downloaded BPEmb files will be cached | PosixPath('/home/vincent/.cache/bpemb') |
Typically the vocabulary size given from this backend can be of size 1000, 3000, 5000, 10000, 25000, 50000, 100000 or 200000. The available dimensionality of the embbeddings typically are 25, 50, 100, 200 and 300.
Usage:
> from whatlies.language import BytePairLanguage
> lang = BytePairLanguage(lang="en")
> lang['python']
> lang = BytePairLanguage(lang="multi")
> lang[['hund', 'hond', 'dog']]
__getitem__(self, item)
¶
Show source code in language/_bpemblang.py
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
|
Retreive a single embedding or a set of embeddings. If an embedding contains multiple sub-tokens then we'll average them before retreival.
Parameters
Name | Type | Description | Default |
---|---|---|---|
item |
single string or list of strings | required |
Usage
> lang = BytePairLanguage(lang="en")
> lang['python']
> lang[['python', 'snake']]
> lang[['nobody expects', 'the spanish inquisition']]
embset_similar(self, emb, n=10, lower=False, metric='cosine')
¶
Show source code in language/_bpemblang.py
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
|
Retreive an EmbeddingSet that are the most similar to the passed query.
Parameters
Name | Type | Description | Default |
---|---|---|---|
emb |
Union[str, whatlies.embedding.Embedding] |
query to use | required |
n |
int |
the number of items you'd like to see returned | 10 |
metric |
metric to use to calculate distance, must be scipy or sklearn compatible | 'cosine' |
|
lower |
only fetch lower case tokens | False |
Important
This method is incredibly slow at the moment without a good top_n
setting due to
this bug.
Returns
Type | Description |
---|---|
EmbeddingSet |
An EmbeddingSet containing the similar embeddings. |
score_similar(self, emb, n=10, metric='cosine', lower=False)
¶
Show source code in language/_bpemblang.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
|
Retreive a list of (Embedding, score) tuples that are the most similar to the passed query.
Parameters
Name | Type | Description | Default |
---|---|---|---|
emb |
Union[str, whatlies.embedding.Embedding] |
query to use | required |
n |
int |
the number of items you'd like to see returned | 10 |
metric |
metric to use to calculate distance, must be scipy or sklearn compatible | 'cosine' |
|
lower |
only fetch lower case tokens | False |
Returns
Type | Description |
---|---|
List |
An list of (Embedding, score) tuples. |