whatlies.language.CountVectorLanguage
¶
This object is used to lazily fetch Embeddings or EmbeddingSets from a countvector language backend. This object is meant for retreival, not plotting.
This model will first train a scikit-learn CountVectorizer after which it will perform dimensionality reduction to make the numeric representation a vector. The reduction occurs via TruncatedSVD, also from scikit-learn.
Warning
This method does not implement a word embedding in the traditional sense. The interpretation needs to be altered. The information that is captured here only relates to the words/characters that are used in the text. There is no notion of meaning that should be suggested.
Also, in order to keep this system consistent with the rest of the api you train the system when you retreive
vectors if you just use __getitem__
. If you want to seperate train/test you need to call fit_manual
yourself or use it in a scikit-learn pipeline.
Parameters
Name | Type | Description | Default |
---|---|---|---|
n_components |
int |
Number of components that TruncatedSVD will reduce to. | required |
lowercase |
bool |
If the tokens need to be lowercased beforehand. | True |
analyzer |
str |
Which analyzer to use, can be "word", "char", "char_wb". | 'char' |
ngram_range |
Tuple[int, int] |
The range that specifies how many ngrams to use. | (1, 2) |
min_df |
Union[int, float] |
Ignore terms that have a document frequency strictly lower than the given threshold. | 1 |
max_df |
Union[int, float] |
Ignore terms that have a document frequency strictly higher than the given threshold. | 1.0 |
binary |
bool |
Determines if the counts are binary or if they can accumulate. | False |
strip_accents |
str |
Remove accents and perform normalisation. Can be set to "ascii" or "unicode". | None |
random_state |
int |
Random state for SVD algorithm. | 42 |
For more elaborate explainers on these arguments, check out the scikit-learn documentation.
Usage:
> from whatlies.language import CountVectorLanguage
> lang = CountVectorLanguage(n_components=2, ngram_range=(1, 2), analyzer="char")
> lang[['pizza', 'pizzas', 'firehouse', 'firehydrant']]
__getitem__(self, query)
¶
Show source code in language/_countvector_lang.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
|
Retreive a set of embeddings.
Parameters
Name | Type | Description | Default |
---|---|---|---|
query |
Union[str, List[str]] |
list of strings | required |
Usage
> from whatlies.language import CountVectorLanguage
> lang = CountVectorLanguage(n_components=2, ngram_range=(1, 2), analyzer="char")
> lang[['pizza', 'pizzas', 'firehouse', 'firehydrant']]
embset_similar(self, emb, n=10, lower=False, metric='cosine')
¶
Show source code in language/_countvector_lang.py
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 |
|
Retreive an EmbeddingSet that are the most similar to the passed query.
Note that we will only consider words that were passed in the .fit_manual()
step.
Parameters
Name | Type | Description | Default |
---|---|---|---|
emb |
Union[str, whatlies.embedding.Embedding] |
query to use | required |
n |
int |
the number of items you'd like to see returned | 10 |
metric |
metric to use to calculate distance, must be scipy or sklearn compatible | 'cosine' |
|
lower |
only fetch lower case tokens | False |
Returns
Type | Description |
---|---|
EmbeddingSet |
An EmbeddingSet containing the similar embeddings. |
fit_manual(self, query)
¶
Show source code in language/_countvector_lang.py
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
|
Fit the model manually. This way you can call __getitem__
independantly of training.
Parameters
Name | Type | Description | Default |
---|---|---|---|
query |
list of strings | required |
Usage
> from whatlies.language import CountVectorLanguage
> lang = CountVectorLanguage(n_components=2, ngram_range=(1, 2), analyzer="char")
> lang.fit_manual(['pizza', 'pizzas', 'firehouse', 'firehydrant'])
> lang[['piza', 'pizza', 'pizzaz', 'fyrehouse', 'firehouse', 'fyrehidrant']]
score_similar(self, emb, n=10, metric='cosine', lower=False)
¶
Show source code in language/_countvector_lang.py
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
|
Retreive a list of (Embedding, score) tuples that are the most similar to the passed query.
Note that we will only consider words that were passed in the .fit_manual()
step.
Parameters
Name | Type | Description | Default |
---|---|---|---|
emb |
Union[str, whatlies.embedding.Embedding] |
query to use | required |
n |
int |
the number of items you'd like to see returned | 10 |
metric |
metric to use to calculate distance, must be scipy or sklearn compatible | 'cosine' |
|
lower |
only fetch lower case tokens | False |
Returns
Type | Description |
---|---|
List |
An list of (Embedding, score) tuples. |