Embeddings¶
Finalfusion Embeddings
-
class
ffp.embeddings.Embeddings(storage: Optional[ffp.storage.storage.Storage] = None, vocab: Optional[ffp.vocab.vocab.Vocab] = None, norms: Optional[ffp.norms.Norms] = None, metadata: Optional[ffp.metadata.Metadata] = None)[source]¶ Bases:
objectEmbeddings class.
Typically consists of a
StorageandVocab. Other possible chunks areffp.norms.Normscorresponding to the embeddings of the in-vocab tokens andMetadata.If a vocabulary, storage are provided, embeddings can be retrieved through three methods:
Embeddings.embedding()allows to provide a default value and returns this value if no embedding could be found.Embeddings.__getitem__()retrieves an embedding for the query but raises an exception if it cannot retrieve an embedding.Embeddings.embedding_with_norm()requires aNormschunk and returns an embedding together with the corresponding L2 norm.
Embeddings wrap any combination of the 4 chunk types:
Storage, eitherNdArrayorQuantizedArrayVocab, one ofSimpleVocab,FinalfusionBucketVocab,FastTextVocabandExplicitVocab
Examples
>>> storage = NdArray(np.float32(np.random.rand(2, 10))) >>> vocab = SimpleVocab(["Some", "words"]) >>> metadata = Metadata({"Some": "value", "numerical": 0}) >>> norms = Norms(np.float32(np.random.rand(2))) >>> embeddings = Embeddings(storage=storage, vocab=vocab, metadata=metadata, norms=norms) >>> embeddings.vocab.words ['Some', 'words'] >>> np.allclose(embeddings["Some"], storage[0]) True >>> try: ... embeddings["oov"] ... except KeyError: ... True True >>> _, n = embeddings.embedding_with_norm("Some") >>> np.isclose(n, norms[0]) True >>> embeddings.metadata {'Some': 'value', 'numerical': 0}
-
__init__(storage: Optional[ffp.storage.storage.Storage] = None, vocab: Optional[ffp.vocab.vocab.Vocab] = None, norms: Optional[ffp.norms.Norms] = None, metadata: Optional[ffp.metadata.Metadata] = None)[source]¶ Initialize Embeddings.
Initializes Embeddings with the given chunks.
- Conditions
The following conditions need to hold if the respective chunks are passed.
Chunks need to have the expected type.
vocab.idx_bound == storage.shape[0]len(vocab) == len(norms)len(norms) == len(vocab) and len(norms) >= storage.shape[0]
- Parameters
storage (Storage, optional) – Embeddings Storage.
vocab (Vocab, optional) – Embeddings Vocabulary.
norms (Norms, optional) – Embeddings Norms.
metadata (Metadata, optional) – Embeddings Metadata.
- Raises
AssertionError – If any of the conditions don’t hold.
-
__getitem__(item: str) → numpy.ndarray[source]¶ Returns an embeddings.
- Parameters
item (str) – The query item.
- Returns
embedding – The embedding.
- Return type
- Raises
KeyError – If no embedding could be retrieved.
See also
-
embedding(word: str, out: Optional[numpy.ndarray] = None, default: Optional[numpy.ndarray] = None) → Optional[numpy.ndarray][source]¶ Embedding lookup.
Looks up the embedding for the input word.
If an out array is specified, the embedding is written into the array.
If it is not possible to retrieve an embedding for the input word, the default value is returned. This defaults to None. An embedding can not be retrieved if the vocabulary cannot provide an index for word.
This method fails if either the storage or vocab are not set.
- Parameters
word (str) – The query word.
out (numpy.ndarray, any, optional) – Optional output array to write the embedding into.
default (numpy.ndarray, any, optional) – Optional default value to return if no embedding can be retrieved. Defaults to None.
- Returns
embedding – The retrieved embedding or the default value.
- Return type
numpy.ndarray, optional
Examples
>>> matrix = np.float32(np.random.rand(2, 10)) >>> storage = NdArray(matrix) >>> vocab = SimpleVocab(["Some", "words"]) >>> embeddings = Embeddings(storage=storage, vocab=vocab) >>> np.allclose(embeddings.embedding("Some"), matrix[0]) True >>> # default value is None >>> embeddings.embedding("oov") is None True >>> # It's possible to specify a default value >>> default = embeddings.embedding("oov", default=storage[0]) >>> np.allclose(default, storage[0]) True >>> # Embeddings can be written to an output buffer. >>> out = np.zeros(10, dtype=np.float32) >>> out2 = embeddings.embedding("Some", out=out) >>> out is out2 True >>> np.allclose(out, matrix[0]) True
See also
-
embedding_with_norm(word: str, out: Optional[numpy.ndarray] = None, default: Optional[Tuple[numpy.ndarray, float]] = None) → Optional[Tuple[numpy.ndarray, float]][source]¶ Embedding lookup.
Looks up the embedding for the input word together with its norm.
If an out array is specified, the embedding is written into the array.
If it is not possible to retrieve an embedding for the input word, the default value is returned. This defaults to None. An embedding can not be retrieved if the vocabulary cannot provide an index for word.
This method fails if either storage, vocab or norms are not set.
- Parameters
word (str) – The query word.
out (Optional[numpy.ndarray]) – Optional output array to write the embedding into.
default (Optional[numpy.ndarray]) – Optional default value to return if no embedding can be retrieved. Defaults to None.
- Returns
(embedding, norm) – Tuple with the retrieved embedding or the default value at the first index and the norm at the second index.
- Return type
tuple, optional
See also
-
property
storage¶ Get the
Embeddingsffp.storage.storage.Storage.Returns None if no storage is set.
- Setter
Sets a new storage.
- Getter
Get the storage.
- Returns
storage – The embeddings storage.
- Return type
Storage, optional
- Raises
AssertionError – if
embeddings.storage.shape[0] != embeddings.vocab.idx_boundorlen(embeddings.norms) > embeddings.storage.shape[0]TypeError – If storage is neither a Storage nor None.
-
property
vocab¶ The
Vocab.- Getter
Returns None or the Vocabulary.
- Setter
Set the vocabulary.
- Returns
vocab – The vocabulary or None.
- Return type
Vocab, optional
- Raises
AssertionError – if
embeddings.storage.shape[0] != embeddings.vocab.idx_boundorlen(embeddings.norms) != len(embeddings.vocab)TypeError – If vocab is neither a Vocab nor None.
Examples
>>> words = ['Some', 'words'] >>> vocab = SimpleVocab(words) >>> embeddings = Embeddings(vocab=vocab) >>> embeddings.vocab.words ['Some', 'words']
>>> embeddings.vocab['Some'] 0
-
property
norms¶ The
Norms.- Getter
Returns None or the Norms.
- Setter
Set the Norms.
- Returns
norms – The Norms or None.
- Return type
Norms, optional
- Raises
AssertionError – if
embeddings.storage.shape[0] < len(embeddings.norms)orlen(embeddings.norms) != len(embeddings.vocab)TypeError – If
normsis neither Norms nor None.
Examples
>>> norms = Norms(np.float32(np.abs(np.random.rand(5)))) >>> embeddings = Embeddings() >>> embeddings.norms = norms >>> np.isclose(embeddings.norms[0], norms[0]) True
-
property
metadata¶ The
Metadata.- Getter
Returns None or the Metadata.
- Setter
Set the Metadata.
- Returns
metadata – The Metadata or None.
- Return type
Metadata, optional
- Raises
TypeError – If
metadatais neither Metadata nor None.
Examples
>>> metadata = Metadata({"test": "value", "num": -1}) >>> embeddings = Embeddings() >>> embeddings.metadata = metadata >>> embeddings.metadata {'test': 'value', 'num': -1}
-
bucket_to_explicit() → ffp.embeddings.Embeddings[source]¶ Convert bucket embeddings to embeddings with explicit lookup.
Multiple embeddings can still map to the same bucket, but all buckets that are not indexed by in-vocabulary n-grams are eliminated. This can have a big impact on the size of the embedding matrix.
A side effect of this method is the conversion from a quantized storage to an array storage.
- Returns
embeddings – Embeddings with an ExplicitVocab instead of a hash-based vocabulary.
- Return type
- Raises
TypeError – If the current vocabulary is not a hash-based vocabulary (FinalfusionBucketVocab or FastTextVocab)
-
chunks() → List[ffp.io.Chunk][source]¶ Get the Embeddings Chunks as a list.
The Chunks are ordered in the expected serialization order: 1. Metadata 2. Vocabulary 3. Storage 4. Norms
- Returns
chunks – List of embeddings chunks.
- Return type
List[Chunk]
-
ffp.embeddings.load_finalfusion(file: Union[str, bytes, int, os.PathLike], mmap: bool = False) → ffp.embeddings.Embeddings[source]¶ Read embeddings from a file in finalfusion format.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in finalfusoin format.
mmap (bool) – Toggles memory mapping the storage buffer.
- Returns
embeddings – The embeddings from the input file.
- Return type
-
ffp.embeddings.load_word2vec(file: Union[str, bytes, int, os.PathLike]) → ffp.embeddings.Embeddings[source]¶ Read embeddings in word2vec binary format.
Files are expected to start with a line containing rows and cols in utf-8. Words are encoded in utf-8 followed by a single whitespace. After the whitespace the embedding components are expected as little-endian float32.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
- Returns
embeddings – The embeddings from the input file.
- Return type
-
ffp.embeddings.load_textdims(file: Union[str, bytes, int, os.PathLike]) → ffp.embeddings.Embeddings[source]¶ Read emebddings in textdims format.
The first line contains whitespace separated rows and cols, the rest of the file contains whitespace separated word and vector components.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
- Returns
embeddings – The embeddings from the input file.
- Return type
-
ffp.embeddings.load_text(file: Union[str, bytes, int, os.PathLike]) → ffp.embeddings.Embeddings[source]¶ Read embeddings in text format.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
- Returns
embeddings – Embeddings from the input file. The resulting Embeddings will have a SimpleVocab, NdArray and Norms.
- Return type
-
ffp.embeddings.load_fastText(file: Union[str, bytes, int, os.PathLike]) → ffp.embeddings.Embeddings[source]¶ Read embeddings from a file in fastText format.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
- Returns
embeddings – The embeddings from the input file.
- Return type