Embeddings

Finalfusion Embeddings

class ffp.embeddings.Embeddings(storage: ffp.storage.storage.Storage, vocab: ffp.vocab.vocab.Vocab, norms: Optional[ffp.norms.Norms] = None, metadata: Optional[ffp.metadata.Metadata] = None)[source]

Bases: object

Embeddings class.

Embeddings always contain a Storage and Vocab. Optional chunks are Norms corresponding to the embeddings of the in-vocab tokens and Metadata.

Embeddings can be retrieved through three methods:

  1. Embeddings.embedding() allows to provide a default value and returns this value if no embedding could be found.

  2. Embeddings.__getitem__() retrieves an embedding for the query but raises an exception if it cannot retrieve an embedding.

  3. Embeddings.embedding_with_norm() requires a Norms chunk and returns an embedding together with the corresponding L2 norm.

Embeddings are composed of the 4 chunk types:

  1. Storage: either NdArray or QuantizedArray (required)

  2. Vocab, one of SimpleVocab, FinalfusionBucketVocab, FastTextVocab and ExplicitVocab

    (required)

  3. Norms

  4. Metadata

Examples

>>> storage = NdArray(np.float32(np.random.rand(2, 10)))
>>> vocab = SimpleVocab(["Some", "words"])
>>> metadata = Metadata({"Some": "value", "numerical": 0})
>>> norms = Norms(np.float32(np.random.rand(2)))
>>> embeddings = Embeddings(storage=storage, vocab=vocab, metadata=metadata, norms=norms)
>>> embeddings.vocab.words
['Some', 'words']
>>> np.allclose(embeddings["Some"], storage[0])
True
>>> try:
...     embeddings["oov"]
... except KeyError:
...     True
True
>>> _, n = embeddings.embedding_with_norm("Some")
>>> np.isclose(n, norms[0])
True
>>> embeddings.metadata
{'Some': 'value', 'numerical': 0}
__init__(storage: ffp.storage.storage.Storage, vocab: ffp.vocab.vocab.Vocab, norms: Optional[ffp.norms.Norms] = None, metadata: Optional[ffp.metadata.Metadata] = None)[source]

Initialize Embeddings.

Initializes Embeddings with the given chunks.

Conditions

The following conditions need to hold if the respective chunks are passed.

  • Chunks need to have the expected type.

  • vocab.idx_bound == storage.shape[0]

  • len(vocab) == len(norms)

  • len(norms) == len(vocab) and len(norms) >= storage.shape[0]

Parameters
  • storage (Storage) – Embeddings Storage.

  • vocab (Vocab) – Embeddings Vocabulary.

  • norms (Norms, optional) – Embeddings Norms.

  • metadata (Metadata, optional) – Embeddings Metadata.

Raises

AssertionError – If any of the conditions don’t hold.

__getitem__(item: str)numpy.ndarray[source]

Returns an embeddings.

Parameters

item (str) – The query item.

Returns

embedding – The embedding.

Return type

numpy.ndarray

Raises

KeyError – If no embedding could be retrieved.

embedding(word: str, out: Optional[numpy.ndarray] = None, default: Optional[numpy.ndarray] = None) → Optional[numpy.ndarray][source]

Embedding lookup.

Looks up the embedding for the input word.

If an out array is specified, the embedding is written into the array.

If it is not possible to retrieve an embedding for the input word, the default value is returned. This defaults to None. An embedding can not be retrieved if the vocabulary cannot provide an index for word.

This method never fails. If you do not provide a default value, check the return value for None. out is left untouched if no embedding can be found and default is None.

Parameters
  • word (str) – The query word.

  • out (numpy.ndarray, optional) – Optional output array to write the embedding into.

  • default (numpy.ndarray, optional) – Optional default value to return if no embedding can be retrieved. Defaults to None.

Returns

embedding – The retrieved embedding or the default value.

Return type

numpy.ndarray, optional

Examples

>>> matrix = np.float32(np.random.rand(2, 10))
>>> storage = NdArray(matrix)
>>> vocab = SimpleVocab(["Some", "words"])
>>> embeddings = Embeddings(storage=storage, vocab=vocab)
>>> np.allclose(embeddings.embedding("Some"), matrix[0])
True
>>> # default value is None
>>> embeddings.embedding("oov") is None
True
>>> # It's possible to specify a default value
>>> default = embeddings.embedding("oov", default=storage[0])
>>> np.allclose(default, storage[0])
True
>>> # Embeddings can be written to an output buffer.
>>> out = np.zeros(10, dtype=np.float32)
>>> out2 = embeddings.embedding("Some", out=out)
>>> out is out2
True
>>> np.allclose(out, matrix[0])
True
embedding_with_norm(word: str, out: Optional[numpy.ndarray] = None, default: Optional[Tuple[numpy.ndarray, float]] = None) → Optional[Tuple[numpy.ndarray, float]][source]

Embedding lookup with norm.

Looks up the embedding for the input word together with its norm.

If an out array is specified, the embedding is written into the array.

If it is not possible to retrieve an embedding for the input word, the default value is returned. This defaults to None. An embedding can not be retrieved if the vocabulary cannot provide an index for word.

This method raises a TypeError if norms are not set.

Parameters
  • word (str) – The query word.

  • out (numpy.ndarray, optional) – Optional output array to write the embedding into.

  • default (Tuple[numpy.ndarray, float], optional) – Optional default value to return if no embedding can be retrieved. Defaults to None.

Returns

(embedding, norm) – Tuple with the retrieved embedding or the default value at the first index and the norm at the second index.

Return type

EmbeddingWithNorm, optional

property storage

Get the Storage.

Returns

storage – The embeddings storage.

Return type

Storage

property vocab

The Vocab.

Returns

vocab – The vocabulary

Return type

Vocab

property norms

The Norms.

Getter

Returns None or the Norms.

Setter

Set the Norms.

Returns

norms – The Norms or None.

Return type

Norms, optional

Raises
  • AssertionError – if embeddings.storage.shape[0] < len(embeddings.norms) or len(embeddings.norms) != len(embeddings.vocab)

  • TypeError – If norms is neither Norms nor None.

property metadata

The Metadata.

Getter

Returns None or the Metadata.

Setter

Set the Metadata.

Returns

metadata – The Metadata or None.

Return type

Metadata, optional

Raises

TypeError – If metadata is neither Metadata nor None.

bucket_to_explicit()ffp.embeddings.Embeddings[source]

Convert bucket embeddings to embeddings with explicit lookup.

Multiple embeddings can still map to the same bucket, but all buckets that are not indexed by in-vocabulary n-grams are eliminated. This can have a big impact on the size of the embedding matrix.

A side effect of this method is the conversion from a quantized storage to an array storage.

Returns

embeddings – Embeddings with an ExplicitVocab instead of a hash-based vocabulary.

Return type

Embeddings

Raises

TypeError – If the current vocabulary is not a hash-based vocabulary (FinalfusionBucketVocab or FastTextVocab)

chunks() → List[ffp.io.Chunk][source]

Get the Embeddings Chunks as a list.

The Chunks are ordered in the expected serialization order: 1. Metadata 2. Vocabulary 3. Storage 4. Norms

Returns

chunks – List of embeddings chunks.

Return type

List[Chunk]

write(file: str)[source]

Write the Embeddings to the given file.

Writes the Embeddings to a finalfusion file at the given file.

Parameters

file (str) – Path of the output file.

ffp.embeddings.load_finalfusion(file: Union[str, bytes, int, os.PathLike], mmap: bool = False)ffp.embeddings.Embeddings[source]

Read embeddings from a file in finalfusion format.

Parameters
  • file (str, bytes, int, PathLike) – Path to a file with embeddings in finalfusoin format.

  • mmap (bool) – Toggles memory mapping the storage buffer.

Returns

embeddings – The embeddings from the input file.

Return type

Embeddings

ffp.embeddings.load_word2vec(file: Union[str, bytes, int, os.PathLike])ffp.embeddings.Embeddings[source]

Read embeddings in word2vec binary format.

Files are expected to start with a line containing rows and cols in utf-8. Words are encoded in utf-8 followed by a single whitespace. After the whitespace the embedding components are expected as little-endian float32.

Parameters

file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.

Returns

embeddings – The embeddings from the input file.

Return type

Embeddings

ffp.embeddings.load_textdims(file: Union[str, bytes, int, os.PathLike])ffp.embeddings.Embeddings[source]

Read emebddings in textdims format.

The first line contains whitespace separated rows and cols, the rest of the file contains whitespace separated word and vector components.

Parameters

file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.

Returns

embeddings – The embeddings from the input file.

Return type

Embeddings

ffp.embeddings.load_text(file: Union[str, bytes, int, os.PathLike])ffp.embeddings.Embeddings[source]

Read embeddings in text format.

Parameters

file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.

Returns

embeddings – Embeddings from the input file. The resulting Embeddings will have a SimpleVocab, NdArray and Norms.

Return type

Embeddings

ffp.embeddings.load_fastText(file: Union[str, bytes, int, os.PathLike])ffp.embeddings.Embeddings[source]

Read embeddings from a file in fastText format.

Parameters

file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.

Returns

embeddings – The embeddings from the input file.

Return type

Embeddings