Finalfusion in Python¶
ffp
is a Python package to interface with finalfusion
embeddings. ffp
supports all common embedding formats, including finalfusion,
fastText, word2vec binary, text and textdims.
ffp
integrates nicely with numpy
since its ffp.storage.Storage
types can be
treated as ndarrays.
The finalfusion
format revolves around ffp.io.Chunk
s, these are specified in the
finalfusion spec. Each component class in ffp
implements the ffp.io.Chunk
interface which specifies serialization and deserialization.
Any unique combination of chunks can make up ffp.Embeddings
.
Contents¶
Quickstart¶
You can install ffp
through:
pip install ffp
And use embeddings by:
import ffp
# load finalfusion embeddings
embeddings = ffp.load_finalfusion("/path/to/embeddings.fifu")
# embedding lookup
embedding = embeddings["Test"]
# embedding lookup with default value
embedding = embeddings.embedding("Test", default=0)
# access storage and calculate dot product with an embedding
storage = embedding.dot(embeddings.storage)
# print 10 first vocab items
print(embeddings.vocab.words[:10])
# print embeddings metadata
print(embeddings.metadata)
ffp
exports most common-use functions and types in the top level.
See Top-Level Exports for an overview.
These re-exports are also available in their respective sub-packages and modules. The full API documentation can be foud here.
Install¶
ffp
can be installed from GitHub via:
$ pip install git+https://github.com/sebpuetz/ffp
or directly from pypi:
$ pip install ffp
When building from source, ffp
requires Cython
.
Top-level Exports¶
ffp
re-exports some common types at the top-level. These types cover the
typical use-cases.
Embeddings¶
|
Embeddings class. |
|
Read embeddings from a file in finalfusion format. |
Read embeddings from a file in fastText format. |
|
|
Read embeddings in text format. |
Read emebddings in textdims format. |
|
Read embeddings in word2vec binary format. |
Metadata¶
Embeddings metadata |
|
Load a Metadata chunk from the given file. |
Norms¶
Embedding Norms. |
|
|
Load an Norms chunk from the given file. |
Storage¶
Common interface to finalfusion storage types. |
|
|
Load any storage from a finalfusion file. |
Vocab¶
|
Finalfusion vocabulary interface. |
|
Load a vocabulary from a finalfusion file. |
API¶
Embeddings¶
Finalfusion Embeddings
-
class
ffp.embeddings.
Embeddings
(storage: ffp.storage.storage.Storage, vocab: ffp.vocab.vocab.Vocab, norms: Optional[ffp.norms.Norms] = None, metadata: Optional[ffp.metadata.Metadata] = None)[source]¶ Bases:
object
Embeddings class.
Embeddings always contain a
Storage
andVocab
. Optional chunks areNorms
corresponding to the embeddings of the in-vocab tokens andMetadata
.Embeddings can be retrieved through three methods:
Embeddings.embedding()
allows to provide a default value and returns this value if no embedding could be found.Embeddings.__getitem__()
retrieves an embedding for the query but raises an exception if it cannot retrieve an embedding.Embeddings.embedding_with_norm()
requires aNorms
chunk and returns an embedding together with the corresponding L2 norm.
Embeddings are composed of the 4 chunk types:
Storage
: eitherNdArray
orQuantizedArray
(required)Vocab
, one ofSimpleVocab
,FinalfusionBucketVocab
,FastTextVocab
andExplicitVocab
(required)
Examples
>>> storage = NdArray(np.float32(np.random.rand(2, 10))) >>> vocab = SimpleVocab(["Some", "words"]) >>> metadata = Metadata({"Some": "value", "numerical": 0}) >>> norms = Norms(np.float32(np.random.rand(2))) >>> embeddings = Embeddings(storage=storage, vocab=vocab, metadata=metadata, norms=norms) >>> embeddings.vocab.words ['Some', 'words'] >>> np.allclose(embeddings["Some"], storage[0]) True >>> try: ... embeddings["oov"] ... except KeyError: ... True True >>> _, n = embeddings.embedding_with_norm("Some") >>> np.isclose(n, norms[0]) True >>> embeddings.metadata {'Some': 'value', 'numerical': 0}
-
__init__
(storage: ffp.storage.storage.Storage, vocab: ffp.vocab.vocab.Vocab, norms: Optional[ffp.norms.Norms] = None, metadata: Optional[ffp.metadata.Metadata] = None)[source]¶ Initialize Embeddings.
Initializes Embeddings with the given chunks.
- Conditions
The following conditions need to hold if the respective chunks are passed.
Chunks need to have the expected type.
vocab.idx_bound == storage.shape[0]
len(vocab) == len(norms)
len(norms) == len(vocab) and len(norms) >= storage.shape[0]
- Parameters
storage (Storage) – Embeddings Storage.
vocab (Vocab) – Embeddings Vocabulary.
norms (Norms, optional) – Embeddings Norms.
metadata (Metadata, optional) – Embeddings Metadata.
- Raises
AssertionError – If any of the conditions don’t hold.
-
__getitem__
(item: str) → numpy.ndarray[source]¶ Returns an embeddings.
- Parameters
item (str) – The query item.
- Returns
embedding – The embedding.
- Return type
- Raises
KeyError – If no embedding could be retrieved.
See also
-
embedding
(word: str, out: Optional[numpy.ndarray] = None, default: Optional[numpy.ndarray] = None) → Optional[numpy.ndarray][source]¶ Embedding lookup.
Looks up the embedding for the input word.
If an out array is specified, the embedding is written into the array.
If it is not possible to retrieve an embedding for the input word, the default value is returned. This defaults to None. An embedding can not be retrieved if the vocabulary cannot provide an index for word.
This method never fails. If you do not provide a default value, check the return value for None.
out
is left untouched if no embedding can be found anddefault
is None.- Parameters
word (str) – The query word.
out (numpy.ndarray, optional) – Optional output array to write the embedding into.
default (numpy.ndarray, optional) – Optional default value to return if no embedding can be retrieved. Defaults to None.
- Returns
embedding – The retrieved embedding or the default value.
- Return type
numpy.ndarray, optional
Examples
>>> matrix = np.float32(np.random.rand(2, 10)) >>> storage = NdArray(matrix) >>> vocab = SimpleVocab(["Some", "words"]) >>> embeddings = Embeddings(storage=storage, vocab=vocab) >>> np.allclose(embeddings.embedding("Some"), matrix[0]) True >>> # default value is None >>> embeddings.embedding("oov") is None True >>> # It's possible to specify a default value >>> default = embeddings.embedding("oov", default=storage[0]) >>> np.allclose(default, storage[0]) True >>> # Embeddings can be written to an output buffer. >>> out = np.zeros(10, dtype=np.float32) >>> out2 = embeddings.embedding("Some", out=out) >>> out is out2 True >>> np.allclose(out, matrix[0]) True
See also
-
embedding_with_norm
(word: str, out: Optional[numpy.ndarray] = None, default: Optional[Tuple[numpy.ndarray, float]] = None) → Optional[Tuple[numpy.ndarray, float]][source]¶ Embedding lookup with norm.
Looks up the embedding for the input word together with its norm.
If an out array is specified, the embedding is written into the array.
If it is not possible to retrieve an embedding for the input word, the default value is returned. This defaults to None. An embedding can not be retrieved if the vocabulary cannot provide an index for word.
This method raises a TypeError if norms are not set.
- Parameters
word (str) – The query word.
out (numpy.ndarray, optional) – Optional output array to write the embedding into.
default (Tuple[numpy.ndarray, float], optional) – Optional default value to return if no embedding can be retrieved. Defaults to None.
- Returns
(embedding, norm) – Tuple with the retrieved embedding or the default value at the first index and the norm at the second index.
- Return type
EmbeddingWithNorm, optional
See also
-
property
norms
¶ The
Norms
.- Getter
Returns None or the Norms.
- Setter
Set the Norms.
- Returns
norms – The Norms or None.
- Return type
Norms, optional
- Raises
AssertionError – if
embeddings.storage.shape[0] < len(embeddings.norms)
orlen(embeddings.norms) != len(embeddings.vocab)
TypeError – If
norms
is neither Norms nor None.
-
property
metadata
¶ The
Metadata
.
-
bucket_to_explicit
() → ffp.embeddings.Embeddings[source]¶ Convert bucket embeddings to embeddings with explicit lookup.
Multiple embeddings can still map to the same bucket, but all buckets that are not indexed by in-vocabulary n-grams are eliminated. This can have a big impact on the size of the embedding matrix.
A side effect of this method is the conversion from a quantized storage to an array storage.
- Returns
embeddings – Embeddings with an ExplicitVocab instead of a hash-based vocabulary.
- Return type
- Raises
TypeError – If the current vocabulary is not a hash-based vocabulary (FinalfusionBucketVocab or FastTextVocab)
-
chunks
() → List[ffp.io.Chunk][source]¶ Get the Embeddings Chunks as a list.
The Chunks are ordered in the expected serialization order: 1. Metadata 2. Vocabulary 3. Storage 4. Norms
- Returns
chunks – List of embeddings chunks.
- Return type
List[Chunk]
-
ffp.embeddings.
load_finalfusion
(file: Union[str, bytes, int, os.PathLike], mmap: bool = False) → ffp.embeddings.Embeddings[source]¶ Read embeddings from a file in finalfusion format.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in finalfusoin format.
mmap (bool) – Toggles memory mapping the storage buffer.
- Returns
embeddings – The embeddings from the input file.
- Return type
-
ffp.embeddings.
load_word2vec
(file: Union[str, bytes, int, os.PathLike]) → ffp.embeddings.Embeddings[source]¶ Read embeddings in word2vec binary format.
Files are expected to start with a line containing rows and cols in utf-8. Words are encoded in utf-8 followed by a single whitespace. After the whitespace the embedding components are expected as little-endian float32.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
- Returns
embeddings – The embeddings from the input file.
- Return type
-
ffp.embeddings.
load_textdims
(file: Union[str, bytes, int, os.PathLike]) → ffp.embeddings.Embeddings[source]¶ Read emebddings in textdims format.
The first line contains whitespace separated rows and cols, the rest of the file contains whitespace separated word and vector components.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
- Returns
embeddings – The embeddings from the input file.
- Return type
-
ffp.embeddings.
load_text
(file: Union[str, bytes, int, os.PathLike]) → ffp.embeddings.Embeddings[source]¶ Read embeddings in text format.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
- Returns
embeddings – Embeddings from the input file. The resulting Embeddings will have a SimpleVocab, NdArray and Norms.
- Return type
-
ffp.embeddings.
load_fastText
(file: Union[str, bytes, int, os.PathLike]) → ffp.embeddings.Embeddings[source]¶ Read embeddings from a file in fastText format.
- Parameters
file (str, bytes, int, PathLike) – Path to a file with embeddings in word2vec binary format.
- Returns
embeddings – The embeddings from the input file.
- Return type
Storage¶
ffp.storage
|
Load any storage from a finalfusion file. |
|
Load an array chunk from the given file. |
|
Array storage. |
Load a quantized array chunk from the given file. |
|
QuantizedArray storage. |
|
|
Product Quantizer |
NdArray¶
-
class
ffp.storage.ndarray.
NdArray
(array: numpy.ndarray)[source]¶ Bases:
numpy.ndarray
,ffp.io.Chunk
,ffp.storage.storage.Storage
Array storage.
Essentially a numpy matrix, either in-memory or memory-mapped.
Examples
>>> matrix = np.float32(np.random.rand(10, 50)) >>> ndarray_storage = NdArray(matrix) >>> np.allclose(matrix, ndarray_storage) True >>> ndarray_storage.shape (10, 50)
-
static
__new__
(cls, array: numpy.ndarray)[source]¶ Construct a new NdArray storage.
- Parameters
array (numpy.ndarray) – The storage buffer.
- Raises
TypeError – If the array is not a 2-dimensional float32 array.
-
property
shape
¶ The storage shape
-
classmethod
load
(file: BinaryIO, mmap=False) → ffp.storage.ndarray.NdArray[source]¶ Load Storage from the given finalfusion file.
- Parameters
file (BinaryIO) – File at the beginning of a finalfusion storage
mmap (bool) – Toggles memory mapping the buffer.
- Returns
storage – The storage from the file.
- Return type
-
static
read_chunk
(file: BinaryIO) → ffp.storage.ndarray.NdArray[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
static
mmap_storage
(file: BinaryIO) → ffp.storage.ndarray.NdArray[source]¶ Memory map the storage.
Parallel method to
ffp.io.Chunk.read_chunk()
. Instead of storing theStorage
in-memory, it memory maps the embeddings.- Parameters
file (BinaryIO) – File at the beginning of a finalfusion storage
- Returns
storage – The memory mapped storage.
- Return type
-
static
-
ffp.storage.ndarray.
load_ndarray
(file: Union[str, bytes, int, os.PathLike], mmap: bool = False) → ffp.storage.ndarray.NdArray[source]¶ Load an array chunk from the given file.
- Parameters
file (str, bytes, int, PathLike) – Finalfusion file with a ndarray chunk.
mmap (bool) – Toggles memory mapping the array buffer as read only.
- Returns
storage – The NdArray storage from the file.
- Return type
- Raises
ValueError – If the file did not contain an NdArray chunk.
Quantized¶
-
class
ffp.storage.quantized.
QuantizedArray
(pq: ffp.storage.quantized.PQ, quantized_embeddings: numpy.ndarray, norms: Optional[numpy.ndarray])[source]¶ Bases:
ffp.io.Chunk
,ffp.storage.storage.Storage
QuantizedArray storage.
QuantizedArrays support slicing, indexing with integers, lists of integers and arbitrary dimensional integer arrays. Slicing a QuantizedArray returns a new QuantizedArray but does not copy any buffers.
QuantizedArrays offer two ways of indexing:
QuantizedArray.__getitem__()
:passing a slice returns a new view of the QuantizedArray.
passing an integer returns a single embedding, lists and arrays return ndims + 1 dimensional embeddings.
QuantizedArray.embedding()
:embeddings can be written to an output buffer.
passing a slice returns a matrix holding reconstructed embeddings.
otherwise, this method behaves like
__getitem__()
A QuantizedArray can be treated as
numpy.ndarray
throughnumpy.asarray()
. This restores the original matrix and copies into a new buffer.Using common numpy functions on a QuantizedArray will produce a regular
ndarray
in the process and is therefore an expensive operation.-
__init__
(pq: ffp.storage.quantized.PQ, quantized_embeddings: numpy.ndarray, norms: Optional[numpy.ndarray])[source]¶ Initialize a QuantizedArray.
- Parameters
pq (PQ) – A product quantizer
quantized_embeddings (numpy.ndarray) – The quantized embeddings
norms (numpy.ndarray, optional) – Optional norms corresponding to the quantized embeddings. Reconstructed embeddings are scaled by their norm.
-
property
shape
¶ The storage shape
-
embedding
(key, out: numpy.ndarray = None)[source]¶ Get embeddings.
if
key
is an integer, a single reconstructed embedding is returned.if
key
is a list of integers or a slice, a matrix of reconstructed embeddings is returned.if
key
is an n-dimensional array, a tensor with reconstructed embeddings is returned. This tensor has one new axis in the last dimension containing the embeddings.
If
out
is passed, the reconstruction is written to this buffer.out.shape
needs to match the dimensions described above.- Parameters
key (int, list, numpy.ndarray, slice) – Key specifying which embeddings to retrieve.
out (numpy.ndarray) – Array to reconstruct the embeddings into.
- Returns
reconstruction – The reconstructed embedding or embeddings.
- Return type
-
property
quantized_len
¶ Length of the quantized embeddings.
- Returns
quantized_len – Length of quantized embeddings.
- Return type
-
classmethod
load
(file: BinaryIO, mmap=False) → ffp.storage.quantized.QuantizedArray[source]¶ Load Storage from the given finalfusion file.
- Parameters
file (BinaryIO) – File at the beginning of a finalfusion storage
mmap (bool) – Toggles memory mapping the buffer.
- Returns
storage – The storage from the file.
- Return type
-
static
read_chunk
(file: BinaryIO) → ffp.storage.quantized.QuantizedArray[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
static
mmap_storage
(file: BinaryIO) → ffp.storage.quantized.QuantizedArray[source]¶ Memory map the storage.
Parallel method to
ffp.io.Chunk.read_chunk()
. Instead of storing theStorage
in-memory, it memory maps the embeddings.- Parameters
file (BinaryIO) – File at the beginning of a finalfusion storage
- Returns
storage – The memory mapped storage.
- Return type
-
write_chunk
(file: BinaryIO)[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
static
chunk_identifier
() → ffp.io.ChunkIdentifier[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
class
ffp.storage.quantized.
PQ
(quantizers: numpy.ndarray, projection: Optional[numpy.ndarray])[source]¶ Product Quantizer
Product Quantizers are vector quantizers which decompose high dimensional vector spaces into subspaces. Each of these subspaces is a slice of the the original vector space. Embeddings are quantized by assigning their ith slice to the closest centroid.
Product Quantizers can reconstruct vectors by concatenating the slices of the quantized vector.
-
__init__
(quantizers: numpy.ndarray, projection: Optional[numpy.ndarray])[source]¶ Initializes a Product Quantizer.
- Parameters
quantizers (np.ndarray) – 3-d ndarray with dtype uint8
projection (np.ndarray, optional) – Projection matrix, must be a square matrix with shape [reconstructed_len, reconstructed_len]
- Raises
AssertionError – If the projection shape does not match the reconstructed_len
-
property
n_centroids
¶ Number of centroids per quantizer.
- Returns
n_centroids – The number of centroids per quantizer.
- Return type
-
property
projection
¶ Projection matrix.
- Returns
projection – Projection Matrix (2-d numpy array with datatype float32) or None.
- Return type
np.ndarray, optional
-
property
reconstructed_len
¶ Reconstructed length.
- Returns
reconstructed_len – Length of the reconstructed vectors.
- Return type
-
property
subquantizers
¶ Get the quantizers.
Returns a 3-d array with shape quantizers * n_centroids * reconstructed_len / quantizers
- Returns
quantizers (np.ndarray) – 3-d np.ndarray with dtype=np.uint8
@return (3d tensor of quantizers)
-
reconstruct
(quantized: numpy.ndarray, out: numpy.ndarray = None) → numpy.ndarray[source]¶ Reconstruct vectors.
Input
- Parameters
quantized (np.ndarray) – Batch of quantized vectors. 2-d np.ndarray with integers required.
out (np.ndarray, optional) – 2-d np.ndarray to write the output into.
- Returns
out – Batch of reconstructed vectors.
- Return type
np.ndarray
- Raises
AssertionError – If out is passed and its last dimension does not match reconstructed_len or its first n-1 dimensions do not match the first n-1 dimensions of quantized.
-
-
ffp.storage.quantized.
load_quantized_array
(file: Union[str, bytes, int, os.PathLike], mmap: bool = False) → ffp.storage.quantized.QuantizedArray[source]¶ Load a quantized array chunk from the given file.
- Parameters
file (str, bytes, int, PathLike) – Finalfusion file with a quantized array chunk.
mmap (bool) – Toggles memory mapping the array buffer as read only.
- Returns
storage – The QuantizedArray storage from the file.
- Return type
- Raises
ValueError – If the file did not contain a QuantizedArray chunk.
Storage Interface¶
-
class
ffp.storage.
Storage
[source]¶ Common interface to finalfusion storage types.
-
abstract property
shape
¶ The storage shape
-
abstract classmethod
load
(file: BinaryIO, mmap=False) → ffp.storage.storage.Storage[source]¶ Load Storage from the given finalfusion file.
- Parameters
file (BinaryIO) – File at the beginning of a finalfusion storage
mmap (bool) – Toggles memory mapping the buffer.
- Returns
storage – The storage from the file.
- Return type
-
abstract static
mmap_storage
(file: BinaryIO) → ffp.storage.storage.Storage[source]¶ Memory map the storage.
Parallel method to
ffp.io.Chunk.read_chunk()
. Instead of storing theStorage
in-memory, it memory maps the embeddings.- Parameters
file (BinaryIO) – File at the beginning of a finalfusion storage
- Returns
storage – The memory mapped storage.
- Return type
-
abstract property
-
ffp.storage.
load_storage
(file: Union[str, bytes, int, os.PathLike], mmap: bool = False) → ffp.storage.storage.Storage[source]¶ Load any storage from a finalfusion file.
Loads the first known storage from a finalfusion file.
- Parameters
file (str) – Path to file containing a finalfusion storage chunk.
mmap (bool) – Toggles memory mapping the storage buffer as read-only.
- Returns
vocab – First storage in the file.
- Return type
Union[ffp.storage.NdArray, ffp.storage.QuantizedArray]
- Raises
ValueError – If the file did not contain a storage.
Vocabularies¶
ffp.vocab
|
Load a vocabulary from a finalfusion file. |
Load a FinalfusionBucketVocab from the given finalfusion file. |
|
Load a FastTextVocab from the given finalfusion file. |
|
Load a ExplicitVocab from the given finalfusion file. |
|
Load a SimpleVocab from the given finalfusion file. |
|
Finalfusion vocabulary interface. |
|
|
Simple vocabulary. |
Interface for vocabularies with subword lookups. |
|
Finalfusion Bucket Vocabulary. |
|
|
FastText vocabulary |
|
A vocabulary with explicitly stored n-grams. |
|
Frequency Cutoff |
SimpleVocab¶
-
class
ffp.vocab.simple_vocab.
SimpleVocab
(words: List[str], index: Optional[Dict[str, int]] = None)[source]¶ Bases:
ffp.io.Chunk
,ffp.vocab.vocab.Vocab
Simple vocabulary.
SimpleVocabs provide a simple string to index mapping and index to string mapping. SimpleVocab is also the base type of other vocabulary types.
-
__init__
(words: List[str], index: Optional[Dict[str, int]] = None)[source]¶ Initialize a SimpleVocab.
Initializes the vocabulary with the given words and optional index. If no index is given, the nth word in the words list is assigned index n. The word list cannot contain duplicate entries and it needs to be of same length as the index.
- Parameters
words (List[str]) – List of unique words
index (Optional[Dict[str, int]]) – Dictionary providing an entry -> index mapping.
- Raises
ValueError – if the length of index and word doesn’t match.
-
static
from_corpus
(file: Union[str, bytes, int, os.PathLike], cutoff: ffp.vocab.cutoff.Cutoff = Cutoff(30, 'min_freq'))[source]¶ Construct a simple vocabulary from the given corpus.
- Parameters
file (str, bytes, int, PathLike) – Path to corpus file
cutoff (Cutoff) – Frequency cutoff or target size to restrict vocabulary size.
- Returns
(vocab, counts) – Tuple containing the Vocabulary as first item and counts of in-vocabulary items as the second item.
- Return type
Tuple[SimpleVocab, List[int]]
-
property
word_index
¶ Get the index of known words
-
property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
property
idx_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
idx_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
static
read_chunk
(file: BinaryIO) → ffp.vocab.simple_vocab.SimpleVocab[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
write_chunk
(file: BinaryIO)[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
static
chunk_identifier
()[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
idx
(item, default=None)[source]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
int
if there is a single index for a known itemlist
of indices if the vocab can provide subword indices for a unknown item. Thedefault
item if the vocab can’t provide indices.- Return type
-
-
ffp.vocab.simple_vocab.
load_simple_vocab
(file: Union[str, bytes, int, os.PathLike]) → ffp.vocab.simple_vocab.SimpleVocab[source]¶ Load a SimpleVocab from the given finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a SimpleVocab chunk.
- Returns
vocab – Returns the first SimpleVocab in the file.
- Return type
FinalfusionBucketVocab¶
-
class
ffp.vocab.subword.
FinalfusionBucketVocab
(words: List[str], indexer: ffp.subwords.hash_indexers.FinalfusionHashIndexer = None, index: Optional[Dict[str, int]] = None)[source]¶ Bases:
ffp.io.Chunk
,ffp.vocab.subword.SubwordVocab
Finalfusion Bucket Vocabulary.
-
__init__
(words: List[str], indexer: ffp.subwords.hash_indexers.FinalfusionHashIndexer = None, index: Optional[Dict[str, int]] = None)[source]¶ Initialize a FinalfusionBucketVocab.
Initializes the vocabulary with the given words and optional index and indexer.
If no indexer is passed, a FinalfusionHashIndexer with bucket exponent 21 is used.
If no index is given, the nth word in the words list is assigned index n. The word list cannot contain duplicate entries and it needs to be of same length as the index.
- Parameters
words (List[str]) – List of unique words
indexer (FinalfusionHashIndexer, optional) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2^21 buckets with range 3-6.
index (Dict[str, int], optional) – Dictionary providing an entry -> index mapping.
- Raises
ValueError – if the length of index and word doesn’t match.
AssertionError – If the indexer is not a FinalfusionHashIndexer.
-
static
from_corpus
(file: Union[str, bytes, int, os.PathLike], cutoff: Optional[ffp.vocab.cutoff.Cutoff] = None, indexer: Optional[ffp.subwords.hash_indexers.FinalfusionHashIndexer] = None) → Tuple[ffp.vocab.subword.FinalfusionBucketVocab, List[int]][source]¶ Build a Finalfusion Bucket Vocabulary from a corpus.
- Parameters
file (str, bytes, int, PathLike) – File with white-space separated tokens.
cutoff (Cutoff) – Frequency cutoff or target size to restrict vocabulary size. Defaults to minimum frequency cutoff of 30.
indexer (FinalfusionHashIndexer) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2^21 buckets with range 3-6.
- Returns
(vocab, counts) – Tuple containing the Vocabulary as first item and counts of in-vocabulary items as the second item.
- Return type
Tuple[FinalfusionBucketVocab, List[int]]
- Raises
AssertionError – If the indexer is not a FinalfusionHashIndexer.
-
to_explicit
() → ffp.vocab.subword.ExplicitVocab[source]¶ Returns a Vocabulary with explicit storage built from this vocab.
- Returns
explicit_vocab – The converted vocabulary.
- Return type
-
write_chunk
(file: BinaryIO)[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
property
subword_indexer
¶ Get this vocab’s subword Indexer.
The subword indexer produces indices for n-grams.
In case of bucket vocabularies, this is a hash-based indexer (
FinalfusionHashIndexer
,FastTextIndexer
). For explicit subword vocabularies, this is anExplicitIndexer
.- Returns
subword_indexer – The subword indexer of the vocabulary.
- Return type
-
property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
property
word_index
¶ Get the index of known words
-
static
read_chunk
(file: BinaryIO) → ffp.vocab.subword.FinalfusionBucketVocab[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
static
chunk_identifier
()[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
__getitem__
(item: str) → Union[int, List[int]]¶ Lookup the query item.
This method raises an exception if the vocab can’t provide indices.
- Parameters
item (str) – The query item
- Raises
KeyError – If no indices can be provided.
-
idx
(item: str, default=None) → Union[List[int], int, None]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
int
if there is a single index for a known itemlist
of indices if the vocab can provide subword indices for a unknown item. Thedefault
item if the vocab can’t provide indices.- Return type
-
property
idx_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
idx_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
property
max_n
¶ Get the upper bound of the range of extracted n-grams.
- Returns
max_n – upper bound of n-gram range.
- Return type
-
property
min_n
¶ Get the lower bound of the range of extracted n-grams.
- Returns
min_n – lower bound of n-gram range.
- Return type
-
subword_indices
(item: str, bracket: bool = True) → List[int]¶ Get the subword indices for the given item.
This list does not contain the index for known items.
- Parameters
item (str) – The query item.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
indices – The list of subword indices.
- Return type
List[int]
-
subwords
(item: str, bracket: bool = True) → List[str]¶ Get the n-grams of the given item as a list.
The n-gram range is determined by the min_n and max_n values.
- Parameters
item (str) – The query item to extract n-grams from.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
ngrams – List of n-grams.
- Return type
List[str]
-
-
ffp.vocab.subword.
load_finalfusion_bucket_vocab
(file: Union[str, bytes, int, os.PathLike]) → ffp.vocab.subword.FinalfusionBucketVocab[source]¶ Load a FinalfusionBucketVocab from the given finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a FinalfusionBucketVocab chunk.
- Returns
vocab – Returns the first FinalfusionBucketVocab in the file.
- Return type
ExplicitVocab¶
-
class
ffp.vocab.subword.
ExplicitVocab
(words: List[str], indexer: ffp.subwords.explicit_indexer.ExplicitIndexer, index: Dict[str, int] = None)[source]¶ Bases:
ffp.io.Chunk
,ffp.vocab.subword.SubwordVocab
A vocabulary with explicitly stored n-grams.
-
__init__
(words: List[str], indexer: ffp.subwords.explicit_indexer.ExplicitIndexer, index: Dict[str, int] = None)[source]¶ Initialize an ExplicitVocab.
Initializes the vocabulary with the given words, subword indexer and an optional word index.
If no index is given, the nth word in the words list is assigned index n. The word list cannot contain duplicate entries and it needs to be of same length as the index.
- Parameters
words (List[str]) – List of unique words
indexer (ExplicitIndexer) – Subword indexer to use for the vocabulary.
index (Dict[str, int], optional) – Dictionary providing a word -> index mapping.
- Raises
ValueError – if the length of
index
andword
doesn’t match.AssertionError – If the indexer is not an ExplicitIndexer.
See also
-
static
from_corpus
(file: Union[str, bytes, int, os.PathLike], ngram_range=3, 6, token_cutoff: Optional[ffp.vocab.cutoff.Cutoff] = None, ngram_cutoff: Optional[ffp.vocab.cutoff.Cutoff] = None)[source]¶ Build an ExplicitVocab from a corpus.
- Parameters
file (str, bytes, int, PathLike) – File with white-space separated tokens.
ngram_range (Tuple[int, int]) – Specifies the n-gram range for the indexer.
token_cutoff (Cutoff, optional) – Frequency cutoff or target size to restrict token vocabulary size. Defaults to minimum frequency cutoff of 30.
ngram_cutoff (Cutoff, optional) – Frequency cutoff or target size to restrict ngram vocabulary size. Defaults to minimum frequency cutoff of 30.
- Returns
(vocab, counts) – Tuple containing the Vocabulary as first item, counts of in-vocabulary tokens as the second item and in-vocabulary ngram counts as the last item.
- Return type
Tuple[FastTextVocab, List[int], List[int]]
-
property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
property
word_index
¶ Get the index of known words
-
property
subword_indexer
¶ Get this vocab’s subword Indexer.
The subword indexer produces indices for n-grams.
In case of bucket vocabularies, this is a hash-based indexer (
FinalfusionHashIndexer
,FastTextIndexer
). For explicit subword vocabularies, this is anExplicitIndexer
.- Returns
subword_indexer – The subword indexer of the vocabulary.
- Return type
-
static
chunk_identifier
()[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
static
read_chunk
(file: BinaryIO) → ffp.vocab.subword.ExplicitVocab[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
write_chunk
(file) → None[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
idx
(item: str, default=None) → Union[List[int], int, None]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
int
if there is a single index for a known itemlist
of indices if the vocab can provide subword indices for a unknown item. Thedefault
item if the vocab can’t provide indices.- Return type
-
property
idx_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
idx_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
property
max_n
¶ Get the upper bound of the range of extracted n-grams.
- Returns
max_n – upper bound of n-gram range.
- Return type
-
property
min_n
¶ Get the lower bound of the range of extracted n-grams.
- Returns
min_n – lower bound of n-gram range.
- Return type
-
subword_indices
(item: str, bracket: bool = True) → List[int]¶ Get the subword indices for the given item.
This list does not contain the index for known items.
- Parameters
item (str) – The query item.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
indices – The list of subword indices.
- Return type
List[int]
-
subwords
(item: str, bracket: bool = True) → List[str]¶ Get the n-grams of the given item as a list.
The n-gram range is determined by the min_n and max_n values.
- Parameters
item (str) – The query item to extract n-grams from.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
ngrams – List of n-grams.
- Return type
List[str]
-
-
ffp.vocab.subword.
load_explicit_vocab
(file: Union[str, bytes, int, os.PathLike]) → ffp.vocab.subword.ExplicitVocab[source]¶ Load a ExplicitVocab from the given finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a ExplicitVocab chunk.
- Returns
vocab – Returns the first ExplicitVocab in the file.
- Return type
FastTextVocab¶
-
class
ffp.vocab.subword.
FastTextVocab
(words: List[str], indexer: ffp.subwords.hash_indexers.FastTextIndexer = None, index: Optional[Dict[str, int]] = None)[source]¶ Bases:
ffp.io.Chunk
,ffp.vocab.subword.SubwordVocab
FastText vocabulary
-
__init__
(words: List[str], indexer: ffp.subwords.hash_indexers.FastTextIndexer = None, index: Optional[Dict[str, int]] = None)[source]¶ Initialize a FastTextVocab.
Initializes the vocabulary with the given words and optional index and indexer.
If no indexer is passed, a FastTextIndexer with 2,000,000 buckets is used.
If no index is given, the nth word in the words list is assigned index n. The word list cannot contain duplicate entries and it needs to be of same length as the index.
- Parameters
words (List[str]) – List of unique words
indexer (FastTextIndexer, optional) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2,000,000 buckets with range 3-6.
index (Dict[str, int], optional) – Dictionary providing an entry -> index mapping.
- Raises
ValueError – if the length of index and word doesn’t match.
AssertionError – If the indexer is not a FastTextIndexer.
-
static
from_corpus
(file: Union[str, bytes, int, os.PathLike], cutoff: Optional[ffp.vocab.cutoff.Cutoff] = None, indexer: Optional[ffp.subwords.hash_indexers.FastTextIndexer] = None) → Tuple[ffp.vocab.subword.FastTextVocab, List[int]][source]¶ Build a fastText vocabulary from a corpus.
- Parameters
file (str, bytes, int, PathLike) – File with white-space separated tokens.
cutoff (Cutoff, optional) – Frequency cutoff or target size to restrict vocabulary size. Defaults to minimum frequency cutoff of 30.
indexer (FastTextIndexer, optional) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2,000,000 buckets with range 3-6.
- Returns
(vocab, counts) – Tuple containing the Vocabulary as first item and counts of in-vocabulary items as the second item.
- Return type
Tuple[FastTextVocab, List[int]]
- Raises
AssertionError – If the indexer is not a FastTextIndexer.
-
to_explicit
() → ffp.vocab.subword.ExplicitVocab[source]¶ Returns a Vocabulary with explicit storage built from this vocab.
- Returns
explicit_vocab – The converted vocabulary.
- Return type
-
property
subword_indexer
¶ Get this vocab’s subword Indexer.
The subword indexer produces indices for n-grams.
In case of bucket vocabularies, this is a hash-based indexer (
FinalfusionHashIndexer
,FastTextIndexer
). For explicit subword vocabularies, this is anExplicitIndexer
.- Returns
subword_indexer – The subword indexer of the vocabulary.
- Return type
-
property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
property
word_index
¶ Get the index of known words
-
static
read_chunk
(file: BinaryIO) → ffp.vocab.subword.FastTextVocab[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
write_chunk
(file: BinaryIO)[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
-
ffp.vocab.subword.
load_fasttext_vocab
(file: Union[str, bytes, int, os.PathLike]) → ffp.vocab.subword.FastTextVocab[source]¶ Load a FastTextVocab from the given finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a FastTextVocab chunk.
- Returns
vocab – Returns the first FastTextVocab in the file.
- Return type
Interfaces¶
-
class
ffp.vocab.vocab.
Vocab
[source]¶ Bases:
abc.ABC
Finalfusion vocabulary interface.
Vocabs provide at least a simple string to index mapping and index to string mapping. Vocab is the base type of all vocabulary types.
-
abstract property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
abstract property
word_index
¶ Get the index of known words
-
abstract property
idx_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
idx_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
abstract
idx
(item: str, default: Union[List[int], int, None] = None) → Union[List[int], int, None][source]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
int
if there is a single index for a known itemlist
of indices if the vocab can provide subword indices for a unknown item. Thedefault
item if the vocab can’t provide indices.- Return type
-
abstract property
-
class
ffp.vocab.subword.
SubwordVocab
[source]¶ Bases:
ffp.vocab.vocab.Vocab
Interface for vocabularies with subword lookups.
-
idx
(item: str, default=None) → Union[List[int], int, None][source]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
int
if there is a single index for a known itemlist
of indices if the vocab can provide subword indices for a unknown item. Thedefault
item if the vocab can’t provide indices.- Return type
-
property
idx_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
idx_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
property
min_n
¶ Get the lower bound of the range of extracted n-grams.
- Returns
min_n – lower bound of n-gram range.
- Return type
-
property
max_n
¶ Get the upper bound of the range of extracted n-grams.
- Returns
max_n – upper bound of n-gram range.
- Return type
-
abstract property
subword_indexer
¶ Get this vocab’s subword Indexer.
The subword indexer produces indices for n-grams.
In case of bucket vocabularies, this is a hash-based indexer (
FinalfusionHashIndexer
,FastTextIndexer
). For explicit subword vocabularies, this is anExplicitIndexer
.- Returns
subword_indexer – The subword indexer of the vocabulary.
- Return type
-
subwords
(item: str, bracket: bool = True) → List[str][source]¶ Get the n-grams of the given item as a list.
The n-gram range is determined by the min_n and max_n values.
- Parameters
item (str) – The query item to extract n-grams from.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
ngrams – List of n-grams.
- Return type
List[str]
-
subword_indices
(item: str, bracket: bool = True) → List[int][source]¶ Get the subword indices for the given item.
This list does not contain the index for known items.
- Parameters
item (str) – The query item.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
indices – The list of subword indices.
- Return type
List[int]
-
-
ffp.vocab.
load_vocab
(file: Union[str, bytes, int, os.PathLike]) → ffp.vocab.vocab.Vocab[source]¶ Load a vocabulary from a finalfusion file.
Loads the first known vocabulary from a finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a finalfusion vocab chunk.
- Returns
vocab – First Vocab in the file.
- Return type
SimpleVocab, FastTextVocab, FinalfusionBucketVocab, ExplicitVocab
- Raises
ValueError – If the file did not contain a vocabulary.
Subwords¶
ffp.subwords
FastTextIndexer |
|
FinalfusionHashIndexer |
|
ExplicitIndexer |
|
Get the ngrams for the given word. |
FinalfusionHashIndexer¶
-
class
ffp.subwords.hash_indexers.
FinalfusionHashIndexer
(bucket_exp=21, min_n=3, max_n=6)¶ FinalfusionHashIndexer
FinalfusionHashIndexer is a hash-based subword indexer. It hashes n-grams with the FNV-1a algorithm and maps the hash to a predetermined bucket space.
N-grams can be indexed directly through the __call__ method or all n-grams in a string can be indexed in bulk through the subword_indices method.
-
buckets_exp
¶ ‘uint64_t’
- Type
buckets_exp
-
idx_bound
¶ Get the exclusive upper bound
This is the number of distinct indices.
- Returns
idx_bound – Exclusive upper bound of the indexer.
- Return type
-
max_n
¶ ‘uint32_t’
- Type
max_n
-
min_n
¶ ‘uint32_t’
- Type
min_n
-
subword_indices
(self, unicode word, uint64_t offset=0, bool bracket=True, bool with_ngrams=False)¶ Get the subword indices for a word.
- Parameters
word (str) – The string to extract n-grams from
offset (int) – The offset to add to the index, e.g. the length of the word-vocabulary.
bracket (bool) – Toggles bracketing the input string with < and >
with_ngrams (bool) – Toggles returning tuples of (ngram, idx)
- Returns
indices – List of n-gram indices, optionally as (str, int) tuples.
- Return type
- Raises
TypeError – If word is None.
-
FastTextIndexer¶
-
class
ffp.subwords.hash_indexers.
FastTextIndexer
(n_buckets=2000000, min_n=3, max_n=6)¶ FastTextIndexer
FastTextIndexer is a hash-based subword indexer. It hashes n-grams with (a slightly) FNV-1a variant and maps the hash to a predetermined bucket space.
N-grams can be indexed directly through the __call__ method or all n-grams in a string can be indexed in bulk through the subword_indices method.
-
max_n
¶ ‘uint32_t’
- Type
max_n
-
min_n
¶ ‘uint32_t’
- Type
min_n
-
n_buckets
¶ ‘uint64_t’
- Type
n_buckets
-
subword_indices
(self, unicode word, uint64_t offset=0, bool bracket=True, bool with_ngrams=False)¶ Get the subword indices for a word.
- Parameters
word (str) – The string to extract n-grams from
offset (int) – The offset to add to the index, e.g. the length of the word-vocabulary.
bracket (bool) – Toggles bracketing the input string with < and >
with_ngrams (bool) – Toggles returning tuples of (ngram, idx)
- Returns
indices – List of n-gram indices, optionally as (str, int) tuples.
- Return type
- Raises
TypeError – If word is None.
-
ExplicitIndexer¶
-
class
ffp.subwords.explicit_indexer.
ExplicitIndexer
(ngrams: List[str], ngram_range: Tuple[int, int] = 3, 6, ngram_index: Optional[Dict[str, int]] = None)¶ ExplicitIndexer
Explicit Indexers do not index n-grams through hashing but define an actual lookup table.
It can be constructed from a list of unique ngrams. In that case, the ith ngram in the list will be mapped to index i. It is also possible to pass a mapping via ngram_index which allows mapping multiple ngrams to the same value.
N-grams can be indexed directly through the __call__ method or all n-grams in a string can be indexed in bulk through the subword_indices method.
subword_indices optionally returns tuples of form (ngram, idx), otherwise a list of indices belonging to the input string is returned.
-
idx_bound
¶ Get the exclusive upper bound
This is the number of distinct indices.
- Returns
idx_bound – Exclusive upper bound of the indexer.
- Return type
-
max_n
¶ ‘uint32_t’
- Type
max_n
-
min_n
¶ ‘uint32_t’
- Type
min_n
-
ngram_index
¶ Get the ngram-index mapping.
- Returns
ngram_index – The ngram -> index mapping.
- Return type
-
ngrams
¶ Get the list of n-grams.
- Returns
ngrams – The list of in-vocabulary n-grams.
- Return type
-
subword_indices
(self, unicode word, offset=0, bool bracket=True, bool with_ngrams=False)¶ Get the subword indices for a word.
- Parameters
word (str) – The string to extract n-grams from
offset (int) – The offset to add to the index, e.g. the length of the word-vocabulary.
bracket (bool) – Toggles bracketing the input string with < and >
with_ngrams (bool) – Toggles returning tuples of (ngram, idx)
- Returns
indices – List of n-gram indices, optionally as (str, int) tuples.
- Return type
- Raises
TypeError – If word is None.
-
NGrams¶
-
ffp.subwords.ngrams.
word_ngrams
(unicode word, uint32_t min_n=3, uint32_t max_n=6, bool bracket=True)¶ Get the ngrams for the given word.
- Parameters
word (str) – The string to extract n-grams from
min_n (int) – Inclusive lower bound of n-gram range. Must be greater than zero and smaller or equal to max_n
max_n (int) – Inclusive upper bound of n-gram range. Must be greater than zero and greater or equal to min_n
bracket (bool) – Toggles bracketing the input string with < and >
- Returns
ngrams – List of n-grams.
- Return type
- Raises
AssertionError – If max_n < min_n or min_n <= 0.
TypeError – If word is None.
Metadata¶
finalfusion metadata
-
class
ffp.metadata.
Metadata
[source]¶ Bases:
dict
,ffp.io.Chunk
Embeddings metadata
Metadata can be used as a regular Python dict. For serialization, the contents need to be serializable through toml.dumps. Finalfusion assumes metadata to be a TOML formatted string.
Examples
>>> metadata = Metadata({'Some': 'value', 'number': 1}) >>> metadata {'Some': 'value', 'number': 1} >>> metadata['Some'] 'value' >>> metadata['Some'] = 'other value' >>> metadata['Some'] 'other value'
-
static
chunk_identifier
()[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
static
read_chunk
(file: BinaryIO) → ffp.metadata.Metadata[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
static
-
ffp.metadata.
load_metadata
(file: Union[str, bytes, int, os.PathLike]) → ffp.metadata.Metadata[source]¶ Load a Metadata chunk from the given file.
- Parameters
file (str, bytes, int, PathLike) – Finalfusion file with a metadata chunk.
- Returns
metadata – The Metadata from the file.
- Return type
- Raises
ValueError – If the file did not contain an Metadata chunk.
Norms¶
Norms module.
-
class
ffp.norms.
Norms
[source]¶ Bases:
numpy.ndarray
,ffp.io.Chunk
Embedding Norms.
Norms subclass numpy.ndarray, all typical numpy operations are available.
The ith norm is expected to correspond to the l2 norm of the ith row in the storage before normalizing it. Therefore, Norms should have at most the same length as a given Storage and are expected to match the length of the Vocabulary.
-
static
chunk_identifier
()[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
static
read_chunk
(file: BinaryIO) → ffp.norms.Norms[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
static
-
ffp.norms.
load_norms
(file: Union[str, bytes, int, os.PathLike]) → ffp.norms.Norms[source]¶ Load an Norms chunk from the given file.
- Parameters
file (str, bytes, int, PathLike) – Finalfusion file with a norms chunk.
- Returns
storage – The Norms from the file.
- Return type
- Raises
ValueError – If the file did not contain an Norms chunk.
IO¶
This module defines some common IO operations and types.
Chunk
is the building block of finalfusion embeddings, each component
is serialized as its own, non-overlapping, chunk in finalfusion files.
ChunkIdentifier
is a unique integer identifiers for Chunk
.
TypeId
is used to uniquely identify numerical types.
The Header
handles the preamble of finalfusion files.
FinalfusionFormatError
is raised upon reading from malformed finalfusion
files.
-
class
ffp.io.
Chunk
[source]¶ Bases:
abc.ABC
Basic building blocks of finalfusion files.
-
write
(file: Union[str, bytes, int, os.PathLike])[source]¶ Write the Chunk as a standalone finalfusion file.
-
abstract static
chunk_identifier
() → ffp.io.ChunkIdentifier[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
abstract static
read_chunk
(file: BinaryIO) → ffp.io.Chunk[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
-
class
ffp.io.
Header
(chunk_ids)[source]¶ Bases:
ffp.io.Chunk
Header Chunk
The header chunk handles the preamble.
-
property
chunk_ids
¶ Get the chunk IDs from the header
- Returns
chunk_ids – List of ChunkIdentifiers in the Header.
- Return type
List[ChunkIdentifier]
-
static
chunk_identifier
() → ffp.io.ChunkIdentifier[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
static
read_chunk
(file: BinaryIO) → ffp.io.Header[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
property
-
ffp.io.
find_chunk
(file: BinaryIO, chunks: List[ChunkIdentifier]) → Optional[ffp.io.ChunkIdentifier][source]¶ Find a
Chunk
in a file.Looks for one of the specified chunks in the input file and seeks the file to the beginning of the first chunk found from chunks. I.e. the file is positioned before the content but after the header of a chunk.
The
Chunk.read_chunk()
method can be invoked on the Chunk corresponding to the returnedChunkIdentifier
.This method seeks the input file to the beginning before searching.
- Parameters
file (BinaryIO) – finalfusion file
chunks (List[ChunkIdentifier]) – List of Chunks to look for in the input file.
- Returns
chunk_id – The first ChunkIdentifier found in the file. None if none of the chunks could be found.
- Return type
Optional[ChunkIdentifier]
-
class
ffp.io.
ChunkIdentifier
[source]¶ Bases:
enum.IntEnum
Known finalfusion Chunk types.
-
class
ffp.io.
TypeId
[source]¶ Bases:
enum.IntEnum
Known finalfusion data types.