FastTextVocab¶

class ffp.vocab.subword.FastTextVocab(words: List[str], indexer: ffp.subwords.hash_indexers.FastTextIndexer = None, index: Optional[Dict[str, int]] = None)[source]¶

Bases: ffp.io.Chunk, ffp.vocab.subword.SubwordVocab

FastText vocabulary

__init__(words: List[str], indexer: ffp.subwords.hash_indexers.FastTextIndexer = None, index: Optional[Dict[str, int]] = None)[source]¶

Initialize a FastTextVocab.

Initializes the vocabulary with the given words and optional index and indexer.

If no indexer is passed, a FastTextIndexer with 2,000,000 buckets is used.

If no index is given, the nth word in the words list is assigned index n. The word list cannot contain duplicate entries and it needs to be of same length as the index.

Parameters

words (List[str]) – List of unique words
indexer (FastTextIndexer, optional) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2,000,000 buckets with range 3-6.
index (Dict[str, int], optional) – Dictionary providing an entry -> index mapping.

Raises

ValueError – if the length of index and word doesn’t match.
AssertionError – If the indexer is not a FastTextIndexer.

static from_corpus(file: Union[str, bytes, int, os.PathLike], cutoff: Optional[ffp.vocab.cutoff.Cutoff] = None, indexer: Optional[ffp.subwords.hash_indexers.FastTextIndexer] = None) → Tuple[ffp.vocab.subword.FastTextVocab, List[int]][source]¶

Build a fastText vocabulary from a corpus.

Parameters

file (str, bytes, int, PathLike) – File with white-space separated tokens.
cutoff (Cutoff, optional) – Frequency cutoff or target size to restrict vocabulary size. Defaults to minimum frequency cutoff of 30.
indexer (FastTextIndexer, optional) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2,000,000 buckets with range 3-6.

Returns

(vocab, counts) – Tuple containing the Vocabulary as first item and counts of in-vocabulary items as the second item.

Return type

Tuple[FastTextVocab, List[int]]

Raises

AssertionError – If the indexer is not a FastTextIndexer.

to_explicit() → ffp.vocab.subword.ExplicitVocab [source]¶

Returns a Vocabulary with explicit storage built from this vocab.

Returns: explicit_vocab – The converted vocabulary.
Return type: ExplicitVocab

property subword_indexer¶

Get this vocab’s subword Indexer.

The subword indexer produces indices for n-grams.

In case of bucket vocabularies, this is a hash-based indexer (FinalfusionHashIndexer, FastTextIndexer). For explicit subword vocabularies, this is an ExplicitIndexer.

Returns: subword_indexer – The subword indexer of the vocabulary.
Return type: ExplicitIndexer, FinalfusionHashIndexer, FastTextIndexer

property words¶

Get the list of known words

Returns: words – list of known words
Return type: List[str]

property word_index¶

Get the index of known words

Returns: dict – index of known words
Return type: Dict[str, int]

static read_chunk(file: BinaryIO) → ffp.vocab.subword.FastTextVocab [source]¶

Read the Chunk and return it.

The file must be positioned before the contents of the Chunk but after its header.

Parameters: file (BinaryIO) – a finalfusion file containing the given Chunk
Returns: chunk – The chunk read from the file.
Return type: Chunk

write_chunk(file: BinaryIO)[source]¶

Write the Chunk to a file.

Parameters: file (BinaryIO) – Output file for the Chunk

static chunk_identifier()[source]¶

Get the ChunkIdentifier for this Chunk.

Returns: chunk_identifier
Return type: ChunkIdentifier

ffp.vocab.subword.load_fasttext_vocab(file: Union[str, bytes, int, os.PathLike]) → ffp.vocab.subword.FastTextVocab [source]¶

Load a FastTextVocab from the given finalfusion file.

Parameters: file (str, bytes, int, PathLike) – Path to file containing a FastTextVocab chunk.
Returns: vocab – Returns the first FastTextVocab in the file.
Return type: FastTextVocab