ExplicitVocab

class ffp.vocab.subword.ExplicitVocab(words: List[str], indexer: ffp.subwords.explicit_indexer.ExplicitIndexer, index: Dict[str, int] = None)[source]

Bases: ffp.io.Chunk, ffp.vocab.subword.SubwordVocab

A vocabulary with explicitly stored n-grams.

__init__(words: List[str], indexer: ffp.subwords.explicit_indexer.ExplicitIndexer, index: Dict[str, int] = None)[source]

Initialize an ExplicitVocab.

Initializes the vocabulary with the given words, subword indexer and an optional word index.

If no index is given, the nth word in the words list is assigned index n. The word list cannot contain duplicate entries and it needs to be of same length as the index.

Parameters
  • words (List[str]) – List of unique words

  • indexer (ExplicitIndexer) – Subword indexer to use for the vocabulary.

  • index (Dict[str, int], optional) – Dictionary providing a word -> index mapping.

Raises
  • ValueError – if the length of index and word doesn’t match.

  • AssertionError – If the indexer is not an ExplicitIndexer.

See also

ExplicitIndexer

static from_corpus(file: Union[str, bytes, int, os.PathLike], ngram_range=3, 6, token_cutoff: Optional[ffp.vocab.cutoff.Cutoff] = None, ngram_cutoff: Optional[ffp.vocab.cutoff.Cutoff] = None)[source]

Build an ExplicitVocab from a corpus.

Parameters
  • file (str, bytes, int, PathLike) – File with white-space separated tokens.

  • ngram_range (Tuple[int, int]) – Specifies the n-gram range for the indexer.

  • token_cutoff (Cutoff, optional) – Frequency cutoff or target size to restrict token vocabulary size. Defaults to minimum frequency cutoff of 30.

  • ngram_cutoff (Cutoff, optional) – Frequency cutoff or target size to restrict ngram vocabulary size. Defaults to minimum frequency cutoff of 30.

Returns

(vocab, counts) – Tuple containing the Vocabulary as first item, counts of in-vocabulary tokens as the second item and in-vocabulary ngram counts as the last item.

Return type

Tuple[FastTextVocab, List[int], List[int]]

property words

Get the list of known words

Returns

words – list of known words

Return type

List[str]

property word_index

Get the index of known words

Returns

dict – index of known words

Return type

Dict[str, int]

property subword_indexer

Get this vocab’s subword Indexer.

The subword indexer produces indices for n-grams.

In case of bucket vocabularies, this is a hash-based indexer (FinalfusionHashIndexer, FastTextIndexer). For explicit subword vocabularies, this is an ExplicitIndexer.

Returns

subword_indexer – The subword indexer of the vocabulary.

Return type

ExplicitIndexer, FinalfusionHashIndexer, FastTextIndexer

static chunk_identifier()[source]

Get the ChunkIdentifier for this Chunk.

Returns

chunk_identifier

Return type

ChunkIdentifier

static read_chunk(file: BinaryIO)ffp.vocab.subword.ExplicitVocab[source]

Read the Chunk and return it.

The file must be positioned before the contents of the Chunk but after its header.

Parameters

file (BinaryIO) – a finalfusion file containing the given Chunk

Returns

chunk – The chunk read from the file.

Return type

Chunk

write_chunk(file)None[source]

Write the Chunk to a file.

Parameters

file (BinaryIO) – Output file for the Chunk

idx(item: str, default=None) → Union[List[int], int, None]

Lookup the given query item.

This lookup does not raise an exception if the vocab can’t produce indices.

Parameters
  • item (str) – The query item.

  • default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.

Returns

indexint if there is a single index for a known item list of indices if the vocab can provide subword indices for a unknown item. The default item if the vocab can’t provide indices.

Return type

int, List[int], optional

property idx_bound

The exclusive upper bound of indices in this vocabulary.

Returns

idx_bound – Exclusive upper bound of indices covered by the vocabulary.

Return type

int

property max_n

Get the upper bound of the range of extracted n-grams.

Returns

max_n – upper bound of n-gram range.

Return type

int

property min_n

Get the lower bound of the range of extracted n-grams.

Returns

min_n – lower bound of n-gram range.

Return type

int

subword_indices(item: str, bracket: bool = True) → List[int]

Get the subword indices for the given item.

This list does not contain the index for known items.

Parameters
  • item (str) – The query item.

  • bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.

Returns

indices – The list of subword indices.

Return type

List[int]

subwords(item: str, bracket: bool = True) → List[str]

Get the n-grams of the given item as a list.

The n-gram range is determined by the min_n and max_n values.

Parameters
  • item (str) – The query item to extract n-grams from.

  • bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.

Returns

ngrams – List of n-grams.

Return type

List[str]

write(file: Union[str, bytes, int, os.PathLike])

Write the Chunk as a standalone finalfusion file.

Parameters

file (str, bytes, int, PathLike) – Output file

Raises

TypeError – If the Chunk is a Header.

ffp.vocab.subword.load_explicit_vocab(file: Union[str, bytes, int, os.PathLike])ffp.vocab.subword.ExplicitVocab[source]

Load a ExplicitVocab from the given finalfusion file.

Parameters

file (str, bytes, int, PathLike) – Path to file containing a ExplicitVocab chunk.

Returns

vocab – Returns the first ExplicitVocab in the file.

Return type

ExplicitVocab