SimpleVocab

class ffp.vocab.simple_vocab.SimpleVocab(words: List[str], index: Optional[Dict[str, int]] = None)[source]

Bases: ffp.io.Chunk, ffp.vocab.vocab.Vocab

Simple vocabulary.

SimpleVocabs provide a simple string to index mapping and index to string mapping. SimpleVocab is also the base type of other vocabulary types.

__init__(words: List[str], index: Optional[Dict[str, int]] = None)[source]

Initialize a SimpleVocab.

Initializes the vocabulary with the given words and optional index. If no index is given, the nth word in the words list is assigned index n. The word list cannot contain duplicate entries and it needs to be of same length as the index.

Parameters
  • words (List[str]) – List of unique words

  • index (Optional[Dict[str, int]]) – Dictionary providing an entry -> index mapping.

Raises

ValueError – if the length of index and word doesn’t match.

static from_corpus(file: Union[str, bytes, int, os.PathLike], cutoff: ffp.vocab.cutoff.Cutoff = Cutoff(30, 'min_freq'))[source]

Construct a simple vocabulary from the given corpus.

Parameters
  • file (str, bytes, int, PathLike) – Path to corpus file

  • cutoff (Cutoff) – Frequency cutoff or target size to restrict vocabulary size.

Returns

(vocab, counts) – Tuple containing the Vocabulary as first item and counts of in-vocabulary items as the second item.

Return type

Tuple[SimpleVocab, List[int]]

property word_index

Get the index of known words

Returns

dict – index of known words

Return type

Dict[str, int]

property words

Get the list of known words

Returns

words – list of known words

Return type

List[str]

property idx_bound

The exclusive upper bound of indices in this vocabulary.

Returns

idx_bound – Exclusive upper bound of indices covered by the vocabulary.

Return type

int

static read_chunk(file: BinaryIO)ffp.vocab.simple_vocab.SimpleVocab[source]

Read the Chunk and return it.

The file must be positioned before the contents of the Chunk but after its header.

Parameters

file (BinaryIO) – a finalfusion file containing the given Chunk

Returns

chunk – The chunk read from the file.

Return type

Chunk

write_chunk(file: BinaryIO)[source]

Write the Chunk to a file.

Parameters

file (BinaryIO) – Output file for the Chunk

static chunk_identifier()[source]

Get the ChunkIdentifier for this Chunk.

Returns

chunk_identifier

Return type

ChunkIdentifier

idx(item, default=None)[source]

Lookup the given query item.

This lookup does not raise an exception if the vocab can’t produce indices.

Parameters
  • item (str) – The query item.

  • default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.

Returns

indexint if there is a single index for a known item list of indices if the vocab can provide subword indices for a unknown item. The default item if the vocab can’t provide indices.

Return type

int, List[int], optional

ffp.vocab.simple_vocab.load_simple_vocab(file: Union[str, bytes, int, os.PathLike])ffp.vocab.simple_vocab.SimpleVocab[source]

Load a SimpleVocab from the given finalfusion file.

Parameters

file (str, bytes, int, PathLike) – Path to file containing a SimpleVocab chunk.

Returns

vocab – Returns the first SimpleVocab in the file.

Return type

SimpleVocab