ExplicitIndexer

class ffp.subwords.explicit_indexer.ExplicitIndexer(ngrams: List[str], ngram_range: Tuple[int, int] = 3, 6, ngram_index: Optional[Dict[str, int]] = None)

ExplicitIndexer

Explicit Indexers do not index n-grams through hashing but define an actual lookup table.

It can be constructed from a list of unique ngrams. In that case, the ith ngram in the list will be mapped to index i. It is also possible to pass a mapping via ngram_index which allows mapping multiple ngrams to the same value.

N-grams can be indexed directly through the __call__ method or all n-grams in a string can be indexed in bulk through the subword_indices method.

subword_indices optionally returns tuples of form (ngram, idx), otherwise a list of indices belonging to the input string is returned.

idx_bound

Get the exclusive upper bound

This is the number of distinct indices.

Returns

idx_bound – Exclusive upper bound of the indexer.

Return type

int

max_n

‘uint32_t’

Type

max_n

min_n

‘uint32_t’

Type

min_n

ngram_index

Get the ngram-index mapping.

Returns

ngram_index – The ngram -> index mapping.

Return type

dict

ngrams

Get the list of n-grams.

Returns

ngrams – The list of in-vocabulary n-grams.

Return type

list

subword_indices(self, unicode word, offset=0, bool bracket=True, bool with_ngrams=False)

Get the subword indices for a word.

Parameters
  • word (str) – The string to extract n-grams from

  • offset (int) – The offset to add to the index, e.g. the length of the word-vocabulary.

  • bracket (bool) – Toggles bracketing the input string with < and >

  • with_ngrams (bool) – Toggles returning tuples of (ngram, idx)

Returns

indices – List of n-gram indices, optionally as (str, int) tuples.

Return type

list

Raises

TypeError – If word is None.