FinalfusionBucketVocab¶
-
class
ffp.vocab.subword.
FinalfusionBucketVocab
(words: List[str], indexer: ffp.subwords.hash_indexers.FinalfusionHashIndexer = None, index: Optional[Dict[str, int]] = None)[source]¶ Bases:
ffp.io.Chunk
,ffp.vocab.subword.SubwordVocab
Finalfusion Bucket Vocabulary.
-
__init__
(words: List[str], indexer: ffp.subwords.hash_indexers.FinalfusionHashIndexer = None, index: Optional[Dict[str, int]] = None)[source]¶ Initialize a FinalfusionBucketVocab.
Initializes the vocabulary with the given words and optional index and indexer.
If no indexer is passed, a FinalfusionHashIndexer with bucket exponent 21 is used.
If no index is given, the nth word in the words list is assigned index n. The word list cannot contain duplicate entries and it needs to be of same length as the index.
- Parameters
words (List[str]) – List of unique words
indexer (FinalfusionHashIndexer, optional) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2^21 buckets with range 3-6.
index (Dict[str, int], optional) – Dictionary providing an entry -> index mapping.
- Raises
ValueError – if the length of index and word doesn’t match.
AssertionError – If the indexer is not a FinalfusionHashIndexer.
-
static
from_corpus
(file: Union[str, bytes, int, os.PathLike], cutoff: Optional[ffp.vocab.cutoff.Cutoff] = None, indexer: Optional[ffp.subwords.hash_indexers.FinalfusionHashIndexer] = None) → Tuple[ffp.vocab.subword.FinalfusionBucketVocab, List[int]][source]¶ Build a Finalfusion Bucket Vocabulary from a corpus.
- Parameters
file (str, bytes, int, PathLike) – File with white-space separated tokens.
cutoff (Cutoff) – Frequency cutoff or target size to restrict vocabulary size. Defaults to minimum frequency cutoff of 30.
indexer (FinalfusionHashIndexer) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2^21 buckets with range 3-6.
- Returns
(vocab, counts) – Tuple containing the Vocabulary as first item and counts of in-vocabulary items as the second item.
- Return type
Tuple[FinalfusionBucketVocab, List[int]]
- Raises
AssertionError – If the indexer is not a FinalfusionHashIndexer.
-
to_explicit
() → ffp.vocab.subword.ExplicitVocab[source]¶ Returns a Vocabulary with explicit storage built from this vocab.
- Returns
explicit_vocab – The converted vocabulary.
- Return type
-
write_chunk
(file: BinaryIO)[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
property
subword_indexer
¶ Get this vocab’s subword Indexer.
The subword indexer produces indices for n-grams.
In case of bucket vocabularies, this is a hash-based indexer (
FinalfusionHashIndexer
,FastTextIndexer
). For explicit subword vocabularies, this is anExplicitIndexer
.- Returns
subword_indexer – The subword indexer of the vocabulary.
- Return type
-
property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
property
word_index
¶ Get the index of known words
-
static
read_chunk
(file: BinaryIO) → ffp.vocab.subword.FinalfusionBucketVocab[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
static
chunk_identifier
()[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
__getitem__
(item: str) → Union[int, List[int]]¶ Lookup the query item.
This method raises an exception if the vocab can’t provide indices.
- Parameters
item (str) – The query item
- Raises
KeyError – If no indices can be provided.
-
idx
(item: str, default=None) → Union[List[int], int, None]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
int
if there is a single index for a known itemlist
of indices if the vocab can provide subword indices for a unknown item. Thedefault
item if the vocab can’t provide indices.- Return type
-
property
idx_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
idx_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
property
max_n
¶ Get the upper bound of the range of extracted n-grams.
- Returns
max_n – upper bound of n-gram range.
- Return type
-
property
min_n
¶ Get the lower bound of the range of extracted n-grams.
- Returns
min_n – lower bound of n-gram range.
- Return type
-
subword_indices
(item: str, bracket: bool = True) → List[int]¶ Get the subword indices for the given item.
This list does not contain the index for known items.
- Parameters
item (str) – The query item.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
indices – The list of subword indices.
- Return type
List[int]
-
subwords
(item: str, bracket: bool = True) → List[str]¶ Get the n-grams of the given item as a list.
The n-gram range is determined by the min_n and max_n values.
- Parameters
item (str) – The query item to extract n-grams from.
bracket (bool) – Toggles bracketing the item with ‘<’ and ‘>’ before extraction.
- Returns
ngrams – List of n-grams.
- Return type
List[str]
-
-
ffp.vocab.subword.
load_finalfusion_bucket_vocab
(file: Union[str, bytes, int, os.PathLike]) → ffp.vocab.subword.FinalfusionBucketVocab[source]¶ Load a FinalfusionBucketVocab from the given finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a FinalfusionBucketVocab chunk.
- Returns
vocab – Returns the first FinalfusionBucketVocab in the file.
- Return type