FastTextVocab¶
-
class
ffp.vocab.subword.
FastTextVocab
(words: List[str], indexer: ffp.subwords.hash_indexers.FastTextIndexer = None, index: Optional[Dict[str, int]] = None)[source]¶ Bases:
ffp.io.Chunk
,ffp.vocab.subword.SubwordVocab
FastText vocabulary
-
__init__
(words: List[str], indexer: ffp.subwords.hash_indexers.FastTextIndexer = None, index: Optional[Dict[str, int]] = None)[source]¶ Initialize a FastTextVocab.
Initializes the vocabulary with the given words and optional index and indexer.
If no indexer is passed, a FastTextIndexer with 2,000,000 buckets is used.
If no index is given, the nth word in the words list is assigned index n. The word list cannot contain duplicate entries and it needs to be of same length as the index.
- Parameters
words (List[str]) – List of unique words
indexer (FastTextIndexer, optional) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2,000,000 buckets with range 3-6.
index (Dict[str, int], optional) – Dictionary providing an entry -> index mapping.
- Raises
ValueError – if the length of index and word doesn’t match.
AssertionError – If the indexer is not a FastTextIndexer.
-
static
from_corpus
(file: Union[str, bytes, int, os.PathLike], cutoff: Optional[ffp.vocab.cutoff.Cutoff] = None, indexer: Optional[ffp.subwords.hash_indexers.FastTextIndexer] = None) → Tuple[ffp.vocab.subword.FastTextVocab, List[int]][source]¶ Build a fastText vocabulary from a corpus.
- Parameters
file (str, bytes, int, PathLike) – File with white-space separated tokens.
cutoff (Cutoff, optional) – Frequency cutoff or target size to restrict vocabulary size. Defaults to minimum frequency cutoff of 30.
indexer (FastTextIndexer, optional) – Subword indexer to use for the vocabulary. Defaults to an indexer with 2,000,000 buckets with range 3-6.
- Returns
(vocab, counts) – Tuple containing the Vocabulary as first item and counts of in-vocabulary items as the second item.
- Return type
Tuple[FastTextVocab, List[int]]
- Raises
AssertionError – If the indexer is not a FastTextIndexer.
-
to_explicit
() → ffp.vocab.subword.ExplicitVocab[source]¶ Returns a Vocabulary with explicit storage built from this vocab.
- Returns
explicit_vocab – The converted vocabulary.
- Return type
-
property
subword_indexer
¶ Get this vocab’s subword Indexer.
The subword indexer produces indices for n-grams.
In case of bucket vocabularies, this is a hash-based indexer (
FinalfusionHashIndexer
,FastTextIndexer
). For explicit subword vocabularies, this is anExplicitIndexer
.- Returns
subword_indexer – The subword indexer of the vocabulary.
- Return type
-
property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
property
word_index
¶ Get the index of known words
-
static
read_chunk
(file: BinaryIO) → ffp.vocab.subword.FastTextVocab[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
write_chunk
(file: BinaryIO)[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
-
ffp.vocab.subword.
load_fasttext_vocab
(file: Union[str, bytes, int, os.PathLike]) → ffp.vocab.subword.FastTextVocab[source]¶ Load a FastTextVocab from the given finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a FastTextVocab chunk.
- Returns
vocab – Returns the first FastTextVocab in the file.
- Return type