SimpleVocab¶
-
class
ffp.vocab.simple_vocab.
SimpleVocab
(words: List[str], index: Optional[Dict[str, int]] = None)[source]¶ Bases:
ffp.io.Chunk
,ffp.vocab.vocab.Vocab
Simple vocabulary.
SimpleVocabs provide a simple string to index mapping and index to string mapping. SimpleVocab is also the base type of other vocabulary types.
-
__init__
(words: List[str], index: Optional[Dict[str, int]] = None)[source]¶ Initialize a SimpleVocab.
Initializes the vocabulary with the given words and optional index. If no index is given, the nth word in the words list is assigned index n. The word list cannot contain duplicate entries and it needs to be of same length as the index.
- Parameters
words (List[str]) – List of unique words
index (Optional[Dict[str, int]]) – Dictionary providing an entry -> index mapping.
- Raises
ValueError – if the length of index and word doesn’t match.
-
static
from_corpus
(file: Union[str, bytes, int, os.PathLike], cutoff: ffp.vocab.cutoff.Cutoff = Cutoff(30, 'min_freq'))[source]¶ Construct a simple vocabulary from the given corpus.
- Parameters
file (str, bytes, int, PathLike) – Path to corpus file
cutoff (Cutoff) – Frequency cutoff or target size to restrict vocabulary size.
- Returns
(vocab, counts) – Tuple containing the Vocabulary as first item and counts of in-vocabulary items as the second item.
- Return type
Tuple[SimpleVocab, List[int]]
-
property
word_index
¶ Get the index of known words
-
property
words
¶ Get the list of known words
- Returns
words – list of known words
- Return type
List[str]
-
property
idx_bound
¶ The exclusive upper bound of indices in this vocabulary.
- Returns
idx_bound – Exclusive upper bound of indices covered by the vocabulary.
- Return type
-
static
read_chunk
(file: BinaryIO) → ffp.vocab.simple_vocab.SimpleVocab[source]¶ Read the Chunk and return it.
The file must be positioned before the contents of the
Chunk
but after its header.- Parameters
file (BinaryIO) – a finalfusion file containing the given Chunk
- Returns
chunk – The chunk read from the file.
- Return type
-
write_chunk
(file: BinaryIO)[source]¶ Write the Chunk to a file.
- Parameters
file (BinaryIO) – Output file for the Chunk
-
static
chunk_identifier
()[source]¶ Get the ChunkIdentifier for this Chunk.
- Returns
chunk_identifier
- Return type
-
idx
(item, default=None)[source]¶ Lookup the given query item.
This lookup does not raise an exception if the vocab can’t produce indices.
- Parameters
item (str) – The query item.
default (Optional[Union[int, List[int]]]) – Fall-back value to return if the vocab can’t provide indices.
- Returns
index –
int
if there is a single index for a known itemlist
of indices if the vocab can provide subword indices for a unknown item. Thedefault
item if the vocab can’t provide indices.- Return type
-
-
ffp.vocab.simple_vocab.
load_simple_vocab
(file: Union[str, bytes, int, os.PathLike]) → ffp.vocab.simple_vocab.SimpleVocab[source]¶ Load a SimpleVocab from the given finalfusion file.
- Parameters
file (str, bytes, int, PathLike) – Path to file containing a SimpleVocab chunk.
- Returns
vocab – Returns the first SimpleVocab in the file.
- Return type