API Reference

dotjson

Classes

Vocabulary

A vocabulary for constructing an index.

Class Variables

max_token_id Returns the largest token id in the Vocabulary.

This is used for bounds checking in the logits processor.

eos_token_ids Returns the EOS (end of string) TokenIds for the Vocabulary.

This can be useful for checking termination. In order to support general tokenizers and advanced usage, we allow for multiple EOS tokens.

Methods

`from_pretrained`

Create a Vocabulary from a pretrained Hugging Face model.

vocabulary = Vocabulary.from_pretrained(model, revision)

Parameters

model (str): The name of the tokenizer in the Hugging Face index.
revision (str, optional): The specific revision of the tokenizer on the hf-hub, with default value of empty string.
user_agent_first (list[str], optional): The user agent info keys for Hugging Face, in the form of a list.
user_agent_second (list[str], optional): The user agent info values for Hugging Face, in the form of a list. If provided, must be the same length as user_agent_first.
auth_token(str, optional): A valid user access token for hf-hub, with default value of empty string.

`from_tuples`

A low-level unchecked constructor for serializable vocabulary object from a set of (Token, TokenId) pairs.

vocabulary = vocabulary.from_tuples(tokens, token_ids, eos_token_ids)

Parameters

tokens (list[bytes]): Token sequences; each item is a Python bytes representing one token (e.g., UTF-8-encoded).
token_ids (list[int]): Integer token IDs aligned with tokens (same length as tokens).
eos_token_ids (list[int]): Token IDs treated as end-of-sequence markers.

`serialize_to_disk`

Serialize the Vocabulary object to disk.

Parameters

path (str): Destination path for writing the Vocabulary.

`deserialize_from_disk`

Load the Vocabulary from a file.

Parameters

path (str): Source path for loading the Vocabulary.

Returns

The deserialized Vocabulary.

Index

An Index to perform structured generation over a schema.

index = Index(schema, vocabulary, disallow_whitespace=False, device="cpu")

Parameters

schema (str): JSON Schema for structured generation.
vocabulary (Vocabulary): A Vocabulary object encoding information about the model’s tokenizer.
disallow_whitespace (bool, optional): Control whether model will generate extra whitespace (such as spaces after commas or linebreaks after {), with default False.
compliance_mode (bool, optional): Use the JSON schema specification deaults for things like additionalProperties. Defaults to false, which uses the dottxt extension to the schema specification.
device (Device, optional): The Device to use, with default value of Device(cpu).

Note Using disallow_whitespace=true may cause unanticipated model performance issues, as it disables a formatting that may be natural for the model. On the other hand, disallow_whitespace=True will produce outputs with better formatting.

Note Our implementation of the JSON schema enforces additionalProperties: false. This means that making an index based on the empty schema { } will result in an error as it would lead to no allowable output.

Class Variables

device Gets the device.

Methods

`serialize_to_disk`

Saves the Index to a file.

Parameters

path (str): Destination path for writing the Index.

`deserialize_from_disk`

Loads the Index from a file.

Parameters

path (str): Source path for loading the Index.
device (Device, optional): The Device to use, with default value of Device(cpu).

`to_device`

Sends index to Device.

Parameters

device (Device, optional): The Device to use, with default value of Device(cpu).

TokenSet

A structure for representing and checking collections of tokens.

Class Variables

device Gets the device.

Methods

`contains`

Check whether token_id is in the TokenSet

Parameters

token_id(int): A token id to check

Returns

bool

TokenSetBatch

A class containing the information needed for the LogitsProcessor to efficiently mask a batch of logits vectors.

Class Variables

batch_size Gets the batch size.

device Gets the device.

Methods

`contains`

Returns a vector of booleans indicating if a particular token is marked as allowed in a token set

Parameters

token_id(int): A token id to check

Returns

list[bool]: A vector of booleans with length equal to the batch size, where each element indicates if the token is allowed in the corresponding batch element

`len`

Gets list of lengths for the token sets in the batch.

Guide

A class for producing a set of allowed tokens with respect to an Index.

guide = Guide(index, batch_size=1)

Parameters

index (Index): The Index used for developing the mask.
batch_size (int, optional): The batch size for the logits updates, with default value of 1.

Class Variables

device Device the guide is on.

Methods

`get_start_tokensets`

Retrieve the set of allowed start tokens from the Guide.

`get_next_tokensets`

Get the next set of valid tokens.

Parameters

token_ids (list[int]): The vector of TokenIds that have just been sampled.

Raises This will throw an exception when the wrong number of tokens are given or when at least one of the tokens does not come from the allowed token set.

LogitsProcessor

A logits processor that modifies logits arrays in place for structured generation.

Create a new LogitProcessor:

logits_processor = LogitsProcessor(guide: Guide, mask_value: Any, dtype: Optional[str] = None)

guide (Guide) : The Guide object used to produce the token sets.
mask_value: The value with which to mask logits values.
dtype (str): The dtype of the logits, either float32 or bfloat16

Class Variables

device: <class 'builtins.Device'>

vocabulary_size: <class 'int'>

Methods

`update_logits`

Adaptively compute the mask and apply it in place to the logits array, without checking the types of the logit values.

`update_logits_unchecked`

Adaptively compute the mask and apply it in place to the logits array.

Device

A device (CPU or GPU) used for structured generation.

Class Variables

Methods

`cpu`

Convenience constructor for Device("cpu")

Installation