API Reference

dotjson

Classes

Vocabulary

A vocabulary for constructing an index.

Class Variables

max_token_id Returns the largest token id in the Vocabulary.

This is used for bounds checking in the logits processor.

eos_token_ids Returns the EOS (end of string) TokenIds for the Vocabulary.

This can be useful for checking termination. In order to support general tokenizers and advanced usage, we allow for multiple EOS tokens.

Methods

from_pretrained

Create a Vocabulary from a pretrained Hugging Face model.

vocabulary = Vocabulary.from_pretrained(model, revision)

Parameters

  • model (str): The name of the tokenizer in the Hugging Face index.
  • revision (str, optional): The specific revision of the tokenizer on the hf-hub, with default value of empty string.
  • user_agent_first (list[str], optional): The user agent info keys for Hugging Face, in the form of a list.
  • user_agent_second (list[str], optional): The user agent info values for Hugging Face, in the form of a list. If provided, must be the same length as user_agent_first.
  • auth_token(str, optional): A valid user access token for hf-hub, with default value of empty string.
from_tuples

A low-level unchecked constructor for serializable vocabulary object from a set of (Token, TokenId) pairs.

vocabulary = vocabulary.from_tuples(tokens, token_ids, eos_token_ids)

Parameters

  • tokens (list[bytes]): Token sequences; each item is a Python bytes representing one token (e.g., UTF-8-encoded).
  • token_ids (list[int]): Integer token IDs aligned with tokens (same length as tokens).
  • eos_token_ids (list[int]): Token IDs treated as end-of-sequence markers.
serialize_to_disk

Serialize the Vocabulary object to disk.

Parameters

  • path (str): Destination path for writing the Vocabulary.
deserialize_from_disk

Load the Vocabulary from a file.

Parameters

  • path (str): Source path for loading the Vocabulary.

Returns

  • The deserialized Vocabulary.

Index

An Index to perform structured generation over a schema.

index = Index(schema, vocabulary, disallow_whitespace=False, device="cpu")

Parameters

  • schema (str): JSON Schema for structured generation.
  • vocabulary (Vocabulary): A Vocabulary object encoding information about the model’s tokenizer.
  • disallow_whitespace (bool, optional): Control whether model will generate extra whitespace (such as spaces after commas or linebreaks after {), with default False.
  • compliance_mode (bool, optional): Use the JSON schema specification deaults for things like additionalProperties. Defaults to false, which uses the dottxt extension to the schema specification.
  • device (Device, optional): The Device to use, with default value of Device(cpu).

Note Using disallow_whitespace=true may cause unanticipated model performance issues, as it disables a formatting that may be natural for the model. On the other hand, disallow_whitespace=True will produce outputs with better formatting.

Note Our implementation of the JSON schema enforces additionalProperties: false. This means that making an index based on the empty schema { } will result in an error as it would lead to no allowable output.

Class Variables

device Gets the device.

Methods

serialize_to_disk

Saves the Index to a file.

Parameters

  • path (str): Destination path for writing the Index.
deserialize_from_disk

Loads the Index from a file.

Parameters

  • path (str): Source path for loading the Index.
  • device (Device, optional): The Device to use, with default value of Device(cpu).
to_device

Sends index to Device.

Parameters

  • device (Device, optional): The Device to use, with default value of Device(cpu).

TokenSet

A structure for representing and checking collections of tokens.

Class Variables

device Gets the device.

Methods

contains

Check whether token_id is in the TokenSet

Parameters

  • token_id(int): A token id to check

Returns

  • bool

TokenSetBatch

A class containing the information needed for the LogitsProcessor to efficiently mask a batch of logits vectors.

Class Variables

batch_size Gets the batch size.

device Gets the device.

Methods

contains

Returns a vector of booleans indicating if a particular token is marked as allowed in a token set

Parameters

  • token_id(int): A token id to check

Returns

  • list[bool]: A vector of booleans with length equal to the batch size, where each element indicates if the token is allowed in the corresponding batch element
len

Gets list of lengths for the token sets in the batch.

Guide

A class for producing a set of allowed tokens with respect to an Index.

guide = Guide(index, batch_size=1)

Parameters

  • index (Index): The Index used for developing the mask.
  • batch_size (int, optional): The batch size for the logits updates, with default value of 1.

Class Variables

device Device the guide is on.

Methods

get_start_tokensets

Retrieve the set of allowed start tokens from the Guide.

get_next_tokensets

Get the next set of valid tokens.

Parameters

  • token_ids (list[int]): The vector of TokenIds that have just been sampled.

Raises This will throw an exception when the wrong number of tokens are given or when at least one of the tokens does not come from the allowed token set.

LogitsProcessor

A logits processor that modifies logits arrays in place for structured generation.

Create a new LogitProcessor:

logits_processor = LogitsProcessor(guide: Guide, mask_value: Any, dtype: Optional[str] = None)
  • guide (Guide) : The Guide object used to produce the token sets.
  • mask_value: The value with which to mask logits values.
  • dtype (str): The dtype of the logits, either float32 or bfloat16

Class Variables

device: <class 'builtins.Device'>

vocabulary_size: <class 'int'>

Methods

update_logits

Adaptively compute the mask and apply it in place to the logits array, without checking the types of the logit values.

update_logits_unchecked

Adaptively compute the mask and apply it in place to the logits array.

Device

A device (CPU or GPU) used for structured generation.

Class Variables

Methods

cpu

Convenience constructor for Device(&#34;cpu&#34;)