Usage
dotjson exposes five primary classes:
Vocabularyto prepare the vocabulary.Indexto compile an index for a given schema and model.Guideproduces a set of allowed tokens at each generation step.BatchedTokenSetto represent allowed tokens at each generation step.LogitsProcessorto mask logits for tokens inconsistent with the provided schema.
Tip
Vocabulary and Index are both serializable. See the API reference for details.
General program flow
Programs using dotjson follow a similar pattern:
- Initialize the
Vocabularyusing a model’s HuggingFace identifier. - Compile the
Indexusing the schema and vocabulary. - Create a
Guideand aLogitsProcessor. - For each step in the inference loop:
- Perform a forward pass on your language model and place the results for each batch in the
logitsvector. - Retrieve the set of allowed tokens:
- If this is the first step, get the initial set of allowed tokens from the guide with
guide.get_start_tokensets(). - Otherwise, get the next set of allowed tokens using the most recently sampled tokens from the guide with
guide.get_next_tokensets(sampled_tokens).
- Call the processor on the logits vector using the current token set with
processor(logits, token_set). - Choose new tokens for each sequence in the batch using the
logitsvector, via your preferred sampling method.
- Perform a forward pass on your language model and place the results for each batch in the
Tip
Visit the example page for a complete example of how to use dotjson.
Compiling an index
Construct a vocabulary
# Basic usage with just the model name
# Assumes you have a hugging face token in your environment
from dotjson import Vocabulary
model = "NousResearch/Hermes-3-Llama-3.1-8B"
vocabulary = Vocabulary.from_pretrained(model, auth_token=os.environ["HF_TOKEN"])model must be the model identifier on the HuggingFace hub, such as "NousResearch/Hermes-3-Llama-3.1-8B". dotjson will download the tokenizer for this model if it is not currently cached on disk.
Vocabulary contains two convenience functions:
Vocabulary.eos_token_ids()returns the EOS token IDs to determine when generation should terminate.Vocabulary.max_token_id()returns the maximum token ID to determine the size of the logits vector.
Construct an index
Pass the schema you wish to compile and the vocabulary to create an index:
from dotjson import Index
schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}"
index = Index(schema, vocabulary)schema must be a string containing a valid JSON schema. See the list of supported JSON Schema features here.
By default, the generated JSON allows whitespace (spaces after commas, linebreaks after {, etc.). To generate compact JSON without whitespace constraints, set the disallow_whitespace parameter to false:
# Generate compact JSON without extra whitespace
index = Index(schema, vocabulary, disallow_whitespace=False)Note
Your index should be compiled when a new schema is received, unless that schema is invalid. Please see the API reference for information on exceptions thrown for invalid schemas.
Loading and saving an index
If you use the same schema repeatedly, you can serialize the corresponding Index instance and save it to disk to avoid repeated compilations.
To save an index after compilation:
from dotjson import Vocabulary, Index
model = "NousResearch/Hermes-3-Llama-3.1-8B"
vocabulary = Vocabulary.from_pretrained(model, auth_token=os.environ["HF_TOKEN"])
schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}"
index = Index(schema, vocabulary)
path = "./simple_index"
index.serialize_to_disk(path)To load an index from disk:
# Load the index
path = "./simple_index"
index = Index.deserialize_from_disk(path)Prepare the set of allowed tokens
dotjson uses a Guide to generate sets of allowed tokens at each inference step, as determined by previously sampled tokens.
Guides separate the logic of determining allowed tokens from the logic of biasing the model’s logits. The distinction is important because it allows for parallel computation of allowed tokens during the model’s forward pass, rather than during the logit processing step.
The Guide is constructed with an Index and a batch size:
# Create a guide from the index with the batch size
guide = Guide(index, batch_size=1)The Guide generates a BatchedTokenSet for each batch. This contains a set of valid token IDs.
The guide is used in two ways:
Generate the initial set of allowed tokens
# Get the initial set of allowed tokens
initial_token_set = guide.get_start_tokensets()Generate the next set of allowed tokens
# Assume you have sampled a set of tokens as sampled_tokens
next_token_set = guide.get_next_tokensets(sampled_tokens)Important
A BatchedTokenSet should never be reused after new tokens have been sampled.
Always get a fresh token set from the guide after each sampling step, or tokens will be masked incorrectly.
Failing to do so will result in silent failures where invalid tokens are masked.
Note
guide.get_start_tokensets() can only be called once. If you need to restart generation, create a new Guide instance.
Using BatchedTokenSet methods
The BatchedTokenSet class provides two utility methods to inspect the allowed tokens:
# Check if a specific token is allowed in each batch
token_to_check = 42
is_allowed = next_token_set.contains(token_to_check)
# Get the number of allowed tokens in each batch
allowed_count = len(next_token_set)These methods can be useful for debugging or for implementing custom token sampling strategies. For example, if only one token is allowed in a batch element (using num_allowed), you can sample that token directly without evaluating the model.
Preallocate the logits
Logits must be stored in a NumPy array or a PyTorch tensor. Each element of logits is a vector of token logit spans for each batch.
A dummy initialization, which uses constant logits vectors for demonstration purposes, follows:
import numpy as np
# Get the vocabulary size
vocab_size = vocabulary.max_token_id()
# Allocate a single vector for all batches, initialized to 1 (uint16)
all_logits = np.ones(n_batches * vocab_size, dtype=np.uint16)
# Create non-overlapping views for each batch (rows are views, not copies)
logits = all_logits.reshape(n_batches, vocab_size)
# Example: logits[i] is the i-th batch slice (a view)For multiple batches, an alternative approach is to use separate vectors for each batch:
# Store the vectors in a container that persists outside the loop
batch_logits_storage = [np.full(vocab_size, 1, dtype=np.uint16) for _ in range(n_batches)]
# Initialize each batch's logits:
logits = [arr.view() for arr in batch_logits_storage]
# Example: logits[i] is a NumPy view into its own (disjoint) bufferConstruct a logit processor
The logit processor is a function that modifies the logits in-place based on the set of allowed tokens. It is constructed with a Guide object and the mask value:
# Set the batch size
batch_size = 1
# The value to use for masking (aka disabling tokens)
# This will be determined by your quantization scheme,
# but in this example we use 0
mask_value = 0
# Create the processor using the guide object
processor = LogitsProcessor(guide, mask_value)LogitsProcessor is a function that masks tokens that are inconsistent with the schema.
To use the processor, call:
processor.update_logits(logits, token_set)This will modify the logits vector in place, using the token_set to determine which tokens to allow.
For example, if logits is the following after the model’s forward pass:
# Single logits vector, with a three-token vocabulary
logits = [[1, 2, 3]]Assuming token 2 is not in the allowed token set, logit processing will modify it to:
processor.update_logits(logits, token_set)
print(logits) # [[1, 0, 3]]Example Program
Need help?
- Email us at [email protected]
- Your dedicated Slack channel
- Schedule a call with us here