Usage

dotjson exposes five primary classes:

  • Vocabulary to prepare the vocabulary.
  • Index to compile an index for a given schema and model.
  • Guide produces a set of allowed tokens at each generation step.
  • BatchedTokenSet to represent allowed tokens at each generation step.
  • LogitsProcessor to mask logits for tokens inconsistent with the provided schema.

  Tip

Vocabulary and Index are both serializable. See the API reference for details.

General program flow

Programs using dotjson follow a similar pattern:

  • Initialize the Vocabulary using a model’s HuggingFace identifier.
  • Compile the Index using the schema and vocabulary.
  • Create a Guide and a LogitsProcessor.
  • For each step in the inference loop:
    1. Perform a forward pass on your language model and place the results for each batch in the logits vector.
    2. Retrieve the set of allowed tokens:
    • If this is the first step, get the initial set of allowed tokens from the guide with guide.get_start_tokensets().
    • Otherwise, get the next set of allowed tokens using the most recently sampled tokens from the guide with guide.get_next_tokensets(sampled_tokens).
    1. Call the processor on the logits vector using the current token set with processor(logits, token_set).
    2. Choose new tokens for each sequence in the batch using the logits vector, via your preferred sampling method.

  Tip

Visit the example page for a complete example of how to use dotjson.

Compiling an index

Construct a vocabulary

# Basic usage with just the model name
# Assumes you have a hugging face token in your environment
from dotjson import Vocabulary
model = "NousResearch/Hermes-3-Llama-3.1-8B"
vocabulary = Vocabulary.from_pretrained(model, auth_token=os.environ["HF_TOKEN"])

model must be the model identifier on the HuggingFace hub, such as "NousResearch/Hermes-3-Llama-3.1-8B". dotjson will download the tokenizer for this model if it is not currently cached on disk.

Vocabulary contains two convenience functions:

  • Vocabulary.eos_token_ids() returns the EOS token IDs to determine when generation should terminate.
  • Vocabulary.max_token_id() returns the maximum token ID to determine the size of the logits vector.

Construct an index

Pass the schema you wish to compile and the vocabulary to create an index:

from dotjson import Index
schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}"
index = Index(schema, vocabulary)

schema must be a string containing a valid JSON schema. See the list of supported JSON Schema features here.

By default, the generated JSON allows whitespace (spaces after commas, linebreaks after {, etc.). To generate compact JSON without whitespace constraints, set the disallow_whitespace parameter to false:

# Generate compact JSON without extra whitespace
index = Index(schema, vocabulary, disallow_whitespace=False)

  Note

Your index should be compiled when a new schema is received, unless that schema is invalid. Please see the API reference for information on exceptions thrown for invalid schemas.

Loading and saving an index

If you use the same schema repeatedly, you can serialize the corresponding Index instance and save it to disk to avoid repeated compilations.

To save an index after compilation:

from dotjson import Vocabulary, Index
model = "NousResearch/Hermes-3-Llama-3.1-8B"
vocabulary = Vocabulary.from_pretrained(model, auth_token=os.environ["HF_TOKEN"])
schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}"
index = Index(schema, vocabulary)
path = "./simple_index"
index.serialize_to_disk(path)

To load an index from disk:

# Load the index
path = "./simple_index"
index = Index.deserialize_from_disk(path)

Prepare the set of allowed tokens

dotjson uses a Guide to generate sets of allowed tokens at each inference step, as determined by previously sampled tokens.

Guides separate the logic of determining allowed tokens from the logic of biasing the model’s logits. The distinction is important because it allows for parallel computation of allowed tokens during the model’s forward pass, rather than during the logit processing step.

The Guide is constructed with an Index and a batch size:

# Create a guide from the index with the batch size
guide = Guide(index, batch_size=1)

The Guide generates a BatchedTokenSet for each batch. This contains a set of valid token IDs.

The guide is used in two ways:

Generate the initial set of allowed tokens

# Get the initial set of allowed tokens
initial_token_set = guide.get_start_tokensets()

Generate the next set of allowed tokens

# Assume you have sampled a set of tokens as sampled_tokens
next_token_set = guide.get_next_tokensets(sampled_tokens)

  Important

A BatchedTokenSet should never be reused after new tokens have been sampled.

Always get a fresh token set from the guide after each sampling step, or tokens will be masked incorrectly.

Failing to do so will result in silent failures where invalid tokens are masked.

  Note

guide.get_start_tokensets() can only be called once. If you need to restart generation, create a new Guide instance.

Using BatchedTokenSet methods

The BatchedTokenSet class provides two utility methods to inspect the allowed tokens:

# Check if a specific token is allowed in each batch
token_to_check = 42
is_allowed = next_token_set.contains(token_to_check)

# Get the number of allowed tokens in each batch
allowed_count = len(next_token_set)

These methods can be useful for debugging or for implementing custom token sampling strategies. For example, if only one token is allowed in a batch element (using num_allowed), you can sample that token directly without evaluating the model.

Preallocate the logits

Logits must be stored in a NumPy array or a PyTorch tensor. Each element of logits is a vector of token logit spans for each batch.

A dummy initialization, which uses constant logits vectors for demonstration purposes, follows:

import numpy as np

# Get the vocabulary size
vocab_size = vocabulary.max_token_id()

# Allocate a single vector for all batches, initialized to 1 (uint16)
all_logits = np.ones(n_batches * vocab_size, dtype=np.uint16)

# Create non-overlapping views for each batch (rows are views, not copies)
logits = all_logits.reshape(n_batches, vocab_size)

# Example: logits[i] is the i-th batch slice (a view)

For multiple batches, an alternative approach is to use separate vectors for each batch:

# Store the vectors in a container that persists outside the loop
batch_logits_storage = [np.full(vocab_size, 1, dtype=np.uint16) for _ in range(n_batches)]


# Initialize each batch's logits:
logits = [arr.view() for arr in batch_logits_storage]

# Example: logits[i] is a NumPy view into its own (disjoint) buffer

Construct a logit processor

The logit processor is a function that modifies the logits in-place based on the set of allowed tokens. It is constructed with a Guide object and the mask value:

# Set the batch size
batch_size = 1

# The value to use for masking (aka disabling tokens)
# This will be determined by your quantization scheme,
# but in this example we use 0
mask_value = 0

# Create the processor using the guide object
processor = LogitsProcessor(guide, mask_value)

LogitsProcessor is a function that masks tokens that are inconsistent with the schema.

To use the processor, call:

processor.update_logits(logits, token_set)

This will modify the logits vector in place, using the token_set to determine which tokens to allow.

For example, if logits is the following after the model’s forward pass:

# Single logits vector, with a three-token vocabulary
logits = [[1, 2, 3]]

Assuming token 2 is not in the allowed token set, logit processing will modify it to:

processor.update_logits(logits, token_set)
print(logits)   # [[1, 0, 3]]

Example Program

See the example

Need help?

Next steps