Usage
dotjson exposes five primary classes:
dotjson::Vocabularyto prepare the vocabulary.dotjson::Indexto compile an index for a given schema and model.dotjson::Guideproduces a set of allowed tokens at each generation step.dotjson::BatchedTokenSetto represent allowed tokens at each generation step.dotjson::LogitsProcessorto mask logits for tokens inconsistent with the provided schema.
Tip
dotjson::Vocabulary and dotjson::Index are both serializable. See the API reference for details.
General program flow
Programs using dotjson follow a similar pattern:
- Initialize the
Vocabularyusing a model’s HuggingFace identifier. - Compile the
Indexusing the schema and vocabulary. - Create a
Guideand aLogitsProcessor. - For each step in the inference loop:
- Perform a forward pass on your language model and place the results for each batch in the
logitsvector. - Retrieve the set of allowed tokens:
- If this is the first step, get the initial set of allowed tokens from the guide with
guide.get_start_tokensets(). - Otherwise, get the next set of allowed tokens using the most recently sampled tokens from the guide with
guide.get_next_tokensets(sampled_tokens).
- Call the processor on the logits vector using the current token set with
processor(logits, token_set). - Choose new tokens for each sequence in the batch using the
logitsvector, via your preferred sampling method.
- Perform a forward pass on your language model and place the results for each batch in the
Tip
Visit the example page for a complete example of how to use dotjson.
Compiling an index
Construct a vocabulary
// Basic usage with just the model name
std::string model = "NousResearch/Hermes-3-Llama-3.1-8B";
dotjson::Vocabulary vocabulary(model);
// Advanced usage with all optional parameters
std::string model = "NousResearch/Hermes-3-Llama-3.1-8B";
std::string revision = "main"; // specific model revision
std::vector<std::pair<std::string, std::string>> user_agent = {
{"application", "my-app-name"},
{"version", "1.0.0"}
};
std::string auth_token = "hf_..."; // HuggingFace access token
dotjson::Vocabulary vocabulary(model, revision, user_agent, auth_token);model must be the model identifier on the HuggingFace hub, such as "NousResearch/Hermes-3-Llama-3.1-8B". dotjson will download the tokenizer for this model if it is not currently cached on disk.
The user_agent parameter identifies your application when making requests to the HuggingFace hub. The library automatically augments your user agent with information about installed packages.
Alternatively, you can construct a vocabulary from a custom dictionary of token-to-ID mappings:
std::vector<std::pair<std::string, u_int32_t>> dictionary = {
{"hello", 1},
{"world", 2},
// ... more token mappings
};
std::vector<u_int32_t> eos_token_id = {3};
dotjson::Vocabulary vocabulary(dictionary, eos_token_id);Note
When constructing a vocabulary from a dictionary, the EOS token ID must not be present in the dictionary. The dictionary must be valid and contain unique token-to-ID mappings.
Vocabulary contains two convenience functions:
Vocabulary::eos_token_ids()returns the EOS token IDs to determine when generation should terminate.Vocabulary::max_token_id()returns the maximum token ID to determine the size of the logits vector.
Construct an index
Pass the schema you wish to compile and the vocabulary to create an index:
std::string schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}";
dotjson::Index index(schema, vocabulary);schema must be a string containing a valid JSON schema. See the list of supported JSON Schema features here.
By default, the generated JSON allows whitespace (spaces after commas, linebreaks after {, etc.). To generate compact JSON without whitespace constraints, set the disallow_whitespace parameter to false:
// Generate compact JSON without extra whitespace
dotjson::Index index(schema, vocabulary, false);Note
Your index should be compiled when a new schema is received, unless that schema is invalid. Please see the API reference for information on exceptions thrown for invalid schemas.
Loading and saving an index
If you use the same schema repeatedly, you can serialize the corresponding Index instance and save it to disk to avoid repeated compilations.
To save an index after compilation:
// Compile a vocabulary and index
std::string schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}";
std::string model = "NousResearch/Hermes-3-Llama-3.1-8B";
dotjson::Vocabulary vocabulary(model);
dotjson::Index index(schema, vocabulary);
// Save the index
index.serialize_to_disk(path);To load an index from disk:
// Load the index
std::filesystem::path path = "path/to/file";
dotjson::Index index(path);Prepare the set of allowed tokens
dotjson uses a Guide to generate sets of allowed tokens at each inference step, as deterined by previously sampled tokens.
Guides separate the logic of determining allowed tokens from the logic of biasing the model’s logits. The distinction is important because it allows for parallel computation of allowed tokens during the model’s forward pass, rather than during the logit processing step.
The Guide is constructed with an Index and a batch size:
// Create a guide from the index with the batch size
std::size_t batch_size = 1;
dotjson::Guide guide(index, batch_size);The Guide generates a BatchedTokenSet for each batch. This contains a set of valid token IDs.
The guide is used in two ways:
Generate the initial set of allowed tokens
// Get the initial set of allowed tokens
dotjson::BatchedTokenSet initial_token_set = guide.get_start_tokensets();Generate the next set of allowed tokens
// Get the next set of allowed tokens
dotjson::BatchedTokenSet next_token_set = guide.get_next_tokensets(sampled_tokens);Important
A BatchedTokenSet should never be reused after new tokens have been sampled.
Always get a fresh token set from the guide after each sampling step, or tokens will be masked incorrectly.
Failing to do so will result in silent failures where invalid tokens are masked.
Note
guide.get_start_tokensets() can only be called once. If you need to restart generation, create a new Guide instance.
Using BatchedTokenSet methods
The BatchedTokenSet class provides two utility methods to inspect the allowed tokens:
// Check if a specific token is allowed in each batch
u_int32_t token_to_check = 42;
std::vector<bool> is_allowed = token_set.contains(token_to_check);
// is_allowed[i] is true if token 42 is allowed in batch i
// Get the number of allowed tokens in each batch
std::vector<std::size_t> allowed_count = token_set.num_allowed();
// allowed_count[i] contains the number of allowed tokens in batch i
These methods can be useful for debugging or for implementing custom token sampling strategies. For example, if only one token is allowed in a batch element (using num_allowed), you can sample that token directly without evaluating the model.
Preallocate the logits
Logits must be stored in a std::vector<std::span<u_int16_t>>. Each element of logits is a vector of token logit spans for each batch.
A dummy initialization, which uses constant logits vectors for demonstration purposes, follows:
// Get the vocabulary size
u_int32_t vocab_size = vocabulary.max_token_id();
// Allocate a single vector for all batches
std::vector<u_int16_t> all_logits(n_batches * vocab_size, 1);
std::vector<std::span<u_int16_t>> logits(n_batches);
// Create non-overlapping spans pointing to sections of the vector
for (int i = 0; i < n_batches; ++i) {
logits[i] = std::span<u_int16_t>(all_logits.data() + i * vocab_size, vocab_size);
}For multiple batches, an alternative approach is to use separate vectors for each batch:
// Store the vectors in a container that persists outside the loop
std::vector<std::vector<u_int16_t>> batch_logits_storage(n_batches);
std::vector<std::span<u_int16_t>> logits(n_batches);
// Initialize each batch's logits:
for (int i = 0; i < n_batches; ++i) {
batch_logits_storage[i] = std::vector<u_int16_t>(vocab_size, 1);
logits[i] = std::span<u_int16_t>(batch_logits_storage[i].data(), batch_logits_storage[i].size());
}Important
Each span must not overlap in memory with another span. For example, the following will throw an exception:
std::vector<u_int16_t> logits_base_test(50258, 3);
std::span<u_int16_t> logits_span_test(logits_base_test.data(), 50257);
std::span<u_int16_t> logits_span2_test(logits_base_test.data() + 1, 50257);
std::vector<std::span<u_int16_t>> logits_test = {logits_span_test, logits_span2_test };Construct a logit processor
The logit processor is a function that modifies the logits in-place based on the set of allowed tokens. It is constructed with a Guide object and the mask value:
// Set the batch size
std::size_t batch_size = 1;
// The value to use for masking (aka disabling tokens)
// This will be determined by your quantization scheme,
// but in this example we use 0
u_int16_t mask_value = 0;
// Create the processor using the guide object
dotjson::LogitsProcessor processor(guide, mask_value);LogitsProcessor is a function that masks tokens that are inconsistent with the schema.
To use the processor, call:
processor(logits, token_set);This will modify the logits vector in place, using the token_set to determine which tokens to allow.
For example, if logits is the following after the model’s forward pass:
// Single logits vector, with a three token vocabulary
logits_vec = { {1, 2, 3} };Assuming token 2 is not in the allowed token set, logit processing will modify it to:
processor(logits, token_set);
// logits = {{1,0,3}};Example program
// Import dotjson
// Note: The include path may vary depending on your installation method
// For system installation, use: #include "dotjson.hpp"
// For local installation, use: #include "vendor/dotjson/include/dotjson.hpp"
#include "../src/dotjson.hpp"
#include <iostream>
#include <vector>
#include <limits>
#include <span>
// A function that returns logits in the appropriate form,
// taking the full sequence history as input for the model's forward pass
std::vector<std::span<u_int16_t>> get_logits(std::string model, const std::vector<std::vector<u_int32_t>>& sequences) {
// This is your language model's forward pass.
// It should use the sequences (token history) to compute the next token logits
// ...
// Return the computed logits
std::vector<std::span<u_int16_t>> logits;
// ... populate logits from your forward pass ...
return logits;
}
std::vector<u_int32_t> sample_tokens(std::vector<std::span<u_int16_t>> &logits) {
// A function that returns the next tokens generated from a given set of logits, such as multinomial or greedy sampling.
// ...
return sampled_tokens;
}
int main() {
// Specify the model to use -- this downloads the tokenizer
// from HuggingFace.
std::string model = "gpt2";
// Specify a JSON schema to use.
std::string schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}";
// Compile the index and vocabulary for the schema.
dotjson::Vocabulary vocabulary(model);
dotjson::Index index(schema, vocabulary);
// Specify the mask value to use for disabling tokens.
// Tokens that don't match the schema will have their logit values set to this.
u_int16_t mask_value = 0;
std::size_t batch_size = 1;
// Create a guide to generate sets of allowed tokens
dotjson::Guide guide(index, batch_size);
// Create a logit processor using the guide
dotjson::LogitsProcessor processor(guide, mask_value);
// Get the initial set of allowed tokens
dotjson::BatchedTokenSet token_set = guide.get_start_tokensets();
// Initialize sequence tracking for each batch
std::vector<std::vector<u_int32_t>> sequences(batch_size);
// Maximum allowed sequence length
const size_t max_sequence_length = 1024;
// Initialize the first logits based on empty sequences
// (this is your language model's forward pass)
std::vector<std::span<u_int16_t>> logits = get_logits(model, sequences);
// Process the initial logits with the token set
processor(logits, token_set);
// Sample the first set of tokens
std::vector<u_int32_t> sampled_tokens = sample_tokens(logits);
// Add sampled tokens to the sequences history
for (size_t i = 0; i < batch_size; i++) {
sequences[i].push_back(sampled_tokens[i]);
}
// Tracking boolean to know when to end generation
bool is_completed = false;
// A complete inference loop would look like:
while (!is_completed) {
// Get the next set of allowed tokens based on the
// most recently sampled tokens
token_set = guide.get_next_tokensets(sampled_tokens);
// Get the next set of logits from the model
// The full sequences history is used for the model's context
logits = get_logits(model, sequences);
// Process logits with the current token set
processor(logits, token_set);
// After processing, only tokens that are valid according to the schema
// will retain their original values. Others will be set to mask_value.
// Sample the next tokens
sampled_tokens = sample_tokens(logits);
// Add the new tokens to the sequence history
for (size_t i = 0; i < batch_size; i++) {
sequences[i].push_back(sampled_tokens[i]);
// Check for completion (e.g., EOS token)
// In this case there is only one EOS token, but in general
// you need to check the entire vector.
if (sampled_tokens[i] == vocabulary.eos_token_ids()[0]) {
is_completed = true;
}
}
}
// Process the generated sequences to return the final
// completion to the user
for (size_t i = 0; i < batch_size; i++) {
// Convert token IDs back to text
// ...
}
}Need help?
- Email us at [email protected]
- Your dedicated Slack channel
- Schedule a call with us here