API Reference
Namespaces
dotjson
Main namespace containing the public API.
Classes
BatchedTokenSet
An opaque class containing the information needed for the DotJsonLogits processor to efficiently mask a batch of logits vectors.
Constructors
explicit BatchedTokenSet(rust::Box<internal::BatchedTokenSet> token_set);Methods
std::vector<bool> contains(u_int32_t token_id);- Description: Returns a vector of booleans indicating if a particular token is marked as allowed in a token set
- Parameters:
token_id: A tokenid to check
- Returns: A vector of booleans with length equal to the batch size, where each element indicates if the token is allowed in the corresponding batch element
Tip
This can be used to stop early if the EOS token is available in the set of allowed tokens. For example, if the EOS token has ID 42, you can use
if (batched_token_set.contains(42)[0]) {
// EOS token is allowed in the first batch element, stop this sequence early
}Early stopping may not be appropriate for all use cases. Typically, you will want to stop when the model has sampled the EOS token, not when it is first available.
std::vector<std::size_t> num_allowed();- Description: Returns a vector of length batch_size containing the number of allowed tokens in each tokenset
- Returns: A vector of integers with length equal to the batch size, where each element indicates the number of allowed tokens in the corresponding batch element
Tip
num_allowed can be useful for skipping model forward passes. If only one token is allowed in a batch element, you can sample that token directly without evaluating the model.
Vocabulary
A vocabulary for constructing an index.
Constructors
Vocabulary(
std::string &model,
const std::string &revision = "",
const std::vector<std::pair<std::string, std::string>> &user_agent = {},
const std::string &auth_token = "");- Description: Constructs a serializable vocabulary object for a given tokenizer
- Parameters:
model: The name of the tokenizer in the huggingface indexrevision: (default:"") The specific revision of the tokenizer on the hf-hubuser_agent: (default:{}) The user agent info in the form of a dictionary or a single string. It will be completed with information about the installed packages.auth_token: (default:"") A valid user access token forhf-hub.
- Throws: This will throw a
std::exceptionif the vocabulary cannot be built. This will typically happen in one of two cases: The tokenizer could not be found in the huggingface hub; or the tokenizer has unsupported features.
Vocabulary(std::string &model, std::filesystem::path &path);- Description: Constructs a serializable vocabulary object for a given tokenizer from local files.
- Parameters:
model: The name of the tokenizerpath: The path to the directory containing the tokenizer and configuration JSON files
- Throws: This will throw a
std::exceptionif the vocabulary cannot be built. This will typically happen in one of two cases: The tokenizer.json file or the.json file could not be found in the path; or the tokenizer has unsupported features.
Vocabulary(
std::vector<std::pair<std::vector<uint8_t>, u_int32_t>> &dictionary,
const std::vector<u_int32_t> &eos_token_ids);- Description: A low-level unchecked constructor for serializable vocabulary object from a set of
(Token, TokenId)pairs. - Parameters:
dictionary: This contains the pairs of(Token, TokenId). The Token must be the string as a vector of bytes as it would be decoded.eos_token_ids: The token_ids of the end of string tokens. This should NOT be present in the dictionary. In order to support general tokenizers, we allow for multiple end-of-generation tokens. This vector must be non-empty.
- Throws: This will throw a
std::exceptionif the vocabulary cannot be built. This will typically happen if the dictionary provided contains the EOS tokenID or is otherwise invalid or ifeos_token_idsis an empty vector.
Warning
This constructor does no further postprocessing of the tokens
and assumes that all Token strings are the decoded version. That is,
a Token string should not contain special characters like ▁
(SentencePiece encoders) or Ġ (BPE encoders) or Ċ (newline character
in BPE). If these characters are present the resulting index will most
likely be incorrect.
explicit Vocabulary(std::filesystem::path &path);- Description: Constructs a vocabulary object from a serialized object on disk
- Parameters:
path: Path to the serialized object
- Throws: This will throw a
std::exceptionin the event that a properly serialized vocabulary cannot be found at the specified path.
Methods
void serialize_to_disk(std::string &path);- Description: Serialize the
Vocabularyobject to disk - Parameters:
path: The path to the file you want to serialize your vocabulary to
- Throws: This will throw a
std::exceptionin the event that the serialization fails.
u_int32_t max_token_id();- Description: Returns the largest token id in the vocabulary. This is used for bounds checking in the logits processor.
std::vector<u_int32_t> eos_token_ids();- Description: Returns the EOS (end of string) TokenIds for the vocabulary. This can be useful for checking termination. In order to support general tokenizers and advanced usage, we allow for multiple EOS tokens.
Index
An index for constructing a LogitsProcessor.
Constructors
Index(std::string &schema, Vocabulary &vocabulary,
bool disallow_whitespace = false);- Description: Constructs a serializable index object for a given JSON schema
- Parameters:
schema: A JSON schema that generations should matchvocabulary: A Vocabulary object encoding information about the model’s tokenizerdisallow_whitespace: (Default:false) Don’t generate JSON containing extra whitespace (such as spaces after commas or linebreaks after{)
- Throws: This will throw a
std::exceptionif the index cannot be built. This will typically happen in one of two cases: The JSON schema is malformed or it contains unsupported features.
[!NOTE]
Using
disallow_whitespace=truemay cause unanticipated model performance issues, as it disables a formatting that may be natural for the model.
[!NOTE]
Our implementation of the JSON schema enforces
additionalProperties: false. This means that making and index based on the empty schema{}will result in an error as it would lead to no allowable output.
Index(Vocabulary &vocabulary,
bool disallow_whitespace = false);- Description: Constructs a serializable index object for producing valid JSON unconstrained by a specific schema
- Parameters:
vocabulary: A Vocabulary object encoding information about the model’s tokenizerdisallow_whitespace: (Default:false) Don’t generate JSON containing extra whitespace (such as spaces after commas or linebreaks after{)
- Throws: This will throw a
std::exceptionif the index cannot be built.
[!NOTE]
Using
disallow_whitespace=truemay cause unanticipated model performance issues, as it disables a formatting that may be natural for the model.
explicit Index(std::filesystem::path &path);- Description: Constructs an index object from a serialized object on disk
- Parameters:
path: Path to the serialized object
- Throws: This will throw a
std::exceptionin the event that a properly serialized index cannot be found at the specified path.
Methods
void serialize_to_disk(std::string &path);- Description: Serialize the
Indexobject to disk - Parameters:
path: The path to the file you want to serialize your index to
- Throws: This will throw a
std::exceptionin the event that the serialization fails.
Guide
A guide class that reads batches of TokenIds and produces BatchedTokenSets that can be used in the logits processor.
Constructors
Guide(const Index &index, size_t batch_size) noexcept;- Description: Constructs the Guide object
- Parameters:
index: The index used for developing the maskbatch_size: The batch size for the logits updates
Methods
BatchedTokenSet get_start_tokensets() noexcept;- Description: Construct the set of allowed tokens need to generate the first token.
BatchedTokenSet get_next_tokensets(const std::vector<u_int32_t> &token_ids);- Description: Read a new batch of tokens in and produce the mask
- Parameters:
token_ids: The vector of TokenIds that have just been sampled.
- Throws: This will throw an exception when the wrong number of tokens are given or when at least one of the tokens does not come from the allowed token set.
LogitsProcessor
A logits processor that modifies logits arrays of u_int16_t elements in place for structured generation.
Constructors
LogitsProcessor(Guide &guide, u_int16_t mask_value) noexcept;- Description: Constructs the logits processor
- Parameters:
guide: TheGuideobject used to produce the token sets.mask_value: The equivalent of-std::numeric_limits<float>::infinity()foru_int16_t.
Tip
Sometimes you may be working on the probability scale rather than the logit scale. In this case, you can still use the LogitsProcesssor framework with mask_value set to 0.
Methods
void operator()(std::vector<std::span<u_int16_t>> &logits,
BatchedTokenSet &token_set);- Description: Adaptively compute the mask and apply it in place to the logits array
- Parameters:
logits: A vector ofstd::span<u_int16_t>containing logits computed after reading the tokens incontext. The vector must be the same as thebatch_sizeused to initialize the processor and the spans must all have the same size.token_set: The set of tokens produced by the guide.
- Throws: A
std::exceptionwill be thrown on bounds errors.
LogitsProcessorF32
A logits processor that modifies logits arrays of float elements in place for structured generation.
Constructors
LogitsProcessorF32(Guide &guide, float mask_value) noexcept;- Description: Constructs the logits processor
- Parameters:
guide: TheGuideobject used to produce the token sets.mask_value: The equivalent of-std::numeric_limits<float>::infinity()forfloat.
Tip
Sometimes you may be working on the probability scale rather than the logit scale. In this case, you can still use the LogitsProcessorF32 framework with mask_value set to 0.0.
Methods
void operator()(std::vector<std::span<float>> &logits,
BatchedTokenSet &token_set);- Description: Adaptively compute the mask and apply it in place to the logits array
- Parameters:
logits: A vector ofstd::span<float>containing logits computed after reading the tokens incontext. The vector must be the same as thebatch_sizeused to initialize the processor and the spans must all have the same size.token_set: The set of tokens produced by the guide.
- Throws: A
std::exceptionwill be thrown on bounds errors.
Example Usage
Here is a minimal example of how to use the dotjson library:
// Create vocabulary and index
std::string model = "gpt2";
std::string schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}";
dotjson::Vocabulary vocabulary(model);
dotjson::Index index(schema, vocabulary);
// Create guide and processor
std::size_t batch_size = 1;
u_int16_t mask_value = 0; // Appropriate mask value
dotjson::Guide guide(index, batch_size);
dotjson::LogitsProcessor processor(guide, mask_value);
// Get initial token set
dotjson::BatchedTokenSet token_set = guide.get_start_tokensets();
// Initial logit vector (generated by LLM)
std::vector<std::span<uint16_t>> logits;
// Populate logits...
// Process the logits using the token set
processor(logits, token_set);
// Sample tokens based on the processed logits
std::vector<u_int32_t> sampled_tokens = sample_tokens(logits);
// Get the next token set based on sampled tokens
token_set = guide.get_next_tokensets(sampled_tokens);Alternatively, if you wish to generate any valid JSON (and not constrain by a specific schema), you can replace the vocabulary and index construction above with
// Create vocabulary and index
std::string model = "gpt2";
dotjson::Vocabulary vocabulary(model);
dotjson::Index index(vocabulary);