API Reference

Namespaces

`dotjson`

Main namespace containing the public API.

Classes

`BatchedTokenSet`

An opaque class containing the information needed for the DotJsonLogits processor to efficiently mask a batch of logits vectors.

Constructors

explicit BatchedTokenSet(rust::Box<internal::BatchedTokenSet> token_set);

Methods

std::vector<bool> contains(u_int32_t token_id);

Description: Returns a vector of booleans indicating if a particular token is marked as allowed in a token set
Parameters:
- token_id: A tokenid to check
Returns: A vector of booleans with length equal to the batch size, where each element indicates if the token is allowed in the corresponding batch element

Tip

This can be used to stop early if the EOS token is available in the set of allowed tokens. For example, if the EOS token has ID 42, you can use

if (batched_token_set.contains(42)[0]) {
    // EOS token is allowed in the first batch element, stop this sequence early
}

Early stopping may not be appropriate for all use cases. Typically, you will want to stop when the model has sampled the EOS token, not when it is first available.

std::vector<std::size_t> num_allowed();

Description: Returns a vector of length batch_size containing the number of allowed tokens in each tokenset
Returns: A vector of integers with length equal to the batch size, where each element indicates the number of allowed tokens in the corresponding batch element

Tip

num_allowed can be useful for skipping model forward passes. If only one token is allowed in a batch element, you can sample that token directly without evaluating the model.

`Vocabulary`

A vocabulary for constructing an index.

Constructors

Vocabulary(
    std::string &model,
    const std::string &revision = "",
    const std::vector<std::pair<std::string, std::string>> &user_agent = {},
    const std::string &auth_token = "");

Description: Constructs a serializable vocabulary object for a given tokenizer
Parameters:
- model: The name of the tokenizer in the huggingface index
- revision: (default: "") The specific revision of the tokenizer on the hf-hub
- user_agent: (default: {}) The user agent info in the form of a dictionary or a single string. It will be completed with information about the installed packages.
- auth_token: (default: "") A valid user access token for hf-hub.
Throws: This will throw a std::exception if the vocabulary cannot be built. This will typically happen in one of two cases: The tokenizer could not be found in the huggingface hub; or the tokenizer has unsupported features.

Vocabulary(std::string &model, std::filesystem::path &path);

Description: Constructs a serializable vocabulary object for a given tokenizer from local files.
Parameters:
- model: The name of the tokenizer
- path: The path to the directory containing the tokenizer and configuration JSON files
Throws: This will throw a std::exception if the vocabulary cannot be built. This will typically happen in one of two cases: The tokenizer.json file or the .json file could not be found in the path; or the tokenizer has unsupported features.

Vocabulary(
      std::vector<std::pair<std::vector<uint8_t>, u_int32_t>> &dictionary,
      const std::vector<u_int32_t> &eos_token_ids);

Description: A low-level unchecked constructor for serializable vocabulary object from a set of (Token, TokenId) pairs.
Parameters:
- dictionary: This contains the pairs of (Token, TokenId). The Token must be the string as a vector of bytes as it would be decoded.
- eos_token_ids: The token_ids of the end of string tokens. This should NOT be present in the dictionary. In order to support general tokenizers, we allow for multiple end-of-generation tokens. This vector must be non-empty.
Throws: This will throw a std::exception if the vocabulary cannot be built. This will typically happen if the dictionary provided contains the EOS tokenID or is otherwise invalid or if eos_token_ids is an empty vector.

Warning

This constructor does no further postprocessing of the tokens and assumes that all Token strings are the decoded version. That is, a Token string should not contain special characters like ▁ (SentencePiece encoders) or Ġ (BPE encoders) or Ċ (newline character in BPE). If these characters are present the resulting index will most likely be incorrect.

explicit Vocabulary(std::filesystem::path &path);

Description: Constructs a vocabulary object from a serialized object on disk
Parameters:
- path: Path to the serialized object
Throws: This will throw a std::exception in the event that a properly serialized vocabulary cannot be found at the specified path.

Methods

void serialize_to_disk(std::string &path);

Description: Serialize the Vocabulary object to disk
Parameters:
- path: The path to the file you want to serialize your vocabulary to
Throws: This will throw a std::exception in the event that the serialization fails.

u_int32_t max_token_id();

Description: Returns the largest token id in the vocabulary. This is used for bounds checking in the logits processor.

std::vector<u_int32_t> eos_token_ids();

Description: Returns the EOS (end of string) TokenIds for the vocabulary. This can be useful for checking termination. In order to support general tokenizers and advanced usage, we allow for multiple EOS tokens.

`Index`

An index for constructing a LogitsProcessor.

Constructors

Index(std::string &schema, Vocabulary &vocabulary,
      bool disallow_whitespace = false);

Description: Constructs a serializable index object for a given JSON schema
Parameters:
- schema: A JSON schema that generations should match
- vocabulary: A Vocabulary object encoding information about the model’s tokenizer
- disallow_whitespace: (Default: false) Don’t generate JSON containing extra whitespace (such as spaces after commas or linebreaks after {)
Throws: This will throw a std::exception if the index cannot be built. This will typically happen in one of two cases: The JSON schema is malformed or it contains unsupported features.

[!NOTE]

Using disallow_whitespace=true may cause unanticipated model performance issues, as it disables a formatting that may be natural for the model.

[!NOTE]

Our implementation of the JSON schema enforces additionalProperties: false. This means that making and index based on the empty schema {} will result in an error as it would lead to no allowable output.

Index(Vocabulary &vocabulary,
      bool disallow_whitespace = false);

Description: Constructs a serializable index object for producing valid JSON unconstrained by a specific schema
Parameters:
- vocabulary: A Vocabulary object encoding information about the model’s tokenizer
- disallow_whitespace: (Default: false) Don’t generate JSON containing extra whitespace (such as spaces after commas or linebreaks after {)
Throws: This will throw a std::exception if the index cannot be built.

[!NOTE]

Using disallow_whitespace=true may cause unanticipated model performance issues, as it disables a formatting that may be natural for the model.

explicit Index(std::filesystem::path &path);

Description: Constructs an index object from a serialized object on disk
Parameters:
- path: Path to the serialized object
Throws: This will throw a std::exception in the event that a properly serialized index cannot be found at the specified path.

Methods

void serialize_to_disk(std::string &path);

Description: Serialize the Index object to disk
Parameters:
- path: The path to the file you want to serialize your index to
Throws: This will throw a std::exception in the event that the serialization fails.

`Guide`

A guide class that reads batches of TokenIds and produces BatchedTokenSets that can be used in the logits processor.

Constructors

Guide(const Index &index, size_t batch_size) noexcept;

Description: Constructs the Guide object
Parameters:
- index: The index used for developing the mask
- batch_size: The batch size for the logits updates

Methods

BatchedTokenSet get_start_tokensets() noexcept;

Description: Construct the set of allowed tokens need to generate the first token.

BatchedTokenSet get_next_tokensets(const std::vector<u_int32_t> &token_ids);

Description: Read a new batch of tokens in and produce the mask
Parameters:
- token_ids: The vector of TokenIds that have just been sampled.
Throws: This will throw an exception when the wrong number of tokens are given or when at least one of the tokens does not come from the allowed token set.

`LogitsProcessor`

A logits processor that modifies logits arrays of u_int16_t elements in place for structured generation.

Constructors

LogitsProcessor(Guide &guide, u_int16_t mask_value) noexcept;

Description: Constructs the logits processor
Parameters:
- guide: The Guide object used to produce the token sets.
- mask_value: The equivalent of -std::numeric_limits<float>::infinity() for u_int16_t.

Tip

Sometimes you may be working on the probability scale rather than the logit scale. In this case, you can still use the LogitsProcesssor framework with mask_value set to 0.

Methods

void operator()(std::vector<std::span<u_int16_t>> &logits,
                BatchedTokenSet &token_set);

Description: Adaptively compute the mask and apply it in place to the logits array
Parameters:
- logits: A vector of std::span<u_int16_t> containing logits computed after reading the tokens in context. The vector must be the same as the batch_size used to initialize the processor and the spans must all have the same size.
- token_set: The set of tokens produced by the guide.
Throws: A std::exception will be thrown on bounds errors.

`LogitsProcessorF32`

A logits processor that modifies logits arrays of float elements in place for structured generation.

Constructors

LogitsProcessorF32(Guide &guide, float mask_value) noexcept;

Description: Constructs the logits processor
Parameters:
- guide: The Guide object used to produce the token sets.
- mask_value: The equivalent of -std::numeric_limits<float>::infinity() for float.

Tip

Sometimes you may be working on the probability scale rather than the logit scale. In this case, you can still use the LogitsProcessorF32 framework with mask_value set to 0.0.

Methods

void operator()(std::vector<std::span<float>> &logits,
                BatchedTokenSet &token_set);

Description: Adaptively compute the mask and apply it in place to the logits array
Parameters:
- logits: A vector of std::span<float> containing logits computed after reading the tokens in context. The vector must be the same as the batch_size used to initialize the processor and the spans must all have the same size.
- token_set: The set of tokens produced by the guide.
Throws: A std::exception will be thrown on bounds errors.

Example Usage

Here is a minimal example of how to use the dotjson library:

// Create vocabulary and index
std::string model = "gpt2";
std::string schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}";
dotjson::Vocabulary vocabulary(model);
dotjson::Index index(schema, vocabulary);

// Create guide and processor
std::size_t batch_size = 1;
u_int16_t mask_value = 0; // Appropriate mask value
dotjson::Guide guide(index, batch_size);
dotjson::LogitsProcessor processor(guide, mask_value);

// Get initial token set
dotjson::BatchedTokenSet token_set = guide.get_start_tokensets();

// Initial logit vector (generated by LLM)
std::vector<std::span<uint16_t>> logits;
// Populate logits...

// Process the logits using the token set
processor(logits, token_set);

// Sample tokens based on the processed logits
std::vector<u_int32_t> sampled_tokens = sample_tokens(logits);

// Get the next token set based on sampled tokens
token_set = guide.get_next_tokensets(sampled_tokens);

Alternatively, if you wish to generate any valid JSON (and not constrain by a specific schema), you can replace the vocabulary and index construction above with

// Create vocabulary and index
std::string model = "gpt2";
dotjson::Vocabulary vocabulary(model);
dotjson::Index index(vocabulary);

Example Troubleshooting