Troubleshooting

Support

Email us at [email protected]
Your dedicated Slack channel
Schedule a call with us here

Error Codes

Errors with the `Vocabulary` constructor

Problem: Vocabulary construction will throw a std::exception if the vocabulary cannot be built. This will typically happen in one of two cases: The tokenizer could not be found in the huggingface hub, or the tokenizer has unsupported features.

Solution: Ensure that the tokenizer is available in the HuggingFace hub using its identifier (e.g. gpt2). If the tokenizer is available but construction still fails, check if the tokenizer uses any unsupported features and contact us.

Errors with deserialization of the `Vocabulary` object

Problem: Deserialization will throw a std::exception in the event that a properly serialized vocabulary cannot be found at the specified path.

Solution: Ensure that the vocabulary has been serialized and that the path is correct. If the vocabulary is present but the deserialization fails, it is possible that the vocabulary is corrupted. In this case, you should delete the vocabulary and re-build it.

Errors with serializing the `Vocabulary` object

Problem: Serialization will throw a std::exception if the serialization fails.

Solution: Ensure that the vocabulary has been built and that the path is correct.

Errors with the `Index` constructor

Problem: Index construction will throw a std::exception if the index cannot be built. This will typically happen in one of two cases: The JSON schema is malformed or contains unsupported features.

Solution: Ensure that the JSON schema is valid and follows the supported schema format. The schema should be a valid JSON object with appropriate type definitions and patterns.

Errors with deserialization of the `Index` object

Problem: Deserialization will throw a std::exception in the event that a properly serialized index cannot be found at the specified path.

Solution: Ensure that the index has been serialized and that the path is correct. If the index is present but the deserialization fails, it is possible that the index is corrupted. In this case, you should delete the index and re-build it.

Errors with serializing the `Index` object

Problem: Serialization will throw a std::exception if the serialization fails.

Solution: Ensure that the index has been built and that the path is correct.

Errors with the `Guide`

Problem: The get_start_tokensets() method throws an exception with a message like “This operation can only be performed on a fresh guide”.

Solution: The get_start_tokensets() method can only be called once at the beginning of token generation. If you’ve already called get_next_tokensets(), you cannot go back to the start. If you need to restart generation, create a new Guide instance.

Problem: The get_next_tokensets() method throws an exception about invalid tokens.

Solution: This occurs when one or more of the tokens you’re passing to the guide was not in the allowed token set from the previous step. Make sure you’re only sampling tokens from the logits after they’ve been processed by the processor, and that you’re passing the correct tokens to the guide.

Errors with the processor

Problem: The processor will throw a std::exception on bounds errors.

Solution: This can arise in the event of mismatches between the expected batch size from the Guide and the size of the logits vector. Ensure that:

The logits vector size matches the batch size used to initialize the Guide
All spans in the logits vector have the same size
The Guide and LogitsProcessor are initialized with the same Guide object

Problem: The processor throws std::invalid_argument with message “The spans in the logits vector must be views into disjoint memory.”

Solution: Each span in the logits vector must reference completely separate memory regions with no overlap.

// INCORRECT: Spans overlap in memory
std::vector<u_int16_t> logits_base(50258, 3);
std::span<u_int16_t> span1(logits_base.data(), 50257);
std::span<u_int16_t> span2(logits_base.data() + 1, 50257); // Overlaps with span1

Correct way to create non-overlapping spans:

// Method 1: Use a single contiguous buffer
std::vector<u_int16_t> all_logits(n_batches * vocab_size, 1);
std::vector<std::span<u_int16_t>> logits(n_batches);

for (int i = 0; i < n_batches; ++i) {
    logits[i] = std::span<u_int16_t>(all_logits.data() + i * vocab_size, vocab_size);
}

// Method 2: Use separate vectors stored in a container
std::vector<std::vector<u_int16_t>> batch_logits_storage(n_batches);
std::vector<std::span<u_int16_t>> logits(n_batches);

for (int i = 0; i < n_batches; ++i) {
    batch_logits_storage[i] = std::vector<u_int16_t>(vocab_size, 1);
    logits[i] = std::span<u_int16_t>(batch_logits_storage[i].data(), batch_logits_storage[i].size());
}

Common errors with token sets

Problem: Incorrect tokens are silently being masked, allowing invalid JSON to be generated.

Solution: This is caused by reusing an old BatchedTokenSet in the LogitsProcessor. Never reuse a token set after sampling new tokens. Always get a fresh token set from the guide using get_next_tokensets() after each sampling step. For example:

// INCORRECT: Reusing the initial token set
TokenBatchedSet token_set = guide.get_start_tokensets();
processor(logits1, token_set);

// Sample tokens and move to next generation step
std::vector<u_int32_t> tokens = sample_tokens(logits1);

// Wrong! This will mask incorrect tokens because we need a new token set
processor(logits2, token_set);

// CORRECT approach:
TokenBatchedSet token_set = guide.get_start_tokensets();
processor(logits1, token_set);

// Sample tokens
std::vector<u_int32_t> tokens = sample_tokens(logits1);

// Get a new token set based on the sampled tokens
token_set = guide.get_next_tokensets(tokens);

// Now process with the updated token set
processor(logits2, token_set);

Excessive whitespace

If you set disallow_whitespace=false in the Index constructor, you may encounter excessive whitespace between keywords within the object. Excess whitespace will never be generated before or after the JSON output.

For example, the schema:

{
    "type": "object",
    "properties": {
      "text": {
        "type": "string"
      }
    },
    "required": ["text"]
}

may produce raw text output like:

{\n  "text": "Hello, how are you?" \n            \n\n}

This behavior is expected. The model may wish to output JSON in a particular visual format that does not impact its machine-readability, as the influence of whitespace on model output quality is unknown. Allowing the model to choose the whitespace may enhance its ability to provide higher-quality answers.

Solutions

Option 1: Use `disallow_whitespace`

The most robust solution is to use the disallow_whitespace parameter when constructing your Index:

dotjson::Index index(schema, vocabulary, true);

This will prevent the generation of extra whitespace in the output JSON. However, note that disabling whitespace may have unknown effects on model output quality in cases where whitespace is important to the model’s problem understanding.

Output will have no whitespace before or after the JSON object, and with no whitespace between keywords:

{"text":"Hello, how are you?"}

Option 2: Prompting

Provide one-shot examples in your prompt to demonstrate the expected whitespace formatting. Models tend to follow formatting patterns shown in examples.

Generate a JSON object following this example format:
{"text": "Example text"}

Option 3: Model Selection

Larger models typically produce more idiomatic JSON with appropriate whitespace. Consider upgrading to a more capable model if excessive whitespace is problematic.

Version Compatibility

This documentation covers dotjson v1.0. Breaking changes will be introduced with major version increments.

Need help?

Email us at [email protected]
Your dedicated Slack channel
Schedule a support call with us here

API Reference