When using an embedding model for text vectorization, I sometimes encounter exceptions. The cause of the exception is that the input exceeds the model's maximum context token size. As a result, I began investigating the TokenTextSplitter's splitting behavior and found that in certain cases, such as when a lot of consecutive strings (' . . . . . . . . . . . . . . . . . . . . . . . ') appear, the text segments it splits will exceed the expected defaultChunkSize.

The version I'm currently using is 1.0.0-M4.

Even when I customize the parameters through the TokenTextSplitter constructor, the issue still occurs.

For example:

private int defaultChunkSize = 400;
private int minChunkSizeChars = 175;
private int minChunkLengthToEmbed = 5;
private int maxNumChunks = 10000;
private boolean keepSeparator = true;

TokenTextSplitter tokenTextSplitter = new TokenTextSplitter(defaultChunkSize, minChunkSizeChars, minChunkLengthToEmbed, maxNumChunks, keepSeparator);
List<Document> splitDocuments = tokenTextSplitter.apply(documents);

I'm still unsure whether this issue is related to the third-party dependency jtokkit itself. However, I wrote a test case using jtokkit to calculate the token count of the overly long string that caused the error, and it calculated correctly, with the token count matching the expected result. This suggests that the error might still originate from the splitting process itself rather than from the token count calculation.

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
System.out.println(enc.encode(text).size());

The calculated token count matches consistently with the result from OpenAI's tokenizer, which further confirms that the issue is likely in the splitting process rather than in token count calculation.

Comment From: rliangzi

I am facing the same issue. How should I handle it?

Comment From: viosay

I am facing the same issue. How should I handle it?

The issue is most likely occurring in the protected List<String> doSplit(String text, int chunkSize) method. I will continue investigating this, and you can also take a closer look at this part of the code.

Feel free to share more details if you find anything, and we can dive deeper into the potential root cause!