I am trying to use embedding API with this code:

List<List<Double>> embeddings = embeddingModel.embed(List.of("Hello world", "How are you?"));

Getting: ai.onnxruntime.OrtException: Supplied array is ragged, expected 4, found 6

Once calling it in following way, everything is ok: List<List<Double>> embeddings = embeddingModel.embed(List.of("Hello world", "Hello world"));

Basically my conclusion is, that the embed method expects, that all strings will have the same number of tokens???!!! Particularly in this case, first sentence has 4 tokens and second 6 tokens... Is my understanding correct? Is this feasible assumption?

Once investigating the problem further, I found taht there is following check in TensorInfo.java"

  /**
   * Extracts the shape from a multidimensional array. Checks to see if the array is ragged or not.
   *
   * @param shape The shape array to write to.
   * @param curDim The current dimension to check.
   * @param obj The multidimensional array to inspect.
   * @throws OrtException If the array has a zero dimension, or is ragged.
   */
  private static void extractShape(long[] shape, int curDim, Object obj) throws OrtException {
    if (shape.length != curDim) {
      int curLength = Array.getLength(obj);
      if (curLength == 0) {
        throw new OrtException(
            "Supplied array has a zero dimension at "
                + curDim
                + ", all dimensions must be positive");
      } else if (shape[curDim] == 0L) {
        shape[curDim] = curLength;
      } else if (shape[curDim] != curLength) {
        throw new OrtException(
            "Supplied array is ragged, expected " + shape[curDim] + ", found " + curLength);
      }
      for (int i = 0; i < curLength; i++) {
        extractShape(shape, curDim + 1, Array.get(obj, i));
      }
    }
  }

Comment From: inpink

Hello, Thank you for raising this interesting issue.

Could you please let me know which embedding model and tokenizer you used? I tested the first code you mentioned using the TransformersEmbeddingModel and confirmed that it runs successfully. Please refer to the code below. I referenced the TransformersEmbeddingModelTests class from the Spring AI tests for the test code.

If you could provide more information, I would like to review it further. Thank you.

@Test
void embedList() throws Exception {
    TransformersEmbeddingModel embeddingModel = new TransformersEmbeddingModel();
    embeddingModel.afterPropertiesSet();
    List<List<Double>> embed = embeddingModel.embed(List.of("Hello world", "How are you?"));

    // Check the size of the overall embedding list
    assertThat(embed).hasSize(2);

    // Check the size and values of the first vector
    assertThat(embed.get(0)).hasSize(384);
    assertThat(DF.format(embed.get(0).get(0))).isEqualTo(DF.format(-0.19744634628295898));
    assertThat(DF.format(embed.get(0).get(383))).isEqualTo(DF.format(0.17298996448516846));

    // Check the size and values of the second vector
    assertThat(embed.get(1)).hasSize(384);
    assertThat(DF.format(embed.get(1).get(0))).isEqualTo(DF.format(0.037425532937049866));
    assertThat(DF.format(embed.get(1).get(383))).isEqualTo(DF.format(-0.08211533725261688));

    // Verify the two vectors are different
    assertThat(embed.get(0)).isNotEqualTo(embed.get(1));

    // Print the embedding vector for the first sentence "Hello world"
    System.out.println("Embedding for 'Hello world': " + embed.get(0));

    // Print the embedding vector for the second sentence "How are you?"
    System.out.println("Embedding for 'How are you?': " + embed.get(1));
}

Spring-ai Not able to call Embedding API with different sentence length in List

Comment From: JirHr

I am using intfloat/multilingual-e5-large model: https://huggingface.co/intfloat/multilingual-e5-large/tree/main As it is multipart model (having model.onnx and model.onnx_data) I could not use standard TransformersEmbeddingModel. I am using tokenizer, which is in onnx directory, I had to write my own TransformersEmbeddingModel class (extending external parameters by session and testing whether session is null for backward compatibility). I have my own code, loading multipart model and creating session. I decided to go this way, as I published following request, without getting any response in two weeks: https://github.com/spring-projects/spring-ai/discussions/1034

Now I get correct embeddings, using single input string. I am willing to share my current code (if somebody let me know how - I am not very experienced in contributing to such project...)

Comment From: JirHr

I have tested embed method for model all-MiniLM-L6-v2 (which is working fine without mentioned issue) Data after tokenization of "Hello world", "How are you?" are: data[0] = long[128] = [101, 7592, 2088, 102, 0, 0, 0,...] data[1] = long[128] = [101, 2129, 2024, 2017, 1029, 102, 0, 0, 0,...] e.g. there are trailing 0. to max context length

I have tested embed method for model multilingual-e5-large (where issue occurs) Data after tokenization of "Hello world", "How are you?" are: data[0] = long[4] = [0, 35378, 8999, 2] data[0] = long[6] = [0, 11249, 621, 398, 32, 2]

e.g. the difference is, that tokenizer of multilingual-e5-large does not fills trailing 0s

Model multilingual-e5-large has parameter "max_position_embeddings": 514 - e.g. if data would be modified, putting there trailing 0s, I am sure, that there would not be error..

But should not Spring AI support also this type of tokenizers??? Is there chance that TransformersEmbeddingModel class will be extended, allowing external specification of session?