Thank you for taking time to contribute this pull request! You might have already read the [contributor guide][1], but as a reminder, please make sure to:
- Sign the contributor license agreement
- Rebase your changes on the latest
main
branch and squash your commits - Add/Update unit tests as needed
- Run a build and make sure all tests pass prior to submission
Comment From: nurlicht
The test 'BeanOutputParserTest$FormatTest' fails locally. This behavior also exists on the 'main' branch.
Comment From: tzolov
Thank your for the contribution @nurlicht
Is there are good reason to implement the Id generator for every hash codes algorithm? Wouldn't a simple MessageDigest impl. like in the snipped bellow be good enough? After all, the hash-based id is an heuristic approach, no 100% guaranties.
import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
....
public static String generateIdFrom(String contentWithMetadata) {
try {
byte[] hashBytes = MessageDigest.getInstance("SHA-256").digest(contentWithMetadata.getBytes(StandardCharsets.UTF_8));
StringBuilder sb = new StringBuilder();
for (byte b : hashBytes) {
sb.append(String.format("%02x", b));
}
return UUID.nameUUIDFromBytes(sb.toString().getBytes(StandardCharsets.UTF_8)).toString();
}
catch (NoSuchAlgorithmException e) {
throw new IllegalArgumentException(e);
}
}
Comment From: nurlicht
Hi @tzolov It seems that the class MessageDigest is not thread-safe, whereas the Apache implementations are. As such, I tried to implement multiple implementations of an IdGenerator (the current random one, the proposed one, and seemingly thread-safe versions of the proposed one).
The code as implemented in this PR uses your proposed IdGenerator (as default), unless a client calls the new constructor of Document with a specific instance of IdGenerator.
With that said, if you still prefer to make it simpler (and not to use the Apache algorithms), feel free to change the code or let me know to do it.
Comment From: markpollack
I like the thoughtfulness with respect to thread safety but am a bit hesitant to bring in a third party library just for this functionalty and to offer the user multiple choices of digest options, it can be overwhelming. Perhaps a simpler wrapper approach such as done here would be sufficient. Or even a simpler approach of creating a new one every time it is needed since it is apparently not a performance penalty to do this - see this SO post.
Comment From: nurlicht
Thanks for the feedback. With the new commit: - The Apache dependency has been removed. - MessageDigest is cloned each time it is used. - The interface for the client remains as in the old code, but the new IdGenerator is used. - The code for the new IdGenerator is essentially the one provided in the definition of the issue #113, aside from cloning and some minor refactoring. - Only if the user wants to use another IdGenerator, the new cnstructor can be used (with the type of IdGenerator as an additional argument). - The abstraction for different IdGenerator has been preserved.
Please let me know if you find issues or rooms for improvement. Thanks.
Comment From: tzolov
Thank you @nurlicht, I will have a look shortly.
I wonder if for the sake of the cautiousness, set the default to RANDOM (e.g. to mimic the current behaviour) and switch to hash based one later?
@markpollack what do you think? Also (not related to this PR) I wish we had a better (more centralised) approach to configure stuff like the Id generator or Document's contentFormatter. I guess we need to discuss this after the 0.8.0 release.
Comment From: tzolov
Also @nurlicht what name should i put in the javadoc @author
? nurlicht?
Comment From: nurlicht
@tzolov, I changed the default IdGenerator to be the existing one (RANDOM). @author = Aliakbar Jafarpour
Comment From: tzolov
Refactored, rebased and merged at b0d8d2b9cef37b4af988990fdc086bbae32dc2cd
@nurlicht, I took the deliberate to refactored the PR a bit according to our current needs.
Please have a look and let me know if I have perhaps oversimplified or missed something.
Thank you for your contribution!