Introduction: Before, I created an issue at spring-data-mongodb, but they recommended creating an issue for spring boot. Therefore, I copied the issue and created a new one in this repository.
Context: We have scheduled services (jobs) that run multiple times a day. However, there is an issue where 1-3 runs per day fail with an error. All other runs are problem-free. Therefore, the error seems blinked.
Environment: 1) We run these services using an docker image on Kubernetes. 2) We connect to AWS DocumentDB.
Versions: I think the main dependencies are Spring and related MongoDB libraries (such as mongo-driver). We obtain these dependencies through spring-boot-dependencies. Also, every spring version have different mongo related lib versions. We have tested them with different versions:
1) spring 3.3.2 - no issues (or we never catch them) 2) spring 3.3.4 - error occurs 2) spring 3.3.5 - error occurs 3) spring 3.4.0 - error occurs
Verification: Since the issue is not consistent, we have examined the logs and checked for common causes such as missing classpath or empty BOOT-INF data and etc. Everything appears correct (and most of the runs are successful).
Error full stacktrace: trace.txt
Notes: 1) At different versions, sometimes we encounter a NoClassDefFoundError with other com/mongodb classes. 2) Mongo FAQ that related to issue but looks like everything at our sine is correct. That why was decided that problem could be on spring side. 3) Local Integration test + mongo testcontainer (not AWS DocumentDB) never get error. 4) Maybe something related with 38611
Comment From: wilkinsona
Thanks for the report.
There certainly appears to be some similarity with #38611, but in this case it seems that the failure's occurring during startup and on the main thread. The problem in #38611 was occurring off the main thread and only when it had been interrupted. Unfortunately, without a way to reproduce the failure, I don't think we'll be able to diagnose the problem and either fix it or determine that it has an external cause.
You could try using the classic loader implementation to see if that makes a difference. The new loader was introduced in Spring Boot 3.2.0 so, given that 3.3.2 works, it may not make a difference. It would also be interesting to know if the problem occurs with 3.3.3 as that would narrow down the differences between a working version and a failing version.
Beyond that, I think we're really going to need a minimal sample that reproduces the problem to make any progress here.
Comment From: Osmyslitelny
We tried using the classic loader on Spring 3.4.0 with a cron job set to run every 5 minutes. In the past 24 hours, we haven't had any failures, which is better than our usual rate of around 3 failures. Next, we will test version 3.3.3 without the classic loader for the next 24 hours, and also continue running the job without the classic loader for 48 hours to further evaluate its result.
Unfortunately, I have no idea how to reproduce the failure because when you get 3 failures at day with this (5 min) schedule there is no idea what to do to reproduce that or catch.
Tomorrow, I will update the information regarding 3.3.3. If you have any ideas on how to reproduce or expand the logs, or any other suggestions, please feel free to share them.
Comment From: Osmyslitelny
We checked other version and current data is:
spring 3.3.2 - no issues (or we never catch them) spring 3.3.3 - no issues (or we never catch them) spring 3.3.4 - error occurs spring 3.3.5 - error occurs spring 3.4.0 - error occurs spring 3.4.0 with classic loader - no issues (or we never catch them)
If its also needed we use amazoncorretto:21.0.5
Comment From: wilkinsona
Thanks for the additional details.
Looking more closely at the original stack trace, I've noticed a couple of things:
- A
MultiServerClusteris being created - The failure's occurring beneath
com.mongodb.internal.Locks.withInterruptibleLockwhich casts some doubt on the theory that thread interruption is the cause
1 requires some non-default configuration of Mongo. Please share that configuration with us.
For 2, if I call Thread.currentThread.interrupt() before Mongo client's created, it quickly fails with a com.mongodb.MongoInterruptedException. If I, through the debugger, interrupt the thread once the lock has been taken, creating of the cluster succeeds without any class loading problems. I won't discount thread interruption completely at this point, but it is looking unlikely so we need some other avenues to explore.
The loader can be configured to output some debug information (-Dloader.debug=true). It could be informative to compare this output from a run that succeeds and a run that fails. The two runs should be with the exact same binary so that all of the information about the positions of files in the archive and their sizes remains constant. Could you please gather this information and share it with us?
Comment From: Osmyslitelny
I will collect all the mentioned information on various versions to compare. The error hasn't happened again in the last few days, but please don't close the issues for now. Once I have all the required logs, I will update the thread.
Comment From: AceOfSnakes
looks like -Dloader.debug=true solved or masked real problem
Comment From: philwebb
That may well be because the System.out logging causes enough of a delay that the race condition (or whatever is causing the problem) doesn't occur.
Comment From: AceOfSnakes
Yes - exactly - by my opinion
Comment From: wilkinsona
That's unfortunate, but thanks for trying it. Working on the theory that it may be a race condition, could you try deploying the apps with -Dspring.backgroundpreinitializer.ignore=true?. This will (largely) limit class loading during startup to the main thread.
Comment From: Osmyslitelny
We collected full logs with different settings that you requested. Mongo settings could be found at 'without debug file'. They printed as 'log info' object.
spring_335_wihtout_debug_failed.log spring_335_without_debug_passed.log spring_335_debug_failed.log spring_335_debug_passed.log
Also, I run -Dspring.backgroundpreinitializer.ignore=true without debug flag but we need time to observe it (1-2 days I think).
Comment From: Osmyslitelny
With -Dspring.backgroundpreinitializer.ignore=true still get the same error (
Comment From: wilkinsona
Thanks, @Osmyslitelny. I had hoped to compare the debug_failed and debug_passed logs but, unfortunately, the former seems to be incomplete. debug_failed starts differently to debug_passed and only seems to contain content related to loading Spring Boot's failure analysers. This would only happen after the NoClassDefFoundError has been thrown so my guess is that some earlier logging has somehow been missed.
Would it be possible to capture the output again please?
Comment From: Osmyslitelny
I agree that logs from debug_failed look incomplete, but this is the full logs that pods return with the -Dloader.debug=true flag. For me, it is not obvious why all logs disappear with this additional flag that should make logs more "wider". I going to try to get more logs but actually I don't know why it happens and open to any idea and recommendation how get full logs and keep this flag.
Comment From: wilkinsona
Sorry, I don't know why that would be. The passed logs should that output from the beginning of start up can be produced successfully. If I were to guess, I'd guess that something in the pod is limiting or tailing the output, perhaps only once it has reached a certain length.
Comment From: spring-projects-issues
If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.
Comment From: Osmyslitelny
We found the full logs, but unfortunately, we couldn't capture the failure with debug information at this time (7 days 24/7 test period). The additional log output delay provided enough time to prevent the issue from being reproduced. Therefore, we only have full logs for the successful case. A potential proposal regarding this behavior was made earlier.