Description

When using WebFlux to upload large files, the upload process sometimes hangs before reaching the route's business logic. Specifically, the issue occurs during the client’s file upload to the org.springframework.http.codec.multipart.MultipartParser when writing to a temporary file, causing the upload to become stuck. This problem is intermittent but has a certain probability of occurring when uploading files of 200MB, between 2-3GB, and between 4-5GB.

Steps to Reproduce

  1. Set up a Spring Boot application using WebFlux to handle file upload requests.
  2. Upload a large file (e.g., 200MB, 2-3GB, or 4-5GB).
  3. Observe the upload process. Occasionally, the upload hangs before the file is fully transmitted to the server. At this point, the generated temporary file xxx.multipart is incomplete, and the request remains in a pending state without completing the upload.

Additional Resources

  • A minimal reproduction project has been provided that can replicate the aforementioned issue, However, it’s not guaranteed to happen every time; I tried a few times: large-file-upload.zip
  • Screen Recording Demonstration:

https://github.com/user-attachments/assets/4d92b3e3-df14-4217-9081-52a321516900

In the screen recording, I used an ISO file that I downloaded from the Manjaro official website: https://download.manjaro.org/gnome/24.2.1/manjaro-gnome-24.2.1-241216-linux612.iso

The file upload progress in the video is stuck at 68% and does not complete, regardless of how long you wait, with no error messages displayed

Expected Behavior

Large files should upload and be processed smoothly without hanging or blocking at any stage.

Actual Behavior

When uploading large files, the upload process sometimes hangs during the MultipartParser phase of writing to the temporary file, preventing the upload from completing.

Environment Information

  • Spring Boot Version: 3.4.1
  • Java Versions:
    • OpenJDK 17
    • Eclipse Temurin JDK 21.0.5
  • Operating Systems:
    • macOS
    • Linux
  • Other Relevant Dependencies: spring-boot-starter-webflux

Note: The issue has been tested and reproduced across all the above-mentioned systems and JDK versions.

Additional Information

This issue is not caused by business logic but by the framework’s handling of large file uploads and writing to temporary files, which leads to blocking. We hope the development team can investigate and resolve this issue to enhance the stability and reliability of large file uploads.

Comment From: guqing

Perhaps it would be more appropriate to raise this issue in the spring-framework project. I apologize for overlooking this. Could the developers please transfer this issue to the spring-framework repository?🤪

Comment From: JohnNiang

Several uploading tries of the same file 2.7G Nov 26 15:38 Fedora-Server-dvd-x86_64-41-1.4.iso.

    1.7 GiB [####################]  6138071211851868018.multipart
    1.5 GiB [#################   ]  14413026015246421036.multipart
    1.4 GiB [################    ]  15576621334079039706.multipart
    1.1 GiB [############        ]  14246059143399172415.multipart

Comment From: sdeleuze

Thanks for the very high quality reproducer, that's much appreciated. So far I am unable to reproduce after 10 tries with manjaro-gnome-24.2.1-241216-linux612.iso. Could you share some indications on how frequent the issue is on your side?

Comment From: chenggangpro

Through my debugging process. If upstream() != null && !this.sink.isCancelled() && this.sink.requestedFromDownstream() == 0 && !this.requestOutstanding.get() then it would hang up for 100% through my local tests about dozens of times. Whenever the upload is successful, the previous debug point will never be reached.

Comment From: bclozel

@chenggangpro it would be interesting to know which condition doesn't match. Is "this.sink.requestedFromDownstream() == 0" or "this.requestOutstanding == false"? This is probably a concurrency issue in the parser and we need to pinpoint exactly the issue to fix it.

I think the proposed fix in #34388 accidentally fixes things by over-requesting data but queuing "onNext" might cause other issues, too much memory consumption or even parsing errors?

Comment From: chenggangpro

@bclozel I think it's the condition this.sink.requestedFromDownstream() == 0 but I don't know the cause is.

My local debugging is as below:

I add some debug logging points to MultipartParser:

  • MultipartParser#parse method

Original Source Code Location

public static Flux<Token> parse(Flux<DataBuffer> buffers, byte[] boundary, int maxHeadersSize, Charset headersCharset) {
    return Flux.create(sink -> {
        MultipartParser parser = new MultipartParser(sink, boundary, maxHeadersSize, headersCharset);
        sink.onCancel(parser::onSinkCancel);
        sink.onRequest(l -> logger.warn("===== Sink On request : " + l));// here is the debug logging point
        buffers.subscribe(parser);
    });
}

I didn't add parser.requestBuffer() into the sink.onRequest(...), just the logging point.

  • MultipartParser#requestBuffer method

Original Source Code Location

private void requestBuffer() {
    if (upstream() != null &&
            !this.sink.isCancelled() &&
            this.sink.requestedFromDownstream() > 0 &&
            this.requestOutstanding.compareAndSet(false, true)) {
        request(1);
    }else if(!this.requestOutstanding.get()){
       // here is the debug logging point
        logger.warn("===== Request buffer called =================");
        logger.warn("===== Sink is cancelled :" + sink.isCancelled());
        logger.warn("===== Sink requested from down stream :" + sink.requestedFromDownstream());
        logger.warn("===== Request buffer called =================");
    }
}

Then I uploaded manjaro-gnome-24.2.1-241216-linux612.iso, and here is the log large-file-upload-original-debug-logging.zip.

You can see lines between L#483532 and L#483539 of the log file; sink.requestedFromDownstream() is ZERO. Furthermore, therequestBuffer() method is never called within the sink.onRequest(...). So in my opinion this should be the bug point, but I don't know the cause of sink.requestedFromDownstream() == 0. When I first dove into the parser, I added a debug point at MultipartParser.java#L192 with a condition sink.requestedFromDownstream() == 0. This is the reason I added the logging debug point as mentioned earlier.

I kept running PR #34388 for 6 hours yesterday, and there were no hanging issues or parsing errors. However, I am not sure if my fix is actually correct or if there are any potential errors that I haven't noticed. I hope my debugging process is useful for you all in solving this issue.

Comment From: guqing

Thanks for the very high quality reproducer, that's much appreciated. So far I am unable to reproduce after 10 tries with manjaro-gnome-24.2.1-241216-linux612.iso. Could you share some indications on how frequent the issue is on your side?

Thanks for looking into this! There isn’t a fixed frequency for reproduction—it might take just 3 attempts sometimes, while other times it doesn’t reproduce even after dozens of tries. It seems to require a bit of luck. However, in applications with a large user base, users do encounter this issue. Fortunately, they can usually resolve it by retrying, as seen in cases like https://github.com/halo-dev/halo/issues/7170

Comment From: sdeleuze

I just reproduced it.

Image

Comment From: sdeleuze

@chemicL We suspect that here when this.sink.requestedFromDownstream() == 0, request(1); is skipped (which is intended), but then the related requestBuffer() method is never invoked again, and we get the transfer of big files hanged forever.

34388 was proposing to add sink.onRequest(l -> parser.requestBuffer()); in the Flux#create lambda. Adding this additional sink.onRequest invocation kind of makes sense , but we were concerned about potentially over-requesting. I am wondering if we should introduce a refined version of sink.onRequest(l -> parser.requestBuffer()); that would invoke conditionally parser.requestBuffer() only after this.sink.requestedFromDownstream() == 0 has been detected. Does that make sense from your POV? Do you have any guideline or link to similar use case that could help?