Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large HTML File conversion to PDF hangs. #180

Closed
rajaningle opened this issue Mar 7, 2018 · 25 comments
Closed

Large HTML File conversion to PDF hangs. #180

rajaningle opened this issue Mar 7, 2018 · 25 comments

Comments

@rajaningle
Copy link

rajaningle commented Mar 7, 2018

Hi,

I am trying to convert large HTML File approximately 600 pages which is not passing the conversion and hangs.

Following is my observation after debugging the core.
PdfRendererBuilder.class file has following method call.

  1. renderer.layout(); // This action takes significant time but completes the process.
    2. renderer.createPDF(); // This action is not completing its execution and hangs the process.

when I looked into it renderer.createPDF() is trying to create entire PDF in memory (document) and after completion it starts writing to OutputStream.

Can we write it directly to OutputStream page by page? I think this might solve the problem.

Following is my code snippet please check the same if I am doing anything wrong here.

public void exportToPdf(List<Map<String, Object>> data, String template, Map<String, Object> exportData,
            Configuration cfg, String pdfURL) throws Exception   {
        File exportedPdfFile = null;
        File exportedPdfFileTemp = null;
        FileChannel pdfFileSrcIS = null;
        FileChannel pdfFileDestOS = null;
        FileOutputStream tmpFileOS = null;
        BufferedOutputStream tmpFileBOS = null;
        FileInputStream pdfSrcIS = null;
        FileOutputStream exportedPdfFileOs = null;
        try {
            // Create New File
            exportedPdfFileTemp = new File(pdfURL + "_" + TEMP);
            LOGGER.info("### Temp PDF File Name After Creation :::" + pdfURL + "_" + TEMP);
            
            tmpFileOS = new FileOutputStream(exportedPdfFileTemp);
            tmpFileBOS = new BufferedOutputStream(tmpFileOS);
            // Create Builder
            PdfRendererBuilder builder = new PdfRendererBuilder();
            addFonts(builder);
            // Generate HTML Template String
            String htmlTemplateString = generateHtmlFromTemplate(data, template, exportData, cfg);
            // Generate Doc from the HTML string
            Document doc = html5ParseDocument(htmlTemplateString, PDF_GENERATION_TIMEOUT);// builder.withUri(url);
            builder.withW3cDocument(doc, null);
            // Write the PDF to file
            builder.toStream(tmpFileBOS);
            builder.run();

            LOGGER.info("::: PDF Generation Successful with :::");

                exportedPdfFile = new File(pdfURL);
                
                if (exportedPdfFileTemp.renameTo(exportedPdfFile)) {
                    LOGGER.info("### Temp FILE Renamed To ::: " + pdfURL );
                } else {
                    LOGGER.info("### Temp FILE Rename Failed Creating New File ::: " + pdfURL);
                    pdfSrcIS = new FileInputStream(exportedPdfFileTemp);
                    pdfFileSrcIS = pdfSrcIS.getChannel();
                    exportedPdfFileOs = new FileOutputStream(exportedPdfFile);  
                    pdfFileDestOS = exportedPdfFileOs.getChannel();
                    LOGGER.info("### Starting Copy Operation ::: " + pdfURL);
                    pdfFileDestOS.transferFrom(pdfFileSrcIS, 0, pdfFileSrcIS.size());
                    LOGGER.info("### Copy Operation Completed ::: " + pdfURL);
                }

                LOGGER.info("### *** File Created *** ::: " + pdfURL);
                LOGGER.info("### PDF Created successfully!");

        } catch (Exception e) {
            LOGGER.error("Error generating PDF :" + e.getMessage(), e);
            throw e;
        } finally {
            LOGGER.info("### PDF Created successfully with Name ::: " + pdfURL);
            // Close all streams

            if (exportedPdfFileOs != null) {
                org.apache.commons.io.IOUtils.closeQuietly(exportedPdfFileOs);
            }
            if (pdfSrcIS != null) {
                org.apache.commons.io.IOUtils.closeQuietly(pdfSrcIS);
            }
            
            if (tmpFileBOS != null) {
                org.apache.commons.io.IOUtils.closeQuietly(tmpFileBOS);
            }
            if (tmpFileOS != null) {
                org.apache.commons.io.IOUtils.closeQuietly(tmpFileOS);
            }
            // Clean after creation
            try {
                if (exportedPdfFileTemp.isFile()) {
                    if (exportedPdfFileTemp.delete()) {
                        LOGGER.info("### Temp PDF File :::" + exportedPdfFileTemp.getName() + " is deleted!");
                    } else {
                        LOGGER.error("Temp File Delete operation is failed.");
                    }
                }

            } catch (Exception deleteException) {
                LOGGER.error("Error Deleting Temp File!  Name :::" + pdfURL + deleteException.getMessage());
            }
        }
    }

In above code snippet it is not completing builder.run(); process and hangs.

Please help me with the solution.

Thanks in advance.

@dilworks
Copy link

Sounds silly to ask, but how much memory are allocating to your JVM? Try setting a higher limit with -xmx

When there is not enough RAM, the generator will hang while eating all of your CPU time.

@danfickle
Copy link
Owner

How long does it hang for? Could it be that it is hitting disk to use the swap space? As @dilworks asks, how much memory are you allocating to Java and how much physical memory is available to the machine?

Hanging is obviously unacceptable, so I'm keen to get to the bottom of this one. I'll also investigate the memory/disk options of PDF-BOX (currently it is constructed completely in memory) and reply here.

@rajaningle
Copy link
Author

Hi Thanks for reply @dilworks and @danfickle

We have hosted it on AWS t2.micro instance where it never resolves (hangs indefinitely) we have provided following options:

Initial JVM heap size: 256m
JVM command line options: blank
Maximum JVM heap size: 256m
Maximum JVM permanent generation size: 64m

On My Local machine It hangs for more than 20 minutes and eats all the CPU. Physical memory on local is around 4GB free and heap size is 256m.

I will try increasing the heap as @dilworks suggested.

But I feel it will be better to directly construct it on the disk instead of memory which will give better performance.

@danfickle Please investigate and implement the solution. meanwhile I am also investigating PDF-BOX options to construct it on disk will post if found something useful.

@rajaningle
Copy link
Author

Hi @dilworks I have tried assigning -xmx 2048 and it did not resolve the problem it still hangs.

@danfickle it is hitting the disk for swap space. please check below.

image

danfickle added a commit that referenced this issue Mar 27, 2018
+ Allow user to create their own PDDocument with memory settings of
their choice.
+ Fix silly bug in bidi splitter that was taking more than half the
time in my sample document (according to VisualVM).
@danfickle
Copy link
Owner

Thanks @rajaningle

I added a builder method to pass in your own PDDocument which can be configured in the constructor with a MemoryUsageSetting to control how much memory/disk is used by PDFBOX.

However, with my simple testing of a large document, this didn't fix the problem so I am now profiling with VisualVM to find CPU/Memory hogs. I've already found a major CPU hog as discussed in #170

Thanks for your patience and hopefully we can get this fixed.

@rajaningle
Copy link
Author

Thanks @danfickle
I will check with new fix whether it improves performance in my project.

I was checking PDFBox options and came across doc.saveIcremental(outStream) Method.

Link: https://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/pdmodel/PDDocument.html#saveIncremental(java.io.OutputStream)

Please check if we can use it and whether this method resolves our problem.

Thanks.

@javimartinez
Copy link

Hi, Today we got the same issue as @rajaningle trying to convert an HTML about 400 pages with 0.0.1-RC12 version. After read this issue, we have tried the SNAPSHOT version using MemoryUsageSetting.setupTempFileOnly() on building a PDDocument and this works fine for us. (We don't do performance testing for now)

Are you planing to do a new release?

Thanks!

@rajaningle
Copy link
Author

Hi @danfickle I tried with MemoryUsageSetting.setupTempFileOnly() and it did not solve the problem it is still hogging the CPU/Memory.

danfickle added a commit that referenced this issue Apr 3, 2018
@danfickle
Copy link
Owner

OK, I generate a large (inline only) document with this code:

	private static void createLargeInlineDoc() throws IOException {
		OutputStream os2 = new FileOutputStream("/Users/me/Documents/pdf-issues/issue-180.htm");
		
		PrintWriter pw = new PrintWriter(os2);
		
		pw.println("<html>");
		pw.println("<head>");
		pw.println("</head>");
		pw.println("<body>");
		
		for (int i = 0; i < 100000; i++) {
			pw.println("Normal <strong>Bold</strong> <i>Italic</i>");
		}

		pw.println("</body>");
		pw.println("</html>");
		
		pw.close();
		os2.close();
	}

After fixing the two BIDI performance bugs it is down to 11 seconds on my machine, from a staggering 400 seconds before!

Next up, in improving performance according to the profiler, is this monstrosity (finally one that's not mine), from com.openhtmltopdf.layout.WhitespaceStripper:

    private static String collapseWhitespace(InlineBox iB, IdentValue whitespace, String text, boolean collapseLeading) {
        if (whitespace == IdentValue.NORMAL || whitespace == IdentValue.NOWRAP) {
            text = linefeed_space_collapse.matcher(text).replaceAll(EOL);
        } else if (whitespace == IdentValue.PRE) {
            text = space_before_linefeed_collapse.matcher(text).replaceAll(EOL);
        }

        if (whitespace == IdentValue.NORMAL || whitespace == IdentValue.NOWRAP) {
            text = linefeed_to_space.matcher(text).replaceAll(SPACE);
            text = tab_to_space.matcher(text).replaceAll(SPACE);
            text = space_collapse.matcher(text).replaceAll(SPACE);
        } else if (whitespace == IdentValue.PRE || whitespace == IdentValue.PRE_WRAP) {
            int tabSize = (int) iB.getStyle().asFloat(CSSName.TAB_SIZE);
            char[] tabs = new char[tabSize];
            Arrays.fill(tabs, ' ');
            text = tab_to_space.matcher(text).replaceAll(new String(tabs));
        } else if (whitespace == IdentValue.PRE_LINE) {
            text = tab_to_space.matcher(text).replaceAll(SPACE);
            text = space_collapse.matcher(text).replaceAll(SPACE);
        }

        if (whitespace == IdentValue.NORMAL || whitespace == IdentValue.NOWRAP) {
            // collapse first space against prev inline
            if (text.startsWith(SPACE) &&
                    collapseLeading) {
                text = text.substring(1, text.length());
            }
        }

        return text;
    }

Note that text in normal mode goes through four regular expression replaces and a substring. Unless someone else provides a replacement without regular expressions, I'll work on it tomorrow, and then do the release.

This was referenced Apr 3, 2018
@dilworks
Copy link

dilworks commented Apr 5, 2018

I've decided to follow your steps and profile everything on my setup... using one of my RAM-eating please-have-mercy testcases: a rather simple table-based report (complete with headers and footers) that easily gets into the thousands of pages (it's a transaction log report for a entire year, and for a mid-sized customer it goes over 5000 pages) - this was the reason of why I was forced to fiddle with -xmx (apparently this flaw was inherited from FS). This report in particular is rather CPU-bound... until it's time to generate the PDF, when my JSF-generated XHTML brings FS/OH down to its knees, now massively eating RAM this time.

What I found was... this:
profile_oh_hotpaths
profile_oh_loggingwhat
profile_oh_xrlog_2
profile_oh_xrlog_jboss

A logging statement on this:


...is causing JBoss/WildFly logging subsystem to go insane and drain a non-insignificant slice of CPU time! Leaving my code outside, this single logging call ends eating almost half of the CPU time.

(And if you were wondering: no, I never got my 5000+ page PDF - profiling makes everything go much slower, plus I was testing with some real data that easily ate the 3GB limit I had set)

danfickle added a commit that referenced this issue Apr 5, 2018
…ons with more performant loop for normal and no-wrap mode white-space settings.
danfickle added a commit that referenced this issue Apr 5, 2018
…ERE.

SEVERE is too severe for a common warning.
@danfickle
Copy link
Owner

Thanks @dilworks

The only thing I could think of causing a slow down is the fact that it was logging as SEVERE. Could Wildfly be set up to do something special with SEVERE log messages? Anyway, I have downgraded it to WARNING to be consistent with other CSS warnings.

I also released RC-13, so we'll make the next release focused on performance and memory. Much work is needed to get 5000+ page documents running smoothly!

@dilworks
Copy link

dilworks commented Apr 5, 2018

Loving that couple of fixes - after some quick tests now performance is on par with FS, and even beats it in a few times with the same 18-page test doc I had attached. But then, that's just the beginning

Thank you very much for the improvements @danfickle !

@rajaningle
Copy link
Author

Thanks @danfickle there is some performance improvement with the current fixes but it still hangs while generating huge documents 5000+ pages or more. hoping for the performance improvement there I have this open defect and need to resolve it ASAP because we have all huge documents to be exported and functionality breaks while generating huge PDF. Please see if you can find solution to resolve this hang issue.

@rototor
Copy link
Contributor

rototor commented Apr 6, 2018

@rajaningle This may not solve your problem, but for those 5000+ pages you have many DOM nodes in memory, and therefore need tons of memory alone for your DOM nodes.

=> You could try exists-db to solve this memory problem. Exists-DB allows you to store big amount of XML in a persistent file. It also allows you to query it very fast using XQuery (this is what I had used exists-db for in another project ten years ago...). And all the nodes also implement org.w3c.dom.

Something like this could work:

If I understand it correctly from the documentation XMLResouce.getContentAsDOM() gets you the content as org.w3c.dom which is then lazy loaded from the database. So that you only have those nodes in memory which are needed at a time.

You could then feed the DOM into the PdfRendererBuilder using withW3cDocument(). I can not guarantee that this will work correctly and really reduces the memory pressure, but it is at least something you could try.

@rototor
Copy link
Contributor

rototor commented Apr 6, 2018

I've just created a testcase for this problem, see #194. This takes 5m 49s on my MacBook Pro 2014 (16 GB RAM) to create a HTML file with 18.5 MB and a result PDF with 232 MB and 12694 pages using JDK 1.7.0_52.

I.e. i can not reproduce the problem, it works for me. @rajaningle what JDK are you using on what OS? Please look at the testcase, it only contains text and tables. What other stuff are you using in your report?

@dilworks
Copy link

dilworks commented Apr 6, 2018

Managed to find some time to test one of my huge files - I'm attaching a sample (a couple pages with test data - the real production report only has longer strings and bigger numbers but nothing else) so you guys can check the layout - it's rather simple as I've already said yet it can grow up in size easily since the report is a transaction log - with one of my datasets it generates a ~4300-page PDF. I'm testing with -xmx3g (my laptop only has 6GB RAM, thankfully our production setups either never have enough data to push things to the limits, or have at least 8GB dedicated for our app).

  • On FS it takes about 6 minutes on this ancient Penryn laptop (Core 2 Duo P8600), but I eventually get my 4300-page document.
  • With OH it takes over 10 minutes... but I get no document - instead it gives up with an OutOfMemoryException:
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.Arrays.copyOf(Arrays.java:3332)
	at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
	at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
	at java.lang.StringBuilder.append(StringBuilder.java:136)
	at java.lang.StringBuilder.append(StringBuilder.java:131)
	at com.openhtmltopdf.pdfboxout.PdfBoxFontResolver.getHashName(PdfBoxFontResolver.java:365)
	at com.openhtmltopdf.pdfboxout.PdfBoxFontResolver.resolveFont(PdfBoxFontResolver.java:342)
	at com.openhtmltopdf.pdfboxout.PdfBoxFontResolver.resolveFont(PdfBoxFontResolver.java:301)
	at com.openhtmltopdf.pdfboxout.PdfBoxFontResolver.resolveFont(PdfBoxFontResolver.java:70)
	at com.openhtmltopdf.layout.SharedContext.getFont(SharedContext.java:356)
	at com.openhtmltopdf.layout.LayoutContext.getFont(LayoutContext.java:336)
	at com.openhtmltopdf.render.InlineBox.getTextWidth(InlineBox.java:168)
	at com.openhtmltopdf.render.InlineBox.calcMinWidthFromWordLength(InlineBox.java:255)
	at com.openhtmltopdf.render.InlineBox.calcMinMaxWidth(InlineBox.java:378)
	at com.openhtmltopdf.render.BlockBox.calcMinMaxWidthInlineChildren(BlockBox.java:1688)
	at com.openhtmltopdf.render.BlockBox.calcMinMaxWidth(BlockBox.java:1562)
	at com.openhtmltopdf.newtable.TableBox$AutoTableLayout.recalcColumn(TableBox.java:1247)
	at com.openhtmltopdf.newtable.TableBox$AutoTableLayout.fullRecalc(TableBox.java:1221)
	at com.openhtmltopdf.newtable.TableBox$AutoTableLayout.calcMinMaxWidth(TableBox.java:1516)
	at com.openhtmltopdf.newtable.TableBox.calcMinMaxWidth(TableBox.java:158)
	at com.openhtmltopdf.newtable.TableBox.layout(TableBox.java:221)
	at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321)
	at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299)
	at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90)
	at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:985)
	at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:865)
	at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:794)
	at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321)
	at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299)
	at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90)
	at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:985)
	at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:865)

Test case here:
mp_test.tar.gz
Just copypaste the <table id=main> a few hundred times and you'll get a realistic test load similar to my case.

...thankfully that's the only heavyweight report on my app (and the least used one, yet... due to $REGULATIONS our customers have to generate it at least a couple times per year: the first run with an empty transaction log easily goes over 100 pages (one per department).

@rototor
Copy link
Contributor

rototor commented Apr 6, 2018

I've integrated parts of your sample in #194 and just did a big freemarker loop around it. Something strange is going on here in the Bidi-splitter stuff. There is one ParagraphSplitter$Paragraph object with 2.2 million entries in the textRuns hash map ... this seems to be the root object? At least it's seems strange to me that one Paragraph object can have so many entries ...

@danfickle you should be able to investigate that in my #194 pull request.

Did you disable the logging? XRLog.setLoggingEnabled(false)? Because the logging causes some overhead, even if the logger does not write the log infos somewhere, because the log infos are generated anyway.

@danfickle
Copy link
Owner

@rototor

In regard of the bidi splitter, it defines a paragraph as a block element. It should define a paragraph as anything block-like, for example a table cell. I meant to make this trivial fix in RC-13 but somehow forgot.

@dilworks
Will do some more performance work tomorrow based on your sample.

danfickle added a commit that referenced this issue Apr 8, 2018
Testcase big document #180 with perf improvements.
danfickle added a commit that referenced this issue Apr 8, 2018
…splitter.

Also define a paragraph as anything block-like or out-of-flow.
@danfickle
Copy link
Owner

I’ve been thinking about the painting side. The core algorithm is:

For-each page:
    For-each layer:
         For-each top-level box such as line box:
             Output if on this page.

This leads to a method call count of page-count x layer-count x box-count. Or for an 1800 page document with one layer and 50 something lines per page, about 180 million iterations, which I’ve observed in the profiler. This is essentially O(n^2). But we have a sorted list of pages, so we should be able to binary sort and get down at least to O(n log n). Or about a million iterations (200 fold decrease) for the 1800 page document. That would really speed everything up.

danfickle added a commit that referenced this issue Apr 8, 2018
Also beginning of style caching with which conditions will disable it.
danfickle added a commit that referenced this issue Dec 20, 2018
danfickle added a commit that referenced this issue Dec 20, 2018
…w page.

Tests that rotated text on overflow page entirely clipped out by the page margin should not generate an overflow page as such page will be visually empty.
danfickle added a commit that referenced this issue Dec 23, 2018
…ng a larger replaced text.

On two vertical pages and one overflow page.
danfickle added a commit that referenced this issue Dec 23, 2018
… does not generate a horizontal overflow page.
danfickle added a commit that referenced this issue Dec 23, 2018
…oes not generate a horizontal overflow page.
danfickle added a commit that referenced this issue Dec 23, 2018
…T output table header, footer or caption on every page.
danfickle added a commit that referenced this issue Dec 23, 2018
…le header and footer on every page (but caption only on first page).
danfickle added a commit that referenced this issue Dec 23, 2018
danfickle added a commit that referenced this issue Dec 27, 2018
… them) despite being in overflow hidden containers.

Also enable test demonstrating this scenario.
danfickle added a commit that referenced this issue Dec 27, 2018
danfickle added a commit that referenced this issue Dec 27, 2018
…overflow pages.

Plus enable test from previous commit with this scenario.
danfickle added a commit that referenced this issue Dec 31, 2018
… [ci skip]

Tests that a nested float in a fixed element renders correctly.

Appears to be an ordering issue of the layers as if header and footer swap element order everything works.
danfickle added a commit that referenced this issue Jan 1, 2019
Some boxes can be layed out many times (to satisfy page constraints for example). If this happens we just mark our old layer for deletion and create a new layer. Not sure this is right, but doesn't break any correct tests.

Yes, this is a particularly hackish solution. This fix also brought up the correct response for positioning-absolute test, so I altered the html to match the expected output. (I hadn't noticed the missing box when I committed the expected test result previously).
danfickle added a commit that referenced this issue Jan 3, 2019
…o the sum of its child boxes using border-box sizing.
danfickle added a commit that referenced this issue Mar 8, 2019
Also cleaned up ContentFunctionFactory class.
danfickle added a commit that referenced this issue Mar 8, 2019
With leader function, attr function, target-counter function and overflow page in the middle.
@danfickle
Copy link
Owner

Yeah, RC18 is finally released with a usable fast renderer.

@Infinity821
Copy link

Infinity821 commented Jun 9, 2020

I am now using version 1.0.2, but the pdf build is still hang.
The size of html is 13241929
I have tried many times and increased the heap size to 4G.
My running machine is i5 4460, 16G RAM.

Attafched with the test html
test.txt

My code for pdf generation is as follow:

    public byte[] generateFromHtml(String html) throws Exception {
        try (ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream()) {
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.useFont(getFont(PMingLiU), "PMingLiU");
            builder.useFont(getFont(PMingLiUExtB), "PMingLiU-ExtB");
            builder.useFont(getFont(seguiemj), "Segoe UI Emoji");
            builder.withHtmlContent(html, null);
            builder.useFastMode();
            builder.toStream(byteArrayOutputStream);
            builder.run();
            return byteArrayOutputStream.toByteArray();
        }
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants