This blog is mainly about Java...

Monday, August 2, 2010

Migrating from JODConverter 2 to JODConverter 3 and converting PDF to PDF/A

In the previous posting I showed you how you could automate conversions of documents to PDF & PDF/A using JODConverter 2.  

JODConverter 3.0.beta has been out for some time, and even though it is still beta, it is very stable. Maybe even more stable than JODConverter 2.

In this blog posting I will highlight the benefits of JODConverter 3 compared to its predecessor and show you how you can modify your code to create PDF/A documents with JODConverter 3.  
To be able to convert an existing PDF document to PDF/A in OpenOffice.org, you will need to install Sun PDF Import extension!

JODConverter 2 versus 3
JODConverter 3 still uses OpenOffice.org to perform its conversion. It is still a wrapper to the OOo API. It is only a complete rewrite of the JODConverter core library which is much cleaner and easier to use.

Whats new? 
  • No more init script(!) 
    • You don't have to manually start OpenOffice.org as a service anymore. This will be handled automatic.
    • You can even create multiple processes which is useful for multi-core CPU's. Best practise is one process for each CPU core.
  • Automatically restart an OOo instance if it crashes.
    • If for some reason your process crashes, JODConverter will detect this, and restart the process automatic. This was a hassle with JODConverter 2, as you needed to manually do this in Linux.
  • Abort conversions that take too long (according to a configurable timeout parameter)
  • Automatically restart an OOo instance after n conversions (workaround for OOo memory leaks)
Additionally the new architecture will make it easier to use the core JODConverter classes as a generic framework for working with OOo - not just limited to document conversions.
I am sure there will be more features when JODConverter 3 goes final.

Configuration

All you need to do do is point your OpenOffice.org installation to the OfficeManager, and you are good to go.

OfficeManager officeManager = new DefaultOfficeManagerConfiguration()
        .setOfficeHome("/usr/lib/openoffice")
        .buildOfficeManager().start();

This manager will use the default settings for Task Queue Timeout, Task Execution Timeout, Port Number etc but you can easily change them

OfficeManager officeManager = new DefaultOfficeManagerConfiguration()
        .setOfficeHome("/usr/lib/openoffice")
        .setTaskExecutionTimeout(240000L)
        .setTaskQueueTimeout(60000L)
        .buildOfficeManager().start();

If you want to utilize piping (Recommended is one process per CPU-core), you will need to set VM argument and point java.library.path to the location of $URE_LIB which on my Ubuntu machine is /usr/lib/ure/lib/
For instance:
-Djava.library.path="/usr/lib/ure/lib"

And then you can change your OfficeManager.

OfficeManager officeManager = new DefaultOfficeManagerConfiguration()
        .setOfficeHome("/usr/lib/openoffice")
        .setConnectionProtocol(OfficeConnectionProtocol.PIPE)
        .setPipeNames("office1","office2") //two pipes
        .setTaskExecutionTimeout(240000L) //4 minutes
        .setTaskQueueTimeout(60000L)  // 1 minute
        .buildOfficeManager().start();



ConverterService3Impl
The following codes performs all the converting. It supports a File or byte[] as input.

This is how you use it:
Lets say you have a PDF file as byte[], and you want to convert this byte to PDF/A as byte.
All you would have to do is call method:



byte[] pdfa = converterService.convertToPDFA(pdfFile);

Similarly, if you have a Document (say a OpenOffice.org writer document) and you want to convert this to PDF you would call the method:

File doc = new File("myDocument.odt");
File pdfDocument = converterService.convert(doc, ".pdf");

Note that you will always get a PDF/A compliant pdf. All you need to do is change the extension from ".pdf" to ".html" and the converter would do the magic.


Here is the source. Please read the comments in the source code if you want to understand it, or just ask in the comment section below.

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.ConnectException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import javax.ejb.Local;
import javax.ejb.Stateless;

import lombok.Cleanup;

import org.apache.commons.io.FilenameUtils;
import org.apache.commons.io.IOUtils;
import org.artofsolving.jodconverter.OfficeDocumentConverter;
import org.artofsolving.jodconverter.document.DefaultDocumentFormatRegistry;
import org.artofsolving.jodconverter.document.DocumentFamily;
import org.artofsolving.jodconverter.document.DocumentFormat;
import org.artofsolving.jodconverter.document.DocumentFormatRegistry;

/**
 * This service converts files from one thing to another ie ODT to PDF, DOC to ODT etc
 * @author Shervin Asgari
 *
 */
@Stateless
@Local(ConverterService.class)
public class ConverterService3Impl implements ConverterService {

  private static final String PDF_EXTENSION = ".pdf";
  private static final String PDF = "pdf";  

  // Uncomment these when we want to use them

  // private final int PDFXNONE = 0;
  private final int PDFX1A2001 = 1;
  // private final int PDFX32002 = 2;
  // private final int PDFA1A = 3;
  // private final int PDFA1B = 4; 


  @Logger //Your favourite logger (ie Log4J) could be injected here 
  private Log log;

  public File convert(File inputFile, String extension) throws IOException, ConnectException {
    if (inputFile == null) {
      throw new IOException("The document to be converted is null");
    }

    Pattern p = Pattern.compile("^.?pdf$", Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(extension);
    OfficeDocumentConverter converter;
    
    //If inputfile is a PDF you will need to use another FormatRegistery, namely DRAWING
    if(FilenameUtils.isExtension(inputFile.getName(), PDF) && m.find()) {
      DocumentFormatRegistry formatRegistry = new DefaultDocumentFormatRegistry();
      formatRegistry.getFormatByExtension(PDF).setInputFamily(DocumentFamily.DRAWING);
      converter = new OfficeDocumentConverter(officeManager, formatRegistry);
    } else {
      converter = new OfficeDocumentConverter(officeManager);
    }
    
    String inputExtension = FilenameUtils.getExtension(inputFile.getName());
    File outputFile = File.createTempFile(FilenameUtils.getBaseName(inputFile.getName()), extension);

    try {
      long startTime = System.currentTimeMillis();
      //If both input and output file is PDF
      if (FilenameUtils.isExtension(inputFile.getName(), PDF) && m.matches()) {
        //We need to add the DocumentFormat with DRAW
        converter.convert(inputFile, outputFile, toFormatPDFA_DRAW());
      } else if(FilenameUtils.isExtension(outputFile.getName(), PDF)) {
        converter.convert(inputFile, outputFile, toFormatPDFA());
      } else {
        converter.convert(inputFile, outputFile);
      }
      long conversionTime = System.currentTimeMillis() - startTime;
      log.info(String.format("successful conversion: %s [%db] to %s in %dms", inputExtension, inputFile.length(), extension, conversionTime));

      return outputFile;
    } catch (Exception exception) {
      log.error(String.format("failed conversion: %s [%db] to %s; %s; input file: %s", inputExtension, inputFile.length(), extension, exception, inputFile.getName()));
      exception.printStackTrace();
      throw new IOException("Converting failed");
    } finally {
      //outputFile.deleteOnExit();
      //inputFile.deleteOnExit();
    }
  }
  
  /**
   * Convert pdf file to pdf/a
   * You will need to install OpenOffice extension (pdf viewer) to get it working
   * @param pdf
   * @return Byte array
   * @throws IOException
   */
  public byte[] convertToPDFA(byte[] pdfByte) throws IOException, ConnectException {
    @Cleanup InputStream is = new ByteArrayInputStream(pdfByte);
    File pdf = createFile(is, PDF_EXTENSION);
    log.debug("PDF is: #0 #1", pdf.getName(), pdf.isFile());
    return convert(pdf);
  }

  
  private byte[] convert(File pdf) throws IOException {
    if (pdf == null) {
      throw new IOException("The document to be converted is null");
    }

    File convertedPdfA = convert(pdf, PDF_EXTENSION);
    @Cleanup final InputStream inputStream = new BufferedInputStream(new FileInputStream(convertedPdfA));
    byte[] pdfa = IOUtils.toByteArray(inputStream);
    return pdfa;
  }

  /**
   * Creates a temp file and writes the content of InputStream to it. doesn't close input
   * 
   * @return File
   */
  private java.io.File createFile(InputStream in, String extension) throws IOException {
    java.io.File f = File.createTempFile("tmpFile", extension);
    @Cleanup BufferedOutputStream out = new BufferedOutputStream(new FileOutputStream(f));
    IOUtils.copy(in, out);
    return f;
  }
  
  /**
   * This DocumentFormat must be used when converting from document (not pdf) to pdf/a
   * For some reason "PDF/A-1" is called "SelectPdfVersion" internally; maybe they plan to add other PdfVersions later.
   */
  private DocumentFormat toFormatPDFA() {
    DocumentFormat format = new DocumentFormat("PDF/A", PDF, "application/pdf");
    Map<String, Object> properties = new HashMap<String, Object>();
    properties.put("FilterName", "writer_pdf_Export");

    Map<String, Object> filterData = new HashMap<String, Object>();
    filterData.put("SelectPdfVersion", this.PDFX1A2001);
    properties.put("FilterData", filterData);

    format.setStoreProperties(DocumentFamily.TEXT, properties);

    return format;
  }
  
  /**
   * This DocumentFormat must be used when converting from pdf to pdf/a
   * For some reason "PDF/A-1" is called "SelectPdfVersion" internally; maybe they plan to add other PdfVersions later.
   */
  private DocumentFormat toFormatPDFA_DRAW() {
    DocumentFormat format = new DocumentFormat("PDF/A", PDF, "application/pdf");
    Map<String, Object> properties = new HashMap<String, Object>();
    properties.put("FilterName", "draw_pdf_Export");

    Map<String, Object> filterData = new HashMap<String, Object>();
    filterData.put("SelectPdfVersion", this.PDFX1A2001);
    properties.put("FilterData", filterData);

    format.setStoreProperties(DocumentFamily.DRAWING, properties);

    return format;
  }

}


Remember to close the connection when your application is quit/shutdown

Labels