This blog is mainly about Java...

Wednesday, May 12, 2010

Automate converting of documents to PDF & PDF/A using JODConverter 2

In this blog post I will be showing a great library for converting existing documents to PDF/A using an OpenSource library called JODConverter.
Note there is nothing that prevents you to convert to normal PDF.

JODConverter leverages OpenOffice.org, which provides arguably the best import/export filters for OpenDocument and Microsoft Office formats available today. Thus, it requires an installation of OpenOffice and it supports all documents which OpenOffice supports.

JODConverter automates all conversions supported by OpenOffice.org, including
  • Microsoft Office to OpenDocument, and viceversa
    • Word to OpenDocument Text (odt); OpenDocument Text (odt) to Word
    • Excel to OpenDocument Spreadsheet (ods); OpenDocument Spreadsheet (ods) to Excel
    • PowerPoint to OpenDocument Presentation (odp); OpenDocument Presentation (odp) to PowerPoint
  • Any format to PDF
    • OpenDocument (Text, Spreadsheet, Presentation) to PDF
    • Word to PDF; Excel to PDF; PowerPoint to PDF
    • RTF to PDF; WordPerfect to PDF; ...
  • And more
    • OpenDocument Presentation (odp) to Flash; PowerPoint to Flash
    • RTF to OpenDocument; WordPerfect to OpenDocument
    • Any format to HTML (with limitations)
    • Support for OpenOffice.org 1.0 and old StarOffice formats
    • ...
JODConverter can be used in many different ways
  • As a Java library, embedded in your own Java application
  • As a command line tool, possibly invoked from your own scripts
  • As a simple web application: upload your input document, select the desired format and download the converted version
  • As a web service, invoked from your own application written in your favourite language (.NET, PHP, Python, Ruby, ...)
JODConverter is open source software released under the terms of the LGPL and can be downloaded from SourceForge.net.

Starting OpenOffice as a service

JODConverter needs to connect to a running OpenOffice.org instance in order to perform the document conversions. This is different from starting the OpenOffice.org program as you would normally do. OpenOffice.org can be configured to run as a service and listen for commands on a TCP port. One way of doing this is to run the following command in Linux: (You only need to change location of the soffice)
/usr/bin/soffice "-accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager" -norestore -nofirststartwizard -nologo -headless &
I suggest putting this script in /etc/init.d/ so it will run automatically.
Note that you cannot open OpenOffice.org if you have this service running as headless mode.
If you are running your system on Windows, you can read here for information on how to create a service on Windows.
See the Uno/FAQ on the OpenOffice.org Wiki for more on this topic.

Command-Line Tool (cli like)

You can run JODConverter as cli (command line interface) like program.
To use it as a command line tool, you need to download the 2.2.2 distribution, unpack it, and run it using Java.
To convert a single file specify input and output files as parameters
java -jar lib/jodconverter-cli-2.2.0.jar document.doc document.pdf
To convert multiple files to a given format specify the format using the -f (or --output-format) option and then pass the input files as parameters
java -jar lib/jodconverter-cli-2.2.0.jar -f pdf *.odt

Usage in your Java applications
Using JODConverter in your own Java application is very easy. The following example shows the skeleton code required to perform a one off conversion from a Word document to PDF:
File inputFile = new File("document.doc");
File outputFile = new File("document.pdf");
 
// connect to an OpenOffice.org instance running on port 8100
OpenOfficeConnection connection = new SocketOpenOfficeConnection(8100);
connection.connect();
 
// convert
DocumentConverter converter = new OpenOfficeDocumentConverter(connection);
converter.convert(inputFile, outputFile);
 
// close the connection
connection.disconnect();

To get convert the same document to PDF/A instead, you need to create a custom DocumentFormat that is of type PDF/A and then send that into the convert method like this:
/**
* Returns DocumentFormat of PDF/A
*/
private DocumentFormat toDocumentFormatPDFA() {
  //These are the different PDF version's you can get. 1 is the default PDF/A
    final int PDFXNONE = 0;
    final int PDFX1A2001 = 1;
    final int PDFX32002 = 2;
    final int PDFA1A = 3;
    final int PDFA1B = 4;
    // create a PDF DocumentFormat (as normally configured in document-formats.xml)
    DocumentFormat customPdfFormat = new DocumentFormat(PORTABEL_FORMAT, PDF_APP, "pdf");

    //now set our custom options
    customPdfFormat.setExportFilter(DocumentFamily.TEXT, "writer_pdf_Export");
    /*
     * For some reason "PDF/A-1" is called "SelectPdfVersion" internally; maybe they plan to add other
     * PdfVersions later.
     */
    final Map<String, Integer> pdfOptions = new HashMap<String, Integer>();
    pdfOptions.put("SelectPdfVersion", PDFX1A2001);
    customPdfFormat.setExportOption(DocumentFamily.TEXT, "FilterData", pdfOptions);
    return customPdfFormat;
}
And then you call the convert method with toDocumentFormatPDFA() as parameter.
converter.convert(inputFile, outputFile, toDocumentFormatPDFA());

Note that this is a very simple example. I do not recommend opening and closing connection for each conversion. You open once the application is started (or the first time you want to convert), and then close the connection when the application shuts down.

3 comments:

Zarfishan Zahid said...

Use Aspose.PDF for Java for converting your MS Office document to pdf format its a very secure API available online that uses cloud technology which is growing really fast these days.

Kamal C said...

Is there a way to convert tif to pdf file respectively from JODConverter 2.2.2.

Shervin Asgari said...

I don't think so.
If it is possible directly from OpenOffice.org writer, then theoretically it should work from JODConverter 2.
However, I am not sure it will work out of the box.

Labels