This blog is mainly about Java...

Showing posts with label PDF. Show all posts
Showing posts with label PDF. Show all posts

Monday, August 2, 2010

Migrating from JODConverter 2 to JODConverter 3 and converting PDF to PDF/A

In the previous posting I showed you how you could automate conversions of documents to PDF & PDF/A using JODConverter 2.  

JODConverter 3.0.beta has been out for some time, and even though it is still beta, it is very stable. Maybe even more stable than JODConverter 2.

In this blog posting I will highlight the benefits of JODConverter 3 compared to its predecessor and show you how you can modify your code to create PDF/A documents with JODConverter 3.  
To be able to convert an existing PDF document to PDF/A in OpenOffice.org, you will need to install Sun PDF Import extension!

JODConverter 2 versus 3
JODConverter 3 still uses OpenOffice.org to perform its conversion. It is still a wrapper to the OOo API. It is only a complete rewrite of the JODConverter core library which is much cleaner and easier to use.

Whats new? 
  • No more init script(!) 
    • You don't have to manually start OpenOffice.org as a service anymore. This will be handled automatic.
    • You can even create multiple processes which is useful for multi-core CPU's. Best practise is one process for each CPU core.
  • Automatically restart an OOo instance if it crashes.
    • If for some reason your process crashes, JODConverter will detect this, and restart the process automatic. This was a hassle with JODConverter 2, as you needed to manually do this in Linux.
  • Abort conversions that take too long (according to a configurable timeout parameter)
  • Automatically restart an OOo instance after n conversions (workaround for OOo memory leaks)
Additionally the new architecture will make it easier to use the core JODConverter classes as a generic framework for working with OOo - not just limited to document conversions.
I am sure there will be more features when JODConverter 3 goes final.

Configuration

All you need to do do is point your OpenOffice.org installation to the OfficeManager, and you are good to go.

OfficeManager officeManager = new DefaultOfficeManagerConfiguration()
        .setOfficeHome("/usr/lib/openoffice")
        .buildOfficeManager().start();

This manager will use the default settings for Task Queue Timeout, Task Execution Timeout, Port Number etc but you can easily change them

OfficeManager officeManager = new DefaultOfficeManagerConfiguration()
        .setOfficeHome("/usr/lib/openoffice")
        .setTaskExecutionTimeout(240000L)
        .setTaskQueueTimeout(60000L)
        .buildOfficeManager().start();

If you want to utilize piping (Recommended is one process per CPU-core), you will need to set VM argument and point java.library.path to the location of $URE_LIB which on my Ubuntu machine is /usr/lib/ure/lib/
For instance:
-Djava.library.path="/usr/lib/ure/lib"

And then you can change your OfficeManager.

OfficeManager officeManager = new DefaultOfficeManagerConfiguration()
        .setOfficeHome("/usr/lib/openoffice")
        .setConnectionProtocol(OfficeConnectionProtocol.PIPE)
        .setPipeNames("office1","office2") //two pipes
        .setTaskExecutionTimeout(240000L) //4 minutes
        .setTaskQueueTimeout(60000L)  // 1 minute
        .buildOfficeManager().start();



ConverterService3Impl
The following codes performs all the converting. It supports a File or byte[] as input.

This is how you use it:
Lets say you have a PDF file as byte[], and you want to convert this byte to PDF/A as byte.
All you would have to do is call method:



byte[] pdfa = converterService.convertToPDFA(pdfFile);

Similarly, if you have a Document (say a OpenOffice.org writer document) and you want to convert this to PDF you would call the method:

File doc = new File("myDocument.odt");
File pdfDocument = converterService.convert(doc, ".pdf");

Note that you will always get a PDF/A compliant pdf. All you need to do is change the extension from ".pdf" to ".html" and the converter would do the magic.


Here is the source. Please read the comments in the source code if you want to understand it, or just ask in the comment section below.

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.ConnectException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import javax.ejb.Local;
import javax.ejb.Stateless;

import lombok.Cleanup;

import org.apache.commons.io.FilenameUtils;
import org.apache.commons.io.IOUtils;
import org.artofsolving.jodconverter.OfficeDocumentConverter;
import org.artofsolving.jodconverter.document.DefaultDocumentFormatRegistry;
import org.artofsolving.jodconverter.document.DocumentFamily;
import org.artofsolving.jodconverter.document.DocumentFormat;
import org.artofsolving.jodconverter.document.DocumentFormatRegistry;

/**
 * This service converts files from one thing to another ie ODT to PDF, DOC to ODT etc
 * @author Shervin Asgari
 *
 */
@Stateless
@Local(ConverterService.class)
public class ConverterService3Impl implements ConverterService {

  private static final String PDF_EXTENSION = ".pdf";
  private static final String PDF = "pdf";  

  // Uncomment these when we want to use them

  // private final int PDFXNONE = 0;
  private final int PDFX1A2001 = 1;
  // private final int PDFX32002 = 2;
  // private final int PDFA1A = 3;
  // private final int PDFA1B = 4; 


  @Logger //Your favourite logger (ie Log4J) could be injected here 
  private Log log;

  public File convert(File inputFile, String extension) throws IOException, ConnectException {
    if (inputFile == null) {
      throw new IOException("The document to be converted is null");
    }

    Pattern p = Pattern.compile("^.?pdf$", Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(extension);
    OfficeDocumentConverter converter;
    
    //If inputfile is a PDF you will need to use another FormatRegistery, namely DRAWING
    if(FilenameUtils.isExtension(inputFile.getName(), PDF) && m.find()) {
      DocumentFormatRegistry formatRegistry = new DefaultDocumentFormatRegistry();
      formatRegistry.getFormatByExtension(PDF).setInputFamily(DocumentFamily.DRAWING);
      converter = new OfficeDocumentConverter(officeManager, formatRegistry);
    } else {
      converter = new OfficeDocumentConverter(officeManager);
    }
    
    String inputExtension = FilenameUtils.getExtension(inputFile.getName());
    File outputFile = File.createTempFile(FilenameUtils.getBaseName(inputFile.getName()), extension);

    try {
      long startTime = System.currentTimeMillis();
      //If both input and output file is PDF
      if (FilenameUtils.isExtension(inputFile.getName(), PDF) && m.matches()) {
        //We need to add the DocumentFormat with DRAW
        converter.convert(inputFile, outputFile, toFormatPDFA_DRAW());
      } else if(FilenameUtils.isExtension(outputFile.getName(), PDF)) {
        converter.convert(inputFile, outputFile, toFormatPDFA());
      } else {
        converter.convert(inputFile, outputFile);
      }
      long conversionTime = System.currentTimeMillis() - startTime;
      log.info(String.format("successful conversion: %s [%db] to %s in %dms", inputExtension, inputFile.length(), extension, conversionTime));

      return outputFile;
    } catch (Exception exception) {
      log.error(String.format("failed conversion: %s [%db] to %s; %s; input file: %s", inputExtension, inputFile.length(), extension, exception, inputFile.getName()));
      exception.printStackTrace();
      throw new IOException("Converting failed");
    } finally {
      //outputFile.deleteOnExit();
      //inputFile.deleteOnExit();
    }
  }
  
  /**
   * Convert pdf file to pdf/a
   * You will need to install OpenOffice extension (pdf viewer) to get it working
   * @param pdf
   * @return Byte array
   * @throws IOException
   */
  public byte[] convertToPDFA(byte[] pdfByte) throws IOException, ConnectException {
    @Cleanup InputStream is = new ByteArrayInputStream(pdfByte);
    File pdf = createFile(is, PDF_EXTENSION);
    log.debug("PDF is: #0 #1", pdf.getName(), pdf.isFile());
    return convert(pdf);
  }

  
  private byte[] convert(File pdf) throws IOException {
    if (pdf == null) {
      throw new IOException("The document to be converted is null");
    }

    File convertedPdfA = convert(pdf, PDF_EXTENSION);
    @Cleanup final InputStream inputStream = new BufferedInputStream(new FileInputStream(convertedPdfA));
    byte[] pdfa = IOUtils.toByteArray(inputStream);
    return pdfa;
  }

  /**
   * Creates a temp file and writes the content of InputStream to it. doesn't close input
   * 
   * @return File
   */
  private java.io.File createFile(InputStream in, String extension) throws IOException {
    java.io.File f = File.createTempFile("tmpFile", extension);
    @Cleanup BufferedOutputStream out = new BufferedOutputStream(new FileOutputStream(f));
    IOUtils.copy(in, out);
    return f;
  }
  
  /**
   * This DocumentFormat must be used when converting from document (not pdf) to pdf/a
   * For some reason "PDF/A-1" is called "SelectPdfVersion" internally; maybe they plan to add other PdfVersions later.
   */
  private DocumentFormat toFormatPDFA() {
    DocumentFormat format = new DocumentFormat("PDF/A", PDF, "application/pdf");
    Map<String, Object> properties = new HashMap<String, Object>();
    properties.put("FilterName", "writer_pdf_Export");

    Map<String, Object> filterData = new HashMap<String, Object>();
    filterData.put("SelectPdfVersion", this.PDFX1A2001);
    properties.put("FilterData", filterData);

    format.setStoreProperties(DocumentFamily.TEXT, properties);

    return format;
  }
  
  /**
   * This DocumentFormat must be used when converting from pdf to pdf/a
   * For some reason "PDF/A-1" is called "SelectPdfVersion" internally; maybe they plan to add other PdfVersions later.
   */
  private DocumentFormat toFormatPDFA_DRAW() {
    DocumentFormat format = new DocumentFormat("PDF/A", PDF, "application/pdf");
    Map<String, Object> properties = new HashMap<String, Object>();
    properties.put("FilterName", "draw_pdf_Export");

    Map<String, Object> filterData = new HashMap<String, Object>();
    filterData.put("SelectPdfVersion", this.PDFX1A2001);
    properties.put("FilterData", filterData);

    format.setStoreProperties(DocumentFamily.DRAWING, properties);

    return format;
  }

}


Remember to close the connection when your application is quit/shutdown

Wednesday, May 12, 2010

Automate converting of documents to PDF & PDF/A using JODConverter 2

In this blog post I will be showing a great library for converting existing documents to PDF/A using an OpenSource library called JODConverter.
Note there is nothing that prevents you to convert to normal PDF.

JODConverter leverages OpenOffice.org, which provides arguably the best import/export filters for OpenDocument and Microsoft Office formats available today. Thus, it requires an installation of OpenOffice and it supports all documents which OpenOffice supports.

JODConverter automates all conversions supported by OpenOffice.org, including
  • Microsoft Office to OpenDocument, and viceversa
    • Word to OpenDocument Text (odt); OpenDocument Text (odt) to Word
    • Excel to OpenDocument Spreadsheet (ods); OpenDocument Spreadsheet (ods) to Excel
    • PowerPoint to OpenDocument Presentation (odp); OpenDocument Presentation (odp) to PowerPoint
  • Any format to PDF
    • OpenDocument (Text, Spreadsheet, Presentation) to PDF
    • Word to PDF; Excel to PDF; PowerPoint to PDF
    • RTF to PDF; WordPerfect to PDF; ...
  • And more
    • OpenDocument Presentation (odp) to Flash; PowerPoint to Flash
    • RTF to OpenDocument; WordPerfect to OpenDocument
    • Any format to HTML (with limitations)
    • Support for OpenOffice.org 1.0 and old StarOffice formats
    • ...
JODConverter can be used in many different ways
  • As a Java library, embedded in your own Java application
  • As a command line tool, possibly invoked from your own scripts
  • As a simple web application: upload your input document, select the desired format and download the converted version
  • As a web service, invoked from your own application written in your favourite language (.NET, PHP, Python, Ruby, ...)
JODConverter is open source software released under the terms of the LGPL and can be downloaded from SourceForge.net.

Starting OpenOffice as a service

JODConverter needs to connect to a running OpenOffice.org instance in order to perform the document conversions. This is different from starting the OpenOffice.org program as you would normally do. OpenOffice.org can be configured to run as a service and listen for commands on a TCP port. One way of doing this is to run the following command in Linux: (You only need to change location of the soffice)
/usr/bin/soffice "-accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager" -norestore -nofirststartwizard -nologo -headless &
I suggest putting this script in /etc/init.d/ so it will run automatically.
Note that you cannot open OpenOffice.org if you have this service running as headless mode.
If you are running your system on Windows, you can read here for information on how to create a service on Windows.
See the Uno/FAQ on the OpenOffice.org Wiki for more on this topic.

Command-Line Tool (cli like)

You can run JODConverter as cli (command line interface) like program.
To use it as a command line tool, you need to download the 2.2.2 distribution, unpack it, and run it using Java.
To convert a single file specify input and output files as parameters
java -jar lib/jodconverter-cli-2.2.0.jar document.doc document.pdf
To convert multiple files to a given format specify the format using the -f (or --output-format) option and then pass the input files as parameters
java -jar lib/jodconverter-cli-2.2.0.jar -f pdf *.odt

Usage in your Java applications
Using JODConverter in your own Java application is very easy. The following example shows the skeleton code required to perform a one off conversion from a Word document to PDF:
File inputFile = new File("document.doc");
File outputFile = new File("document.pdf");
 
// connect to an OpenOffice.org instance running on port 8100
OpenOfficeConnection connection = new SocketOpenOfficeConnection(8100);
connection.connect();
 
// convert
DocumentConverter converter = new OpenOfficeDocumentConverter(connection);
converter.convert(inputFile, outputFile);
 
// close the connection
connection.disconnect();

To get convert the same document to PDF/A instead, you need to create a custom DocumentFormat that is of type PDF/A and then send that into the convert method like this:
/**
* Returns DocumentFormat of PDF/A
*/
private DocumentFormat toDocumentFormatPDFA() {
  //These are the different PDF version's you can get. 1 is the default PDF/A
    final int PDFXNONE = 0;
    final int PDFX1A2001 = 1;
    final int PDFX32002 = 2;
    final int PDFA1A = 3;
    final int PDFA1B = 4;
    // create a PDF DocumentFormat (as normally configured in document-formats.xml)
    DocumentFormat customPdfFormat = new DocumentFormat(PORTABEL_FORMAT, PDF_APP, "pdf");

    //now set our custom options
    customPdfFormat.setExportFilter(DocumentFamily.TEXT, "writer_pdf_Export");
    /*
     * For some reason "PDF/A-1" is called "SelectPdfVersion" internally; maybe they plan to add other
     * PdfVersions later.
     */
    final Map<String, Integer> pdfOptions = new HashMap<String, Integer>();
    pdfOptions.put("SelectPdfVersion", PDFX1A2001);
    customPdfFormat.setExportOption(DocumentFamily.TEXT, "FilterData", pdfOptions);
    return customPdfFormat;
}
And then you call the convert method with toDocumentFormatPDFA() as parameter.
converter.convert(inputFile, outputFile, toDocumentFormatPDFA());

Note that this is a very simple example. I do not recommend opening and closing connection for each conversion. You open once the application is started (or the first time you want to convert), and then close the connection when the application shuts down.

Wednesday, December 3, 2008

Dynamically generate ODT and PDF documents from Java

I would like to generate a OpenOffice document and a PDF document without having a running OpenOffice service in Java. This was not as easy as it sounds, however I have found a solution.

You can create ODT document fairly easy without having a running instance of OpenOffice. However I have not found an easy way to convert that document to a PDF. However, I found a solution for the latter when running Linux. 

All the following libraries are Open Source .

The easiest way to generate ODT documents from templates is by using a (unmaintaned) library by the name of JOOReport. JOOReport uses Freemarker to create ODT documents based on templates. 

Basically what you need to do is create a template odt document in OpenOffice and whereever you want to insert something, you can insert it with the syntax 
${anythingGoesHere}
ie: My name is ${name}

When you have finished implementing the template, you must then create a properties file defining the values. 

ie. 
name=
age=
birthday=
address=


We can then from the Java program get the Properties file, and fill inn already pre defined variables. We then give the template and the properties file as well as the output odt file in arguments to the program. After the creation is successfull, the easiest way is to call a linux program called 
odt2pdf someFile.odt

which takes the odt file as argument and creates a pdf file with the same name. 

Vouila. As easy as that. You can also style the template as you like and the styling will also be implemented in the generated output file.

This is a standalone program that creates a document from a template and a data file and converts it to the specific format. 


// JOOReports - The Open Source Java/OpenOffice Report Engine 
// Copyright (C) 2004-2006 - Mirko Nasato 
// 
// This library is free software; you can redistribute it and/or 
// modify it under the terms of the GNU Lesser General Public 
// License as published by the Free Software Foundation; either 
// version 2.1 of the License, or (at your option) any later version. 
// 
// This library is distributed in the hope that it will be useful, 
// but WITHOUT ANY WARRANTY; without even the implied warranty of 
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 
// Lesser General Public License for more details. 
// http://www.gnu.org/copyleft/lesser.html 
// 
package net.sf.jooreports.tools; 

import java.io.File; 
import java.io.FileInputStream; 
import java.io.FileOutputStream; 
import java.net.ConnectException; 
import java.util.Properties; 

import org.apache.commons.io.FilenameUtils; 

import net.sf.jooreports.converter.DocumentConverter; 
import net.sf.jooreports.openoffice.connection.OpenOfficeConnection; 
import net.sf.jooreports.openoffice.connection.SocketOpenOfficeConnection; 
import net.sf.jooreports.openoffice.converter.OpenOfficeDocumentConverter; 
import net.sf.jooreports.templates.DocumentTemplate; 
import net.sf.jooreports.templates.UnzippedDocumentTemplate; 
import net.sf.jooreports.templates.ZippedDocumentTemplate; 
import freemarker.ext.dom.NodeModel; 

/** 
 * Command line tool to create a document from a template and a data file 
 * and convert it to the specified format. 
 * 


 * The data file can be in XML format or a simple .properties file. 
 * 


 * Requires an OpenOffice.org service to be running on localhost:8100 
 * (if the output format is other than ODT). 
 */ 
public class CreateAndConvertDocument { 

  public static void main(String[] args) throws Exception { 
  if (args.length < 3) { System.err.println("USAGE: "+ CreateAndConvertDocument.class.getName() +" "); 
  System.exit(0); 
  }  
  File templateFile = new File(args[0]); 
  File dataFile = new File(args[1]); 
  File outputFile = new File(args[2]); 

  DocumentTemplate template = null; 
  if (templateFile.isDirectory()) { 
  template = new UnzippedDocumentTemplate(templateFile); 
  } else { 
  template = new ZippedDocumentTemplate(templateFile); 
  } 
   
  Object model = null; 
  String dataFileExtension = FilenameUtils.getExtension(dataFile.getName()); 
  if (dataFileExtension.equals("xml")) { 
  model = NodeModel.parse(dataFile); 
  } else if (dataFileExtension.equals("properties")) { 
  Properties properties = new Properties(); 
  properties.load(new FileInputStream(dataFile)); 
  model = properties; 
  } else { 
  throw new IllegalArgumentException("data file must be 'xml' or 'properties'; unsupported type: " + dataFileExtension); 
  } 
   
  if ("odt".equals(FilenameUtils.getExtension(outputFile.getName()))) { 
  template.createDocument(model, new FileOutputStream(outputFile)); 
  } else { 
  OpenOfficeConnection connection = new SocketOpenOfficeConnection(); 
  try { 
  connection.connect(); 
  } catch (ConnectException connectException) { 
  System.err.println("ERROR: connection failed. Please make sure OpenOffice.org is running and listening on port "+ SocketOpenOfficeConnection.DEFAULT_PORT +"."); 
  System.exit(1); 
  } 
   
  File temporaryFile = File.createTempFile("document", ".odt"); 
  temporaryFile.deleteOnExit(); 
  template.createDocument(model, new FileOutputStream(temporaryFile)); 
  
  try { 
  DocumentConverter converter = new OpenOfficeDocumentConverter(connection); 
  converter.convert(temporaryFile, outputFile); 
  } finally { 
  connection.disconnect(); 
  } 
  } 
  } 
}




Labels