This tool works much the same way as the image cleanup does, except that it's designed for the text parts of the PDF files. The tool provides the user with 4 options: Duplicate, Merge Lines, Merge Paragrap, and Classify. When merging the same principle is applied as with images, when the bounding boxes are closer to each other than what the determined Delta Values, they will merge into one bounding box. In case of the Merge Paragraph whole paragraphs are merged into one bounding box.
Bounding boxes:
Merge Lines performed:
Merge Paragraphs performend:
The tool provides the user with options (Checkboxes at the bottom) to display only the bounding boxes that are interesting. Duplicates, Unclassified, classified as Unit or Price can be blended out. The sliding bars that determine the Delta are at the top of the main panel. The Reset button rids all the changes that have been made to the document and restores it's original form, drawing all the bounding boxes over again. The Duplicates button cleans up the bounding boxes which occur due to artistic effects in PDF files, for instance the shadow of some piece of text. Often this will be stored in two images which overlap and create the impression of a shadow. Using this button will get rid of these. Merging Lines and Merging paragraphs is tied to the Delta value that is set by the user. If the boxes overlap less than the value that is set than they will be merged into one box surrounding the both. The classify button labels the texts and splits them into Unit or Price. Applying the changes will make them permanent to the document.
The function of the 'Duplicate' button is to get rid of duplicate bounding boxes that occur with texts. For instance there may be a second image that is merely a shadow to the text. This will combine such bounding boxes into one.
Authors: Rob Kooper, Peter Bajcsy. Documentation: Peter Ferak.