Jump to content
 







Main menu
   


Navigation  



Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Donate
 




Contribute  



Help
Learn to edit
Community portal
Recent changes
Upload file
 








Search  

































Create account

Log in
 









Create account
 Log in
 




Pages for logged out editors learn more  



Contributions
Talk
 



















Contents

   



(Top)
 


1 Overview of methods  





2 Example of a bottom up approach  





3 Layout analysis software  





4 See also  





5 External links  





6 Further reading  





7 References  














Document layout analysis







 

Edit links
 









Article
Talk
 

















Read
Edit
View history
 








Tools
   


Actions  



Read
Edit
View history
 




General  



What links here
Related changes
Upload file
Special pages
Permanent link
Page information
Cite this page
Get shortened URL
Download QR code
Wikidata item
 




Print/export  



Download as PDF
Printable version
 
















Appearance
   

 






From Wikipedia, the free encyclopedia
 


Incomputer visionornatural language processing, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.[1] Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis.[2] But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis.

Document layout analysis is the union of geometric and logical labeling. It is typically performed before a document image is sent to an OCR engine, but it can be used also to detect duplicate copies of the same document in large archives, or to index documents by their structure or pictorial content.

Document layout is formally defined in the international standard ISO 8613-1:1989.

Overview of methods[edit]

There are two main approaches to document layout analysis. Firstly, there are bottom-up approaches which iteratively parse a document based on the raw pixel data. These approaches typically first parse a document into connected regions of black and white, then these regions are grouped into words, then into text lines, and finally into text blocks.[3][4] Secondly, there are top-down approaches which attempt to iteratively cut up a document into columns and blocks based on white space and geometric information.[4]

The bottom-up approaches are the traditional ones, and they have the advantage that they require no assumptions on the overall structure of the document. On the other hand, bottom-up approaches require iterative segmentation and clustering, which can be time consuming.[4] Top-down approaches have the advantage that they parse the global structure of a document directly, thus eliminating the need to iteratively cluster together the possibly hundreds or even thousands of characters/symbols which appear on a document. They tend to be faster, but in order for them to operate robustly they typically require a number of assumptions to be made about on the layout of the document.[4] Examples of top-down approaches include the recursive X-Y cut algorithm, which decomposes the document in rectangular sections.[5]

There are two issues common to any approach at document layout analysis: noise and skew. Noise refers to image noise, such as salt and pepper noiseorGaussian noise. Skew refers to the fact that a document image may be rotated in a way so that the text lines are not perfectly horizontal. It is a common assumption in both document layout analysis algorithms and optical character recognition algorithms that the characters in the document image are oriented so that text lines are horizontal. Therefore, if there is skew present then it is important to rotate the document image so as to remove it.

It follows that the first steps in any document layout analysis code are to remove image noise and to come up with an estimate for the skew angle of the document.

Example of a bottom up approach[edit]

In this section we will walk through the steps of a bottom-up document layout analysis algorithm developed in 1993 by O`Gorman.[3] The steps in this approach are as follows:

  1. Preprocess the image to remove Gaussian and salt-and-pepper noise. Note that some noise removal filters may consider commas and periods as noise, so some care must be taken.
  2. Convert the image into a binary image, i.e. convert each pixel value to completely white or completely black.
  3. Segment the image into connected components of black pixels. These are the symbols of the image. For each symbol, compute a bounding box and centroid.
  4. For each symbol, determine its k nearest neighbors where k is an integer greater than or equal to four. O`Gorman suggests k=5 in his paper as a good compromise between robustness and speed. The reason to use at least k=4 is that for a symbol in a document, the two or three nearest symbols are the ones right next to it on the same text line. The fourth-nearest symbol is typically on a line right above or below, and it is important to include these symbols in the nearest neighbor calculation for the following.
  5. Each nearest neighbor pair of symbols is related by a vector pointing from one symbol’s centroid to the other symbol’s centroid. If these vectors are plotted for every pair of nearest neighbor symbols, then one gets what is called the docstrum for the document (See figure below). One can also use the angle Θ from the horizontal and distance D between two nearest neighbor symbols and create a nearest-neighbor angle and nearest-neighbor distance histogram.
  6. Using the nearest-neighbor angle histogram, the skew of the document can be calculated. If the skew is acceptably low, continue to the next step. If it is not, rotate the image so as to remove the skew and return to step 3.
  7. The nearest-neighbor distance histogram has several peaks, and these peaks typically represent between-character spacing, between-word spacing, and between-line spacing. Calculate these values from the histogram and set them aside.
  8. For each symbol, look at its nearest neighbors and flag any of them that are a distance away which is within some tolerance of the between-character spacing distance or between-word spacing distance. For each nearest neighbor symbol which is flagged, draw a line segment connecting their centroids.
  9. Symbols connected to their neighbors by line segments form text lines. Using all the centroids in a text line, one can compute an actual line segment representing the text line with linear regression. This is important because it is unlikely that all the centroids of symbols in a text line are actually collinear.
  10. For each pair of text lines, one can compute a minimum distance between their corresponding line segments. If this distance is within some tolerance of the between-line spacing calculated in step 7, then the two text lines are grouped into the same text block.
  11. Finally, one can calculate a bounding box for each text block, and the document layout analysis is complete.

Layout analysis software[edit]

See also[edit]

External links[edit]

Further reading[edit]

References[edit]

  1. ^ Baird, K.S. (July 1992). "Anatomy of a versatile page reader". Proceedings of the IEEE. 80 (7): 1059–1065. CiteSeerX 10.1.1.40.8060. doi:10.1109/5.156469.
  • ^ Cattoni, R.; Coianiz, T.; Messelodi, S.; Modena, C. M. "Geometric Layout Analysis Techniques for Document Image Understanding: a Review. ITC-irst Technical Report TR#9703-09". {{cite journal}}: Cite journal requires |journal= (help)
  • ^ a b O'Gorman, L. (1993). "The document spectrum for page layout analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence. 15 (11): 1162–1173. doi:10.1109/34.244677.
  • ^ a b c d Seong-Whan Lee; Dae-Seok Ryu (2001). "Parameter-free geometric document layout analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence. 23 (11): 1240–1256. CiteSeerX 10.1.1.574.7875. doi:10.1109/34.969115.
  • ^ Ha, Jaekyu; Haralick, Robert M.; Phillips, Ihsin T. (1995). "Recursive X-Y Cut using Bounding Boxes of Connected Components" (PDF). Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR '95).

  • Retrieved from "https://en.wikipedia.org/w/index.php?title=Document_layout_analysis&oldid=1220677783"

    Categories: 
    Image processing
    Optical character recognition
    Hidden categories: 
    CS1 errors: missing periodical
    Articles lacking in-text citations from May 2010
    All articles lacking in-text citations
    Articles with GND identifiers
     



    This page was last edited on 25 April 2024, at 07:11 (UTC).

    Text is available under the Creative Commons Attribution-ShareAlike License 4.0; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.



    Privacy policy

    About Wikipedia

    Disclaimers

    Contact Wikipedia

    Code of Conduct

    Developers

    Statistics

    Cookie statement

    Mobile view



    Wikimedia Foundation
    Powered by MediaWiki