Proceeding of the 3rd International Conference on Informatics and Technology,

Share Embed Donate


Short Description

Download Proceeding of the 3rd International Conference on Informatics and Technology,...

Description

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

DOCUMENT PRESENTATION ENGINE FOR INDIAN OCR A DOCUMENT LOYOUT ANALYSIS APPLICATION Umesh Kumar, Jagdish Raheja

ABSTRACT Today office automation is going on in the all of the fields. Everybody is using the computers for fast data processing and for maintaining the large amount data. But, there is need of the previous processed data which is printed documents. There are two ways to use old data. First is to type the data to process in the computer. And the second is to scan the document and use OCR (Optical Character Recognition) to convert the document image in to the editable text format. There are more than 1000 languages and 14 scripts used by 112 million people in the India [1,2]. So there is need of OCR system for Indian scripts which is in development process. In OCR we have to scan the document, then noise cleaning, skew detection and correction, text non–text classification, text line detection and segmentation, word segmentation, character segmentation and identification and output file generation. The main contribution in this paper is that how to maintain the layout of document. At the present scenario OCR system produce the text file as output without maintaining the layout of document. OCR is an error-prone process. The error remaining in OCRed texts can cause serious problems in reading and understanding if they do not refer to the original image representation. In this paper the use of XML is presented to generate the open office document file. In this paper the open document standard is followed which is approved by OASIS (Organization for the advancement of Structured Information Standards) on Feb. 1, 2007 [3]. The main feature of the propose solution is that it is scripts independent. So it can be applied for all Indian scripts. Keywords: 1.0 I

OCRed Document, Indian Scripts, Document layout.

NTRODUCTION

In the field of computer science text and images are the main source of the information. A human can understand the information if it is presented in well manner. For example, if the data is presented with the images in well organized manner then the other one can understand in a better way. If the presentation is not well the information given will not be un derstood by the other. Th e OCR process is error prone . It is time c onsuming and expensive to ma nually proofread O CR resul ts. T he errors re maining i n O CRed t exts c an cause s erious prob lems i n readi ng and understanding i f the y do not r efer to th e ori ginal image repr esentation. Document repr esentation a fter OCR is ver y important task. It c an cause serious problems in reading and as well as understanding the document if they do not maintain the la yout as in the original image representation. Present system scan the document image and p lace the text and image one after ot her w ithout maintaining the l ayout [2]. In t his paper t here i s sm all di scussion a bout the processing of the OCR and the detail discussion of the proposed system. The following figure gives an overview of the OCR process.

Figure 1: flow of OCR process As the above figure show the first step of OCR is two tone conversions which convert the image into binary image, and then skew is de tected an d corre cted. N oise c leaning i s perf ormed o n th e skew c orrected image. T ext no n-text classification t echnique classifies th e do cument image i n te xt an d i mage. T he i mage part is e xtracted fro m the document image for fur ther processing and remaining text par t is passed to text lines detection. After detecting the text lines, each line is processed and individual character is generated and final output file is generated. But at this

©Informatics '09, UM 2009

RDT4 - 88

Proceeding of the 3rd International Conference on Informatics and Technology, 2009 stage there is problem that, in output file the layout of document image, is not maintained. The main contribution of this paper is that how to maintain the layout of the output document. 2.0

PROCESSING OF OCR

The OCR is combination of mu ltiple processes as sho wn in the a bove f igure 1. The first process o f th e OCR is to acquire the document image using the scanner or a camera. The image has the many color combination in the image but the OCR process the binary image. So the image is converted to the binary image. A global threshold value is generated and the image is converted to the corresponding binary image. Then the noise is removed from the input image. T he m ost c ommonly used ap proach call ed morphological component re moval te chnique is us ed to r emove the noise. Figure 2 shows the image taken as input and after removing the noise.

Figure 2: Noisy image, Noise cleaned Image, Skewed Image and Skew Corrected Image After noise removal the resultant image is taken as input to the skew detection and correction technique. The cause of sk ew i s due t o the improper a lignment of th e d ocument paper d uring sc anning. When th e d ocument is b eing scanned and it not properly placed on scanner then the scanned image can have skewed image. A s kewed image may result in failure to detect the text in the document image. So it is necessary to detect the skew and correct it. Above figure 2 represent the skew in the image and the resultant image after the correction. After the preprocessing of image involving the noise removal and skew correction, the image is segmented in two categories i.e. text and non text c lassification. T his categorization is done b y the text, no n-text c lassification. As a ll types of document image contain the text, Ima ge and tabl e; this module takes the image and tab le as non-text area and remaining part as a text area [13, 14]. At this stage the non-text area is extracted from the image and stored. The remaining part of image is u sed to de tect the text. Th e figure 4 rep resents th e Input image and t he tex t no n t ext area cla ssification an d extracted text area. Red color boundary is used to represent non-text area of image.

Figure 3: An input image, Text non text classification, and extracted Text area The abov e fig ure 4 s hows t he co mplete text p art i n the do cument i mage which is us ed to d etect t he te xt. A fter classification each text block is identified and text lines are detected as shown in the following image for a single text block.

©Informatics '09, UM 2009

RDT4 - 89

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

Figure 5: Detected text lines. After detecting the t ext l ines word s egmentation, ch aracter seg mentation, template matching and ch aracter replacement is then performed. As th ere are standard techniques for these processes and are dis cussed elsewhere [5, 12] so details discussion is not given here. But for the sake of completion of the OCR process, just introduction is given here. Wo rd segmentation i s pe rformed using t he bas ic feature of t he script called the white s pace between each word. Each word is segmented and passed for further processing called the character segmentation. Again the character s egmentation is performed using the s ame b asic fe ature o f the sc ripts c alled the white space or g ap between each character. After this process the output comes as individual character image. And matching character is searched and replaced. 3.0

OVERVIEW OF PROBLEM

The pres ent O CR s ystem av ailable f or Indian s cripts is a ble to c onvert th e doc ument image in ed itable text an d produce a text file as discussed above. As discussed before it performs the pre-processing and classify the text and non text are a. T hen d etect the text in s egmented text area and generate i ts equivalent te xt. T he edi table te xt a nd image is plac ed in t he text file. It works fine and has no pr oblem if the d ocument image is single column. But what happens if the document image is multicolumn as shown in the following image.

Figure 6: Multicolumn Document Image (left) and corresponding disorder blocks representation (right) As shown in the figure 6 (left) there are five text blocks and the two image block. If the 1st block then 3, 2, 4, and 5 is processed one by one and placed in output file. But if it not follows this sequence or i mage is not placed at pr operly

©Informatics '09, UM 2009

RDT4 - 90

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

location. Then it is not easy to understand that document. So, it is important to maintain the layout of the document as it was in the inp ut d ocument im age. The abov e fi gure 6 (right) repr esents the ou tput f ile of the OCR s ystem available now. Bu t i t r epresents the serious problem o f document understanding. T he d ocument image i n figure 6 (left) is understandable because he/she know which column to be read after reading the one column but in the case of figure 6 (ri ght), it l oss the document presentation so no t able t o understand. I n thi s p aper t here i s a simple approach which takes the text output from OCR and blocks coordinate information and p roduce the output file in t he odt format in which the layout is maintained as it was in the input document image. 4.0 PR

OPOSED SOLUTION

Problem of layout retention of the OCR for document image is solved using the Open Document Format specification v1.1 approved by the OASIS on F eb. 2007. It specifies the XML su pport to generate the document and handle the document el ements. The document c ontains the Document root s, Document metadata, Bod y elements an d Document types, Application settings, Scripts, Font face declarations, Styles, Page styles and Layout [3]. Using the java, a s imple parser is designed which define the document elements in XML and compress as the document file. The Open Document Type file is generated as the output which can be viewed in Linux as well as in Window using the Open Office [3] and Microsoft Office (using the plug-ins). Algorithm

Flow Chart 1. Read the block information file. If no blocks are there then Exit. 2. Create Support files. 3. Select the font for the specific language. 4. Create Content.xml file. 5. Write the root information in content file. 6. Read the block coordinate information. 7. If (image!=0) then Include the element meta-data for image and text. Else If (text! = 0) then Include the element meta-data for text. 8. For I= 0 to N // where N is no. of images. 8.1 read Coordinate information 8.2 insert the image and image tag and coordinate information. 9. For I =0 to M // where M is no. of Text block 9.1 read coordinate information 9.2 Put the tag for text frame with coordinate information. 9.3 Insert the text in above written tag. 10. Write the closing tag in the content file. 11. Generate the output file using jar utility.

The above algorithm and flow chart represent that how the document will be g enerated and e lements are placed in document. Every document contains the document root tag which is the meta-data and static information. Each Xml file has some specific root tag which gives the information like version, date, types of document and footer tag. The user h as the ch oice to sel ect the la nguage of OCR be cause the t echnique proposed here a re s cripts i ndependent and will work for all the languages. The next process depends upon the block coordinate information file. Decide from the block coordinate information file that there is image block or text block. If document is not blank then write the declaration data in content file after the root information. This declaration data define that there is image block or text block. Now insert the document elements in the document using their coordinate information. Repeat the process to draw the image using the coordinate value for each image, width and height of image, and the path of image to insert for the number of images. As image blocks are inserted in the document then the text frames are inserted for the text information. An d write th e text i nformation in the text frame t ag. Again re peat this p rocess f or each available text block. At last include the document leaf tag called “footer tag” which is static and same for all documents. 5.0 TESTING

AND RESULTS

In this section there is a reference input image and define the result step by step.

©Informatics '09, UM 2009

RDT4 - 91

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

Figure 7: A document image which will be OCRed. The pre processing is performed on t he i mage and c lassified as t he te xt n on text blocks a nd block c oordinate information i s stored in file. The tex t bl ocks a re fu rther pro cessed a nd equivalent tex t i s p roduced. As th e above document i mage is giv en i t has the four i mages block an d t hree text b locks. And their c orresponding block coordinates information which is converted to the document size as given below. image 4 5734.05 10919.968 17416.018 17105.884 6755.892 151.892 17393.92 4012.946 249.936 154.94 6470.904 4121.912 228.092 4129.024 10721.086 10816.082 text 3 10870.946 4215.892 17375.886 1234.43 202.946 10997.946 5592.826 345.765 305.054 17347.946 17445.99 4567.87 There ar e four co ordinate v alues for the image blo cks and thre e values for the te xt bl ocks. Using these coordinate values draw the text frame, insert the text from text files, and define the image path with their coordinate values and height, width. After defining the all el ements of document in t he content file the jar/Zip utility is used to compress all the fil es in the file name.odt. The out put f ile op ened in the O pen O ffice and h as the correct l ayout as shown in th e following image.

Figure 8: output of Document Presentation Engine Process

©Informatics '09, UM 2009

RDT4 - 92

Proceeding of the 3rd International Conference on Informatics and Technology, 2009 6.0

CONCLUSION AND FUTURE WORKS

The approach presented in this paper is implemented and tested with different Indian scripts as well as roman scripts. This is working ver y well an d dep endent o f t he acc uracy of t he text n on-text clas sification process. Where i t is working well but there is so much to do. The OCR gives the out as a simple text it is not recognizing the text is Bold, Italic, Underline, or heading. So many features are remaining to implement in the OCR. REFERENCES [1] ht tp://tdil.mit.gov.in/resource_centre.html. [2] ht tp://indiansaga.com/languages/index.html. [3] ht tp://docs.oasis-open.org/office/v1.1/OS/ OpenDocument-v1.1.pdf [4] R Thomas M. Breuel. “High Performance Document Layout Analysis”. Volume: 3, On page(s): 61- 64, 2002. [5] Gaurav Gupta, S hobhit N iranjan, Ank it S hrivastava, Dr RMK Sin ha, “ Document L ayout An alysis a nd its application i n OCR” , 10th IEEE International E nterprise D istributed Obj ect Computing Conference Workshops, 2006. [6] Jignesh Dholakia, Atul Negi, S. Rama Mohan, “Zone Identification In Gujarati Text”,Proceedings of the 2005 8th International Conference on Document Analysis and Recognition, 2005 IEEE. [7] Suryaprakash Kom palli, Sankalp Na yak, Srirang araj Setlur, “Chall enges in O CR of Dev anag ari Doc ument”, Proceedings of the 20 05 Eig ht In ternational Co nference on Document An alysis and Recognition (ICDAR’05) 1520-5263/05, 2005 IEEE. [8] Pavlidis and Zhou J., “Page segmentation and classification”, Graphical models and Image Processing vol.54 pp 484-496.pan, p. 301, 1982. [9] Kolcz, A ., Als pector, J. , Augusteijn, M ., Carlson, R. , Popescu, G .V.[G. Vi orel], “A Li ne-Oriented Ap proach to Word Spotting in Handwritten Documents”, PAA(3), No. 2, 2000, pp. 153-168. [10] Likforman-Sulem, L .[Laurence], Zahour, A .[Abderrazak], Taconet, B. [Bruno], “Te xt li ne s egmentation o f historical documents: a survey”, IJDAR(9), No. 2-4, April 2007, pp. 123-138. [11] Udo Miletzki, “Character Recognition in Practice Today and Tomorrow”, 0-8, 186-7898-4197,IEEE 1997. [12] Tao Hong and Sargur N. Srihari, “Representating OCRed Documents In HTML”, Published in IEEE, 1997. [13] Z. Sh i, S . Se tlur, and V. G ovindaraju, '' Text Ex traction from G ray Scale Historical Document I mages Using Adaptive Lo cal Connectivity Map'', Eighth International Conference o n Docu ment Analysis an d Recognition, Seoul, Korea, 2005, pp. 794-798. [14] Umesh Ku mar “Text Li ne detection i n Mul ticolumn d ocument i mage a nd Doc ument Presentation E ngine” M.Tech Thesis, GGSIPU, New Delhi, 2008.

©Informatics '09, UM 2009

RDT4 - 93

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF