DATA SHEET

Read Text with OCR Blue Prism’s Read Text with OCR action uses Google’s Tesseract open source OCR (Optical Character Recognition) engine to be able to read text without identifying the font or disabling font smoothing.

Using Read Text with OCR To use Read Text with OCR, spy a Region element, drag it into a Read stage and the option will appear in the Data dropdown, as shown below.

Settings The OCR engine can work well with its default settings, but the Read stage input parameters can be adjusted if necessary.

Language •

This is used to specify a language other than the default, which is English.



Download language pack for correct version of Tesseract from: https://code.google.com/p/tesseractocr/downloads/list OR from https://github.com/tesseract-ocr/tessdata



Extract the language pack to “C:\Program Files\Blue Prism Limited\Blue Prism Automate\Tesseract\tessdata”



Use the 3 character ISO code to call the language needed within BP, it will be the same as the file extracted, e.g. SPA for Spanish, FRA for French

Commercial in Confidence ®Blue Prism is a registered trademark of Blue Prism Group plc

Page 1 of 3

DATA SHEET Page Segmentation Mode By default the Tesseract engine expects a page of text when it processes an image. If you're just seeking to OCR a small region try a different segmentation mode, using the Page Segmentation Mode input parameter. Note that adding a small white border to text which is too tightly cropped may also help with page segmentation. The different options available for the Page Segmentation Mode input are as follows: Page Segmentation Mode settings

Description

OSD AutoWithOSD AutoNoOCR Auto Column VerticalBlock Block Line Word CircledWord Character

Orientation and script detection (OSD) only. Automatic page segmentation with OSD. Automatic page segmentation, but no OSD, or OCR. Fully automatic page segmentation, but no OSD. (Default) Assume a single column of text of variable sizes. Assume a single uniform block of vertically aligned text. Assume a single uniform block of text. Treat the image as a single text line. Treat the image as a single word. Treat the image as a single word in a circle. Treat the image as a single character.

If the output quality of 'Read Text With OCR' is not as expected the Page Segmentation should be changed to an appropriate setting for the text area being read. For example, if you are reading a text area that should contain a single line of text, change the setting for this parameter to "Line". For further information on segmentation modes please consult the official documentation provided by Tesseract on their website.

Character Whitelist •

Used to restrict which characters can be recognised. For example, to ignore all non-numeric characters, enter “1234567890-”



The order of characters does not matter, “1234567890” works as well as “0987123456”



Make sure to include any special characters that maybe needed, e.g. . , $ ‘ – ()

Diagnostics Path •

Optional location for the output of what gets OCR’d. This is helpful for diagnostics problem solving if the OCR is not working as expected.



Files in the output folder will be overwritten with each run

Scale •

This is how much the engine will zoom in to read the image. The default is 4 but a value between 8 and 12 will often provide better results. Going over 14 produces poorer results within a larger region of text.



It is recommended that some experimentation is done with different values until the scale which returns the best results for your use case is found.

Commercial in Confidence ®Blue Prism is a registered trademark of Blue Prism Group plc

Page 2 of 3

DATA SHEET •

When trying to get text from multiple columns, the scaling should be set to 10 or higher to maintain the text on a single row, otherwise the data maybe returned in a single column

Details OCR is not intended as a replacement for Character Matching, and the Recognise Text action is still available. OCR and Character Matching are different recognition techniques and both have advantages and disadvantages. Tips • The OCR feature works best when there is a longer string and not one to three words •

Since terminal emulators used by mainframes are mono-spaced, continue using Character Matching and create your own font if necessary.



Unlike Character Matching, OCR does not need font smoothing to be switched off



Both methods require a clear view of the application screen

Limitations •

The engine is designed to work from 300dpi images and not screen prints (~100dpi), so this is not a complete replacement of Recognise Text



OCR can result in a ‘false positive’ or a ‘false negative’. An example of a false positive is when the OCR incorrectly determines that some text value exists on the screen, when in reality it does not. A false negative would be where OCR mistakenly decides that a value does not exist, when in fact it does.



By contrast Character Matching is more deterministic, either there is a 100% match with the character shape or there is no match.



Care should always be taken when using any OCR technology. Quality cannot be guaranteed in advance, and only through large scale testing of your specific use case will you know if the technology is suitable for your solution. Where possible Recognise Text should always be used instead.

The information contained in this document is the proprietary and confidential information of Blue Prism Limited and should not be disclosed to a third party without the written consent of an authorised Blue Prism representative. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying without the written permission of Blue Prism Limited. © Blue Prism Limited, 2001 - 2017 All trademarks are hereby acknowledged and are used to the benefit of their respective owners. Blue Prism is not responsible for the content of external websites referenced by this document. Blue Prism Group plc, Centrix House, Crow Lane East, Newton-le-Willows, WA12 9UY, United Kingdom Registered in England: Reg. No. 4260035. Tel: +44 870 879 3000. Web: www.blueprism.com

Commercial in Confidence ®Blue Prism is a registered trademark of Blue Prism Group plc

Page 3 of 3

Guide to Reading Text with OCR.pdf

... of Tesseract from: https://code.google.com/p/tesseract- ocr/downloads/list OR ... Since terminal emulators used by mainframes are mono-spaced, continue ...

412KB Sizes 0 Downloads 98 Views

Recommend Documents

Guide to Meter Reading
By telephone using our self service line at 1.800.600.2275. (Please have your 11-digit account number and meter reading available.) • By e-mail – Take a photo ...

Reading Guide
from John 14-‐16 during Lent. We have provided ... 1 Peter 2:11; Philippians 3:17-‐21. John 14. Week of March 26 -‐ The Triune God indwells those who love ...

1 Interacting with Text: the role of dialogue in ... - Reading to Learn
[laughs, waves at the mama pig in the illustration and turns the page] ... writing teachers using traditional grammar and composition techniques.5 The joint ..... Democrats, Advance Australia, the Greens and Australia First – espouse policies of ..

Reading Guide
from John 14-‐16 during Lent. We have provided five Scriptures each week which supplement the previous Sunday's message and anticipate the upcoming Sunday's message. We encourage you to create the time to prayerfully reflect on these Scriptures, li

Text - Reading's CentAUR - University of Reading
staff participated in the data collection process. Themes that emerged ...... share parallel ideas with the Warnock Report (Warnock, 1978) including that schools ...

Reading Guide - Rackcdn.com
After the death of her beloved mother, Martha Jefferson spent five years abroad with her father, Thomas Jefferson, on his first diplomatic mission to France. Now ...

22.1 Reading Guide
Physical Science Reading and Study Workbook □. Chapter 22 ... It describes the main layers ... Circle the letters of the major layers of Earth's interior. a. crust.

Reading Guide - Rackcdn.com
After the death of her beloved mother, Martha Jefferson spent five years abroad with her father, Thomas Jefferson, on his first diplomatic mission to France. Now, at seventeen, Jefferson's bright, handsome eldest daughter is returning to the lush hil

Study Guide ALST Reading with answers.pdf
... was where [Stein] was most likely to. see what interested her most" primarily by. 1. contrasting "the classical European way of life" with "the dances of Nijinsky, ...

Text to Text Connections.pdf
item is also bound by copyright laws and redistributing, editing, selling, or posting this item (or any part. thereof) on the Internet are all strictly prohibited without ...

Text-To-Speech with cross-lingual Neural Network-based grapheme ...
thesising social networks contact names, navigation directions abroad, and many others. These scenarios are very frequent in. TTS usage and developers can ...

Download PDF Beginner's Guide to Reading ...
Free Download Beginner's Guide to Reading Schematics, Third Edition Best Book, Download Best Book Beginner's Guide to Reading Schematics, Third Edition, full book Beginner's Guide to Reading Schematics, Third Edition, free online Beginner's Guide to

[PDF] Essential Guide to Reading Biomedical Papers ...
Download Best Book Essential Guide to Reading Biomedical Papers: Recognising and Interpreting Best Practice, Download Online Essential Guide to Reading ...