Thursday 3 November 2011

Extract text from images: a comparison of 10 free OCR tools

OCR Illustration6_eWe reviewed the following online OCR services and desktop OCR programs, all of which are either FREE or have a free component.
Online OCR servicesDesktop software
  1. Google Docs
  2. Free Online OCR
  3. i2OCR
  4. OCRonline
  5. Online OCR
  1. Cuneiform OpenOCR
  2. FreeOCR
  3. gImageReader
  4. Puma.NET
  5. SimpleOCR

Part1: Online OCR software

Online OCR software is available through the web browser and you don’t have to install new software on your computer. All you need is to get the image file using scanner or a digital photo camera, upload it through the online OCR web page and wait for the processed file to download.

Google 11. Google Docs

If you have a Gmail or other Google account you might try Google Docs first. Google Docs is not a dedicated OCR tool but it provides the OCR power Google uses to digitize books and process PDFs for their search engine.
To get text from image or PDF files you need to first upload and convert the files to Google Docs. Then you can do the further editing online or/and download it back as PDF, DOC, TXT etc.
In Google Docs to upload the files first you need to click Upload button, select Settings from the menu and check ‘Convert uploaded files to Google docs format’ and ‘Convert text from uploaded PDF and images files’ and then click Upload/Files.Another way is to check ‘Confirm settings before each upload’ after clicking Upload/Settings so that every time you upload a file it is asked whether you want to convert the file or leave it intact. This gives also an option to select which language dictionary will be used in the text recognition process. The file is therefore converted to Google Docs document having both original image(s) and converted text in it. You can review the text and delete the original images afterwards.
Google Docs conversion works pretty good, especially with English texts. Over 30 different languages can be selected but if your language is not included in the list, the conversion may give an error and the file will not be processed. Of course – if you don’t have a Google account you can create one any time.
  • Input image file types: most bitmap formats
  • Input PDF files: yes
  • Output file types: ODT, PDF, TXT, RTF, DOC, HTML
  • Languages: 30+
Google Docs / PROS:CONS:
  • Unlimited processing capacity
  • Text in some minor languages may not be recognized

Free Online OCR2. Free Online OCR

Free online OCR web page is more thoroughly reviewed in 
  • Input image file types: GIF, BMP, JPEG, TIFF, PNG
  • Input PDF files: yes
  • Output file types: DOC, PDF, RTF, TXT
  • Languages: English dictionary only
Free Online OCR / PROS:CONS:
  • No capacity limits for processing
  • Keeps original formatting and Layout
  • Only English dictionary supported. Text in other languages may be not recognized

i2OCR3. i2OCR

  • Input image file types: TIF, JPEG, PNG, BMP, GIF, PBM, PGM, PPM
  • Input PDF files: no
  • Output file types: TXT
  • languages: 30+
  • No limits for uploading
  • Has a review option after character recognition – the original image and result text is shown side-by-side on screen.
  • Only text output, all the original formatting will be lost. Though at least it supports multi column pages correctly.
  • Creates “hard” linebreaks at the end of each line.
  • Does not process PDF files.

OCRonline4. OCRonline

  • Input image file types: JPG, TIFF, PNG, GIF
  • Input PDF files: yes
  • Output file types: TXT, PDF, RTF, DOC
  • Languages: 150+
  • Excellent recognition quality
  • Rebuilds original formatting
  • Impressive list of 150 language dictionaries
  • Limited upload capacity – 5 pages in a week, file size up to 10 MB. Need to pay to get extra pages.

Online OCR5. Online OCR

  • Input image file types: JPG, JPEG, BMP, TIFF, GIF
  • Input PDF files: only for registered users
  • Output file types: DOC, XLS, TXT (+ PDF for registered users)
  • Languages: 30+
Note: There is registered and guest mode available for this site. In guest mode 15 images per hour can be processed and maximum file size is 4 MB. There are some extra possibilities in registered mode, like uploading larger images, ZIP archives and multi-page PDFs. Initial credits after registering is for converting 20 pages.
  • Supports some languages that other servers do not support.
  • Limited upload capacity. Extra capacity may be purchased or earned by bonus program.

Our Recommendation: The last word on online OCR services

From the online OCR solutions reviewed above, OCRonline provided good and stable OCR accuracy with a number of different fonts and texts. Unfortunately the free service is limited by 5 pages per week. If you need more capacity, try the other providers as they also may give good results depending on your source text.

Part2: Desktop OCR software

Desktop software you need to download and install to your computer, and they usually have more configurable options than online tools. Some programs include the ability to acquire image directly from a scanner so you don’t need to use other programs to do that.
The following OCR software will be reviewed: Cuneiform, OpenOCR, FreeOCR, gImageReader, Puma.NET and SimpleOCR. There are some more free tools available, which are mainly meant for more specific tasks. JOCR is for getting text from screenshots, requires Microsoft Office 2003 or later to be installed and has been previously reviewed here. Also there is Nuance PDF Reader that is able to upload scanned PDFs to its online service for character recognition. Nuance PDF Reader is previously reviewed here. And finally, there is MyMorph, a program intended for converting document archive files from one format to another, like TIFF, PDF, RTF etc. MyMorph is able to convert image files to editable text files.

Cuneiform 16. Cuneiform OpenOCR

OpenOCR is based on commercial product Cuneiform that was released as freeware on 2007.
  • License: freeware
  • Input image: most bitmap file formats
  • Input PDF: no
  • Scanner input: yes
  • Output: TXT, RTF, HTML + output to Word/Excel
  • Dictionary languages: 20+
Cuneiform OpenOCR / PROS:CONS:
  • Includes both single file and batch of files processing mode.
  • Installation program creates invalid start menu shortcuts like NewFolder1

FreeOCR7. FreeOCR

This is another of the programs that uses the open source Tesseract OCR engine. Tesseract was originally developed by HP and is currently sponsored by Google.
  • License: freeware
  • Requires: Microsoft .NET
  • Input image: TIFF, multi-page TIFF
  • Input PDF: yes
  • Scanner input: yes
  • Output: TXT
  • Dictionary languages: 9
  • Tesseract OCR engine has good accuracy.
    • Only text output, no formatting recognition
    • No multi-column support (must crop the image manually to one column)

gImageReader8. gImageReader

gImageReader is one of the front-ends to the free Tesseract OCR engine. You need to download and install Tesseract separately from this page. Tesseract engine uses OpenOffice dictionaries and spellcheckers that can be downloaded from here.
  • License: freeware (GNU)
  • Requires: Tesseract, need to download separately
  • Input PDF: yes
  • Dictionary languages: many, uses freely downloadable OpenOffice spellcheckers
  • Scanner input: yes
  • Input image: JPEG, GIF, PNG, TIFF
  • Output: TXT
gImageReader / PROS:CONS:
  • Tesseract OCR engine has good accuracy
  • OCR area(s) can be manually selected
    • Only text output, no formatting recognition

Puma.NET 19. Puma.NET

Puma.NET is actually not a user solution but a development kit based on CuneiForm OCR engine, though it contains a sample program with the front-end.
After installing there will be no launch icon in Start Menu but you can find the program Puma.Net.Sample.exe deep in the C:\ Program Files\ Puma.NET\ Sample\ bin\ x86\ Debug\folder.
  • License: freeware (BSD)
  • Requires: Microsoft .NET
  • Input image: BMP, GIF, EXIG, JPG, PNG and TIFF
  • Input PDF: no
  • Scanner input: no
  • Output: TXT, RTF, HTML
  • Dictionary languages: 27
  • Font and formatting detection
    • You have to create the shortcut to start the program by yourself
    • Leaves “hard” linebreaks

SimpleOCR10. SimpleOCR

SimpleOCR uses its own OCR engine that is capable of learning the fonts in a particular document.
  • License: free for all non-commercial purposes
  • Input image: TIFF, JPG, BMP
  • Input PDF: no
  • Scanner input: yes
  • Output: DOC, TXT
  • Dictionary languages: 3
Note: SimpleOCR seems to give better results from color JPEGs, not grayscale.
    • Word by word text revision
    • Ability to train the engine to use specific fonts
    • Includes both single file and batch of files processing mode
    • Only 3 languages dictionary.
    • No font and format detection

U Can Download Any Videos, All cracked Applications, Games in the net & also with Various Features.
Click Here to Try ScienceHack ToolBar...