Tesseract OCR implementation in .NET Core & Spring Boot Syndication Cloud

Photo from Unsplash

Originally Posted On: https://medium.com/turkcell/tesseract-ocr-implementation-in-net-core-spring-boot-6f876a5d4ae5

My Purpose :

This article was written for How to implement Tesseract OCR with .net core and with spring boot. Also, both of these projects was coded for proofing of concept without any high level architecture or any software pattern. Project can quickly explain main implementation of Tesseract OCR. Because of it , I preferred two enterprise software languages which are .net core and JAVA. I was coded both of these in Rest API format.This introduction is enough. Let’s begin

What is Tesseract OCR ( Optical Character Recognition ) ?

Tesseract OCR is open source. Since 2006 it is developed by Google.

Basically, this technology recognises text inside images, such as scanned photos,documents, screenshots and pdf. OCR technology is used to convert virtually any kind of images containing scanned /written /taken text into machine-readable text data.

History

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP.

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages “out of the box”.

Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output.

( Reference: https://github.com/tesseract-ocr/tesseract#brief-history )

.NET CORE IMPLEMENTATION

Dependencies

System.Reflection.Emit – Version=4.6.0

Tesseract -Version=3.3.0

Tesseract OCR implementation code block in .NET Core

using System;
using System.IO;
using Microsoft.AspNetCore.Mvc;
using Tesseract;


namespace dotnet_ocr_tesseract.Controllers
{

    [Route("api/[controller]")]
    [ApiController]
    public class OcrController : ControllerBase
    {
        public const string folderName = "images/";
        public const string trainedDataFolderName = "tessdata";

        [HttpPost]
        public String DoOCR([FromForm] OcrModel request)
        {

            string name = request.Image.FileName;
            var image = request.Image;

            if (image.Length > 0)
            {
                using (var fileStream = new FileStream(folderName + image.FileName, FileMode.Create))
                {
                    image.CopyTo(fileStream);
                }
            }

            string tessPath = Path.Combine(trainedDataFolderName, "");
            string result = "";

            using (var engine = new TesseractEngine(tessPath, request.DestinationLanguage, EngineMode.Default))
            {
                using (var img = Pix.LoadFromFile(folderName + name))
                {
                    var page = engine.Process(img);
                    result = page.GetText();
                    Console.WriteLine(result);
                }
            }
            return String.IsNullOrWhiteSpace(result) ? "Ocr is finished. Return empty" : result;


        }

    }
}

OcrController.cs hosted with

by GitHub view raw

Repository link: https://github.com/fatihyildizli/dotnetcore-tesseract-ocr

Input Image:

Result:

Since you’re already working in .NET Core, IronOCR offers a cleaner integration path. It wraps Tesseract 5 with a native .NET API, so there’s no need to configure external binaries or manage PATH variables.

using IronOcr;

[HttpPost("ocr")]
public IActionResult ExtractText(IFormFile image)
{
    var ocr = new IronTesseract();
    using var input = new OcrInput();
    input.LoadImage(image.OpenReadStream());
    var result = ocr.Read(input);
    return Ok(new { text = result.Text });
}

For REST API scenarios, this removes the setup complexity while keeping the same OCR engine under the hood. It also includes built-in PDF support for scanned documents.

SPRING BOOT IMPLEMENTATION

Dependencies

net.sourceforge.tess4j –Version = 3.4.0 (Pom.xml)

java -Version =1.8

Tesseract OCR implementation code block in Spring boot

import java.awt.Graphics2D;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

import javax.imageio.ImageIO;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.multipart.MultipartFile;

import com.ocr.model.OcrModel;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;

@RestController
public class OcrController {

	@PostMapping("/api/ocr")

	public String DoOCR(@RequestParam("DestinationLanguage") String destinationLanguage,
			@RequestParam("Image") MultipartFile image) throws IOException {

		
		OcrModel request = new OcrModel();
		request.setDestinationLanguage(destinationLanguage);
		request.setImage(image);

		ITesseract instance = new Tesseract();

		try {
			
			BufferedImage in = ImageIO.read(convert(image));

			BufferedImage newImage = new BufferedImage(in.getWidth(), in.getHeight(), BufferedImage.TYPE_INT_ARGB);
            
			Graphics2D g = newImage.createGraphics();
			g.drawImage(in, 0, 0, null);
			g.dispose();
            
			instance.setLanguage(request.getDestinationLanguage());
			instance.setDatapath("..//tessdata");

			String result = instance.doOCR(newImage);

			return result;

		} catch (TesseractException | IOException e) {
			System.err.println(e.getMessage());
			return "Error while reading image";
		}

	}
	
	public static File convert(MultipartFile file) throws IOException {
	    File convFile = new File(file.getOriginalFilename());
	    convFile.createNewFile();
	    FileOutputStream fos = new FileOutputStream(convFile);
	    fos.write(file.getBytes());
	    fos.close();
	    return convFile;
	}
	
	
	   


}
OcrController.java  hosted with ❤ by GitHub view raw

Repository link : https://github.com/fatihyildizli/springboot-tesseract-ocr

Input Image:

Result:

Some Alternative For Tesseract OCR

Google Cloud Vision

Try it! | Cloud Vision API Documentation | Google Cloud

Use the application below to return image annotations for your image file. Click the Show JSON button to view the raw…

cloud.google.com

IronOcr

How To Read Text with OCR in C# and VB.Net | Iron OCR

How To Read & PDFs and Scanned Images (OCR) in C# and VB.Net IronOCR is a C# software library allowing .NET platform…

ironsoftware.com

References — Additional Resources

Postman collections link :

https://www.getpostman.com/collections/71606f547b77a9b79ed6

Github Source Code Repositories:

.net core (POC) — .net:

fatihyildizli/dotnetcore-tesseract-ocr

Brief: This project (POC) consists of how to implement Tesseract OCR engine in dotnetcore. API Endpoint…

github.com

Spring boot (POC) — Java:

fatihyildizli/springboot-tesseract-ocr

Brief: This project (POC) consists of how to implement Tesseract OCR engine in Spring boot. API Endpoint…

github.com

Official Resources:

tesseract-ocr/tesseract

This package contains an OCR engine – libtesseract and a command line program – tesseract. Tesseract 4 adds a new…

github.com

tesseract-ocr/tessdata

You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Hope you’ve enjoyed!

Thank you for reading, please press clap button for me