Photo from Unsplash
Originally Posted On: https://medium.com/turkcell/tesseract-ocr-implementation-in-net-core-spring-boot-6f876a5d4ae5
My Purpose :
This article was written for How to implement Tesseract OCR with .net core and with spring boot. Also, both of these projects was coded for proofing of concept without any high level architecture or any software pattern. Project can quickly explain main implementation of Tesseract OCR. Because of it , I preferred two enterprise software languages which are .net core and JAVA. I was coded both of these in Rest API format.This introduction is enough. Let’s begin 
What is Tesseract OCR ( Optical Character Recognition ) ?
Tesseract OCR is open source. Since 2006 it is developed by Google.
Basically, this technology recognises text inside images, such as scanned photos,documents, screenshots and pdf. OCR technology is used to convert virtually any kind of images containing scanned /written /taken text into machine-readable text data.
Press enter or click to view image in full size

History
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP.
Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages “out of the box”.
Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output.
( Reference: https://github.com/tesseract-ocr/tesseract#brief-history )
.NET CORE IMPLEMENTATION
Dependencies
System.Reflection.Emit – Version=4.6.0
Tesseract -Version=3.3.0
Tesseract OCR implementation code block in .NET Core
using System;
using System.IO;
using Microsoft.AspNetCore.Mvc;
using Tesseract;
namespace dotnet_ocr_tesseract.Controllers
{
[Route("api/[controller]")]
[ApiController]
public class OcrController : ControllerBase
{
public const string folderName = "images/";
public const string trainedDataFolderName = "tessdata";
[HttpPost]
public String DoOCR([FromForm] OcrModel request)
{
string name = request.Image.FileName;
var image = request.Image;
if (image.Length > 0)
{
using (var fileStream = new FileStream(folderName + image.FileName, FileMode.Create))
{
image.CopyTo(fileStream);
}
}
string tessPath = Path.Combine(trainedDataFolderName, "");
string result = "";
using (var engine = new TesseractEngine(tessPath, request.DestinationLanguage, EngineMode.Default))
{
using (var img = Pix.LoadFromFile(folderName + name))
{
var page = engine.Process(img);
result = page.GetText();
Console.WriteLine(result);
}
}
return String.IsNullOrWhiteSpace(result) ? "Ocr is finished. Return empty" : result;
}
}
}
Input Image:
Press enter or click to view image in full size

Result:
Press enter or click to view image in full size

Since you’re already working in .NET Core, IronOCR offers a cleaner integration path. It wraps Tesseract 5 with a native .NET API, so there’s no need to configure external binaries or manage PATH variables.
using IronOcr;
[HttpPost("ocr")]
public IActionResult ExtractText(IFormFile image)
{
var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadImage(image.OpenReadStream());
var result = ocr.Read(input);
return Ok(new { text = result.Text });
}
For REST API scenarios, this removes the setup complexity while keeping the same OCR engine under the hood. It also includes built-in PDF support for scanned documents.
SPRING BOOT IMPLEMENTATION
Dependencies
net.sourceforge.tess4j –Version = 3.4.0 (Pom.xml)
java -Version =1.8
Tesseract OCR implementation code block in Spring boot
import java.awt.Graphics2D;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import javax.imageio.ImageIO;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.multipart.MultipartFile;
import com.ocr.model.OcrModel;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
@RestController
public class OcrController {
@PostMapping("/api/ocr")
public String DoOCR(@RequestParam("DestinationLanguage") String destinationLanguage,
@RequestParam("Image") MultipartFile image) throws IOException {
OcrModel request = new OcrModel();
request.setDestinationLanguage(destinationLanguage);
request.setImage(image);
ITesseract instance = new Tesseract();
try {
BufferedImage in = ImageIO.read(convert(image));
BufferedImage newImage = new BufferedImage(in.getWidth(), in.getHeight(), BufferedImage.TYPE_INT_ARGB);
Graphics2D g = newImage.createGraphics();
g.drawImage(in, 0, 0, null);
g.dispose();
instance.setLanguage(request.getDestinationLanguage());
instance.setDatapath("..//tessdata");
String result = instance.doOCR(newImage);
return result;
} catch (TesseractException | IOException e) {
System.err.println(e.getMessage());
return "Error while reading image";
}
}
public static File convert(MultipartFile file) throws IOException {
File convFile = new File(file.getOriginalFilename());
convFile.createNewFile();
FileOutputStream fos = new FileOutputStream(convFile);
fos.write(file.getBytes());
fos.close();
return convFile;
}
}
OcrController.java hosted with ❤ by GitHub view raw
Input Image:

Result:

Some Alternative For Tesseract OCR
- Google Cloud Vision
Try it! | Cloud Vision API Documentation | Google Cloud
Use the application below to return image annotations for your image file. Click the Show JSON button to view the raw…
- IronOcr
How To Read Text with OCR in C# and VB.Net | Iron OCR
How To Read & PDFs and Scanned Images (OCR) in C# and VB.Net IronOCR is a C# software library allowing .NET platform…
References — Additional Resources
Postman collections link :

Github Source Code Repositories:
.net core (POC) — .net:
fatihyildizli/dotnetcore-tesseract-ocr
Brief: This project (POC) consists of how to implement Tesseract OCR engine in dotnetcore. API Endpoint…
Spring boot (POC) — Java:
fatihyildizli/springboot-tesseract-ocr
Brief: This project (POC) consists of how to implement Tesseract OCR engine in Spring boot. API Endpoint…
Official Resources:
tesseract-ocr/tesseract
This package contains an OCR engine – libtesseract and a command line program – tesseract. Tesseract 4 adds a new…
tesseract-ocr/tessdata
You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or…
Hope you’ve enjoyed!
Thank you for reading, please press clap button for me
by