From Pixels to Privacy Using OCR Microsoft Presidio to Redact Sensitive Data in ID Cards Syndication Cloud

Photo from Pexels

Originally Posted On: https://medium.com/the-constellar-digital-technology-blog/from-pixels-to-privacy-using-ocr-microsoft-presidio-to-redact-sensitive-data-in-id-cards-e46df9fce26b

Introduction

Today, I want to share a little journey I took while exploring Microsoft Presidio and its power in detecting and redacting Personally Identifiable Information (PII).

It started from a simple curiosity:

“What if I could automatically detect and censor sensitive information like Indonesian ID numbers (KTP NIK) or Singapore NRIC numbers from text and images?”

On paper, Presidio looked perfect. It can detect and anonymize text at scale. But when combined with OCR to process real-world ID images, I quickly learned that theory and practice are two very different things.

This article is not just about code — it’s about the lessons, the pitfalls, and the “aha moments” of combining OCR + Presidio for document privacy.

Disclaimer
> The ID card samples (KTP, NRIC) shown in this article are not real documents.
> They were sourced from Google for learning and demonstration purposes only.
> No personal data has been used in this exploration.

How OCR and Ms. Presidio Work Together (?)

Let’s break this down.

What is OCR?

Optical Character Recognition (OCR) is the process of converting images of text into machine-readable text.

I used Tesseract OCR, a popular open-source library. Here’s how it works in simple terms:

Image preprocessing — enhancing contrast, removing noise, and straightening skewed text.
Character recognition — scanning the image pixel-by-pixel to detect shapes of letters and numbers.
Text output — exporting the recognized text (with optional bounding boxes showing where each word is).

OCR is powerful but sensitive to image quality. If the photo is blurry, tilted, or has shadows, results can become messy (e.g., “Identity Card” turning into “iDentiry carp”).

Where Presidio Fits

Presidio doesn’t look at pixels — it only analyzes text. That means the pipeline looks like this:

flowchart LR
  A[ID Image (KTP/NRIC)] --> B[Tesseract OCR]
  B --> C[Extracted Text]
  C --> D[Presidio Analyzer]
  D --> E[Presidio Anonymizer]
  E --> F[Redacted Text or Image]

Tesseract OCR → Extracts the words.
Presidio Analyzer → Finds the sensitive data (e.g., NIK, NRIC, credit cards).
Presidio Anonymizer → Masks, replaces, or redacts the sensitive parts.
(Optional) Map back to bounding boxes for visual redaction on the original image.

In short: Tesseract gives Presidio eyes, Presidio gives Tesseract brains.

The Experiments: Text vs Images

Note: The following examples use sample ID Card images only.

Experiment 1: Pure Text

Starting with text was smooth. Presidio detected credit cards, phone numbers, and even custom regex like Indonesian NIK.

Before:

My name is Agus Setiawan, NIK 3171234567890123, and my credit card is 4111111111111111.

After (masked):

My name is , NIK ****************123, and my credit card is ************1111.

Experiment 2: KTP Images (Indonesia)

With KTP images, the workflow was:

Extract text using Tesseract.
Run Presidio Analyzer on the OCR output.
Map detected PII to bounding boxes for redaction.

This worked surprisingly well. The NIK number was detected and redacted directly on the image.

Before (KTP sample):

After (KTP redacted):

Presidio will detect Personally Identifiable Information

This was my first “wow moment” — seeing the system automatically censor a sensitive field on a real ID.

Experiment 3: NRIC Images (Singapore)

When I moved from experimenting with Indonesian KTPs to Singapore’s NRIC, I thought it would be just a copy-paste challenge. Change the regex, feed in the sample image, and done, right?

But reality hit different.

OCR didn’t always read correctly:

“Identity Card No.” became “iDentiry carp No.”
NRIC like S1509088H sometimes turned into $1509088H.

At that moment, I realized: this wasn’t going to be a smooth ride. The format looked simple — one letter, seven digits, one letter. But the OCR saw $ instead of S, or swapped a 5 for an S. Suddenly, Presidio had no clue there was even an NRIC number there.

If you’re working in the .NET ecosystem and want to avoid the OCR accuracy headaches I ran into, IronOCR is worth checking out. It wraps Tesseract with built-in preprocessing filters — so things like deskewing, binarization, and noise removal don’t require separate manual steps.

using IronOcr;

var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadImage("nric_card.png");

// Built-in filters to improve accuracy on messy scans
input.Deskew();
input.DeNoise();
input.Binarize();

OcrResult result = ocr.Read(input);
Console.WriteLine(result.Text);

The preprocessing pipeline handles a lot of the image quality issues that caused my $ vs S misreads. It’s not a drop-in for Python workflows, but for .NET projects dealing with ID cards or documents, the built-in filters make the “OCR is fragile” problem a lot more manageable.

Because of that, Presidio often failed to detect the PII. Even when it did detect, mapping back to bounding boxes sometimes failed because the OCR coordinates were off.

Successfully redacted the NRIC number with a black box.

Lessons Learned from NRIC

After testing NRIC images, it became clear that the biggest challenge was the image quality itself.

Sharp, flat, well-lit images → detection works smoothly.
Blurry, angled, or low-light photos → OCR misreads letters and numbers.
Regex needs to tolerate noise → small errors can break detection.

Even with advanced tools like Presidio, redacting sensitive data from images requires careful attention to how the images are captured and processed.

Key Insights

Presidio + OCR is powerful for structured IDs. (KTP, NRIC, credit cards, passports)
But OCR is fragile. Low-quality images make Presidio fail downstream.
Regex flexibility is essential. Real OCR outputs are messy, so regex must handle variations.
Image preprocessing helps. Deskewing, binarization, noise removal all improve accuracy.
Real-world use needs fallback. Automation first, human review second (especially in compliance-heavy industries).

Practical Use Cases

Finance: Automatically censor credit cards in receipts.
Healthcare: Remove patient IDs from scanned forms.
Government: Secure ID documents during verification workflows.
Customer Support: Redact PII before tickets are shared internally.

Conclusion: Reflections on This Learning

What I learned is that privacy automation is not about being flawless — it’s about scaling protection.

Presidio shines when text is clean and structured. When combined with Tesseract OCR, it unlocks new possibilities: automatically sanitizing ID images, securing workflows, and reducing human error.

But it also teaches a deeper lesson:

Machines are great at speed and consistency.
Humans are still needed for judgment and edge cases.

In other words, automation doesn’t replace responsibility — it amplifies it.

And maybe that’s the true balance: letting machines handle the routine, while humans stay in charge of trust.

Appendix: Code Snippets

Text Analysis with Presidio

import re
import cv2
import pytesseract
from presidio_analyzer import AnalyzerEngine, Pattern, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine
from PIL import Image

text = pytesseract.image_to_string(cv2.imread("ktp_sample.jpg"))
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(PatternRecognizer("INDONESIAN_NIK", [Pattern("NIK", r"\b\d{16}\b", 0.9)]))
results = analyzer.analyze(text=text, entities=["INDONESIAN_NIK"], language="en")
anonymized = AnonymizerEngine().anonymize(text=text, analyzer_results=results)
print(anonymized.text)

OCR Extraction with Tesseract

import cv2, pytesseract
from presidio_analyzer import AnalyzerEngine, Pattern, PatternRecognizer
from pytesseract import Output

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread("ktp_sample.jpeg")
ocr_data = pytesseract.image_to_data(image, output_type=Output.DICT)

text = " ".join(ocr_data["text"])
analyzer = AnalyzerEngine()
nik_pattern = Pattern(name="NIK Pattern", regex=r"\b\d{16}\b", score=0.9)
analyzer.registry.add_recognizer(PatternRecognizer("NIK_ID", [nik_pattern]))

results = analyzer.analyze(text=text, entities=["NIK_ID"], language="en")
for i in range(len(ocr_data["text"])):
    if any(extracted_text[r.start:r.end] in ocr_data["text"][i] for r in results):
        cv2.rectangle(image, (ocr_data["left"][i], ocr_data["top"][i]), (ocr_data["left"][i] + ocr_data["width"][i], ocr_data["top"][i] + ocr_data["height"][i]), (0, 0, 0), -1)

cv2.imwrite("ktp_redacted.jpeg", image)

OCR + Presidio Pipeline

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

# OCR result from Tesseract
extracted_text = "NIK 3171234567890123, Name: Mira Setiawan"analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()results = analyzer.analyze(text=extracted_text, language="en")
redacted = anonymizer.anonymize(text=extracted_text, analyzer_results=results)print("=== Redacted ===")
print(redacted.text)

Visual Redaction with Bounding Boxes

import cv2
import pytesseract
image = cv2.imread("ktp_sample.jpg")
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
for i, word in numerate(data["text"]):
    if word.isdigit() and len(word) == 16:  # NIK
        x, y, w, h = data["left"][i], data["top"][i], data["width"][i], data["height"][i]
        cv2.rectangle(image, (x, y), (x + w, y + h), (0, 0, 0), -1)
cv2.imwrite("ktp_redacted.jpg", image)