python - How to provide image to Tesseract from memory -
i'm using tesseract ocr on millions of pdfs, , i'm trying squeeze out performance can.
my current pipeline uses convert
convert pdf png files (one per page), , uses tesseract on each of those.
during profiling, i've discovered lot of time spent writing files disk, reading them again, i'd move of memory.
i've got pdf png conversion working in memory, need way pass in-memory blob tesseract instead of giving path file? haven't been able find documentation or examples of this?
you can use pytesseract. it's python wrapper google tesseract.
usage:
image = ... # read image memory result = pytesseract.image_to_string(image, lang="eng")
Comments
Post a Comment