How Do I Create a Searchable Archive of PDFs?

In this week’s tech-advice column at Lifehacker—keep your questions coming, folks!—we’re helping out a reader who has way too many important papers that need to make a magical transition to the digital realm. At least, that sounds a lot more exciting than “Optical Character Recognition,” which doesn’t really roll off the tongue.

Lifehacker reader Phil writes:

“David, your columns are informative, helpful and well written. Thank you.

I’m copying (digitizing by scanning) a non-profit’s meeting minutes. The organization meets twice a month. Each set of minutes representing a meeting, usually just one or two typewritten pages, comprise a file. I have about 20 years worth of minutes to do. I will merge all the files of one year’s meetings into one “year” file, then merge 10 of those files into another, separate “decade” file. Too much information, I know.

Anyway, I would like for these files to be searchable.

Could you recommend a free or low-cost decent desktop OCR program or any other method that would make the minutes searchable? I’d very much like to retain the formatting of the original minutes. The minutes never contain photos. I’m using Windows 10.”

Thanks for the kind words, Phil! I’m happy to help—not because of the flattery, but because your question is one that a lot of readers have probably thought about (myself included). I have a whole stack of things that I’d love to move from the physical world to the digital world, so I can then Marie Kondo the original documents and photos into oblivion. Stacks of paper do not bring me joy.

You have a few options you can try. I’d start with an obvious one: Google. Assuming you’re creating PDFs, upload your file(s) to Google Drive. Right-click on any individual PDF, hover your mouse over “Open With,” and select “Google Docs.” Google will then attempt to run some OCR on your PDF, and you should be able to save the resulting file as a document. You can then search through this document (and any others you convert) via Drive itself.

The more I think about it, though, that solution seems a little inelegant given how many files you have to work with. Instead, I might try a piece of software like TesseractStudio.Net—or just Tesseract OCR, if you don’t fear the command line. You should be able to use this to create OCR data from your files, and you can then search for them directly via Windows or macOS. OCRmyPDF is another option that’s similar to Tesseract OCR, but, again, you’ll be playing with typed commands to apply OCR to your files. There’s no GUI, nor is there (direct) Windows support.

There’s also Paperwork, an open-source document cataloging tool that comes with OCR built right in, which I would definitely consider given that it’s designed to be an all-in-one piece of software for archiving, sorting, and searching documents. That sounds like it might be just what you’re looking for.

I haven’t used PDF-XChange Viewer, but others have recommended it as an option. The free version will drop watermarks into your PDFs, but it can create PDFs from images and, if I’m correct, add OCR to these and any existing PDFs you have. It’s worth exploring, even if it’s not the ideal (free) solution. Similarly, FreeOCR can take your images or PDFs, apply OCR, and export the results as plain text files or Word documents. If you don’t mind searching through your archives that way, it’s an option.

As for paid solutions, there’s always Adobe Acrobat Pro or Foxit PhantomPDF. Both will allow you to add OCR to PDFs, and you should be able to process all of your documents as a big batch (or create a script that does this with a folders’ worth of contents). You might even be able to get this all done during the apps’ free trials, if they don’t put limitations on their OCR capabilities. I’ve also seen others with your particular problem find success using an app like PDF OCR, which could be a cheaper alternative.

That’s everything I can think of off the top of my head (and with a little research). Hopefully, one of those solutions works out for you—without costing you a small fortune. Write back and let me know which app worked best for you!

Do you have a tech question keeping you up at night? Tired of troubleshooting your Windows or Mac? Looking for advice on apps, browser extensions, or utilities to accomplish a particular task? Let us know! Tell us in the comments below or email [email protected]

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram