A while back I shared a post about how we were successfully able to use the excellent , yet Java based Tika text extraction library in our .NET based applications. Along side that post I also created a GitHub repo where the code for TikaOnDotNet lives and is maintained. At Dovetail Software we continue to use this library with great success on the .NET platform. Externally I’ve gotten responses from people using the ideas in the project but often they have problems creating their own release or getting Tika up and running in their projects. Today I gave the project some love and some polish and updated it to support being consumed as a Nuget package which makes it really easy to use from your code base. Let’s take a look at how to use Tika in your .Net projects.
From Zero to Text Extraction
Step 1 – Install the Nuget
Step 2 – New up a TextExtractor
Something to note here. The text extractor class uses Tika’s auto detection mechanism. We are not explicitly use Tika’s Office document extraction parsers. Tika has the capability to detect the incoming content and extract the text from it. This is very useful for search engines like Dovetail Seeker. Tika is an extensive library and I have just wrapped this one mechanism. If you like you can use the full power of Tika directly from your .NET code. Check out Tika In Action for more about what Tika can do.
Finale – Compare the text with the original
The code above runs an ancient Word document through Tika text extraction engine. This document is part of our testing of Tika’s compatibility with… ancient word documents.