umbraco examine indexing pdf and open xml office documents gives incomplete index - API Questions

Dragan R. 3 posts 33 karma points

May 10, 2015 @ 18:32

Umbraco Examine: Indexing PDF and Open XML Office documents gives incomplete index

API Questions

Hello all,

On our web site, we need to include media files in search results. "Media files" include: PDF, xlsx, docx, pptx.

We're using a third-party component which performs indexing. But details about the third-party component seem irrelevant, so I'll miss it this time.

We’ve encountered a strange issue: indexing media files (PDF, xlsx, docx and pptx) seem fragile and very sensitive to errors during indexing. To be precise, it seems that some errors during indexing of files (for example, a corrupt PDF document) are causing either an incomplete Lucene index, or no index at all. This happens on application startup, as well as on manual re-indexing of files. As a consequence, search on our web site doesn’t include most of the media files.

When I manually remove all corrupted (or potentially “dangerous”) files on the server, at certain point the indexing goes well and the Lucene indexing output is generated ok. But, this is obviously not possible in the production environment, because the client is updating media content without our intervention.

It seems that, for some reason, Lucene segments get lost during optimization, or don't get created at all.

There are about 500 MB of mentioned document types on the server.

When an incomplete index is created, the index folder contains three files:

a .cfs file (for example, _2.cfs);
segments.gen;
segments_x

As already mentioned, the .cfs file contains just a small subset (say, five or 10) of total documents on web site.

Questions:

does this situation sound familiar?
can I control behavior during in dexing, and basically tell the indexer to "don't break on error, regardless of the error severity, but continue with indexing"? I haven't found any such setting for Examine BaseIndexProvider, IndexWriter or other classes I've looked into...

I've tried to handle different indexing events at application startup, but none of them seem to give me what I need:

public class InitializationEvents : ApplicationEventHandler
    {
        #region ApplicationStarted
        protected override void ApplicationStarted(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext)
        {
            base.ApplicationStarted(umbracoApplication, applicationContext);

            if (applicationContext.IsConfigured && applicationContext.DatabaseContext.IsDatabaseConfigured)
            {
                /* INDEXING */
                var indexerMedia = ExamineManager.Instance.IndexProviderCollection["XfsMediaIndexer"];
                if (indexerMedia != null)
                {
                    indexerMedia.GatheringNodeData += GatheringNodeDataHandlerMedia;
                    //indexerMedia.GatheringFieldData += indexerMedia_GatheringFieldData;
                    //indexerMedia.IgnoringNode += indexerMedia_IgnoringNode;
                    indexerMedia.IndexingError += indexerMedia_IndexingError;
                    //indexerMedia.NodeIndexing += indexerMedia_NodeIndexing;
                    //indexerMedia.NodesIndexing += indexerMedia_NodesIndexing;
                    //indexerMedia.NodeIndexed += indexerMedia_NodeIndexed;
                    //indexerMedia.NodesIndexed += indexerMedia_NodesIndexed;
                }

                ExamineManager.Instance.RebuildIndex();

                /* BUNDLING */

                /* enable bundling for custom-defined bundles (reference: gist.github.com/jkarsrud/5143239) */
                BundleConfig.RegisterBundles(BundleTable.Bundles);
            }
        }

        void indexerMedia_IndexingError(object sender, IndexingErrorEventArgs e)
        {
            // LOG THE ERROR MESSAGE
        }
        #endregion

        
        // REMOVED IRRELEVANT CODE
    }
}

I won't add any entries from ExamineSettings.config and ExamineIndex.config: as I already said, after removing corrupt PDF files the index gets created correctly, so I'm pretty confident everything is configured correctly. But I'll supply this info in a next post, if there is a request.

Any help would be appreciated!

Copy Link

Alex Skrypnyk 6132 posts 23951 karma points MVP 7x admin c-trib

May 10, 2015 @ 22:52

Hi Dragan,

Did you read this doc :

https://our.umbraco.org/Documentation/Reference/Searching/Examine/full-configuration

http://24days.in/umbraco/2013/getting-started-with-examine/

There are some info about PDF indexing.

Used to index PDF content in Umbraco's media section.

**** NOTE: Not all PDFs can have text read from them!!! ****

This shows the PDF specific configuration and the default values applied when 
they are not specified.

Thanks, Alex

Copy Link

Dragan R. 3 posts 33 karma points

May 11, 2015 @ 09:32

Hi Alex,

thanks for the response.

I've already read through the material you referenced. And I'm convinced everything is configured correctly, regarding Examine indexers.

Below I'm submitting an excerpt from ExamineSettings.config (there are other indexers/searchers, I've left only the most relevant). But a small explanation: in the config below you'll se that we're using a third-party component, XfsSearch. This component internally uses iTextSharp and IFilters for parsing PDFs and Office documents.

<?xml version="1.0"?>
<Examine>
  <ExamineIndexProviders>
    <providers>
      <!--[...]-->
      <add name="XfsMediaIndexer" type="Xuntos.Xfs.MediaIndexer, Xuntos.Xfs.ContentIndexer" umbracoFileProperty="umbracoFile" indexSet="XfsMediaIndexSet" />
    </providers>
  </ExamineIndexProviders>
  <ExamineSearchProviders defaultProvider="ExternalSearcher">
    <providers>
      <!--[...]-->
      <add name="XfsMediaSearcher" type="UmbracoExamine.LuceneExamineSearcher, UmbracoExamine" indexSet="XfsMediaIndexSet" />
    </providers>
  </ExamineSearchProviders>
</Examine>

Unfortunately, I'm not sure if the problem is connected to the component or it is a problem with Examine/iTextSharp/IFilters.

Another interesting fact: I've enabled detailed logging, and log files reveal a few different exceptions. But it seems that some exceptions are "non-fatal" (index is still being created) but some others are "fatal" (index is incomplete, or completely missing).

Exception message and stack trace, for a "non-fatal" exception:

Could not read PDF
'>' not expected at file pointer 3343
   at iTextSharp.text.pdf.PRTokeniser.ThrowError(String error)
   at iTextSharp.text.pdf.PRTokeniser.NextToken()
   at Xuntos.Xfs.helpers.PdfIndexHelper.PDFParser.ParsePdfText(String sourcePDF, Action`1 onError)

Exception message and stack trace, for the "fatal" exception (this is caused by the corrupt PDF):

Error indexing queue items
Rebuild failed: trailer not found.; Original message: PDF startxref not found.
   at iTextSharp.text.pdf.PdfReader.ReadPdf()
   at iTextSharp.text.pdf.PdfReader..ctor(String filename, Byte[] ownerPassword)
   at Xuntos.Xfs.helpers.PdfIndexHelper.PDFParser.ParsePdfText(String sourcePDF, Action`1 onError)
   at Xuntos.Xfs.MediaIndexer.ExtractTextFromPdfFile(FileInfo file)
   at Xuntos.Xfs.MediaIndexer.GetDataToIndex(XElement node, String type)
   at Examine.LuceneEngine.Providers.LuceneIndexer.ProcessIndexQueueItem(IndexOperation op, IndexWriter writer)
   at Examine.LuceneEngine.Providers.LuceneIndexer.ForceProcessQueueItems()

Copy Link

Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib

May 11, 2015 @ 10:23

Dragan,

I have in the past indexed pdf using the examine pdf indexer and found that with some pdfs i got errors. I then wrote my own media indexer which under the hood uses Tika see https://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer, that fixed my issue. Please note that was written for v6 but should work for v7 if not you can download source and update. One issue with it is speed when you have alot of media content as it uses java ikvm which wraps around the tika libraries.

Regards

Ismail

Copy Link

Dragan R. 3 posts 33 karma points

Jun 17, 2015 @ 11:38

@Ismail: Thanks for the input, and thank you so much for reminding me of your component!

After initial testing, it seems that it suits our needs perfectly, so we'll probably use CogUmbracoExamineMediaIndexer for indexing files, and XfsSearch for everything else related the search functionality. I've marked your comment as the solution, because it gave us a way to circumvent the original issue. But my initial question is still relevant, and it would be nice if someone else from Umbraco community could provide more information regarding the problem and possible solution...

Just a small reminder for myself and any future visitor: it's not enough to just do "right-click > Save As" for "tikka-app-1.2.dll" link on the product page (it would save a file under correct name, but completely wrong content)... Instead, click the link and click appropriate button on the Dropbox page. The "tikka" dll should be about 28MB in size. ;)

Regards, Dragan.

PS Sorry for the very late response, but for some reason Umbraco forum didn't accept my comment earlier.

Copy Link

Tim 1193 posts 2675 karma points MVP 3x c-trib

Jun 29, 2015 @ 11:04

I've got a similar situation with one of my clients, on 7.2.2. Every now and again, the PDF index with clear itself completely, and the only way to get the index to rebuild is to kick the App Pool. When the index rebuilds, it doesn't add all of the files back in, you have to manually re-publish them all to get them to go into the index correctly.

I'm still looking into this, but I'll let you know if I get to the bottom of the issue.

Copy Link

Tim 1193 posts 2675 karma points MVP 3x c-trib

Jun 29, 2015 @ 12:08

Hiya,

I've found that this issue goes away if you limit the index to just contain items with the "File" node type. Here's an example config:

<IndexSet SetName="PDFIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/PDFs">
        <IncludeNodeTypes>
            <add Name="File" />
        </IncludeNodeTypes>
</IndexSet>

That has stopped the incomplete indexes for me, and the indexes rebuild correctly now.

Copy Link

is working on a reply...

Flag this post as spam?

Umbraco Examine: Indexing PDF and Open XML Office documents gives incomplete index