Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Tom 161 posts 322 karma points
    Jan 02, 2018 @ 11:41
    Tom
    0

    Ways to Improve Lucene Search Engine results

    Hello: I am using Umbraco 7.5.9 and I have 3 indexers defined and working fine, except when my user search the Umbraco site, it is taking 18 seconds to returns through the three indexers before returning the results. So I retooled the code and utilized C# parallel threading and now my search is done in half the time 9 seconds (still too slow).

    For my site, when my users search on "housing" there are a total of 303 items. 131 items belong to the Content indexer. 250 items belong to the pdf indexer, 2 items below to the Inbox indexer (a custom backoffice App PlugIn we wrote.

    Question: I have broken my code down and found the bottleneck is in the PDF indexer.

    Does anyone have any suggestions on how to improve PDF indexer>

    Thank You

    Tom

    PS Here is the code. Note: I am using ConcurrentBag and Parallel processing.

            var indexesToSearch = new List<Tuple<SearchIndexType, string>>
            {
                new Tuple<SearchIndexType, string>(SearchIndexType.Content, @"MembersOnlyIndexSet"),
                new Tuple<SearchIndexType, string>(SearchIndexType.Media, @"MembersOnlyPDFIndexSet"),
                new Tuple<SearchIndexType, string>(SearchIndexType.InboxMessage, "MembersOnlyInboxMessageIndexSet")
            };
    
    
            var results = new List<SearchResultVM>();
    
            if (string.IsNullOrWhiteSpace(searchTerm))
                return results;
    
            var analyzer = new StandardAnalyzer(Version.LUCENE_29);
            var parser = new MultiFieldQueryParser(Version.LUCENE_29,
                QueryFields,
                new StandardAnalyzer(Version.LUCENE_29));
            var query = parser.Parse(searchTerm);
    
            // Build the highlighter
            var formatter = new SimpleHTMLFormatter("<span class=\"lucene-highlight\">", "</span>");
            var scorer = new QueryScorer(query);
            var highlighter = new Highlighter(formatter, scorer);
            highlighter.SetTextFragmenter(new SimpleFragmenter(FragementLength));
    
            var sets = IndexSets.Instance.Sets;
            UmbracoContext context = UmbracoContext.Current;
    
            ConcurrentBag<SearchResultVM> bag = new ConcurrentBag<SearchResultVM>();
    
            Parallel.ForEach(indexSets, (index) =>
            {
                var set = sets[index.Item2];
                var dirInfo = new DirectoryInfo(Path.Combine(set.IndexDirectory.FullName, @"Index"));
    
                using (var indexDir = FSDirectory.Open(dirInfo))
                {
                    using (var indexSearcher = new IndexSearcher(indexDir, true))
                    {
                        var collect = TopScoreDocCollector.create(3000, true);
                        indexSearcher.Search(query, collect);
    
                        var docs = collect.TopDocs();
                        for (int i = 0; i < collect.GetTotalHits(); i++)
                        {
                            var rec = docs.ScoreDocs[i];
                            var doc = indexSearcher.Doc(rec.doc);
    
                            SearchResultVM item;
                            switch (index.Item1)
                            {
                                case SearchIndexType.Content:
                                    item = BuildSearchContentItem(rec, analyzer, highlighter, doc);
                                    break;
                                case SearchIndexType.Media:
                                    item = BuildSearchMediaItem(rec, analyzer, highlighter, doc, context);
                                    break;
                                case SearchIndexType.InboxMessage:
                                    item = BuildSearchInboxMessageItem(rec, analyzer, highlighter, doc);
                                    break;
                                default:
                                    Log.DebugFormat("Unrecognized search index type {0}", index.Item1);
                                    item = null;
                                    break;
                            }
    
                            if (item != null)
                            {
                                bag.Add(item);
                            }
                        }
                    }
                }
            });
    
            results = bag.ToList();
            return results.OrderByDescending(o => o.Score).ToList();
        }
    
  • Tom 161 posts 322 karma points
    Jan 02, 2018 @ 12:24
    Tom
    0

    Here's the routine to build the PDFs

       private static SearchResultVM BuildSearchMediaItem(ScoreDoc rec, StandardAnalyzer analyzer, Highlighter highlighter, Document doc, UmbracoContext context)
        {
            var id = doc.GetField("__NodeId");
            int nodeId;
            int.TryParse(id.StringValue(), out nodeId);
    
            var mediaItem = context.MediaCache.GetById(nodeId);
    
            var item = new SearchResultVM
            {
                Score = rec.score,
                SearchIndexType = SearchIndexType.Media,
                Id = nodeId,
                Name = mediaItem.Name,
                Url = BuildMediaItemUrl(mediaItem, context),
                LastUpdated = mediaItem.UpdateDate.ToShortDateString(),
                HighlightedFragment = HighlightContent(analyzer, highlighter, doc, @"FileTextContent")
            };
    
            return item;
        }
    
  • John Bergman 483 posts 1132 karma points
    Jan 02, 2018 @ 17:51
    John Bergman
    0

    What's in BuildMediaItemUrl()? The only other place I see that could be taking your time (in my light experience with media in searches) would be the HighlightContent() method. Looks like you need to break it down a little further to locate the bottle neck

  • Tom 161 posts 322 karma points
    Jan 03, 2018 @ 12:08
    Tom
    100

    John:

    Thanks do much for replying.

    The BuildMediaItemUrl() simple returns a friendly URL for each item.

    And when I removed Highlighting from the Media searcch index, it improves performance by just a little.

    Do you have any other suggestions. Note: I found Cogworks.ExamineFileIndexer online. Do you have any experience with this. Is it faster than Umbraco's 7.0 Examine v0.1.89?

        private static string BuildMediaItemUrl(IPublishedContent mediaItem, UmbracoContext context)
        {
            var ctypeSvc = ApplicationContext.Current.Services.ContentTypeService;
            var contentSvc = ApplicationContext.Current.Services.ContentService;
            var urlHelper = new UmbracoHelper(context);
    
            if (string.Equals(mediaItem.DocumentTypeAlias, @"membersOnlyPDF", StringComparison.InvariantCultureIgnoreCase)
                || string.Equals(mediaItem.DocumentTypeAlias, @"membersOnlyFile", StringComparison.InvariantCultureIgnoreCase))
            {
                var cTypeMOHomePage = ctypeSvc.GetContentType("membersOnlyHomepage");
                var moHomePage = contentSvc.GetContentOfContentType(cTypeMOHomePage.Id).FirstOrDefault();
                if (moHomePage != null)
                {
                    return $"{urlHelper.NiceUrlWithDomain(moHomePage.Id).TrimEnd('/')}{mediaItem.Url()}";
                }
            }
    
            if (string.Equals(mediaItem.DocumentTypeAlias, @"File", StringComparison.InvariantCultureIgnoreCase))
            {
                var cTypePWSHomePage = ctypeSvc.GetContentType("Homepage");
                var homePage = contentSvc.GetContentOfContentType(cTypePWSHomePage.Id).FirstOrDefault();
                if (homePage != null)
                {
                    return $"{urlHelper.NiceUrlWithDomain(homePage.Id).TrimEnd('/')}{mediaItem.Url()}";
                }
            }
            return string.Empty;
        }
    
  • Niels Hartvig 1951 posts 2391 karma points c-trib
    Jan 04, 2018 @ 09:04
    Niels Hartvig
    0

    I believe a bottleneck could be due using the ContentService which is the full CRUD api and thus not optimised for read alone (and not cached).

    There's more in-depth information in the Common Pitfals part of the documentation which can be a really good and enlightning read. Here's the section around using the Services in views: https://our.umbraco.org/documentation/Reference/Common-Pitfalls/#using-the-services-layer-in-your-views

    Hope this helps!

    Best,

    Niels...

  • Tom 161 posts 322 karma points
    Jan 03, 2018 @ 13:44
    Tom
    0

    John:

    I figured it out. If I bypass the BuildMediaItemUrl and just use the out of box, I am back to sub-second response time for search.

    var mediaItem = context.MediaCache.GetById(nodeId); Url = mediaItem.Url,

    Thanks for your help and tips.

  • John Bergman 483 posts 1132 karma points
    Jan 04, 2018 @ 01:57
    John Bergman
    0

    Interesting, the method must be creating a whole bunch of temporary objects, that's the only thing I can thing of that would have a performance hit like you are talking about.

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Jan 04, 2018 @ 10:49
    Ismail Mayat
    0

    Tom,

    Was there a reason for using lucene.net direct and not examine? Was it so you could get highlighter working? Also in lucene and examine you have multi index searcher which allows you to search over more than one index.

    Regards

    Ismail

Please Sign in or register to post replies

Write your reply to:

Draft