Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Thanh Pham 22 posts 132 karma points
    1 week ago
    Thanh Pham
    0

    Umbraco Examine - Search result highlighting

    Hi guys,

    I'm trying implement the search result highlighting (like Google) within an Umbraco web app. I followed this https://our.umbraco.org/forum/developers/extending-umbraco/13571-Umbraco-Examine-Search-Results-Highlighting, however it's 8 years old and I want to target multiple fields with fuzzy search so below is my code:

            var stdAnalyzer = new StandardAnalyzer(Version.LUCENE_29);
            var formatter = new SimpleHTMLFormatter();
            var finalQuery = new BooleanQuery();
            var tmpQuery = new BooleanQuery();
    
            var multiQueryParser = new MultiFieldQueryParser(Version.LUCENE_29, fields, stdAnalyzer);
            var externalIndexSet = Examine.LuceneEngine.Config.IndexSets.Instance.Sets["ExternalIndexSet"];
            var externalSearcher = new IndexSearcher($"{externalIndexSet.IndexDirectory.FullName}\\Index", true);
            var terms = searchTerm.RemoveStopWords().Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
    
            foreach (var term in terms)
            {
                tmpQuery.Add(multiQueryParser.Parse(term.Replace("~", "") + $@"~{fuzzyScore}"),
                    BooleanClause.Occur.SHOULD);
            }
            tmpQuery.Add(multiQueryParser.Parse("noIndex:1"), BooleanClause.Occur.MUST_NOT);
    
            finalQuery.Add(multiQueryParser.Parse($@"{tmpQuery}"),
                BooleanClause.Occur.MUST);
            finalQuery.Add(multiQueryParser.Parse("__IndexType:content"), BooleanClause.Occur.MUST);
    
    
            var hits = externalSearcher.Search(finalQuery, 100);
            var qs = new QueryScorer(finalQuery);
            var highlighter = new Highlighter(formatter, qs);
            var fragmenter = new SimpleFragmenter();
            highlighter.SetTextFragmenter(fragmenter);
            highlighter.SetMaxDocBytesToAnalyze(int.MaxValue);
    
            foreach (var item in hits.ScoreDocs)
            {                
                var document = externalSearcher.Doc(item.doc);
                var description = document.Get("description");
                var tokenStream = TokenSources.GetTokenStream(externalSearcher.GetIndexReader(), item.doc,
                    "description", stdAnalyzer);
                var frags = highlighter.GetBestFragments(tokenStream, description, 10);
            }
    
            externalSearcher.Dispose();
    

    Everything seems working fine except I can't get token stream regardless how many different methods from different classes I've tried, therefore no frags returned. I then looked at the lucene.net source code here at https://lucenenet.apache.org/docs/3.0.3/df/d43/tokensources8cssource.html and found that the method GetTokenStream will throw an ArgumentException (see image below) if the "description" field I use above is not TermPositionVector. I got exactly this exception when I debugged it. How do I fix this issue?

    enter image description here

    I use default ExternalSearcher & ExternalIndexSet provided by Umbraco (7.7.6) to index & query content within BackOffice.

    Thanks.

    TP

  • Thanh Pham 22 posts 132 karma points
    1 week ago
    Thanh Pham
    0

    Update.

    I used Lucene Luke to examine the index Umbraco created and found that the description field has option Term Vector ticked but not positions nor offsets (see image below), that means Umbraco Examine only knows the number of occurrences, not positions and offsets which are required to be able to get token stream I mentioned in the initial post. Reference: http://makble.com/what-is-term-vector-in-lucene

    Can anyone shed some lights on how to fix this? Thanks.

    enter image description here

  • Thanh Pham 22 posts 132 karma points
    1 week ago
    Thanh Pham
    0

    Can anyone help please as our client really wants to have this feature when they decommission Google search plugin?

  • Dan Diplo 1209 posts 4304 karma points
    1 week ago
    Dan Diplo
    1

    Here's how I do syntax highlighting in Lucene:

    First, add a reference to the NuGet package Lucene.Net.Contrib 2.9.4.1 (ensure it's the 2.9.4.1 version and not latest).

    Then I have the following class with various methods to generate highlighting:

    public class LuceneHighlighter
    {
        private readonly Lucene.Net.Util.Version _luceneVersion = Lucene.Net.Util.Version.LUCENE_29;
    
        /// <summary>
        /// Initialises the queryparsers with an empty dictionary
        /// </summary>
        protected Dictionary<string, QueryParser> QueryParsers = new Dictionary<string, QueryParser>();
    
        /// <summary>
        /// Get or set the separator string (default = "...")
        /// </summary>
        public string Separator { get; set; }
    
        /// <summary>
        /// Get or set the maximum number of highlights to show (default = 5)
        /// </summary>
        public int MaxNumHighlights { get; set; }
    
        /// <summary>
        /// Get or set the Formatter to use (default = SimpleHTMLFormatter)
        /// </summary>
        public Formatter HighlightFormatter { get; set; }
    
        /// <summary>
        /// Get or set the Analyzer to use (default = StandardAnalyzer)
        /// </summary>
        public Analyzer HighlightAnalyzer { get; set; }
    
        /// <summary>
        /// Get the index search being used
        /// </summary>
        public IndexSearcher Searcher { get; private set; }
    
        /// <summary>
        /// Get the Query to be used for highlighting
        /// </summary>
        public Query LuceneQuery { get; private set; }
    
        /// <summary>
        /// Initialise a new LuceneHighlighter instance
        /// </summary>
        /// <param name="searcher">The IndexSearch being used</param>
        /// <param name="luceneQuery">The underlying Lucene Query being used</param>
        /// <param name="highlightCssClassName">The name of the CSS class used to wrap around highlighted words</param>
        public LuceneHighlighter(IndexSearcher searcher, Query luceneQuery, string highlightCssClassName)
        {
            this.Searcher = searcher;
            this.LuceneQuery = luceneQuery;
            this.Separator = "...";
            this.MaxNumHighlights = 5;
            this.HighlightAnalyzer = new StandardAnalyzer(_luceneVersion);
            this.HighlightFormatter = new SimpleHTMLFormatter("<span class=\"" + highlightCssClassName + "\">", "</span> ");
        }
    
        /// <summary>
        /// Get the highlighted string for a value and a field
        /// </summary>
        /// <param name="value">The field value</param>
        /// <param name="highlightField">The field name</param>
        /// <returns>A string containing the highlighted result</returns>
        public string GetHighlight(string value, string highlightField)
        {
            value = Regex.Replace(value, "content", "", RegexOptions.IgnoreCase); // weird bug in GetBestFragments always adds "content"
    
            var scorer = new QueryScorer(LuceneQuery.Rewrite(Searcher.GetIndexReader()));
    
            var highlighter = new Highlighter(HighlightFormatter, scorer);
    
            var tokenStream = HighlightAnalyzer.TokenStream(highlightField, new StringReader(value));
            return highlighter.GetBestFragments(tokenStream, value, MaxNumHighlights, Separator);
        }
    
        /// <summary>
        /// Get the highlighted field for a value and field
        /// </summary>
        /// <param name="value">The field value</param>
        /// <param name="searcher">The Examine searcher</param>
        /// <param name="highlightField">The hghlight field</param>
        /// <param name="luceneQuery">The query being used</param>
        /// <returns>A string containing the highlighted result</returns>
        public string GetHighlight(string value, IndexSearcher searcher, string highlightField, Query luceneQuery)
        {
            var scorer = new QueryScorer(luceneQuery.Rewrite(searcher.GetIndexReader()));
            var highlighter = new Highlighter(HighlightFormatter, scorer);
    
            var tokenStream = HighlightAnalyzer.TokenStream(highlightField, new StringReader(value));
            return highlighter.GetBestFragments(tokenStream, value, MaxNumHighlights, Separator);
        }
    
        /// <summary>
        /// Gets a query parser for a hightlight field
        /// </summary>
        /// <param name="highlightField">The field</param>
        /// <returns>A query parser</returns>
        protected QueryParser GetQueryParser(string highlightField)
        {
            if (!QueryParsers.ContainsKey(highlightField))
            {
                QueryParsers[highlightField] = new QueryParser(_luceneVersion, highlightField, HighlightAnalyzer);
            }
            return QueryParsers[highlightField];
        }
    }
    
  • Thanh Pham 22 posts 132 karma points
    1 week ago
    Thanh Pham
    0

    Thanks heaps Dan, I'll try it and let you know how it goes.

  • Thanh Pham 22 posts 132 karma points
    1 week ago
    Thanh Pham
    0

    Hi Dan,

    Woohoo, it's working. Thank you very much :).

    By the way I found that my code looked pretty much same as yours except the parameter of the QueryScorer. My one did not have the .Rewrite method which was identified as the root of the issue. Again, thank you.

Please Sign in or register to post replies

Write your reply to:

Draft