Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Nik 1591 posts 7148 karma points MVP 6x c-trib
    Jun 22, 2017 @ 13:47
    Nik
    0

    Examine issue when text includes stop words

    Hi all,

    Got an issue that seems to re-occur on the forum a few times here. Basically I've got a site that has a search feature, however when phrases are used that contain words like on, and, if or other examine stop words or 2 char sets it throws up an exception.

    Current test phrase:

    "Focus on sales" - This fails "Focus sales" - This works

    Unfortunately, I didn't write the original search script so I'm not 100% on what it is doing.

    var q =  Request.QueryString["Query"];
    var q_split = q.Trim().Split(new[] {' '}, 
    StringSplitOptions.RemoveEmptyEntries);
    
    var fieldsToSearch = new[]
    {
        "nodeName", "seoDescription", "archetypeBody"
    };
    
    var searcher = ExamineManager.Instance.SearchProviderCollection["ContentSearcher"];
    
    var criteria = searcher.CreateSearchCriteria(IndexTypes.Content, BooleanOperation.And);
    var query = criteria.Field("seoTitle", q_split.First().MultipleCharacterWildcard().Value.Boost(8));
    query = query.Or().GroupedOr(fieldsToSearch, q_split.First().MultipleCharacterWildcard());
    
    criteria.GroupedOr(new[] {"siteIdentifier"}, domainId.ToString());
    
    
    foreach (var term in q_split.Skip(1))
    {
        query = query.Or().Field("seoTitle", term.MultipleCharacterWildcard().Value.Boost(8)).Or().GroupedOr(fieldsToSearch, term.MultipleCharacterWildcard());
    }
    
    var searchResults = searcher.Search(query.Compile()).OrderByDescending(x => x.Score).TakeWhile(x => x.Score > 0.5f);
    

    Having read the following threads:

    https://our.umbraco.org/forum/developers/extending-umbraco/71626-examine-search-fails-for-two-character-words

    https://our.umbraco.org/forum/ourumb-dev-forum/bugs/20727-Lucene-fails-when-seaching-on-common-words

    it pointed me in the direction of the stop words, however, I cannot seem to put my finger on the right solution. There are times when a user will want to search for phrases containing stop words, or two letter abbreviations.

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Jun 22, 2017 @ 13:58
    Ismail Mayat
    0

    Nik,

    Which analyser is being used?

    Also can you write out the generated query?

    Regards

    Ismail

  • Nik 1591 posts 7148 karma points MVP 6x c-trib
    Jun 22, 2017 @ 14:01
    Nik
    0
    <add name="ContentSearchIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
          supportUnpublished="false"
          supportProtected="false"
          indexSet="ContentSearchIndexSet"
          analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net"/>
    

    Found this as well

    <add name="ContentSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" 
           analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" indexSet="ContentSearchIndexSet" 
           enableLeadingWildcards="true"/>
    

    Looks like the standard analyzer

  • Nik 1591 posts 7148 karma points MVP 6x c-trib
    Jun 22, 2017 @ 14:08
    Nik
    0

    When it generates a working query it looks like this:

    +(+seoTitle:focus^8.0 (nodeName:focus* seoDescription:focus* archetypeBody:focus*) +(siteIdentifier:1052) seoTitle:safety^8.0 (nodeName:safety* seoDescription:safety* archetypeBody:safety*)) +__IndexType:content
    

    I'll see if I can find out what the failing query looks like as it is causing a YSOD at the moment when it fails.

  • Nik 1591 posts 7148 karma points MVP 6x c-trib
    Jun 22, 2017 @ 14:19
    Nik
    0

    Hi Ismail,

    The problem is when it hits stop words the following line is throwing the exception:

            query = query.Or().Field("seoTitle", term.MultipleCharacterWildcard().Value.Boost(8)).Or().GroupedOr(fieldsToSearch, term.MultipleCharacterWildcard());
    

    Something somewhere in there is causing a null reference exception but when I individually check the parameters everything looks fine. I think it is caused somewhere in the Field() method or the GroupedOr() method.

    So I never get the point of having a query to process.

    I did check to see if the MultipleCharacterWildcard() extension method resulted in null, but that isn't the case it seems to be absolutely fine at that point.

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Jun 22, 2017 @ 15:33
    Ismail Mayat
    0

    nik,

    Whats the actual exception?

    Regards

    Ismail

  • Nik 1591 posts 7148 karma points MVP 6x c-trib
    Jun 22, 2017 @ 15:43
    Nik
    0

    This is the error:

    Object reference not set to an instance of an object.
    
    Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code. 
    
    Exception Details: System.NullReferenceException: Object reference not set to an instance of an object.
    
    Source Error: 
    
    
    Line 88:     foreach (var term in q_split.Skip(1))
    Line 89:     {
    Line 90:         query = query.Or().Field("seoTitle", term.MultipleCharacterWildcard().Value.Boost(8));
    Line 91:         query = query.Or().GroupedOr(fieldsToSearch, term.MultipleCharacterWildcard());
    Line 92:     }
    

    I split out the two bits that make up lines 90 and 91 as they were originally single lines. Nothing that I can watch/debug is null at any point.

    When it raises the exception:

    • term = "on",
    • term.MultipleCharacterWildcard() has a value
    • term.MultipleCharacterWildcard().Value = "on*"
    • term.MultipleCharacterWildcard().Value.Boost(8) has a value
    • query also has a value

    Additional stack trace info:

    [NullReferenceException: Object reference not set to an instance of an object.]
    Examine.LuceneEngine.SearchCriteria.LuceneSearchCriteria.GetFieldInternalQuery(String fieldName, IExamineValue fieldValue, Boolean useQueryParser) +565
    Examine.LuceneEngine.SearchCriteria.LuceneQuery.Field(String fieldName, IExamineValue fieldValue) +29
    
  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Jun 22, 2017 @ 16:03
    Ismail Mayat
    0

    Nik,

    I suspect that when 'on' is run through standard analyser its replaced and you just get the wildcard. Can you experiment and take of the wildcard so line:

    query = query.Or().Field("seoTitle", term.MultipleCharacterWildcard().Value.Boost(8));
    
    query = query.Or().GroupedOr(fieldsToSearch, term.MultipleCharacterWildcard());
    

    change to

    query = query.Or().Field("seoTitle", term.Value.Boost(8));
    
    query = query.Or().GroupedOr(fieldsToSearch, term);
    

    See if that works?

  • Nik 1591 posts 7148 karma points MVP 6x c-trib
    Jun 22, 2017 @ 16:29
    Nik
    0

    Hey Ismail,

    I just changed the first line of the two to:

    query = query.Or().Field("seoTitle", term.Boost(8));
    

    This still results in the null exception being thrown within the Field method. It definitely doesn't like this very much.

  • Nicholas Westby 2054 posts 7100 karma points c-trib
    Jun 22, 2017 @ 17:07
    Nicholas Westby
    0

    This suggests that you can configure Lucene (and by extension Examine) to have no stop words: https://stackoverflow.com/a/17453193/2052963

    Lucene.Net.Analysis.StopAnalyzer.ENGLISH_STOP_WORDS_SET = new System.Collections.Hashtable();
    

    I've never tried that myself, but it's worth a go. FWIW, I typically strip out any stop words and escape special characters with QueryParser.Escape.

  • Tom Steer 161 posts 596 karma points
    Jun 23, 2017 @ 07:18
    Tom Steer
    100

    Hey Nik,

    I think you are a hitting the same issue I came across the other day which is down to an issue in examine with Boosted stop words (https://github.com/Shazwazza/Examine/issues/34)

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Jun 23, 2017 @ 07:36
    Ismail Mayat
    1

    That was going to be next suggestion take of the boost, basically the stop word is being removed and you are left with boost only hence it blows up.

    I would remove the stop words that should fix the issue.

    Regards

    Ismail

  • Nik 1591 posts 7148 karma points MVP 6x c-trib
    Jun 23, 2017 @ 08:21
    Nik
    0

    Is there an easy way to check if a term is a stop word? I was looking at trying this:

     Lucene.Net.Analysis.StopAnalyzer.ENGLISH_STOP_WORDS_SET.ContainsKey(term)
    

    and this

     Lucene.Net.Analysis.StopAnalyzer.ENGLISH_STOP_WORDS_SET.ContainsValue(term)
    

    But both return false when the term is a stop term. Also when I inspect ENGLISHSTOPWORDS_SET it has a count of 33, but every list I find in the object is empty, so I'm a bit at a loss as to where they are.

  • Nik 1591 posts 7148 karma points MVP 6x c-trib
    Jun 23, 2017 @ 08:32
    Nik
    0

    Never mind, I missed a line of code in the issue linked by Tom that allows me to do the check I need :-)

    Thanks guys, you all rock!

Please Sign in or register to post replies

Write your reply to:

Draft