indexing richtext fields with examine

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Søren Løhr 7 posts 25 karma points

Mar 29, 2011 @ 23:38

0

Indexing richtext fields with Examine

API Questions

I'm trying to use Examine on a slightly older Umbraco site, running Umbraco 4.0.0 using Examine 0.10.0.292, and it seems to work well except for two issues:

1: Special character entered in a richtext field are replaced with their HTML entity names, e.g. ö is stored as ö and is also indexed as such and making it impossible to find words containing these characters in the index.

2: The other issue is searching for multiple words, where I always get 0 hits. If this topic was in my index, I could find it searching for "multiple" or "words", but not "multiple words". I'm searching using code like this:

searchCriteria = ExamineManager.Instance.SearchProviderCollection["mySearcher"].CreateSearchCriteria(BooleanOperation.Or);

filter = searchCriteria.GroupedOr(new string[] { "ShortDescription", "Description" }, Examine.LuceneEngine.SearchCriteria.LuceneSearchExtensions.Fuzzy(searchString));

searchResults =

ExamineManager.Instance.SearchProviderCollection["mySearcher"].Search(filter.Compile());

If I try to copy the query from filter.Compile().ToString() and entering it directly as a search in Luke, I do however get the results I would expect.

Both the index provider and the search provider are configured to use the StandardAnalyzer, and so is my search in Luke.

Any help resolving these issues will be greatly appreciated.

Copy Link
Søren Løhr 7 posts 25 karma points

Mar 31, 2011 @ 14:54

0

I found and old forum post describing how to change entity encoding from 'named' to 'raw' in the tinymce javascript file. This almost fixed the problem with character encoding. It only seems to work for new content. If I save some of the existing conent, it is still being encoded. Any pointers on how to prevent this?

Copy Link
Søren Løhr 7 posts 25 karma points

Mar 31, 2011 @ 16:17

0

Seems tinymce isn't the only place content was/is encoded.

Now that tinymce is set to raw, I can type a word like "søster" in a richtext field, and when viewing the fields HTML, I can see that it still says "søster". When published, and displayed on the site, it says s&oslah;ster in the source, and in the Lucene index it also says søster. The fields HTML remains "søster" when viewed in Umbraco.

Any help to solve this is highly appreciated.

Copy Link
is working on a reply...

Please Sign in or register to post replies

Flag this post as spam?