indexing - Lucene search with dashes does not return consistent results -
hi have problem lucene search not return consistent results. indexing done standardanalyzer , lucene version 3.0
an example entry in database
a1bc-1-12345678 - au-01 / 123456 - no.1 abc defg xx-yyy example data
if search whole string, not return results.
if take out single dashes , slashes, search
a1bc-1-12345678 au-01 123456 no.1 abc defg xx-yyy example data
it not return results.
if replace dash between xx-yyy whitespace, search for
a1bc-1-12345678 au-01 123456 no.1 abc defg xx yyy example data
--------it returns result!----------------------
now if include dashes , slash, , replace dash between xx-yyy whitespace, search for
a1bc-1-12345678 - au-01 / 123456 - no.1 abc defg xx yyy example data
it not return results.
finally if replace dash between both au-01 , xx-yyy whitespace, search for
a1bc-1-12345678 au 01 123456 no.1 abc defg xx yyy example data
it not return results.
in conclusion, "xx-yyy" not valid "au-01" valid, "xx yyy" valid , "au 01" not valid, seems problem?
what can solve this?
i think i've got answer this, according lucene doc, standardanalyzer uses standardtokenizer, , index based on:
- splits words @ punctuation characters, removing punctuation. however, dot that's not followed whitespace considered part of token.
- splits words @ hyphens, unless there's number in token, in case whole token interpreted product number , not split.
i think whitespaceanalyzer lowercase filter suit needs.
Comments
Post a Comment