fuzzy comparison - Python: Fuzzywuzzy not working for foreign characters -
when try simple fuzzywuzzy expression foreign characters, erroneous results using process.extractone method (i've tried , without u):
>>> choices= [u"הלכות חנוכה",u"הלכות פורים",u"הלכות סוכה"] >>> process.extractone("הלכות סוכה", choices) (u'\u05d4\u05dc\u05db\u05d5\u05ea \u05d7\u05e0\u05d5\u05db\u05d4', 0)
yet runs smoothly fuzz.ratio:
>>> fuzz.ratio("הלכות ראש השנה", "הלכות תעניות") 69
and same code works great regular characters:
>>> choices= ['this','that','those'] >>> process.extractone("these", choices) ('those', 80)
what might problem?
pass fuzz.ratio
in scorer=
argument , add u in front of string you're trying match for.
below works:
choices= [u"הלכות חנוכה",u"הלכות פורים",u"הלכות סוכה"] process.extractone(u"הלכות סוכה", choices, scorer=fuzz.ratio)
(u'\u05d4\u05dc\u05db\u05d5\u05ea \u05e1\u05d5\u05db\u05d4', 100)
and others give same score well:
choices= [u"הלכות חנוכה",u"הלכות פורים",u"הלכות סוכה"] process.extract(u"הלכות סוכה", choices, scorer=fuzz.ratio)
[(u'\u05d4\u05dc\u05db\u05d5\u05ea \u05e1\u05d5\u05db\u05d4', 100), (u'\u05d4\u05dc\u05db\u05d5\u05ea \u05d7\u05e0\u05d5\u05db\u05d4', 86), (u'\u05d4\u05dc\u05db\u05d5\u05ea \u05e4\u05d5\u05e8\u05d9\u05dd', 67)]
fuzzywuzzy version: fuzzywuzzy 0.7.0 & python 2.7x
Comments
Post a Comment