Running Tika through tika-python in Windows produces encoding errors -


i have python code extracts text pdf files using tika server through tika-python. stores resulting output in individual json files.

the command run execute script

python extraction.py <full path local directory> 

i'm using python 3.5

it works perfect in different macbookpro computers. doesn´t work expected in windows, using up-to-date windows 10.

some pdf files processed, others produce error such as:

'charmap' codec can't encode characters in position 3648-3649: character maps <undefined> 

i have tried changing code page 65001 , changing console font lucida console, based on other questions posted on stack overflow, including 388490 (unicode characters in windows command line - how?) , 14109024 (how make unicode charset in cmd.exe default?) , 1259084 (what encoding/code page cmd.exe using?).

i tried installing conemu (http://conemu.github.io/en/unicodesupport.html) , changing default encoding consoles.

other references mention win_unicode_console (https://github.com/drekin/win-unicode-console) python patch recommended instructions not working in machine.

i use anaconda python distribution.

i interested in knowing how able run python code in windows without having these encoding problems. have read, not problem python code nor tika server rather windows encoding issue.

thank all,

german


Comments

Popular posts from this blog

mysql - Dreamhost PyCharm Django Python 3 Launching a Site -

java - Sending SMS with SMSLib and Web Services -

java - How to resolve The method toString() in the type Object is not applicable for the arguments (InputStream) -