go - Golang stdin reads german umlauts wrong -


i'm germany use umlauts ä, ö , ü. golang doesn't read them correctly stdin.

when execute simple program:

package main  import (     "bufio"     "fmt"     "os" )  func main() {     {         b, _, _ := bufio.newreader(os.stdin).readline()         printbytes(b)     }  }  func printbytes(bytes []byte) {     _, b := range bytes {         fmt.printf("0x%x ", b)     }     fmt.println() } 

i output:

c:\dev\golang>go run test.go ä 0xe2 0x80 0x9e 

e2 80 9e isn't correct byte sequence ä in utf-8 (this tool tells me it's "double low-9 quotation mark" -> ) , when print out i've read prints ". i've written small "hack" seems read characters correct:

package main  /* #include <stdio.h> #include <stdlib.h>  char * getline(void) {     char * line = malloc(100), * linep = line;     size_t lenmax = 100, len = lenmax;     int c;      if(line == null)         return null;      for(;;) {         c = fgetc(stdin);         if(c == eof)             break;          if(--len == 0) {             len = lenmax;             char * linen = realloc(linep, lenmax *= 2);              if(linen == null) {                 free(linep);                 return null;             }             line = linen + (line - linep);             linep = linen;         }          if((*line++ = c) == '\n')             break;     }     *line = '\0';     return linep; }  void freeline(char* ptr) {     free(ptr); } */ import "c"  import (     "fmt"     "golang.org/x/text/encoding/charmap" )  func getlinefromcp850() string {     line := c.getline()     goline := c.gostring(line)     c.freeline(line)     b := []byte(goline)     ub, _ := charmap.codepage850.newdecoder().bytes(b)     return string(ub) }  func main() {     {         line := getlinefromcp850()         printbytes([]byte(line))     }  }  func printbytes(bytes []byte) {     _, b := range bytes {         fmt.printf("0x%x ", b)     }     fmt.println() } 

and prints out:

c:\dev\golang>go run test.go ä 0xc3 0xa4 0xa 

c3 a4 correct bytesequence ä (0a linefeed hack doesn't strip) seems like, reading , converting cp850 utf-8 job, expected, why go give me gibberish when read line using go's functionality instead of cgo? whats wrong go gives me values, doesn't interpret input bytes cp850 charset? there better go-only way handle problem?

this problem arises when reading stdin. when print out utf-8 ä stdout prints correctly in console.

so bug in golang systems, specific windows systems overall used charset , console charset different (where getacp() , getconsolecp() winapi returned different things). in germany, example, (and maybe other west-european countries), windows uses codepage 1252 overall-charset uses codepage 850 console cmd.exe. not sure why, thats how is. golang wrongly used getacp() decode input utf-8 when should've used codepage returned getconsolecp(). found problem in issue created , we'll see fix merged next version of golang.

we found problem on windows golang decoded characters decomposed utf-8 characters (i.e. read ä character a followed combining diaeresis ̈) lead other problems, example printing decomposed characters prints them separate instead of 1 combined character.


Comments

Popular posts from this blog

mysql - Dreamhost PyCharm Django Python 3 Launching a Site -

java - Sending SMS with SMSLib and Web Services -

java - How to resolve The method toString() in the type Object is not applicable for the arguments (InputStream) -