go - Golang stdin reads german umlauts wrong -
i'm germany use umlauts ä
, ö
, ü
. golang doesn't read them correctly stdin.
when execute simple program:
package main import ( "bufio" "fmt" "os" ) func main() { { b, _, _ := bufio.newreader(os.stdin).readline() printbytes(b) } } func printbytes(bytes []byte) { _, b := range bytes { fmt.printf("0x%x ", b) } fmt.println() }
i output:
c:\dev\golang>go run test.go ä 0xe2 0x80 0x9e
e2 80 9e
isn't correct byte sequence ä
in utf-8 (this tool tells me it's "double low-9 quotation mark" -> „
) , when print out i've read prints "
. i've written small "hack" seems read characters correct:
package main /* #include <stdio.h> #include <stdlib.h> char * getline(void) { char * line = malloc(100), * linep = line; size_t lenmax = 100, len = lenmax; int c; if(line == null) return null; for(;;) { c = fgetc(stdin); if(c == eof) break; if(--len == 0) { len = lenmax; char * linen = realloc(linep, lenmax *= 2); if(linen == null) { free(linep); return null; } line = linen + (line - linep); linep = linen; } if((*line++ = c) == '\n') break; } *line = '\0'; return linep; } void freeline(char* ptr) { free(ptr); } */ import "c" import ( "fmt" "golang.org/x/text/encoding/charmap" ) func getlinefromcp850() string { line := c.getline() goline := c.gostring(line) c.freeline(line) b := []byte(goline) ub, _ := charmap.codepage850.newdecoder().bytes(b) return string(ub) } func main() { { line := getlinefromcp850() printbytes([]byte(line)) } } func printbytes(bytes []byte) { _, b := range bytes { fmt.printf("0x%x ", b) } fmt.println() }
and prints out:
c:\dev\golang>go run test.go ä 0xc3 0xa4 0xa
c3 a4
correct bytesequence ä
(0a linefeed hack doesn't strip) seems like, reading , converting cp850 utf-8 job, expected, why go give me gibberish when read line using go's functionality instead of cgo? whats wrong go gives me values, doesn't interpret input bytes cp850 charset? there better go-only way handle problem?
this problem arises when reading stdin. when print out utf-8 ä
stdout prints correctly in console.
so bug in golang systems, specific windows systems overall used charset , console charset different (where getacp()
, getconsolecp()
winapi returned different things). in germany, example, (and maybe other west-european countries), windows uses codepage 1252 overall-charset uses codepage 850 console cmd.exe
. not sure why, thats how is. golang wrongly used getacp()
decode input utf-8 when should've used codepage returned getconsolecp()
. found problem in issue created , we'll see fix merged next version of golang.
we found problem on windows golang decoded characters decomposed utf-8 characters (i.e. read ä
character a
followed combining diaeresis ̈
) lead other problems, example printing decomposed characters prints them separate instead of 1 combined character.
Comments
Post a Comment