From 8a095bc79f8502c77f60f2397617264aa55d89ff Mon Sep 17 00:00:00 2001 From: "Kay Marquardt (Gnadelwartz)" Date: Mon, 4 Jan 2021 14:49:04 +0100 Subject: [PATCH] doc: add known locale problems --- doc/4_expert.md | 56 +++++++++++++++++++++++++++++++++++++++++++++-- modules/jsonDB.sh | 2 +- 2 files changed, 55 insertions(+), 3 deletions(-) diff --git a/doc/4_expert.md b/doc/4_expert.md index 19a4171..ca46ef3 100644 --- a/doc/4_expert.md +++ b/doc/4_expert.md @@ -35,7 +35,59 @@ export 'LANGUAGE=den_US.UTF-8' ``` 3. make sure your bot scripts use the correct settings, eg. include the lines above at the beginning of your scripts -To display all available locales on your system run `locale -a | more`. [Gentoo Wiki](https://wiki.gentoo.org/wiki/UTF-8) +#### Known UTF-8 pitfalls + +##### Missing locale C + +Even required by POSIX standard some systems (e.g. Manjaro Linux) has no locale `C` and `C.UTF-8` installed. +If bashbot display a warning about missing locale you must install locale `C` and `C.UTF-8`. + +If you don't know what locales are installed on your sytsem use `locale -a | more` to display them. +[Gentoo Wiki](https://wiki.gentoo.org/wiki/UTF-8). + + +##### Character classes + +In ASCII times it was clear `[:lower:]` and `[a-z]` means ONLY the lowercase letters `[abcd...xyz]`. +With introdution of locales character classes and ranges contains every character fitting the class definition. + +This mean `[:lower:]` and `[a-z]` contains ALL lowercase letters e.g. `ä á ø dž ȼ` +also, see [Unicode Latin lowercase letters]https://www.fileformat.info/info/unicode/category/Ll/list.htm) + +If that's ok for your script your'e fine, but many scripts rely on the idea of ASCII ranges and may produce undesired results. + +```bash +# try with different locales ... +lowercase="abcäöü" + +[[ "$string" =~ ^[a-z]$ ] && echo "String is all lower case" + +LANG="en_EN +[[ "$string" =~ ^[a-z]$ ] && echo "String is all lower case" + +LANG="C" +[[ "$string" =~ ^[a-z]$ ] && echo "String is all lower case" +``` + +There are three solutions: + +1. list exactly the characters you want: `[abcd...]` +2. instruct bash to use `C` locale for ranges: `shopt -s "globasciiranges"` +3. use `LC_COLLATE` to change behavior of all programs: `export LC_COLLATE=C` + + +To work independent of language and bash settings bashbot uses solution 1. and uses own "classes" if an exact match is mandatory: + +```bash +azazaz='abcdefghijklmnopqrstuvwxyz' # a-z :lower: +AZAZAZ='ABCDEFGHIJKLMNOPQRSTUVWXYZ' # A-Z :upper: +R090909='0123456789' # 0-9 :digit: +azAZaz="${azazaz}${AZAZAZ}" # a-zA-Z :alpha: +azAZ09="${azAZaz}${R090909}" # a-zA-z0-9 :alnum: + +# e.g. characters allowed for key in key/value pairs +JSSH_KEYOK="[-${azAZ09},._]" +``` #### Bashbot UTF-8 Support Bashbot handles all messages transparently, regardless of the charset in use. The only exception is when converting from JSON data to strings. @@ -378,5 +430,5 @@ for every poll until the maximum of BASHBOT_SLEEP ms. #### [Prev Advanced Use](3_advanced.md) #### [Next Best Practice](5_practice.md) -#### $$VERSION$$ v1.21-0-gc85af77 +#### $$VERSION$$ v1.21-4-g966ee5d diff --git a/modules/jsonDB.sh b/modules/jsonDB.sh index 4dfecb6..d27288b 100644 --- a/modules/jsonDB.sh +++ b/modules/jsonDB.sh @@ -5,7 +5,7 @@ # This file is public domain in the USA and all free countries. # Elsewhere, consider it to be WTFPLv2. (wtfpl.net/txt/copying) # -#### $$VERSION$$ v1.21-3-ge072afa +#### $$VERSION$$ v1.21-4-g966ee5d # # source from commands.sh to use jsonDB functions #