upper-case/lower-case with umlauts?

Q&A's, tips, howto's
Locked
HPW
Posts: 1390
Joined: Thu Sep 26, 2002 9:15 am
Location: Germany
Contact:

upper-case/lower-case with umlauts?

Post by HPW »

Is there any way to support upper-case/lower-case with umlauts?

Example:

(upper-case "Testöäüß")

gives:

"TESTöäüß"

One possibility is to define my own function with parsing/replacing the umlauts. Other ideas?
Hans-Peter

nigelbrown
Posts: 429
Joined: Tue Nov 11, 2003 2:11 am
Location: Brisbane, Australia

Post by nigelbrown »

I found
http://mail.python.org/pipermail/python ... 04165.html
that discusses toupper conversion (in the context of C libraries in Python?)
viz:<quote>
> On POSIX systems there are a several environment variables used to
> control the default locale settings for a users session. For example
> on my SuSE Linux system currently running in the german locale the
> environment variable LC_CTYPE=de_DE is automatically set by a file
> /etc/profile during login, which causes automatically the C-library
> function toupper('ä') to return an 'Ä' ---you should see
> a lower case a-umlaut as argument and an upper case umlaut as return
> value--- without having all applications to call 'setlocale' explicitly.
>
> So this simply works well as intended without having to add calls
> to 'setlocale' to all application program using this C-library functions.

I don;t believe that. According to the ANSI standard, a C program
*must* call setlocale(LC_..., "") if it wants the environment
variables to be honored; without this call, the locale is always the
"C" locale, which should *not* honor the environment variables.
<end quote>

This suggests that newlisp code could have a locale setting that would lead to the correct conversion (if supported by borland)
I'm not currently at a computer with the borland compiler installed so haven't looked at the borland docs.
Regards
Nigel

nigelbrown
Posts: 429
Joined: Tue Nov 11, 2003 2:11 am
Location: Brisbane, Australia

Post by nigelbrown »

Further to my earlier reply: from the Borland helpfile BCB5.HLP
<quote>
Syntax

#include <locale.h>
char *setlocale(int category, const char *locale);
wchar_t * _wsetlocale( int category, const wchar_t *locale);

Description

Use the setlocale to select or query a locale.

Borland C++ supports all locales supported in NT 3.5x and Win95/NT 4.0 operating systems. See your system documentation for details.

The possible values for the category argument are as follows:

Value Affect

LC_ALL Affects all the following categories
LC_COLLATE Affects strcoll and strxfrm
LC_CTYPE Affects single-byte character handling functions. The mbstowcs and mbtowc functions are not affected.
<end quote>...
<quote>
To take advantage of dynamically loadable locales in your application, define _ _USELOCALES_ _ for each module. If _ _USELOCALES_ _ is not defined, all locale-sensitive functions and macros will work only with the default C locale.
<end quote>
This could be tried.
Regards
Nigel

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

Thanks for the pointers Nigel.

I left "newlisp.exe.7305" and "README_7305.txt" in the development directory. This version tries to set the locale of your country automatically. Look into the "README_7305.txt" for instructions.

I had the opportunity to log on to a German Linux computer and it worked well doing the uppercase on the German character set.

But I had not the chance to test a Borland C compile on a German machine. But yes, the Borland compiler supports the setlocale() function. It is called automatically now in newLISP and also available as a builtin function.

The builtin function does not return a correct value (the locale 'C' or 'POSIX' or country code) as it does on CYGWIN and Linux, but perhaps it does it on a German Windows. Here in US BorlandC I get a NULL-pointer which I am converting to 'nil'. In US CYGWIN I get "C" and in German Linux I get "de_DE".

So give it a try and tell me whats happening.

Lutz

ps: FP exception in this version behaves like on UNIX, also changes for big-buffer in Tcl/Tk NewlispEvaluateBuffer(), but still working on newlisp-tk.tcl.

nigelbrown
Posts: 429
Joined: Tue Nov 11, 2003 2:11 am
Location: Brisbane, Australia

Post by nigelbrown »

Lutz
README_7305.txt is missing
Nigel

HPW
Posts: 1390
Joined: Thu Sep 26, 2002 9:15 am
Location: Germany
Contact:

Post by HPW »

newLISP v7.3.5 Copyright (c) 2003 Lutz Mueller. All rights reserved.

> (upper-case "ASDasdöäüßÖÄÜ")
"ASDASDÖÄÜßÖÄÜ"
> (lower-case "ASDasdöäüßÖÄÜ")
"asdasdöäüßöäü"
>
Works well on WIN2K PRO and WIN XP PRO (Both German)!
Great!
Hans-Peter

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

I am glad 'locale' works! Another problem solved. About 50% of newLISP users are outside USA, so this was important.

BTW, readme_7305.txt is now visible in the development directory, it would be good to hear from other countries too.

Lutz

nigelbrown
Posts: 429
Joined: Tue Nov 11, 2003 2:11 am
Location: Brisbane, Australia

Post by nigelbrown »

Using the functions in the readme I noticed that the square root sign \251 is channged by upper-case.
Compare 7.3.3 which leaves sqrt sign alone:
newLISP v7.3.3 Copyright (c) 2003 Lutz Mueller. All rights reserved.

> (char "\251")
251
> (char (upper-case "\251"))
251
> (char (upper-case "a"))
65
> (char "a")
97
>
with 7.3.5 de novo that seems to subtract 32 to make it "uppercase":

C:\temp>newlisp
newLISP v7.3.5 Copyright (c) 2003 Lutz Mueller. All rights reserved.

> (char "\251")
251
> (char (upper-case "\251"))
219
> (set-locale 0)
nil
>


I'm in Australia with Win XP Pro

Regards
Nigel

nigelbrown
Posts: 429
Joined: Tue Nov 11, 2003 2:11 am
Location: Brisbane, Australia

Post by nigelbrown »

Sorry if my last post caused confusion - the variation in console fonts on win systems is the source of confusion
My above comments on the sqrt sign apply using "Lucinda ConsoleP" font on my DOSBox on my Win98SE setup - that has the sqrt sign as \251 - however I see noe that many fonts have
u-hat û at that value which is converted to uppercase by subtracting 32.
Also although my CommandPrompt box on WinXP Pro is set to Lucinda Console that 'should' have u-hat a line draw sumbol actually appears-

C:\temp>newlisp
newLISP v7.3.5 Copyright (c) 2003 Lutz Mueller. All rights reserved.

> "\251"
"¹"
> (upper-case "\251")
"█"
>
(I note here \251 is ? superscript 1 and uppercases to a block)

Very confusing in the upper decimal characters.

Nigel
I guess the use of an unexpected display font will muddy the waters.

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

I am still exploring the whole 'locale' thing, this is what I did so far. On startup with 7.3.5 and after newLISP does a:

(set-locale 0xFF "") ; switch all option on for your locale

Internally it does: setlocale(LC_ALL, ""), LC_ALL is defined in locale.h as 0xff.

To go back to a pre 7.3.5 status I think you would do a:

(set-locale 0 "C") ; switch to ISO 'C' locale available in all countries

When set-locale is given only the first parameter it is supposed to return the current locale (newLISP passes null to setlocale(option, null)). When giving "" as the second it is supposed to switch to the local locale.

The question is: how should I distribute newLISP?

(1) with locale switching as the default. This broke Turtle.lsp in Germany because of decimal comma in floats, but made upper-case etc. working right away for HPW

(2) with ISO 'C' locale as default like before 7.3.5 ? will guarantee a newLISP which behaves in the whole world the same way, but may be not practical for writing your daily application.

Currently I haven't documented 'set-locale' yet to clear up these questions first. I wonder what other languages do, i.e. Perl or Python. Hans-Peter how is it in Germany with Perl, Python ?!

Lutz

HPW
Posts: 1390
Joined: Thu Sep 26, 2002 9:15 am
Location: Germany
Contact:

Post by HPW »

Code: Select all

> (set-locale 0 "C") 
"C"
> (upper-case "asdöäüÖÄÜß")
"ASDöäüÖÄÜß"

> (set-locale 0xFF "") 
"LC_MONETARY=German_Germany.850\nLC_TIME=German_Germany.850\nLC_NUMERIC=German_Germany.850\nLC_COLLATE=German_Germany.850\nLC_CTYPE=German_Germany.850\n"
> (upper-case "asdöäüÖÄÜß")
"ASDÖÄÜÖÄÜß"
With locale "C" original Turtle.lsp works.
With german original Turtle has the bug.

>(1) with locale switching as the default.

Yes I would prefer it as Default. It should be well documented.
Put in the above switch code in Turtle lisp and switch temporaly back to "C" inside Turtle lisp. Then everyone can look in the sample-code how to avoid such Problems. Works for me here with 7.3.7.

Code: Select all

;; Turtle.lsp - graphics demo for newLISP-tk
;;
;; to run: (Turtle:run)
;;
;;
;;

(set-locale 0 "C")

...
...


(define (run )
  (tk "if {[winfo exists .tw] == 1} {destroy .tw}")
  (tk "toplevel .tw")
  (tk "canvas .tw.can -width 500 -height 400 -bg #FFFEC0")
  (tk "pack .tw.can")
  (tk "wm geometry .tw +100+160")
  (tk "wm title .tw { Turtle.lsp}")
  (tk ".tw.can create text 380 70 -fill navy -font {Times 12} -text {Dragon Fractal}")
  (tk ".tw.can create text 100 350 -fill navy -font {Times 16} -text {Turtle Graphics}")
  (tk "after 300; update idletasks")
  (turtle-start 300 50)
  (dragon-curve 12 "red")
  (draw)
  (turtle-start 120 200)
  (rose "blue")
  (set-locale 0xFF ""))

>Perl or Python. Hans-Peter how is it in Germany with Perl, Python ?!

Have to investigate and ask my python-college.
Hans-Peter

nigelbrown
Posts: 429
Joined: Tue Nov 11, 2003 2:11 am
Location: Brisbane, Australia

Post by nigelbrown »

An extensive discussion of perl locale use and issues is at:
http://www.perldoc.com/perl5.6/pod/perllocale.html

An issue for newlisp is how to have the (upper-case and (regex.. working with the same locale. The pcre lib docs suggest that quite a bit of fiddling is needed (compiling custom tables for each desired charset) for pcre to be locale aware - otherwise upper case will mean different things to regex and upper-case.

Perhaps the standard newlisp could work with default locale (option(2) of Lutz's post) and a special compilation flag be used if a locale aware is desired. Looking at the perl locale discussion shows what a can of worms locale can open.

Nigel

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

For now we will leave the locale switching in as a default in the development versions, document everyting well until more research is done.

In PCRE there are different issuee, I looked through the code an there is no locale switching. Everything seems to be dependant on character tables, which are generated before compiling. May be PCRE could be the reason not to automtically switch the locale but distribute newLISP with (set-locale 0 "C").

Unfortunately I don't know how find/replace/regex are performing i.e. on case specific stuff when the locale is switched.

Again, I think I have to do some reading first, to figure out how others solve these issues.

Lutz

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

See new thread "Localization in newLISP" in "Lisp in general" group.

Lutz

Locked