How to take one byte from a string

Q&A's, tips, howto's
Locked
Fritz
Posts: 66
Joined: Sun Sep 27, 2009 12:08 am
Location: Russia

How to take one byte from a string

Post by Fritz »

I'm trying to read the string byte-per-byte (for encoding from 8-bit codepage to UTF-8). But (pop the-string) returns some random number of bytes, so does (the-string 0) etc:
http://img7.imageshost.ru/imgs/091008/3 ... /11005.png

(set-locale "C") did not help too. Only working way I have found is to write temporary file and then use read-char function.

Code: Select all

; Usage: (cyr-win-utf "text in windows-1251 encoding")
; Decodes text from windows-1251 to utf-8
(define (cyr-win-utf t-linea)
  ; Loading encoding table
  (set 'en-win-1251 '((255 "я") (254 "ю") (253 "э") (252 "ь") (251 "ы") 
  (250 "ъ") (249 "щ") (248 "ш") (247 "ч") (246 "ц") (245 "х") (244 "ф") 
  (243 "у") (242 "т") (241 "с") (240 "р") (239 "п") (238 "о") (237 "н")
  (236 "м") (235 "л") (234 "к") (233 "й") (232 "и") (231 "з") (230 "ж")
  (184 "ё") (229 "е") (228 "д") (227 "г") (226 "в") (225 "б") (224 "а")
  (223 "Я") (222 "Ю") (221 "Э") (220 "Ь") (219 "Ы") (218 "Ъ") (217 "Щ")
  (216 "Ш") (215 "Ч") (214 "Ц") (213 "Х") (212 "Ф") (211 "У") (210 "Т") 
  (209 "С") (208 "Р") (207 "П") (206 "О") (205 "Н") (204 "М") (203 "Л")
  (202 "К") (201 "Й") (200 "И") (199 "З") (198 "Ж") (168 "Ё") (197 "Е") 
  (196 "Д") (195 "Г") (194 "В") (193 "Б") (192 "А")))
  ; saving string to a temp file
  (set 't-file-name (append "/tmp/" (crypto:md5 (string (random)))))
  (write-file t-file-name t-linea)
  ; loading characters to the t-out
  (set 't-out "")
  (set 't-file (open t-file-name "read"))
  (while (set 't-char (read-char t-file))
    (push (or (lookup t-char en-win-1251) (char t-char)) t-out -1))
  (close t-file)
  t-out)
May be, there is a shorter way, without file-writing? I need this function in both Linux and Windows, and Windows temp directory has another name.

cormullion
Posts: 2038
Joined: Tue Nov 29, 2005 8:28 pm
Location: latiitude 50N longitude 3W
Contact:

Post by cormullion »

Does unpack help at all?

Fritz
Posts: 66
Joined: Sun Sep 27, 2009 12:08 am
Location: Russia

Post by Fritz »

cormullion wrote:Does unpack help at all?
Thank you! I think, yes, "unpack" is a solution. Function is much shorter now:

Code: Select all

(define (cyr-koi-utf-2 t-linea)
  ; putting character codes to the list
  (set 't-list (unpack (dup "b" (mul 2 (length t-linea))) t-linea))
  ; decoding characters from 't-list to the 't-out
  (set 't-out "")
  (dolist (t-char t-list)
    (push (or (lookup t-char en-koi8r) (char t-char)) t-out -1))
  t-out)
It works ok. Have found a funny thing, btw. Manual says: "Length... returns... the number of characters in a string". But (length "one-russian-letter-in-utf-8") returns 2, not 1.

Jeff
Posts: 604
Joined: Sat Apr 07, 2007 2:23 pm
Location: Ohio
Contact:

Post by Jeff »

dostring processes a string one char at a time...
Jeff
=====
Old programmers don't die. They just parse on...

Artful code

Fritz
Posts: 66
Joined: Sun Sep 27, 2009 12:08 am
Location: Russia

Post by Fritz »

Jeff wrote:dostring processes a string one char at a time...
Dostring takes several bytes from string per time, and I need one byte only:
http://img7.imageshost.ru/imgs/091008/e ... /4489d.png

m35
Posts: 171
Joined: Wed Feb 14, 2007 12:54 pm

Post by m35 »

Fritz wrote:Manual says: "Length... returns... the number of characters in a string". But (length "one-russian-letter-in-utf-8") returns 2, not 1.
What version of the manual are you using? The current manual says
The manual wrote:Returns ... the number of bytes in a string.
There is also utf8len for utf8 strings.

I've run into troubles myself when treating strings as binary data. It would work fine in normal newlisp then blow up when running in utf8 newlisp. Can't remember what I did to make things universal though.

Edit
Looked at the functions in the manual and I see 3 functions that work with bytes regardless: unpack (as you know), slice, and get-char.

You could just loop over the bytes with slice or get-char

Code: Select all

(for (i 0 (- (length s) 1))
   (setq c (slice s i 1))
   ' or
   (setq c (char (get-char (+ i (address s)))))
)

cormullion
Posts: 2038
Joined: Tue Nov 29, 2005 8:28 pm
Location: latiitude 50N longitude 3W
Contact:

Post by cormullion »

You could even use implicit slicing:

Code: Select all

(offset length utf8-str)
but don't confuse it with implicit indexing:

Code: Select all

(utf8-str offset length)
which does work on characters not bytes.

You can sometimes write code for both UTF8 and non-UTF8. Eg:

Code: Select all

(define (string-length s)
    (if unicode (utf8len s) (length s)))

Fritz
Posts: 66
Joined: Sun Sep 27, 2009 12:08 am
Location: Russia

Post by Fritz »

I think I have old manual. It is good: now I can be sure my "unpack" will work in future versions too.
m35 wrote: You could just loop over the bytes with slice or get-char

Code: Select all

(for (i 0 (- (length s) 1))
   (setq c (slice s i 1))
   ' or
   (setq c (char (get-char (+ i (address s)))))
)
Slice works, at least, with uft8 locale and ASCII encoded line. But get-char gives me only some strange negative numbers. Only this entangled construction works:

Code: Select all

(dotimes (i (length rln))
  (print (or (lookup (+ 256 (get-char (+ i (address rln)))) en-win-1251) "?")))

Fritz
Posts: 66
Joined: Sun Sep 27, 2009 12:08 am
Location: Russia

Post by Fritz »

cormullion wrote:You could even use implicit slicing:

Code: Select all

(offset length utf8-str)
But how? ((address str) 1 1) ?

cormullion
Posts: 2038
Joined: Tue Nov 29, 2005 8:28 pm
Location: latiitude 50N longitude 3W
Contact:

Post by cormullion »

How about

Code: Select all

(set 's "\004\003\002\001")
(for (i 0 3)
  (println (get-char (address (i 1 s)))))
4
3
2
1
where i is the offset, 1 is the length, and s is the string you're slicing...

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

You can do without 'address' if the argument is a string. This will do it too:

Code: Select all

(for (i 0 3) (println (get-char (i 1 s))))

Fritz
Posts: 66
Joined: Sun Sep 27, 2009 12:08 am
Location: Russia

Post by Fritz »

Lutz wrote:You can do without 'address' if the argument is a string. This will do it too:

Code: Select all

(for (i 0 3) (println (get-char (i 1 s))))

Code: Select all

(get-char (address (i 1 s)))


gives always "0" as a result.

Code: Select all

(get-char (i 1 s))
works, but only in this strange form:

Code: Select all

(+ 256 (get-char (i 1 s)))
PS: its a pity "explode" can not work with raw bytes, so I can not use "map".

Locked