Q&A's, tips, howto's
			
		
		
			
				
																			
								Fritz 							 
									
		Posts:  66  		Joined:  Sun Sep 27, 2009 12:08 am 		
		
											Location:  Russia 
							
						
		 
		
						
						
													
							
						
									
						Post 
					 
								by Fritz   »  Wed Oct 07, 2009 9:15 pm 
			
			
			
			
			I'm trying to read the string byte-per-byte (for encoding from 8-bit codepage to UTF-8). But (pop the-string) returns some random number of bytes, so does (the-string 0) etc:
http://img7.imageshost.ru/imgs/091008/3 ... /11005.png 
(set-locale "C") did not help too. Only working way I have found is to write temporary file and then use read-char function.
Code: Select all 
; Usage: (cyr-win-utf "text in windows-1251 encoding")
; Decodes text from windows-1251 to utf-8
(define (cyr-win-utf t-linea)
  ; Loading encoding table
  (set 'en-win-1251 '((255 "я") (254 "ю") (253 "э") (252 "ь") (251 "ы") 
  (250 "ъ") (249 "щ") (248 "ш") (247 "ч") (246 "ц") (245 "х") (244 "ф") 
  (243 "у") (242 "т") (241 "с") (240 "р") (239 "п") (238 "о") (237 "н")
  (236 "м") (235 "л") (234 "к") (233 "й") (232 "и") (231 "з") (230 "ж")
  (184 "ё") (229 "е") (228 "д") (227 "г") (226 "в") (225 "б") (224 "а")
  (223 "Я") (222 "Ю") (221 "Э") (220 "Ь") (219 "Ы") (218 "Ъ") (217 "Щ")
  (216 "Ш") (215 "Ч") (214 "Ц") (213 "Х") (212 "Ф") (211 "У") (210 "Т") 
  (209 "С") (208 "Р") (207 "П") (206 "О") (205 "Н") (204 "М") (203 "Л")
  (202 "К") (201 "Й") (200 "И") (199 "З") (198 "Ж") (168 "Ё") (197 "Е") 
  (196 "Д") (195 "Г") (194 "В") (193 "Б") (192 "А")))
  ; saving string to a temp file
  (set 't-file-name (append "/tmp/" (crypto:md5 (string (random)))))
  (write-file t-file-name t-linea)
  ; loading characters to the t-out
  (set 't-out "")
  (set 't-file (open t-file-name "read"))
  (while (set 't-char (read-char t-file))
    (push (or (lookup t-char en-win-1251) (char t-char)) t-out -1))
  (close t-file)
  t-out)
May be, there is a shorter way, without file-writing? I need this function in both Linux and Windows, and Windows temp directory has another name.
 
			
			
									
									
						 
		 
				
		
		 
	 
	 
				
		
		
			
				
																			
								cormullion 							 
									
		Posts:  2038  		Joined:  Tue Nov 29, 2005 8:28 pm 		
		
																Location:  latiitude 50N longitude 3W 
							
							
				Contact: 
				
			 
				
		 
		
						
						 
													
							
						
									
						Post 
					 
								by cormullion   »  Wed Oct 07, 2009 9:52 pm 
			
			
			
			
			Does unpack  help at all?
			
			
									
									
						 
		 
				
		
		 
	 
	 
				
		
		
			
				
																			
								Fritz 							 
									
		Posts:  66  		Joined:  Sun Sep 27, 2009 12:08 am 		
		
											Location:  Russia 
							
						
		 
		
						
						 
													
							
						
									
						Post 
					 
								by Fritz   »  Wed Oct 07, 2009 10:14 pm 
			
			
			
			
			cormullion wrote: Does unpack  help at all?
Thank you! I think, yes, "unpack" is a solution. Function is much shorter now:
Code: Select all 
(define (cyr-koi-utf-2 t-linea)
  ; putting character codes to the list
  (set 't-list (unpack (dup "b" (mul 2 (length t-linea))) t-linea))
  ; decoding characters from 't-list to the 't-out
  (set 't-out "")
  (dolist (t-char t-list)
    (push (or (lookup t-char en-koi8r) (char t-char)) t-out -1))
  t-out)
It works ok. Have found a funny thing, btw. Manual says: "Length... returns... the number of characters in a string". But (length "one-russian-letter-in-utf-8") returns 2, not 1.
 
			
			
									
									
						 
		 
				
		
		 
	 
	 
				
		
		
			
				
																			
								Jeff 							 
									
		Posts:  604  		Joined:  Sat Apr 07, 2007 2:23 pm 		
		
																Location:  Ohio 
							
							
				Contact: 
				
			 
				
		 
		
						
						 
													
							
						
									
						Post 
					 
								by Jeff   »  Wed Oct 07, 2009 10:14 pm 
			
			
			
			
			dostring processes a string one char at a time...
			
			
									
									Jeff
=====
Old programmers don't die. They just parse on...
Artful code  
						 
		 
				
		
		 
	 
	 
				
	 
				
		
		
			
				
																			
								m35 							 
									
		Posts:  171  		Joined:  Wed Feb 14, 2007 12:54 pm 		
		
						
						
		 
		
						
						 
													
							
						
									
						Post 
					 
								by m35   »  Thu Oct 08, 2009 3:35 pm 
			
			
			
			
			Fritz wrote: Manual says: "Length... returns... the number of characters in a string". But (length "one-russian-letter-in-utf-8") returns 2, not 1.
What version of the manual are you using? The 
current manual  says
The manual wrote: Returns ... the number of bytes in a string.
There is also 
utf8len  for utf8 strings.
I've run into troubles myself when treating strings as binary data. It would work fine in normal newlisp then blow up when running in utf8 newlisp. Can't remember what I did to make things universal though.
Edit 
Looked at the functions in the manual and I see 3 functions that work with bytes regardless: unpack (as you know), slice, and get-char.
You could just loop over the bytes with slice or get-char
Code: Select all 
(for (i 0 (- (length s) 1))
   (setq c (slice s i 1))
   ' or
   (setq c (char (get-char (+ i (address s)))))
)
 
			
			
									
									
						 
		 
				
		
		 
	 
	 
				
		
		
			
				
																			
								cormullion 							 
									
		Posts:  2038  		Joined:  Tue Nov 29, 2005 8:28 pm 		
		
																Location:  latiitude 50N longitude 3W 
							
							
				Contact: 
				
			 
				
		 
		
						
						 
													
							
						
									
						Post 
					 
								by cormullion   »  Thu Oct 08, 2009 5:00 pm 
			
			
			
			
			You could even use implicit slicing:
  
but don't confuse it with implicit indexing:
   
which 
does  work on characters not bytes.
You can sometimes write code for both UTF8 and non-UTF8. Eg:
Code: Select all 
(define (string-length s)
    (if unicode (utf8len s) (length s)))
 
			
			
									
									
						 
		 
				
		
		 
	 
	 
				
		
		
			
				
																			
								Fritz 							 
									
		Posts:  66  		Joined:  Sun Sep 27, 2009 12:08 am 		
		
											Location:  Russia 
							
						
		 
		
						
						 
													
							
						
									
						Post 
					 
								by Fritz   »  Thu Oct 08, 2009 6:09 pm 
			
			
			
			
			I think I have old manual. It is good: now I can be sure my "unpack" will work in future versions too.
m35 wrote: 
You could just loop over the bytes with slice or get-char
Code: Select all 
(for (i 0 (- (length s) 1))
   (setq c (slice s i 1))
   ' or
   (setq c (char (get-char (+ i (address s)))))
)
Slice works, at least, with uft8 locale and ASCII encoded line. But get-char gives me only some strange negative numbers. Only this entangled construction works:
Code: Select all 
(dotimes (i (length rln))
  (print (or (lookup (+ 256 (get-char (+ i (address rln)))) en-win-1251) "?")))
 
			
			
									
									
						 
		 
				
		
		 
	 
	 
				
		
		
			
				
																			
								Fritz 							 
									
		Posts:  66  		Joined:  Sun Sep 27, 2009 12:08 am 		
		
											Location:  Russia 
							
						
		 
		
						
						 
													
							
						
									
						Post 
					 
								by Fritz   »  Thu Oct 08, 2009 6:42 pm 
			
			
			
			
			cormullion wrote: You could even use implicit slicing:
  
But how? ((address str) 1 1) ?
 
			
			
									
									
						 
		 
				
		
		 
	 
	 
				
		
		
			
				
																			
								cormullion 							 
									
		Posts:  2038  		Joined:  Tue Nov 29, 2005 8:28 pm 		
		
																Location:  latiitude 50N longitude 3W 
							
							
				Contact: 
				
			 
				
		 
		
						
						 
													
							
						
									
						Post 
					 
								by cormullion   »  Thu Oct 08, 2009 9:51 pm 
			
			
			
			
			How about 
Code: Select all 
(set 's "\004\003\002\001")
(for (i 0 3)
  (println (get-char (address (i 1 s)))))
4
3
2
1
where i is the offset, 1 is the length, and s is the string you're slicing...
 
			
			
									
									
						 
		 
				
		
		 
	 
	 
				
		
		
			
				
																			
								Lutz 							 
									
		Posts:  5289  		Joined:  Thu Sep 26, 2002 4:45 pm 		
		
																Location:  Pasadena, California 
							
							
				Contact: 
				
			 
				
		 
		
						
						 
													
							
						
									
						Post 
					 
								by Lutz   »  Fri Oct 09, 2009 1:11 am 
			
			
			
			
			You can do without 'address' if the argument is a string. This will do it too:
Code: Select all 
(for (i 0 3) (println (get-char (i 1 s)))) 
			
			
									
									
						 
		 
				
		
		 
	 
	 
				
		
		
			
				
																			
								Fritz 							 
									
		Posts:  66  		Joined:  Sun Sep 27, 2009 12:08 am 		
		
											Location:  Russia 
							
						
		 
		
						
						 
													
							
						
									
						Post 
					 
								by Fritz   »  Sat Oct 10, 2009 9:32 pm 
			
			
			
			
			Lutz wrote: You can do without 'address' if the argument is a string. This will do it too:
Code: Select all 
(for (i 0 3) (println (get-char (i 1 s)))) 
 
gives always "0" as a result.
works, but only in this strange form:
PS: its a pity "explode" can not work with raw bytes, so I can not use "map".