Manipulating byte strings -- SOLVED

For the Compleat Fan
Post Reply
Thorstein
Posts: 5
Joined: Tue May 26, 2015 11:30 pm

Manipulating byte strings -- SOLVED

Post by Thorstein »

[See solution in thread below.]

I'm trying to implement several versions of the Lempel-Ziv-x and Snappy compression algorithms.  Ordinarily, I like to get my logic straight in Lisp, and then, if I need the speed, I'll port the tight loops to a C library.  In this case, however, NEWLisp has been atypically difficult to debug.  I wonder if there are some simple code patterns I'm overlooking.

It would, of course, be simpler to use a non-UTF-8 enabled build of NEWLisp, but I want to compress UTF-8 strings that I'm processing within NEWLisp.

So given a UTF-8 string us, I understand that (slice us i 1) will give me an 8-bit "char". I also found that defining

Code: Select all

(define (byte s
   (i 0)  )
  (char s i true)
  )
helped in some situations. But then I ran into problems trying to unpack a code like 32765 into two bytes.  In the following examples I thought I could use the following for the low byte of 253.

Code: Select all

> (mod 32765 256)
253

;; but
> (byte (mod 32765 256)) 
ý

;; and
>(byte (byte (mod 32765 256)))
195
And while, as mentioned above, the following use of (char) looks ok

Code: Select all

>(char (char (mod 32765 256)))
253

>(char (mod 32765 256))
"ý"

>(length "ý")
2
the UTF-8 char length messes with the byte discipline of the compression algorithms.

At last, I found that (pack) can work:

Code: Select all

>(pack "b" (& 32765 0xff))
"�"

;; and
> (byte (pack "b" (& 32765 0xff)))
253

;; (and for the high byte):
>(byte (pack "b" (/ 32765 256)))
127
But, a little confusingly, there were still some gotchas. For example, (pack) doesn't work with (mod):

Code: Select all

> (byte (pack "b"  (mod 32765 256)))
16
So, long story short, I've got these manipulations more-or-less working, but I wonder if there's a more direct way to manipulate such bytes and 8-bit chars??
Last edited by Thorstein on Tue Jan 05, 2021 3:49 pm, edited 1 time in total.

fdb
Posts: 66
Joined: Sat Nov 09, 2013 8:49 pm

Re: Manipulating byte strings

Post by fdb »

Not sure what you are trying to do here but according to the documentation around utf-8 you can use explode : "Use explode to obtain an array of UTF-8 characters and to manipulate characters rather than bytes when a UTF-8–enabled function is unavailable".

Hope this helps!

Thorstein
Posts: 5
Joined: Tue May 26, 2015 11:30 pm

SOLVED: Re: Manipulating byte strings

Post by Thorstein »

Thanks, fdb, but I don't want utf-8 chars; I need a byte stream.

But I think I've found a major source of my confusion, so I'll mark this thread "SOLVED":

(char "x") => a Unicode code-point. This should perhaps have been obvious to me, but "code-point" is not mentioned in the Manual. Consequently my helper function

Code: Select all

(define (byte s
   (i 0)  )
  (char s i true)
  )
;; returns a utf-8 char
(byte 218)
"Ú"
;; and even though the code-point for "Ú" is 218
(char "Ú")
218
;; the encoding of the code-point is of length 2 !!
(length (byte 218))
2  
;; so, confusingly,
(char (byte 218))
218                      ;; the code-point is one byte long
;; but
(byte (byte 218))
195                     
where 195 is the first byte of the 2-byte code-point encoding.

It appears my helper function should have been this:

Code: Select all

(define (byte x)
  (if (number? x)
      (pack "b" x)
      (char x 0 true)
      )
  )
;; and now
(byte (byte 218))
218
Said differently, while (char) reciprocally translates one-byte code-points in the lower ascii range to one-byte chars, (char) does not do so for code-points in the range 0x80 -0xff and beyond.

My code now uses (slice stringx n 1) to fetch a byte and the revised (byte) to reciprocally transform 1-byte chars. It never directly calls (char), and this appears to be working. Yay!

Post Reply