leave-string: any idea for simplifying?

hartrock · Post by **hartrock** » Sun Oct 11, 2015 5:19 pm

There is the following function leave-string for leaving left (positive pos) or right (negative pos) part of a string, which works for pos out of range, too:

Code: Select all

> 
(define (leave-string str pos)
  (if (>= pos 0)
      (0 pos str)
      ((- (min (- pos) (length str))) str)))
;;
(set 'str "foobar")
(leave-string str 3) ; first 3
(leave-string str -3) ; last 3
;;
(leave-string str 7) ; first 7 (all with pos one out of range)
(leave-string str -7) ; last 7 (all with pos one out of range)

(lambda (str pos) 
 (if (>= pos 0) 
  (0 pos str) 
  ((- (min (- pos) (length str))) str)))
"foobar"
"foo"
"bar"
"foobar"
"foobar"
>

Any idea for simplifying the negative pos case?
Or is there any other possibility to get this truncating string functionality, I don't have on the radar?

ralph.ronnquist · Post by **ralph.ronnquist** » Thu Oct 15, 2015 12:28 pm

Not really simpler, but the negative index case would be slightly shorter with the following.

Code: Select all

((length 0 pos str) str)

xytroxon · Post by **xytroxon** » Thu Oct 15, 2015 5:34 pm

A potential problem exists in handling UTF-8 string values in UTF-8 enabled versions of newLISP. In UTF-8 versions, the length function would return the number of bytes and not the number of UTF-8 characters in the string. Hence, your function would not return the correct number of characters from your string!

In such cases utf8len must be used. Since single-byte ASCII characters (0-127) are a subset of UTF-8, length function problems may not be noticed until a user has multi-byte (non-English) UTF-8 characters to process.

A further complication is that non-UTF-8 versions of newLISP do not include the utf8len function!

A possible solution is to use this "ambidextrous" strlen function in place of length:

Code: Select all

(define (strlen str) (if (= (& (sys-info 9) 128) 128) (utf8len str) (length str)))

But of course it fails to return the correct length if your users are trying to process multi-byte UTF-8 character strings on non-UTF-8 versions of newLISP. Like when dealing with UTF-8 html pages that include "fancy" left and right quotes in "English only" text.

A truly "simplified" or "efficient" version of your function may not be entirely possible depending on how robust you want your code to be - like in module code designed to run on all versions of newLISP.

-- xytroxon

hartrock · Post by **hartrock** » Fri Oct 16, 2015 11:39 pm

ralph.ronnquist wrote:Not really simpler, but the negative index case would be slightly shorter with the following.
Code: Select all
((length 0 pos str) str)

Thanks for the suggestion!
Corrected it is:

Code: Select all

((length (0 pos str)) str)

; which is shorter and easier to understand than:

Code: Select all

((- (min (- pos) (length str))) str)

The longer variant avoids copying str for computing valid negative index, though.

Asymmetry triggering this thread is:

Code: Select all

> (set 'str "foobar")
"foobar"
> (11 str)
""
> (-11 str)

ERR: invalid string index
>

(same for

Code: Select all

(slice pos str)

); but there may be reasons for this semantics: e.g. negative indices near to tail of a list are inefficient compared to positive ones near to its head; which could be a reason for making use of negative ones more uncomfortable.

hartrock · Post by **hartrock** » Sat Oct 17, 2015 1:52 am

xytroxon wrote:A potential problem exists in handling UTF-8 string values in UTF-8 enabled versions of newLISP. In UTF-8 versions, the length function would return the number of bytes and not the number of UTF-8 characters in the string. Hence, your function would not return the correct number of characters from your string!

Thanks for the pointer!

This is indeed a problem for usecase triggering this thread: there is a transfer of newlisp interpreter stdin/stdout/stderr via websocket protocol, whose text will be transfered in chunks for having intermediate results in case of longer running computations. Separating these chunks is byte string orientated, but transfer goes by transferring these chunks as - encoded - JSON strings. But JSON encoding could fail, if an UTF-8 char will be cutted at the border between two chunks; with fragments becoming part of different chunks.
So there is something to do; trying to provoke problem triggers error (to be eval'ed code and results of evalutions transferred via websocket as UTF-8 text):

Code: Select all

> (set 's "我能吞下玻璃而不伤身体。")(dup s 500) ; dup leading to chunked transfer
"我能吞下玻璃而不伤身体。"
[text]我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。我能吞下玻璃而不伤身体。...[/text]
> (set 's "我能吞下玻璃而不伤身体。")(dup s 500) ; dup leading to chunked transfer

-> this run has failed for transfer of second (chunk borders are varying) dup result (first succeeded one shown truncated).

newlispfanclub.alh.net

leave-string: any idea for simplifying?

leave-string: any idea for simplifying?

Re: leave-string: any idea for simplifying?

Re: leave-string: any idea for simplifying?

Re: leave-string: any idea for simplifying?

Re: leave-string: any idea for simplifying?