Strange problem with dolist

Darth.Severus · Post by **Darth.Severus** » Sun Aug 23, 2015 12:34 pm

In a script I run: (read-file <path>) and (parse str "/n") to get a list with the content of a file parsed in lines into the symbol input-list. Then I run following code:

Code: Select all

(dolist (temp input-list)
		(when (not (or 
					(starts-with temp "#")
					(starts-with temp "\t")))
			(setq temp (replace " " temp "&nbsp;"))
			(push (string temp "<br>") result-list -1)) 
		(when (starts-with temp "#")
				(push (heading temp) result-list -1)))

It does what it should do, but not with the first line. It has no (exit) function at the end, so I can look what is in input-list. The first line is "### whatever" and e.g. the fifth is "### whatever-again". But it applies the first (when) to the first line and the second (when) function to all the others starting with "#". This is completely crazy.

Linux my-notebook 3.14-0.bpo.2-686-pae #1 SMP Debian 3.14.15-2~bpo70+1 (2014-08-21) i686 GNU/Linux
newLISP v.10.6.2 32-bit on Linux IPv4/6 UTF-8 libffi

TedWalther · Post by **TedWalther** » Sun Aug 23, 2015 6:14 pm

Code: Select all

    (dolist (temp input-list)
          (when (not (or
                   (starts-with temp "#")
                   (starts-with temp "\t")))
             (replace " " temp "&nbsp;")
             (extend temp "<br>")
             (push temp result-list -1))
          (when (starts-with temp "#")
                (push (heading temp) result-list -1)))

Here, cleaned it up a little for you.

Here is a question: are you intentionally skipping lines that start with a tab "\t"?

I don't know why that is a bug; if you send me a fuller code sample I'll run it and take a look. Tell me if this works:

Code: Select all

    (dolist (temp input-list)
          (cond
          ((starts-with temp "\t") nil) ; do nothing
          ((starts-with temp "#") (push (heading temp) result-list -1))
          (default (replace " " temp "&nbsp;") (extend temp "<br>") (push temp result-list -1))))

(this isn't a bug-fixed version, just how I would have implemented it)

Darth_Severus · Post by **Darth_Severus** » Mon Aug 24, 2015 11:01 am

Here, cleaned it up a little for you.

Thanks, but you also put an error in it:

Code: Select all

(push temp result-list-1))
should be (push temp result-list -1))

Here is a question: are you intentionally skipping lines that start with a tab "\t"?

Yes, for the moment. I wanted to decide later what to do instead.

if you send me a fuller code sample I'll run it and take a look.

Thanks. I can't send you my current input file. I'll first try it on my own a bit, with another input file.

Tell me if this works:

No sorry. I mean it's not working.

(this isn't a bug-fixed version, just how I would have implemented it)

Interesting, thanks. Only that your lines are to long for my taste. I'm using huge letters in my editor. You also putted the same error in than described above. There's only one result-list. Never mind.

Darth_Severus · Post by **Darth_Severus** » Mon Aug 24, 2015 3:19 pm

Update

I found the nerve to try it again. I created a new file and it worked correctly, until I used that option in geany called "Write Unicode BOM". I always activated this, without knowing if I need it for sure. It turns out, for this use case I don't. I'm quite sure I started activating this cause I had problems in another program while not having it.

https://en.wikipedia.org/wiki/Byte_order_mark

TedWalther · Post by **TedWalther** » Mon Aug 24, 2015 4:54 pm

Wierd. I think BOM is supposed to only happen once, at the very beginning of the file.

About the error: my eyesight. Didn't see that the -1 was detached from the variable name. I THOUGHT it was a wierd variable name... :)

So even in the (cond ...) style, the code doesn't work if BOM is in the file?

TedWalther · Post by **TedWalther** » Mon Aug 24, 2015 4:58 pm

Ok, try this code with the BOM enabled:

Code: Select all

(dolist (temp result-list)
  (println (char (first temp)) {(} (first temp) {) } temp))

Darth_Severus · Post by **Darth_Severus** » Tue Aug 25, 2015 10:42 am

TedWalther wrote:Wierd. I think BOM is supposed to only happen once, at the very beginning of the file.

Right, and that fits exactly to the error I got. Only the first line makes troubles.

So even in the (cond ...) style, the code doesn't work if BOM is in the file?

Yes. I think it's clearly a bug. Unlike other programs newLisp handles the BOM like it would be part of the text.

Ok, try this code with the BOM enabled:

Before I even mentioned the problem here, I checked what newLisp gives me with print or by accessing the data by indexing. Using print the first line is always shown as it should, but when it handles the data it sees the BOM as start of the first line.

I looked again into it, and could see the problem:

(setq data (read-file "/pathdeleted/untitled"))
(println data)
### Unicode?
### Test0
### lülülü
'''Test1
Test2
"### Unicode?\n### Test0\n### lülülü\n'''Test1\n Test2"

(data 0)
""

(starts-with data "#")
nil

(println (char (first data)) {(} (first data) {) } data)
65279() ### Unicode?
### Test0
### lülülü
'''Test1
Test2
"### Unicode?\n### Test0\n### lülülü\n'''Test1\n Test2"

(char 65279)
""

TedWalther · Post by **TedWalther** » Tue Aug 25, 2015 4:46 pm

Darth_Severus wrote:
TedWalther wrote:Wierd. I think BOM is supposed to only happen once, at the very beginning of the file.
Right, and that fits exactly to the error I got. Only the first line makes troubles.

Oh! I read your post as saying the opposite; I thought the first line worked, and the others didn't. In that case, the fix is easy. newLISP is doing the right thing and leaving the BOM alone. So, add another starts-with clause that includes the BOM. Like this:

Code: Select all

(starts-with temp (string (char 0xFE) (char 0xFF) "#") ;for UTF16
(starts-with temp (string (char 0xFE) (char 0xBB) (char 0xBF) "#") ;for UTF8

Or, before you even enter your loop, do this:

Code: Select all

(setf (result-list 0) (2 (result-list 0)))

That chops off the BOM, (assuming UTF16, for UTF8 change it to (3 (result-list 0))

Darth_Severus · Post by **Darth_Severus** » Tue Aug 25, 2015 6:17 pm

I think programs are not supposed to read the BOM when reading a file (this way). When I do

Code: Select all

$> cat file

in Linux, then the BOM is not shown, nor in any other program. How I've shown above in newLisp it's even a difference when using println or using starts-with. This makes no sense at all, people may have the same problem than me over and over again.

TedWalther · Post by **TedWalther** » Tue Aug 25, 2015 7:03 pm

Darth_Severus wrote:I think programs are not supposed to read the BOM when reading a file (this way). When I do
Code: Select all
$> cat file
in Linux, then the BOM is not shown, nor in any other program. How I've shown above in newLisp it's even a difference when using println or using starts-with. This makes no sense at all, people may have the same problem than me over and over again.

newLISP isn't just a program; it is a general purpose programming language. Some things NEED to see the BOM. You are the one writing the program; it is up to you to handle the BOM. As I just showed you with that one-liner, BOM handling can be done fairly simply. Yes, it is something to watch out for. Not sure where that info belongs in the manual; that is language independant general Unicode knowledge.

Darth_Severus · Post by **Darth_Severus** » Tue Aug 25, 2015 9:00 pm

Even if this would be the right way to do it, then the print function also had to show it. You can't be serious about just keeping it how it is. It's invisible. A programmer can't know which files a user would be using as an input file, so this is one more thing to think of writing a program. Your line, or something like it, had to part of every script reading user generated or third party text files. Not to forget, non-advanced programmers won't have Unicode knowledge.

Is it really done this way in other languages, like Python?

I'd strongly prefer to have it not handled like this. I also see no need for it. To find out what file it is, should only be needed if it is really needed, and some function like "file" in Linux would be better to do that. Maybe also some possibility to write a file with a BOM.

Yeah, but I see - Linux does it the same way:

Code: Select all

> ((exec "cat ~/untitled")0)
"### Unicode?"
(((exec "cat ~/untitled")0)0)
""

Horrible, but it seems to be standard.

TedWalther · Post by **TedWalther** » Tue Aug 25, 2015 9:08 pm

Yeah. I find in general, newLISP doesn't put any burden on you that it doesn't have to. When dealing with Unicode, there are lots of characters that don't show up when you print. Even with regular ASCII, there are codes like "\0" that don't show up at all. If you don't know where your data is coming from, you have to do checks to sanitize it. Just a fact of life. newLISP does make it really easy to check and sanitize data. But binary data is binary data; only you know how you are going to interpret it. So newLISP couldn't practically be changed to handle every type of data format. Instead it gives us a small set of very powerful tools so we can handle every type of binary data format.

That said, I would make a "pop-bom" function, that would strip the bom out of a data stream. In fact, I've written a bunch of small scripts where I go character by character, and convert or drop specific unicode characters depending on what I'm interested in. newLISP has been the ideal language for my work on the text of the Dead Sea Scrolls and other old manuscripts that are in Unicode.

One of my most useful scripts, just reads in a stream a character at a time, and makes a histogram; it counts every unique character, and prints out the count, with the FULL unicode name of that character, plus the hex and decimal value of that character. I call it unicode-histogram.lsp. If you're interested, I could post it here.

Darth_Severus wrote:Even if this would be the right way to do it, then the print function also had to show it. You can't be serious about just keeping it how it is. It's invisible. A programmer can't know which files a user would be using as an input file, so this is one more thing to think of writing a program. Your line, or something like it, had to part of every script reading user generated or third party text files. Not to forget, non-advanced programmers won't have Unicode knowledge.

Is it really done this way in other languages, like Python?

I'd strongly prefer to have it not handled like this. I also see no need for it. To find out what file it is, should only be needed if it is really needed, and some function like "file" in Linux would be better to do that. Maybe also some possibility to write a file with a BOM.

Yeah, but I see - Linux does it the same way:
Code: Select all
> ((exec "cat ~/untitled")0)
"### Unicode?"
(((exec "cat ~/untitled")0)0)
""
Horrible, but it seems to be standard.

TedWalther · Post by **TedWalther** » Tue Aug 25, 2015 9:26 pm

Never mind, it is simple enough, here is my script, it helps with debugging unicode issues.

Code: Select all

#!/usr/bin/newlisp

(load "unicode-names.lsp")
(define histogram:histogram)

(define (hex n) (push "0x" (upper-case (format "%x" n))))

(while (setq c (read-utf8 0))
       (setq c (char c))
       (if (histogram c)
	 (++ (histogram c))
	 (histogram c 1)))

(dolist (i (sort (histogram) (fn (x y) (< (char (x 0)) (char (y 0))))))
  (println (format "%s (decimal %d) %s (%s) occurs %d times."
    (hex (char (i 0))) (char (i 0)) (i 0) (unicode-name (i 0)) (i 1 0))))

(exit)

And here is some output from a project I recently did:

0xA (decimal 10)
(LINE FEED (LF)) occurs 17645 times.
0x20 (decimal 32) (SPACE) occurs 377553 times.
0x26 (decimal 38) & (AMPERSAND) occurs 3 times.
0x28 (decimal 40) ( (LEFT PARENTHESIS) occurs 282 times.
0x29 (decimal 41) ) (RIGHT PARENTHESIS) occurs 282 times.
0x2A (decimal 42) * (ASTERISK) occurs 96 times.
0x2D (decimal 45) - (HYPHEN-MINUS) occurs 148 times.
0x2E (decimal 46) . (FULL STOP) occurs 9086 times.
0x30 (decimal 48) 0 (DIGIT ZERO) occurs 4429 times.
...
0x1372 (decimal 4978) ፲ (ETHIOPIC NUMBER TEN) occurs 84 times.
0x1373 (decimal 4979) ፳ (ETHIOPIC NUMBER TWENTY) occurs 77 times.
0x1374 (decimal 4980) ፴ (ETHIOPIC NUMBER THIRTY) occurs 67 times.
0x1375 (decimal 4981) ፵ (ETHIOPIC NUMBER FORTY) occurs 30 times.
0x1376 (decimal 4982) ፶ (ETHIOPIC NUMBER FIFTY) occurs 49 times.
0x1377 (decimal 4983) ፷ (ETHIOPIC NUMBER SIXTY) occurs 30 times.
0x1378 (decimal 4984) ፸ (ETHIOPIC NUMBER SEVENTY) occurs 43 times.
0x1379 (decimal 4985) ፹ (ETHIOPIC NUMBER EIGHTY) occurs 11 times.
0x137A (decimal 4986) ፺ (ETHIOPIC NUMBER NINETY) occurs 6 times.
0x137B (decimal 4987) ፻ (ETHIOPIC NUMBER HUNDRED) occurs 372 times.

The unicode-names.lsp file is structured very simply, this should give you the idea:

Code: Select all

;; names of unicode code points provided by
;;   http://www.fileformat.info/info/unicode/block/

(define unicode-name:unicode-name)
(unicode-name (char 0x0000) "NULL")
(unicode-name (char 0x0001) "START OF HEADING")
(unicode-name (char 0x0002) "START OF TEXT")
(unicode-name (char 0x0003) "END OF TEXT")
(unicode-name (char 0x0004) "END OF TRANSMISSION")
(unicode-name (char 0x0005) "ENQUIRY")
(unicode-name (char 0x0006) "ACKNOWLEDGE")
(unicode-name (char 0x0007) "BELL")
(unicode-name (char 0x0008) "BACKSPACE")
...

Darth_Severus · Post by **Darth_Severus** » Wed Aug 26, 2015 12:43 pm

Thanks for your help so far, I might look into your code if I can use it.

newlispfanclub.alh.net

Strange problem with dolist

Strange problem with dolist

Re: Strange problem with dolist

Re: Strange problem with dolist

Re: Strange problem with dolist

Re: Strange problem with dolist

Re: Strange problem with dolist

Re: Strange problem with dolist

Re: Strange problem with dolist

Re: Strange problem with dolist

Re: Strange problem with dolist

Re: Strange problem with dolist

Re: Strange problem with dolist

Re: Strange problem with dolist

Re: Strange problem with dolist