Yeah. I find in general, newLISP doesn't put any burden on you that it doesn't have to. When dealing with Unicode, there are lots of characters that don't show up when you print. Even with regular ASCII, there are codes like "\0" that don't show up at all. If you don't know where your data is coming from, you have to do checks to sanitize it. Just a fact of life. newLISP does make it really easy to check and sanitize data. But binary data is binary data; only you know how you are going to interpret it. So newLISP couldn't practically be changed to handle every type of data format. Instead it gives us a small set of very powerful tools so we can handle every type of binary data format.
That said, I would make a "pop-bom" function, that would strip the bom out of a data stream. In fact, I've written a bunch of small scripts where I go character by character, and convert or drop specific unicode characters depending on what I'm interested in. newLISP has been the ideal language for my work on the text of the Dead Sea Scrolls and other old manuscripts that are in Unicode.
One of my most useful scripts, just reads in a stream a character at a time, and makes a histogram; it counts every unique character, and prints out the count, with the FULL unicode name of that character, plus the hex and decimal value of that character. I call it unicode-histogram.lsp. If you're interested, I could post it here.
Darth_Severus wrote:Even if this would be the right way to do it, then the print function also had to show it. You can't be serious about just keeping it how it is. It's invisible. A programmer can't know which files a user would be using as an input file, so this is one more thing to think of writing a program. Your line, or something like it, had to part of every script reading user generated or third party text files. Not to forget, non-advanced programmers won't have Unicode knowledge.
Is it really done this way in other languages, like Python?
I'd strongly prefer to have it not handled like this. I also see no need for it. To find out what file it is, should only be needed if it is really needed, and some function like "file" in Linux would be better to do that. Maybe also some possibility to write a file with a BOM.
Yeah, but I see - Linux does it the same way:
Code: Select all
> ((exec "cat ~/untitled")0)
"### Unicode?"
(((exec "cat ~/untitled")0)0)
""
Horrible, but it seems to be standard.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence. Nine months later, they left with a baby named newLISP. The women of the ivory towers wept and wailed. "Abomination!" they cried.