read-file doesn't work with /dev/stdin

TedWalther · Post by **TedWalther** » Mon Jan 26, 2009 5:52 am

how can I make read-file work with /dev/stdin? when my application doesn't have a file specified by the user, I want to fall back to stdin to slurp my data in. It is potentially utf8 data input, and I will want to iterate over each character in the stream..

Writing to /dev/stdout works just fine.

Perhaps the simplest solution is to let read-file etc use file descriptors (integers) as well as strings?

Example code:

(setq from-format (sym (or (env "FROM") "utf8")))
(setq to-format (sym (or (env "TO") "neo-paleo-hebrew")))
(setq output-file (or (env "SAVEAS") "/dev/stdout"))
(setq input-file (or (env "OPEN") "/dev/stdin"))

(dostring (c (read-file input-file))
(append-file output-file (char c)))

Lutz · Post by **Lutz** » Mon Jan 26, 2009 12:37 pm

'read-file' can only be used to read from files.

You could use 'read-char' to read from stdin using 0 as the device. Returned values will be one byte at a time, or use 'read-line' then 'explode' the line into UTF-8 multibyte characters. In both cases processing does not start until a line-feed.

Code: Select all

#!/usr/bin/newlisp

(while (!= (setq ch (read-char 0)) (char "q"))
		(println "->" ch))

(exit)

Or use 'read-key', which will process immediately after each key-stroke:

Code: Select all

#!/usr/bin/newlisp

(while (!= (setq ch (read-key)) (char "q"))
	(println "->" ch))

(exit)

TedWalther · Post by **TedWalther** » Sat Jan 31, 2009 7:07 am

Lutz, could it be time to change read-char with read-byte, and make read-char do the utf8 thing?

Would it be hard to slurp in fd 0 for read-file? I mean, generally, I cat a file into the program, and it is no different from reading a regular file. The only difference between read-file and read-char is that one takes an already opened file descriptor, and the other wants to do the open itself. Once read-file accepted all possible input from fd0, it makes sense it would close it.

Lutz · Post by **Lutz** » Sat Jan 31, 2009 1:35 pm

I see 'read-char', 'read-buffer' and 'read-line' as the low-level API using file descriptors and dealing with octets and 'read/write/append -file' as the highlevel API dealing with file names.

Lutz · Post by **Lutz** » Sun Feb 01, 2009 5:43 pm

A future version of newLISP has a 'read-utf8' which works like 'read-char', but reads UTF-8 characters from a file handle or stdin. Until then I suggest using (explode (read-line ...)) instead. 'explode' splits a string on multi-byte character borders in UTF8 enabled versions of newLISP.

TedWalther · Post by **TedWalther** » Mon Feb 02, 2009 8:31 pm

Lutz wrote:A future version of newLISP has a 'read-utf8' which works like 'read-char', but reads UTF-8 characters from a file handle or stdin. Until then I suggest using (explode (read-line ...)) instead. 'explode' splits a string on multi-byte character borders in UTF8 enabled versions of newLISP.

Thank you for the explanation. I'm meditating now on the whole utf8 vs byte oriented paradigm works in newLISP. I should have spent more time meditating on the pack and unpack functions.

Does (char ...) use (pack ...) internally? Should it? I recently had some problems when trying to output some Hebrew characters in codepage iso8859-8. I think (char ...) was interpreting them as utf8 characters.

Lutz · Post by **Lutz** » Tue Feb 03, 2009 12:23 am

'char' does not use 'pack' or 'unpack' internally, but 'char' is UTF-8-sensitive when using the UTF-8 enabled version of newLISP. If a Windows installation is working with ISO-8859-x character sets, then one should not use the UTF-8 enabled version for Hebrew and most Eastern European (cyrillic) character sets.

When the UTF-8 enabled version of newLISP sees Hebrew ISO-8859-8 characters (greater 127), it sees them not as Hebrew but something else. All characters below 128 will interpreted as ASCII in both, ISO-8859-x or UTf-8/UNICODE. All above 127 will be one-byte-wide characters in ISO-8859-x characters sets but could initiate different one- or multi-byte-wide characters in UTF-8.

Here is a list of all functions sensitive to UTF-8 characters when using the UTF-8 enabled version: http://www.newlisp.org/newlisp_manual.html#utf8_capable

Ps: by default the Windows installer shipped is non-UTF8, and the Mac OS X installer is UTF-8.

newdep · Post by **newdep** » Tue Feb 03, 2009 10:53 am

Lutz,

Why do you make a differences in function names between utf8 and none utf8?

Its somewhat awkward to have that inside a language...
I would expect global functionality in a function instead of separate behaviour..

Just a though....

Lutz · Post by **Lutz** » Tue Feb 03, 2009 12:22 pm

Some functions work differently in utf8 and non-utf8 versions like these: http://www.newlisp.org/newlisp_manual.html#utf8_capable They are meant to work strictly on displayable strings and can display this global behavior. But there are also functions where you need both version like 'read-char' and 'read-utf8'. If you let 'read-char' switch behavior, like the others in the link, you would not be able to read binary files or ISO-8859 files correctly.

newlispfanclub.alh.net

read-file doesn't work with /dev/stdin

read-file doesn't work with /dev/stdin

utf8, pack, char, and iso8859-8 (hebrew)