qa-float crash

newdep · Post by **newdep** » Tue Sep 22, 2009 10:17 am

Hi Lutz,

on OS/2..10.1.5, Im getting a crash in return from this code inside qa-float ->

(set 'result '())
(set 'u 1.0)
;(while (> u 0.0) (set 'u (mul u 0.5)) (push u result))
 (while (> u 0.0) (set 'u (mul u 0.5)) (println u))

Not sure if it should or not crash, I dont have any other OS at the moment to compare it with..

Code: Select all

..
..
..
1.822780505e-304
9.113902524e-305
4.556951262e-305
2.278475631e-305
1.139237816e-305
5.696189078e-306
2.848094539e-306
1.424047269e-306
7.120236347e-307
3.560118174e-307
1.780059087e-307
8.900295434e-308
4.450147717e-308
2.225073859e-308

Killed by SIGFPE
pid=0x0097 ppid=0x0040 tid=0x0001 slot=0x006e pri=0x0200 mc=0x0001
E:\PROG\NL\NEWLISP-10.1.5\NEWLISP.EXE
NEWLISP 0:00011eb9
cs:eip=005b:00021eb9      ss:esp=0053:0017faf0      ebp=0017fb28
 ds=0053      es=0053      fs=150b      gs=0000     efl=00002297
eax=005503a0 ebx=005503a0 ecx=00550a40 edx=3fe00000 edi=0017fb08 esi=00000003
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

Lutz · Post by **Lutz** » Tue Sep 22, 2009 11:20 am

This is a rare underflow condition handled by OS/2 with an exception. You could setup a signal handler for SIGFPE in function setupAllSignals() around line 310 in newlisp.c, or you could set it up in newLISP itself in the startup code. All other OSs handle subnormals by returning the smallest possible FP value.

This has always been in OS/2 but with qa-float we are forcing it to show up for the first time.

newdep · Post by **newdep** » Tue Sep 22, 2009 4:11 pm

Aha!

This is what I get now..after the sigfpe fix..

A far better error report ;-)

Code: Select all

[E:\prog\nl\newlisp-10.1.5].\newlisp qa-float
SYS1808:
The process has stopped.  The software diagnostic
code (exception code) is  0097.

Here is the simple addon for the newlisp.c code at line 310 ->

Code: Select all

#ifndef WIN_32

#if defined(SOLARIS) || defined(TRU64) || defined(AIX)
setupSignalHandler(SIGALRM, sigalrm_handler);
setupSignalHandler(SIGVTALRM, sigalrm_handler);
setupSignalHandler(SIGPROF, sigalrm_handler);
setupSignalHandler(SIGPIPE, sigpipe_handler);
setupSignalHandler(SIGCHLD, sigchld_handler);
#else
setupSignalHandler(SIGALRM, signal_handler);
setupSignalHandler(SIGVTALRM, signal_handler);
setupSignalHandler(SIGPROF, signal_handler);
setupSignalHandler(SIGPIPE, signal_handler);
setupSignalHandler(SIGCHLD, signal_handler);
#ifdef OS2
setupSignalHandler(SIGFPE, signal_handler);
#endif
#endif

#endif
}

Btw.. Ill put that in for the (import ...) too, thatone crashes far too often here ;-)

newdep · Post by **newdep** » Tue Sep 22, 2009 5:46 pm

mmm actualy something is fishy with the NaN's in OS/2 code..
even a simple (sqrt -1) takes ages and then cracks..

Ill have a closed look inside the code on NaN's I did not check that actualy..

im nor sure if its the Pentium im working on or the compiler or the code ;-)

Lutz · Post by **Lutz** » Tue Sep 22, 2009 7:30 pm

I assume it is the FP library in OS/2, but check Windows or Linux on this machine, if you can. Both should be fine on qa-float, unless its one of those (very old) Pentiums with problems in the FP processing units.

newdep · Post by **newdep** » Tue Sep 22, 2009 9:26 pm

Now this is getting intresting.. Its good to read that a bunch of hardware coders and
compiler writers dont even care about precision ;-) But thats a different story..

Seeking the internet for a Faulty P4 I did indeed ran into story's from back in 1995,
this P4 is from around 1998 so I expect it to have a bug anyway because it one from
a lowcost Compaq mainstream where the Sticker "Intel INside" is bigger then the
machine ittself..

Looking at the GCC compiler optimalizations I added the -march=pentium4 together
with the -O2 this indeed does help a bit but not yet fully to cover the DBL_MIN value
of (DBL_MIN 2.2250738585072014e-308) which is bothering me..

So this is what i did on the command line, a max and a min.. The only difference
where it clashes is the -308...not the max..

Code: Select all

> 
(mul 2e308 2)
inf
> (mul 2e+308 2)
inf
> (div 2e+308 2)
inf
> (div 2e308 2)
inf
> (div 2e-308 2)
SYS1808:
The process has stopped.  The software diagnostic
code (exception code) is  009A.

Im seeking deeper...

PS: After a small C program compiled with GCC laso that clashed.. So its now time
to digg into this gcc port..

PPS: And why is there a difference of 6 all the time between the result and the
original??

Code: Select all

> (div (mul 4195835e128 3145727e128 ) 3145727e128 )
4.195835e+134

= 4195835e128 equal to 4.195835e+134 ? 6 zero's more ? Is this a rounding error?

> (div (mul 4195835e134 3145727e134 ) 3145727e134 )
4.195835e+140

..

> (div (mul 4195835e140 3145727e140) 3145727e140 )
4.195835e+146

> (div (mul 4195835e146 3145727e146) 3145727e146 )
4.195835e+152

> (div (mul 4195835e152 3145727e152) 3145727e152 )


SYS1808:
The process has stopped.  The software diagnostic
code (exception code) is  0098.

PPPS: I did some tests for the FDIV P4 Bug but thats not My P4.. So lets assume
Compaq did sell a working "Intel Inside" for a second...

newdep · Post by **newdep** » Wed Sep 23, 2009 11:00 am

aa yes oke the 6 zero is obvious..
(not when your staring at it already a few hours ;-)

I have now 2 possibillity's.. Or its the gcc that has the issue or the Klibc im working against...

...still digging...

newdep · Post by **newdep** » Thu Sep 24, 2009 12:58 pm

I digged into the gcc (OS2 port) and cant make any decent bread from it.. Its a spagetti of #defines..

Anyway.. from a simple test in plain C a (sqrt -1) and a (div 0 0) do trigger the SIGFPE.
Im unable to get those to return NaN..

The (div 1 0) returns inf
The (div 0 0) crashes (or with the adjustment traps and then exit's)

Where in the newlisp code do I exacly need to make the adjustment to make
it return "nan" ? I tried several places.. no luck without a trap..

It would be fine for me to get a return of "nan" by trap, but i would like to keep newlisp
running from that point on... not sure if thats possible..

Norman

Lutz · Post by **Lutz** » Thu Sep 24, 2009 1:41 pm

Division by zero is caught by newLISP in file nl-math.c. For integers it is the arithmetikOp() function and for floats the floaOp() function.

newdep · Post by **newdep** » Thu Sep 24, 2009 3:20 pm

oke its not that simple and it seems that the NaN is simply not defined correctly in the GCC of OS2.

I use for now the generic error return of MATH_ERR,
which is an official error from newlisp and does not alter any code
and MODULO still uses it too ->

> (div 0 0)

ERR: division by zero in function div

Now i need still to fix

(div 0)

(sqrt -1)

and some rest..

Lutz · Post by **Lutz** » Thu Sep 24, 2009 3:40 pm

I wonder what the following C program produces in OS2:

Code: Select all

#include <stdio.h>

#ifdef __BIG_ENDIAN__
#define __nan_bytes     { 0x7f, 0xf8, 0, 0, 0, 0, 0, 0 }
#endif

#ifdef __LITTLE_ENDIAN__
#define __nan_bytes     { 0, 0, 0, 0, 0, 0, 0xf8, 0x7f }
#endif

int main(int argc, char * argv[])
{
double dFloat;
char bytes[8] = __nan_bytes;

dFloat = *(double *)bytes;

printf("NaN = %lf\n", dFloat);
}

produces:

Code: Select all

NaN = nan

on Mac OS X

The bit pattern used is derived from this:

Code: Select all

> (unpack "bbbbbbbb" (pack "<lf" (sqrt -1)))
(0 0 0 0 0 0 248 255)
> (unpack "bbbbbbbb" (pack ">lf" (sqrt -1)))
(255 248 0 0 0 0 0 0)
>

newdep · Post by **newdep** » Thu Sep 24, 2009 4:09 pm

;-) Thats what I thought about too last night ..
Good you bring that up actualy.. Ill try that tonight...

actualy the raw (sqrt -1) in C crashes too so anything with a (sqrt -1) in it i cant test...
But ill test that Indian code..

newdep · Post by **newdep** » Thu Sep 24, 2009 4:19 pm

I did adust __xxx_ENDIAN__ to __xxx_ENDIAN
this is what it returns..

Code: Select all


#ifdef __BIG_ENDIAN
#define __nan_bytes     { 0x7f, 0xf8, 0, 0, 0, 0, 0, 0 }
#endif

#ifdef __LITTLE_ENDIAN
#define __nan_bytes     { 0, 0, 0, 0, 0, 0, 0xf8, 0x7f }
#endif

int main(int argc, char * argv[])
{
double dFloat;
char bytes[8] = __nan_bytes;

dFloat = *(double *)bytes;

printf("NaN = %lf\n", dFloat);
}
[E:\PROG\NL\newlisp-10.1.5]nan
NaN = nan

Using BIG_ENDIAN it return -> nan
using LITTLE_ENDIAN it return -> NaN = 0.000000

Now because you previously mentioned the IEEE 754 it got me thinking...

newdep · Post by **newdep** » Thu Sep 24, 2009 8:51 pm

resume "totally utterly flabbergasted".. I cant find it..

newdep · Post by **newdep** » Sat Sep 26, 2009 4:05 pm

let see if I can adjust nl-math.c with
http://www.gnu.org/s/libc/manual/html_n ... d-NaN.html
and then explicitly test with isfinite() befor returning..

newdep · Post by **newdep** » Sat Sep 26, 2009 7:56 pm

aaa Fixed it !

Lutz i sent you a PM on this... Ill post the solution inhere when you checked it..

Look Mam... no hands!

> (/ 0 0)

ERR: division by zero in function /
> (div 0)

ERR: division by zero in function div
> (div 0 0)
nan
> (log 0)
-inf
> (sqrt -1)
nan
> (div 0)
inf
>

Lutz · Post by **Lutz** » Sat Sep 26, 2009 8:24 pm

yes, seems to be solved. This will avoid the compiler warning:

Code: Select all

#ifdef OS2
    case SIGFPE:
        errorProc(ERR_MATH);
        break;
#endif

so it runs qa-float (the one in 10.1.6 checking signed inf) well?

I will make a either a development release for 10.1.6, or perhaps wait until the next Release update and post just the affected files in the development directory.

newdep · Post by **newdep** » Sat Sep 26, 2009 8:39 pm

It seems that the very first time a NaN or Inf orrceur it returns
the "ERR: division by zero" message.. The next time you run the same
function again it returns the NaN or Inf..

So somehwere still the ERR: is in the way...
The qa-float now stops at ERR:

Code: Select all

* fresh startup *

> (sqrt -1)

ERR: division by zero in function sqrt

> (sqrt -1)
nan

Code: Select all

* fresh startup *

> (div 0)

ERR: division by zero in function div
> (div 0)
inf
>

Lutz · Post by **Lutz** » Sat Sep 26, 2009 9:17 pm

you have this in line 343 in function setupAllSignals() ?

Code: Select all

#ifdef OS2
setupSignalHandler(SIGFPE, signal_handler);
#endif

Lutz · Post by **Lutz** » Sat Sep 26, 2009 9:19 pm

... perhaps you just take "errorProc(...)" out and let it catch it doing nothing:

Code: Select all

#ifdef OS2
    case SIGFPE:
        break;
#endif

in line 412

newdep · Post by **newdep** » Sat Sep 26, 2009 9:31 pm

No that doesnt work, I read somewhere that actualy catching the SIGFPE you need
a longjmp or create a function... I think thats now happening with the errorProc action..

Defining directly the PrintErrorMessage(...) only doesnt work..leaving it empty with a
break causes the real SIGFPE again.. So i need to re-route the Signal and clear
the ERR befor its displaying the NaN...

* added *

what does the return(nilCell); do in the errorProcAll ? I think i need that in
the SIGFPE.. Or a Signal Reset ?

newdep · Post by **newdep** » Sat Sep 26, 2009 9:56 pm

Just wanted to see what happend with the signals actualy,

The first time when newlisp starts and seeing a division by zero it reports the ERR:
And I get a trap Number 8 (which is SIGFPE).. The second time NO signal! but directly
the "inf".. Is this just dumb Luck? Or is there realy something in between? The secondtime its not from the SIGFPE else I would have seems the Singal message again...Mmmm

newLisp v 10.1.6 ........

> (div 0)
Signal = 8

ERR: division by zero in function div

> (div 0)
inf
>

newdep · Post by **newdep** » Sat Sep 26, 2009 10:13 pm

Ill try a different signal handler tomorrow..
GNU writes this about SIGFPE..

— Macro: int SIGFPE

The SIGFPE signal reports a fatal arithmetic error. Although the name is derived from “floating-point exception”, this signal actually covers all arithmetic errors, including division by zero and overflow. If a program stores integer data in a location which is then used in a floating-point operation, this often causes an “invalid operation” exception, because the processor cannot recognize the data as a floating-point number. Actual floating-point exceptions are a complicated subject because there are many types of exceptions with subtly different meanings, and the SIGFPE signal doesn't distinguish between them. The IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985 and ANSI/IEEE Std 854-1987) defines various floating-point exceptions and requires conforming computer systems to report their occurrences. However, this standard does not specify how the exceptions are reported, or what kinds of handling and control the operating system can offer to the programmer.

BSD systems provide the SIGFPE handler with an extra argument that distinguishes various causes of the exception. In order to access this argument, you must define the handler to accept two arguments, which means you must cast it to a one-argument function type in order to establish the handler. The GNU library does provide this extra argument, but the value is meaningful only on operating systems that provide the information (BSD systems and GNU systems).

FPE_INTOVF_TRAP
Integer overflow (impossible in a C program unless you enable overflow trapping in a hardware-specific fashion).
FPE_INTDIV_TRAP
Integer division by zero.
FPE_SUBRNG_TRAP
Subscript-range (something that C programs never check for).
FPE_FLTOVF_TRAP
Floating overflow trap.
FPE_FLTDIV_TRAP
Floating/decimal division by zero.
FPE_FLTUND_TRAP
Floating underflow trap. (Trapping on floating underflow is not normally enabled.)
FPE_DECOVF_TRAP
Decimal overflow trap. (Only a few machines have decimal arithmetic and C never uses it.)

newdep · Post by **newdep** » Sun Sep 27, 2009 7:25 pm

Hi Lutz,

This works out of the box on my OS2 machine, no strange things at all.

In newlisp I still get the very first time the SIGFPE occeurs the ERR:... and then
after the second time the nan or inf..

Perhpas you know where the "ERR:" mixup could be in newlisp?
Because I cant find it...;-)

Code: Select all

#include <stdio>
#include <float>
#include <signal>
#include <math>
#include <setjmp>

/* testing NaN and Inf return */


/* store stack */
jmp_buf errorJump;
int errorReg = 0;

void signal_handler(int sig)
{
switch(sig)
	{
	case SIGFPE: 
/*	signal(SIGFPE,SIG_DFL); */
	printf("%s", "SIGFPE!\n"); 
	longjmp(errorJump,errorReg);	
	break;	
	default: return;	
	}

}

int main ()
{

	/* save stack */
	setjmp(errorJump);

	/* nan-inf go through sigfpe */
	signal(SIGFPE, signal_handler); 

	double nfloat;

	nfloat = (sqrt (-1));
	printf("sqrt=%f\n", nfloat );

	nfloat = (log (0));
	printf("log=%f\n", nfloat );

	nfloat /= 0;
	printf("div=%f\n",  nfloat ); 

}

Code: Select all

[E:\PROG\NL\newlisp-10.1.6]f
SIGFPE!
sqrt=nan
log=-inf
div=-inf

Here i removed the errorProc function and only used the longjmp.
That returns the first time nothing and then the nan. Thats also
not like my example, seems there is still some code messing around
in the results inside newlisp? But i cant fint it..

Code: Select all

> (sqrt -1)
> (sqrt -1)
nan
>

newdep · Post by **newdep** » Mon Sep 28, 2009 9:07 am

I found the double entry that causes the problem..

Its inside the errorreg check...
Mmm actualy its a setjmp longjmp issue
where the int var is 0 or 1..because of the amount
of jmp's used now..

Code: Select all

if((errorReg = setjmp(errorJump)) != 0) 
    {
    printf("ErrorReg2=%d\n", errorReg);

    if(errorReg && (errorEvent != nilSymbol) ) 
        executeSymbol(errorEvent, NULL, NULL);
    else  exit(-1);

    goto AFTER_ERROR_ENTRY;
    }

I first though there might be a difference in the gcc or OS2 regarding the setjmp
behaviour so i tested with this -> http://www-personal.umich.edu/~williams ... pmode.html
But thats identical on both my Linux and OS2..

a closer look returns this flow during newlisp ->
first setjmp = 0 (=errorReg) in the funtion above, then the (div 0) appears.
The errorReg in the signal_handler of SIGFPE sees the errorReg = 0 (initial)
Because there is a longjmp the next setjmp gets a 1 (from the longjmp).
so the errorReg checkup does "goto AFTER_ERROR_ENTRY" with a new
setjmp but thatone returns ofcourse 1 (due to the last longjmp)..
At this point the signal_handler & the jmp_buf are both 1 at the stack is the same.

Oke.. im looking inside the code now for a fix because these "saved stacks" need to be in sync ;-)