Online FPC with FPU

Posted by ALB42 on 8. November 2020No Comments

You know the online FPC compiler, right? Until now it was only with soft float because usually that is enough. But I added now a possibility also compile with FPU enabled (therefore the resulting program will need a FPU, but will run much faster). For example the raycaster I presented the last days only runs this fast with FPU. And because it’s just a single file, it is perfect for the online FPC compiler, Have fun!

Vampire Gold 2.8 FPU

Posted by ALB42 on 25. März 20183 Comments

During my vacation two new Versions of Vampire Cores are released. One (2.7.1) only adds more serial numbers so no actual core changes and the second (2.8) has some bugfixes. One Bugfix is called ‚- minor FPU fix‘, sadly no more information what that means. I heard the rounding issue is solved and MUIMapparium (FPU-Version) should work now, so let’s try that.

But sadly MUIMappaarium still shows nothing and that is because the rounding issue is still not solved in this core (as you see in the picture), also the the precision problem persists. So I’m not sure what they fixed in this version and what happened with the MUIMapparium fixes they showed before 🙁 More waiting.

Vampire 2.7 FPU

Posted by ALB42 on 3. März 2018No Comments

I just found out you can turn off the FPU with „VCONTROL FP=0“ sometimes it work, sometimes not (just crashing, if this because of my sloppy power on the Vampire or just bad timing, who knows). But after it you can start FEmu as with 2.5 and the testcodes work again. Also MUIMapparium FPU Version is working :-D.
In principle that would be the better alternative to the current situation, implement single precision inside the FPGA and trap the rest, which can then be covered by FEmu. Would make Quake and Demos and so on fast, but would not violate the IEEE 754.

In the current situation Amiga (Vampire) is not anymore IEEE 754 compliant. The Double format is now something like 24 bits significant (like IEEE Single) with 11 bits exponent (like IEEE double) because it can still go to 1e308 but the significant precision is much lower as shown before. Also strange if you multiply two big numbers to produce an overflow for example 1e200 * 1e200 in double precision that would give „+Inf“ or an exception, but Vampire FPU shows 1.8e+308 (something close to the max Double Value) and you can continue to calculate with that.

Vampire 2.7 FPU Part 2

Posted by ALB42 on 3. März 2018No Comments

I played a little bit more with the Vampire FPU the yesterdays example shows an other interesting effect, the calculation of a * 1.3 shows an incorrect result. There is some problem in the last places when comparing with 68882 A1200 or 68060 UAE.
Let’s dig a little bit deeper into that, test how well the double precision calculation works. MUIMapparium needs Double precision everywhere, I tested to make it single but the calculation really get very wrong results (Eiffel tower somewhere near London, such stuff). To test double precision we multiply a very small value to 1 very often to see if it correctly handle the bits. for example like this:

program testfpu;
{$mode objfpc}
var
  i: Integer;
  b: Double;
begin
  b := 1;
  for i := 1 to 100000 do
  begin
    b := b * 1.0000000001;
  end;
  writeln(b);
end.

as you can see we multiply the 1 in b with 1 and a very little bit more, and this little bit more is just over the single precision limit. How the result look like:

Vampire 2.7 1.0000000000000000E+000
Amiga 1200 68882 1.0000100000502183E+000
Linux x86_64 1.0000100000494698E+000

ehm.. no Vampire thats wrong. Seems it does calculate all FPU calculations in single precision, so even they repair the rounding problem MUIMapparium still would be not usable.
Btw. I tried the same thing with the FEmu back in the days and it worked, as expected.
So maybe thats also the reason for the very good benchmark results… you make the calculations in single instead of double of course you can be much faster (and the original 68k FPUs calculate everything in extented so even more precise). I really think about to go back to Gold 2, it was slower but reliable.
As always the downloads: TestFPU3, TestFPU3.pas

Vampire V2.7 with FPU

Posted by ALB42 on 2. März 20188 Comments

The new Vampire firmware is released V2.7 which contains a Hardware FPU in the FPGA (some seldom 68881/68882 commands are still emulated, like the 68060/68040 also do) but nevertheless, thats very nice for my MUIMapparium (you remember the problem?). Of course I flashed directly the new version, first the bad news, it’s VERY unstable for me, it’s said it needs some soldering because there are some errors on the early Vampire cards which make the power supply to the FPGA bad… something like this and I’m affected with that … so it’s a little bit annoying to work with it because there are drawing errors on the screen and it crashes often. I reduced the screen resolution and ended all background program which made it much better. But nevertheless to really try it out I have to wait until someone fix my card. I can’t do that myself, most scary thing in the world a coder with a screwdriver let alone a soldering equipment :-P.
But this was not the topic of this post. I tried MUIMapparium FPU version on my new Vampire 2.7… good news it starts does not crash, bad news the map stays empty. The same Executable worked well with FEmu (I checked especially before I flashed the new one) on the old Gold 2 and still work in UAE. But the FPU calculation seems to work because the mouse pointer movement shows reasonable coordinates. I was a little bit surprised because even the GUI in the map window was gone. I checked the code ah yes there is a tiny floating point calculation, fine let’s see whats that. An my guess was right it is the floating point calculation, the Button size is calculated by the Font Size * 1.2 to have a little bit more space around it. After adding some debug output it seems that the floating point calculation works well but the rounding always return zero, so I wrote a little test program to test the rounding here the outputs of the testprogram in my setups:

Vampire 2.7 Amiga 1200/030/68882 UAE 68060 emul
a := 5 = 5
a * 1.0 = 5.000000000E+00
Round(a * 1.0) = 0
b := a * 1.3 =  6.499999523E+00
Round(b) = 0
Ceil(b) = 7
Floor(b) = 6
Floor(b) = 6
press enter
end
a := 5 = 5
a * 1.0 = 5.000000000E+00
Round(a * 1.0) = 5
b := a * 1.3 =  6.500000000E+00
Round(b) = 6
Ceil(b) = 7
Floor(b) = 6
Floor(b) = 6
press enter
end
a := 5 = 5
a * 1.0 = 5.000000000E+00
Round(a * 1.0) = 5
b := a * 1.3 =  6.500000000E+00
Round(b) = 6
Ceil(b) = 7
Floor(b) = 6
Floor(b) = 6
press enter
end

Na? who spots the difference. Funny that only Round() have this problem but ceil, trunc, floor not. This also explains why MUIMapparium shows no maps at all, if all is rounded to 0. Ok I have to wait until they fix that… yeah I could replace all Round(a) by Floor(a+0.5) but why should I do that, here is clearly something broken in the FPGA.
 
You want to try on your own computer – Exe for m68k with FPU: TestFPU and the source: TestFPU.pas

Vampire FPU emulation

Posted by ALB42 on 29. Juli 20172 Comments

The very first version of the SoftFPU called femu 0.1 is released and of course I want to try how good (and how fast) it works. It is the first version so no one should expect wonders. It comes in three versions, 030, 040 and 080 (why there is no 020?). In principle I wanted to try all of them but only the 080 Version works on the Vampire. the 030 Version crashes directly, the 040 crashes on first FPU command. So I have stay with the 080 Version.
First again my Mandelbrot program. (sadly the picture output does not work currently, not big endian compatbile 😉 so I can not check if the result is ok)

Mandelbrot results (Runtimes, shorter is better)

Test 68060/50 MHz FPU 68060/50Mhz SoftFPU Vampire SoftFPU 68030 68882/50 Mhz FPU 68030 SoftFPU Vampire Femu 0.10
Mandelbrot single precision 0.12 s 9.53 s 3.81 s 2.14 s 38.03 s 11.14 s
Mandelbrot double precision 0.15 s 23.72 s 13.37 s 2.31 s 71.87 s 10.31 s

Thats already rather interesting, it seems the femu calculates everything in double, which makes sense because the FPU always use extended. There was a hint already in the manual that femu needs the double precision math libraries from the system. It seems that femu is just a wrapper to guide the TRAPs to the libraries. Not a bad idea actually. In double it’s even a little bit faster than the FPC SoftFPU, not bad, as guessed in the FPC SoftFPU is a lot of optimization potential 😉

Next the Scimark test:

SciMark2 results (MFlops, higher is better)

Vampire V600 V2+ 128 MB FPU Code femu 0.10
Mininum running time = 2.00 seconds Composite Score MFlops: 0.08 FFT Mflops: 0.04 (N=1024) SOR Mflops: 0.12 (100 x 100) MonteCarlo: Mflops: 0.05 Sparse matmult Mflops: 0.09 (N=1000, nz=5000) LU Mflops: 0.09 (M=100, N=100)


Vampire V600 V2+ 128 MB SoftFPU code
Mininum running time = 2.00 seconds Composite Score MFlops: 0.06 FFT Mflops: 0.03 (N=1024) SOR Mflops: 0.12 (100 x 100) MonteCarlo: Mflops: 0.03 Sparse matmult Mflops: 0.08 (N=1000, nz=5000) LU Mflops: 0.02 (M=100, N=100)

This SciMark tests are usually done in Double precision so we see the same trend as in the Mandelbrot. It’s very nice that this tests run without any problems already kudos to the coder, it works.

To check for more FPU commands I took out my real time raytracer (ok on Amiga not that real time anymore :-P) changed that to a saving routine of a single picture and compiled for FPU and SoftFPU. It works and the picture looks very nice, as it should be:

TraceRay FPU on Vampire with femu 0.10

As visible in the picture it needed 730 s to render that picture (as I said, not really realtime) with fpc SoftFPU it needs 280s the 68030/68882/50 Mhz needs 224 s. (sidemark on my AROS i386 box that image needs 0.2 s) and for all cases the picture looks good. The femu does what it promised, not actually very fast but reliable. A little bit disturbing of course is the freezing mouse, when the TRAPs appear. But here the coder of femu can’t do anything, as far as I understood he works closely together with the Vampire developer, so maybe he get a faster (or even not-) TRAP mechanism for the emulation in a later Vampire Firmware.
At the moment I still would prefer to use FPCs SoftFPU for MUIMapparium because there the Mouse will not freeze so the GUI feels more snappy.

To FPU or not to FPU

Posted by ALB42 on 28. Mai 2017No Comments

When I work on MUIMapparium usually I only work in Linux and test on AROS Linux-hosted. which is very convenient and fast. When starting the MUIMapparium I also tested at the end on every platform if it works and how is the speed. For the last two Releases I skipped this part, due to lack of time.

But yesterday I tried MUIMapparium on my Amiga 600 with Vampire and was shocked how slow it behave. The map moving is just not usable around a second reaction time. I downloaded/compiled older versions to check when this problem appeared. Deep in the back of my brain I guessed already that the fixed position calculation could be the reason (see here). Thats pure floating point calculation and a lot of them. I tested that on the initial implementation and it seemed not too slow, because for simple map moving and zoom only very two-three times this conversation have to be done, so the influence is not very big.

So why it’s now so slow? The difference is that before I tested with a bare MUIMapparium without any marker or tracks loaded. Marker only add a single conversation to the list. But Tracks need a conversation for every recorded (and maybe drawn) point. Remember the most GPS devices measure the position once per second, that means for an hour walk you get something about 3600 points (usually the GPS already strip them from „not moved“ points, nevertheless you get around 1000 points). For NG Amigas with their massive computing power especially on the FPU side, this is not much of a problem, 1000 fpu calculation with some hundreds MFlops are just done some milliseconds.
But on Vampire it’s a different story, no FPU, it has to use the softFPU emulation of FreePascal. This raised the question: How fast is the softFPU emulation on a Vampire in comparison to a real 68060 / 50Mhz FPU. The Vampire integer performance is much higher than the 68060 (around twice as fast, see here) but emulated FPU, there is a lot of code needed to emulate that correctly.
I used two tests for that, a simple Mandelbrot algorithm, in single and double precision and the well known SciMark from NIST. Compiled with either with FPC SoftFPU emulation or the 68881 FPU support.

Mandelbrot results (Runtimes, shorter is better)

Test 68060/50 MHz FPU 68060/50Mhz SoftFPU Vampire SoftFPU
Mandelbrot single precision 0.12 s 9.53 s 3.81 s

Mandelbrot double precision 0.15 s 23.72 s 13.37 s

When comparing the SoftFPU times of 060 and Vampire you can see the 2-3 times I experienced before already. But the (often called „very slow“) 68060 FPU leaves the SoftFPU Vampire in the dust far behind it. (In fact the dust is already settled down again, before the SoftFPU finished the calculation). Of course the errorbars for the calculations with FPU are huge, the time is too short for a reliable time measurement, but a bigger calculation just would need ages with SoftFPU 😉 and the trend is nicely visible.

Next is the SciMark, it uses various real life floating point calculation, like FFT, matrix multiplication, monte carlo simulation, if you work in science you know that stuff, if not just believe me that is what science programs do all day 😉

SciMark2 results (MFlops, higher is better)


Vampire V600 V2+ 128 MB SoftFPU code
** ** ** SciMark2a Numeric Benchmark, see http://math.nist.gov/scimark ** ** ** ** Delphi Port, see http://code.google.com/p/scimark-delphi/ ** ** ** Mininum running time = 2.00 seconds Composite Score MFlops: 0.06 FFT Mflops: 0.03 (N=1024) SOR Mflops: 0.12 (100 x 100) MonteCarlo: Mflops: 0.03 Sparse matmult Mflops: 0.08 (N=1000, nz=5000) LU Mflops: 0.02 (M=100, N=100)

Amiga1200 68060/50 FPU code
**                                                               **
** SciMark2a Numeric Benchmark, see http://math.nist.gov/scimark **
**                                                               **
** Delphi Port, see http://code.google.com/p/scimark-delphi/     **
**                                                               **
Mininum running time = 2.00 seconds
Composite Score MFlops:     2.26
FFT             Mflops:     1.18    (N=1024)
SOR             Mflops:     5.05    (100 x 100)
MonteCarlo:     Mflops:     0.86
Sparse matmult  Mflops:     1.81    (N=1000, nz=5000)
LU              Mflops:     2.41    (M=100, N=100)

So it just shows the same trend. Attention: do not compare this MFlops with the theoretically MFlops most speedtests show you (like sysinfo), you can see, how different the tests behave. It depends very strong on which commands are used and how much memory bandwidth is needed.

In conclusion it shows really nicely why the MUIMapparium with a track on a Vampire is so slow currently, because of the slow SoftFPU. Very sad that the Vampire still lacks a proper FPU support. We (ChainQ and me) believe that it is possible to optimize the SoftFPU performance maybe 50% faster or even double, or lets aim for the stars.. 10 times faster than now (I do not believe that is even close to possible at all). It would still be around 5 times slower than a 68060/50 Mhz FPU, for the people believing a SoftFPU implementation could be a replacement for a native FPU in the FPGA.

That means, if it reacts very slowly on Vampire, just remove the track. 😉 I will work on this, reduce the needed calculations, (by using more memory), see at which places I could possibly go down to single precision (not much hope there ;-)) and of course reduce the number of points, in principle a LOD on the Zoomlevel.

P.S.
if you want to test SciMark you can download the FPu and SoftFPU exe from my server:SciMark FPU Version, SciMark SoftFPU Version. (I would be very interested in 68881/2 Results)

Amiga „Cluster“ with FPU

Posted by ALB42 on 16. November 2015One Comment

The flickering is much better, the picture was painted too often and the message loop was polled too seldom. Optimized this a little bit, now it happens very seldom. When I started this tests first I implemented a simple Speedtest which does the same calculation for one second and then count the successfully calculated pixels.
So of course would be nice to have it also in this graphical server and optimize the number of pixels requested from each client. I noticed the Amiga 1200 with 68060 50Mhz only finished 3-4 Pixels per second, I know it is slow, but that slow. (Besides this I’m surprised how fast the MacMini with MorphOS is, nice :-)) But then I remembered that the m68k freepascal does not use the FPU. By default it uses softfloat routines, which of course are very slow. But my automatic compiler server also compiles the FPU version of all units (enable at compiler with -Cf68881 and use FPU compiled units). The fpu compiled version is much, much faster now 200 px/s. Wow now the amiga really add some pixels to finished image. (Before also some, but less than a line ;-))

Speed test of the "Amiga Cluster"

Speed test of the „Amiga Cluster“

It should be noted that the 192.168.0.122 (and 127.0.0.1) is a Computer with eight cores but only one is used here (in fact, two because 127.0.0.1 is AROS hosted on the same computer), I didn’t implement multicore threading.

Shi(f)t FPU

Posted by ALB42 on 2. August 2014No Comments

I found out whats wrong with my OpenGL Shift-it. There is a Division by zero exception in the gl library, especially on tiny vertextes (like the one on tip of a sphere) or on tiny slices of clipped boxes.
The question was, why it does work in C, but not work in freepascal. because I tried to make a sphere using gluSphere() in C, and it worked perfectly, but in my program it always crashed.
On the Hunt of the source of Exception, I first had to find out which kind of FPU exception/Trap it is. So I wrote my own traphandler and checked the parameters when enter the traphandler, sadly the fpu content was already cleared and the buffer where the content of the FPU registers should be noted was not assigned.
So I took an other way.. and set the FPU exception one by one to masked and checked when the crash disappeared.
I expected bad things with the masked expections on other parts of the calculation. It surprigingly worked VERY nicely. That way I found out, that Division by Zero exception flag mask this crash.
This made me curious and I checked how the setting of standard C programs are in AROS. I made a simple C program which reads out the FPU Control register (which controls the exceptions) and I got a $37F.. this reallly made me laughing loud (my wife must thinking, that I became mad.) becasue that simply mean, all FPU exceptions are masked… means ignored. I also checked my C-OpenGL program, the same. In my Freepascal programs I always find $1372: which means „Invalid Operation“, „Division By Zero“ and „Overflow“ are not masked.
I found out that also at Delphi, they deactivate all FPU exceptions by default if OpenGL is used (in the opengl unit initialization section). Now I’m thinking: one way would be to deactivate FPU exceptions for all FPC programs at AROS (becasue I have no way to catch them), or simply deactivate it for programs using OpenGL (as Delphi is doing).

A nice sidemark: This error only happen for software rendering, if you use hardware rendering it disappears completely. I didn’t know that I can use hardware rendering in linux-hosted mode, a user at aros-exec told me I should try the mesa.library in sys:storage/libs which has hardware support for linux-hosted installations. This indeed is very nice and then the game makes really fun.

The next step for the game would be to include sounds, the password system and some GUI for configuration, but all this things I postponed now. The game is already rather nice, the coding was really fun. And a proof that you really can make nice games/programs for AROS with FPC 😉

PiStorm Emu68

Posted by ALB42 on 11. Dezember 2022No Comments

As I wrote before the PiStorm is already very nice, stable but not that fast, I heard with the Emu68 it should be much faster (not so much features though, not network, AHI direct access and so on).

I used this manual. Also attached a cooler to the raspi to not let it overheat anymore and it works nicely barely it stays under 70° degree even on whole day use.

You can feel that it is much faster… SysInfo tells you that it is much faster, but to be honest, I never trusted SysInfo so lets get out my old test codes. Sadly the links to the benchmark game are gone, seems the page does not exist anymore, but I still have the sources so still can plan around with 🙂

Especially the tests: fannkuch 10 and BinaryTree 15. (68060/50Mhz= A1200 with Blizzard 1260, MorphOS = MacMini 1.4Ghz, UAE/AROS/Linux = Athlon FX 8120 3.16 Ghz)

machineBinaryTreefannkuch
 Time [s]Speed rel. to 060Time [s]Speed rel. to 060
68060/50Mhz42.91.059.61.0  
VampireA60023.41.828.32.1  
PiStorm A600 Emu6811.43.811.95.0
MorphOS 68k emul3.014.333.01.3  
WinUAE OS3.91.430.62.029.8  
MorphOS PPC1.3332.325.9  
AROS i3860.671.50.785.1  
Linux x86_640.31430.2298

4-5 times of a 68060/50, not bad, it also emulates a FPU (68040) so also FPU tests are possible, for example the scimark test:

Amiga1200 68060/50 FPU code
Mininum running time = 2.00 seconds
Composite Score MFlops:     2.26
FFT             Mflops:     1.18    (N=1024)
SOR             Mflops:     5.05    (100 x 100)
MonteCarlo:     Mflops:     0.86
Sparse matmult  Mflops:     1.81    (N=1000, nz=5000)
LU              Mflops:     2.41    (M=100, N=100)

now for the PiStorm A600 Emu 68 it looks like this
Mininum running time = 2.00 seconds
Composite Score MFlops:    20.35
FFT             Mflops:    13.12    (N=1024)
SOR             Mflops:    41.80    (100 x 100)
MonteCarlo:     Mflops:     6.61
Sparse matmult  Mflops:    18.16    (N=1000, nz=5000)
LU              Mflops:    22.07    (M=100, N=100)

so here we have around 10 times faster than the good old 68060/50.

Here again it feels much faster because the RTG output is so smooth and the harddisk access is so much faster feels more like UAE speed wise. MUIMapparium for example is really nicely usable scrolls very smooth.

Also the Emu68 boots much faster than the other method and shows a nice boot picture, so you do not see the raspi booting just needs a bit longer than a usual Amiga boot (like you disconnected the Floppy).

And the most important… it seems to be stable same as the other method, some hours of Delitracker playing no crash, rebooting much more reliable than the other method… works always, I think I will stay with this Emu68 for now.