I blogged a while back about the basic engine, so lets start there :-
Operations like 32 bit addition can be done in 32 cycles one bit at a time.
That blog post was about fully-parallel, searchable, content-addressable memory. This type of memory is used in lookup hardware for caches, it is not a commonly available part in large sizes, because it is power-hungry and not in demand.
But - a poor man’s version of this is to emulate the parallel operation with a bit-serial processing engine (a PE below) - and often the interesting operations like addition and multiplication are bit-serial anyway.
Now, we can use regular RAM, but accessing it “column-wise” instead of row addressing. Everything takes (much ..) longer, but you can be doing a lot of these same operations at a time.
A 4 Gig RAM chip today makes 64k squared - the chip itself is 64K rows and 64k columns.
Instructions (broadcast to all PEs) are just a Karnaugh Map of all inputs :-
- a single bit input from RAM address X or Y
- An accumulator or two
- static data from the instruction (Immediate)
So - a super simple PE - basically enough to do bit-serial addition (about the most complicated thing it will do) - but also logic operations.
The point is that the data is processed in situ - not pulled across a bus one at a time and stored back.
If we take a clock speed of 1Ghz for easy math, addition (X+Y = Z) would take about 3 clocks per bit (read X, read Y, write Z). So - 32 bits at 10 MHz. A multiply? 32 of those, so 300 Khz.
We can do 300k multiplies per second, on 64k fields simultaneously - about 20Gig multiplies / second, on a single chip. Obviously, you can multiply that up by the number of chips you have.
There are problems :-
Data I/O - we have to keep the computer fed. This will be as big a job as the compute, and will probably need its own PE.
If it is read out the same way as the PEs read the data, it will also need to be turned through 90 degrees. The SIMD engine can chunk it up to word-size, simplifying the job for the server.
Communication. Shifts - and some binary tree with latency and bottlenecks.
Reduction. Maybe as simple as all accumulators == 0.
Algorithms. What types of problems lend themselves to this SIMD processing?
This is definitely a solution looking for an algorithm, and needing an accomodating partner.
I have an Artix 7 FPGA development board from Digilent - a reasonable size FPGA - and the (free) Xilinx development tools. It is great to be back into hardware design. I first of all need an executive on the chip, and I have chosen a Forth engine - it is very small, and has an extensible instruction set that I think can drive the SIMD stuff directly. It is based heavily on j1eforth , written in Verilog.
It talks via a USB UART to a minicom terminal on my laptop.
The FPGA has 50 x 512 x 72bit RAM blocks, but there are limitations on the configurations that mean my optimal depth is 1024.
I should be able to go 1024 bits wide by 1024 bits deep. It doesn’t have to be square, it just turned out that way - but really I am testing the principle rather than the performance at this time.
I am recently using the Raspberry Pi for its SPI serial communications bus.
My SIMD Engine works, but there are still some funny bugs that need to be found. Written in Verilog, it is programmed in forth.
For I/O, I am using SPI to get it to a Raspberry Pi, which brings it all with a sigh of relief into network-space. The Artix7 has a network port - maybe I should use their Microblaze to get direct access to the port.
I think the forth engine is fast enough to do all the I/O by polling, because the FPGA can help out on the difficult bits. Conversely, I/O is terrible ..
Some bugs, but the basic engine seems to be all there.