home .. forth .. nosc mail list archive ..

[NOSC] Steamer16 Q&A

Rick Hohensee's questions about Steamer16:

1) It's a one-stack MISC with 3 parameter stack cells?

Yes, that's all that would fit on the CY37128P84. I figured
this was enough to do something useful after a paper
evaluation using the quadratic solution as a benchmark
as it could be performed on a Hewlett-Packard RPN

2) how many gates/transistors on the device in question?

Cypress's datasheet doesn't use those metrics to quantify things.
The next heading sets the record straight.

3) how much room is left?

The following is an excerpt from the report file generated by
Cypress's Warp2 VHDL compiler. There is not enough left over
to add more instructions or stack registers. I know because
I tried ;)

  Information: Macrocell Utilization.

                     Description        Used     Max
                 | Dedicated Inputs   |    1  |    1  |
                 | Clock/Inputs       |    2  |    4  |
                 | I/O Macrocells     |   59  |   64  |
                 | Buried Macrocells  |   52  |   64  |
                 | PIM Input Connects |  242  |  312  |
                                         356  /  445   = 80  %

                                      Required     Max (Available)
          CLOCK/LATCH ENABLE signals     2           12
          Input REG/LATCH signals        0           69
          Input PIN signals              3            5
          Input PINs using I/O cells     0            0
          Output PIN signals            59           64

          Total PIN signals             62           69
          Macrocells Used              111          128
          Unique Product Terms         476          640

4) how long did it take to write the VHDL?

The initial implementation (~18 months ago) was written in 3 days.
After reviewing the boolean equations in the report file and
convincing myself that I expressed myself correctly, I spent 4
days running simulations. This was essentially a warmup exercise
with the then-unfamiliar tools prior to doing "serious" work for
a startup which went nowhere. The design sat in the can for a
year and I kept wondering whether it was worth realizing, due
to its obvious limitations and much sexier examples of silicon
that are out there. The CY37128P84 was chosen because it was the
biggest gun available that would fit in a socket I could deal
with with a view to hand-wiring a prototype for the (unfunded)
aforementioned startup. I had inventory of a 60 pc. min. qty.
purchase all dressed up with nowhere to go, so I dusted off
the design files and convinced myself that it WAS worth taking
to completion.

I wrote myself an assembler (in Forth, of course ;) and became
unhappy with the fact that the NOP, padding that was frequently
required between the last explicit instruction in a packet and a
following (packet-aligned) jump target label cost useless clock
cycles to execute, so I redesigned the instruction sequencer to
force a fetch cycle under those circumstances. While I was at it,
I made the master reset synchronous. This took 1 day, followed by
4 days of fresh simulator sessions.

All hell broke loose when I fired up the prototype! Series
termination on the clocks and high-speed strobes solved some of
the problems, but some of the test software crashed consistently.
Persistence and trial and error showed that the zgo, instruction
failed when a jump was taken in 2 out of 5 slots: inline literals
were being executed. It took me 2 days of staring at the VHDL code
for the instruction packet sequencer to figure out what the problem
was and fix it. I also changed the write strobe timing prior to
identifying the real problem and have kept it that way since. I
took a look at my test vectors and confirmed that they never
generated the scenario which caused the zgo, instruction to fail.

So in summary it took ~6 days of VHDL design or debugging and ~8
days of simulation (should have been more in retrospect) to bring
things to the current happy state of affairs.

5) you're using real good old SRAM with it?

I'm using a pair of 25 nS 32Kx8 SRAMs liberated from the L2 cache
sockets on obsolete PC motherboards on their way to the dumpster. A15 
goes directly to the /CE pins of the SRAMs, mapping them to the low 32K 

6) what are it's bus widths?

Both the address and data busses are 16 bits. There is no byte
addressing capability. 2 early (R/W and W/R) and 2 late (/RD and /WR)
control signals are generated to obviate the need for external
decoding (and consequent delays) in most forseeable cases. W/R goes
to the SRAM /OE pins and /WR goes to the /WE pins. The /WR strobe is
synchronous to the 40 MHz 2x master clock and pipelined such that it
pulses low when the 1x clock (derived from the 2x clock) goes low to
put data and address hold time well clear of the SRAM's 0 nS minimum.
The /RD strobe is combinatorially generated when the 20 MHz 1x clock
is low to assure data hold time during read cycles, but this signal
is not used on the prototype system. All of the internal registers are
synchronously clocked by the rising edge of the 1x clock.

7) etc etc

The 20 MHz clock limit is due to the ripple-carry adder implementation.
BTW, the first VHDL module I wrote used the syntax "+" to specify the
behavior of the adder, and that _in_itself_ exceded the capacity of the
chip. Apparently, a full lookahead-carry implementation was attempted.
Later on I noticed that there were radio buttons in the project options
dialog box: Goal <- area|speed which defaulted to speed. I had already
explicitly written the ripple-carry solution and haven't tried the "+"
syntax with area as the goal.

Interrupts, wait states, and bus sharing features are not implemented
due to fitting limitations. A 2nd CY37128 is used on my prototype
system as an I/O companion chip with 16 bits of parallel input, 16
bits of parallel output, a free-running 20 MHz 16-bit timer, and a
register which provides the 2/ function. It is self-decoded near the
top of memory. A cable between the host PC provides the JTAG interface
for burning the JEDEC fuse maps into the devices and downloading code
using the boundary scan registers to drive the target busses. The
JTAG code downloader is integrated with the assembler.

The 5V operating current was measured at 440 mA on the prototype system
with 2x CY37128 (Steamer16 + I/O), 2x 25 nS SRAMs, and a 40 MHz 2x
clock oscillator module with the system out of reset and running a real
application. The situation could be improved by invoking the low-power
options of the CY37128, downgrading the 1x clock to ~10 MHz, and using
low-power 70 nS SRAMs, which could be battery backed up to retain code
and data during the powerdown condition.

The code is not particularly space efficient despite the compact
instruction encoding due to the heavy incidence of inline literals,
the primitive instruction set, and NOP, padding statistics.

The software tools are written in a bastard dialect of Forth-79 that I
have been using since 1990. The assembler proper is lean and mean.
Colon definitions in an include file are used to implement macros.

Because Steamer16 is not a true Forth chip, and in fact lacks call and
return instructions, it is necessary to use a re-entrant stack frame
strategy I call ArF (don't ask unless you have a sense of humor) which
is a hybrid between Forth and C under the hood. A call is implemented
as a sequence of 7 instructions, a return is 5 instructions, and a read
or write indexed into the stack frame is 5 instructions. The coding
style that inevitably results can be considered bad Forth that
over-uses static variables and the PICK operator, but on the other
hand relieves the programmer of optimizing the order of operands on
the stack or resorting to the use of stack reordering operators. ArF
mandates preservation of the input arguments: the results are
appended to the list and the calling parent is solely responsible for
building up or tearing down the stack frame as in C, therefore Forth
operators like DUP, OVER, and R@ are not required. The on-chip
evaluation stack is nevertheless used for bursts of Forth-like
activity until it is appropriate to write an intermediate or final
result to the stack frame or a static variable. The fine-grained
subroutine factoring that is one of Forth's major strengths is not
as attractive as it is with a true Forth chip, but there are some
compensations and opportunities for optimization that are unique
to the Steamer16/ArF genre. I have experimented enough with the
software end of things to come to discover that agonizing over
shaving off a cycle or two from a subroutine is very seldom
worthwhile and sometimes, very surprisingly, retrograde. I believe
this is due to the instruction alignment statistics, which are
deterministic but difficult to exactly predict as one is writing
code. This is not to say that the programmer should not be
performance-conscious, but rather that the straightforward ArF macros
offer _pretty_good_ runtime performance. The more I use it the more I
like it, and it beats the snot out of any off-the-shelf CPU that I
have used so far, except for DSPs.

The response to my postings has been enough to make me decide to get
off my butt and finally get a personal website together. I hope to
have presentable documentation organized within 2 weeks. VHDL source
for the CPU and I/O chips and Forth for the assembler and JTAG code
downloader will be available. I hope to include schematics for the
prototype system, if the scanner I have access to will co-operate.

Myron Plichota

To Unsubscribe from this list, send mail to Mdaemon@xxxxxxxxxxxxxxxxxx with:
unsubscribe NOSC
as the first and only line within the message body
Problems   -   List-Admin@xxxxxxxxxxxxxxxxxx
Main 4th site   -   http://www.