home .. forth .. nosc mail list archive ..

[NOSC] Steamer16 Q&A

> Rick Hohensee's questions about Steamer16:
> 1) It's a one-stack MISC with 3 parameter stack cells?
> Yes, that's all that would fit on the CY37128P84. I figured
> this was enough to do something useful after a paper
> evaluation using the quadratic solution as a benchmark
> as it could be performed on a Hewlett-Packard RPN
> calculator.
> 2) how many gates/transistors on the device in question?
> Cypress's datasheet doesn't use those metrics to quantify things.
> The next heading sets the record straight.
> 3) how much room is left?
> The following is an excerpt from the report file generated by
> Cypress's Warp2 VHDL compiler. There is not enough left over
> to add more instructions or stack registers. I know because
> I tried ;)
>   Information: Macrocell Utilization.
>                      Description        Used     Max
>                  ______________________________________
>                  | Dedicated Inputs   |    1  |    1  |
>                  | Clock/Inputs       |    2  |    4  |
>                  | I/O Macrocells     |   59  |   64  |
>                  | Buried Macrocells  |   52  |   64  |
>                  | PIM Input Connects |  242  |  312  |
>                  ______________________________________
>                                          356  /  445   = 80  %
>                                       Required     Max (Available)
>           CLOCK/LATCH ENABLE signals     2           12
>           Input REG/LATCH signals        0           69
>           Input PIN signals              3            5
>           Input PINs using I/O cells     0            0
>           Output PIN signals            59           64
>           Total PIN signals             62           69
>           Macrocells Used              111          128
>           Unique Product Terms         476          640
> 4) how long did it take to write the VHDL?
> The initial implementation (~18 months ago) was written in 3 days.
> After reviewing the boolean equations in the report file and
> convincing myself that I expressed myself correctly, I spent 4
> days running simulations. This was essentially a warmup exercise
> with the then-unfamiliar tools prior to doing "serious" work for
> a startup which went nowhere. The design sat in the can for a
> year and I kept wondering whether it was worth realizing, due
> to its obvious limitations and much sexier examples of silicon
> that are out there. The CY37128P84 was chosen because it was the
> biggest gun available that would fit in a socket I could deal
> with with a view to hand-wiring a prototype for the (unfunded)
> aforementioned startup. I had inventory of a 60 pc. min. qty.
> purchase all dressed up with nowhere to go, so I dusted off
> the design files and convinced myself that it WAS worth taking
> to completion.
> I wrote myself an assembler (in Forth, of course ;) and became
> unhappy with the fact that the NOP, padding that was frequently
> required between the last explicit instruction in a packet and a
> following (packet-aligned) jump target label cost useless clock
> cycles to execute, so I redesigned the instruction sequencer to
> force a fetch cycle under those circumstances. While I was at it,
> I made the master reset synchronous. This took 1 day, followed by
> 4 days of fresh simulator sessions.
> All hell broke loose when I fired up the prototype! Series
> termination on the clocks and high-speed strobes solved some of
> the problems, but some of the test software crashed consistently.
> Persistence and trial and error showed that the zgo, instruction
> failed when a jump was taken in 2 out of 5 slots: inline literals
> were being executed. It took me 2 days of staring at the VHDL code
> for the instruction packet sequencer to figure out what the problem
> was and fix it. I also changed the write strobe timing prior to
> identifying the real problem and have kept it that way since. I
> took a look at my test vectors and confirmed that they never
> generated the scenario which caused the zgo, instruction to fail.
> Oops!
> So in summary it took ~6 days of VHDL design or debugging and ~8
> days of simulation (should have been more in retrospect) to bring
> things to the current happy state of affairs.
> 5) you're using real good old SRAM with it?
> I'm using a pair of 25 nS 32Kx8 SRAMs liberated from the L2 cache
> sockets on obsolete PC motherboards on their way to the dumpster. A15 
> goes directly to the /CE pins of the SRAMs, mapping them to the low 32K 
> cells.
> 6) what are it's bus widths?
> Both the address and data busses are 16 bits. There is no byte
> addressing capability. 2 early (R/W and W/R) and 2 late (/RD and /WR)
> control signals are generated to obviate the need for external
> decoding (and consequent delays) in most forseeable cases. W/R goes
> to the SRAM /OE pins and /WR goes to the /WE pins. The /WR strobe is
> synchronous to the 40 MHz 2x master clock and pipelined such that it
> pulses low when the 1x clock (derived from the 2x clock) goes low to
> put data and address hold time well clear of the SRAM's 0 nS minimum.
> The /RD strobe is combinatorially generated when the 20 MHz 1x clock
> is low to assure data hold time during read cycles, but this signal
> is not used on the prototype system. All of the internal registers are
> synchronously clocked by the rising edge of the 1x clock.
> 7) etc etc
> The 20 MHz clock limit is due to the ripple-carry adder implementation.
> BTW, the first VHDL module I wrote used the syntax "+" to specify the
> behavior of the adder, and that _in_itself_ exceded the capacity of the
> chip. Apparently, a full lookahead-carry implementation was attempted.
> Later on I noticed that there were radio buttons in the project options
> dialog box: Goal <- area|speed which defaulted to speed. I had already
> explicitly written the ripple-carry solution and haven't tried the "+"
> syntax with area as the goal.
> Interrupts, wait states, and bus sharing features are not implemented
> due to fitting limitations. A 2nd CY37128 is used on my prototype
> system as an I/O companion chip with 16 bits of parallel input, 16
> bits of parallel output, a free-running 20 MHz 16-bit timer, and a
> register which provides the 2/ function. It is self-decoded near the
> top of memory. A cable between the host PC provides the JTAG interface
> for burning the JEDEC fuse maps into the devices and downloading code
> using the boundary scan registers to drive the target busses. The
> JTAG code downloader is integrated with the assembler.
> The 5V operating current was measured at 440 mA on the prototype system
> with 2x CY37128 (Steamer16 + I/O), 2x 25 nS SRAMs, and a 40 MHz 2x
> clock oscillator module with the system out of reset and running a real
> application. The situation could be improved by invoking the low-power
> options of the CY37128, downgrading the 1x clock to ~10 MHz, and using
> low-power 70 nS SRAMs, which could be battery backed up to retain code
> and data during the powerdown condition.
> The code is not particularly space efficient despite the compact
> instruction encoding due to the heavy incidence of inline literals,
> the primitive instruction set, and NOP, padding statistics.
> The software tools are written in a bastard dialect of Forth-79 that I
> have been using since 1990. The assembler proper is lean and mean.
> Colon definitions in an include file are used to implement macros.
> Because Steamer16 is not a true Forth chip, and in fact lacks call and
> return instructions, it is necessary to use a re-entrant stack frame
> strategy I call ArF (don't ask unless you have a sense of humor) which
> is a hybrid between Forth and C under the hood. A call is implemented
> as a sequence of 7 instructions, a return is 5 instructions, and a read
> or write indexed into the stack frame is 5 instructions. The coding
> style that inevitably results can be considered bad Forth that
> over-uses static variables and the PICK operator, but on the other
> hand relieves the programmer of optimizing the order of operands on
> the stack or resorting to the use of stack reordering operators. ArF
> mandates preservation of the input arguments: the results are
> appended to the list and the calling parent is solely responsible for
> building up or tearing down the stack frame as in C, therefore Forth
> operators like DUP, OVER, and R@ are not required. The on-chip
> evaluation stack is nevertheless used for bursts of Forth-like
> activity until it is appropriate to write an intermediate or final
> result to the stack frame or a static variable. The fine-grained
> subroutine factoring that is one of Forth's major strengths is not
> as attractive as it is with a true Forth chip, but there are some
> compensations and opportunities for optimization that are unique
> to the Steamer16/ArF genre. I have experimented enough with the
> software end of things to come to discover that agonizing over
> shaving off a cycle or two from a subroutine is very seldom
> worthwhile and sometimes, very surprisingly, retrograde. I believe
> this is due to the instruction alignment statistics, which are
> deterministic but difficult to exactly predict as one is writing
> code. This is not to say that the programmer should not be
> performance-conscious, but rather that the straightforward ArF macros
> offer _pretty_good_ runtime performance. The more I use it the more I
> like it, and it beats the snot out of any off-the-shelf CPU that I
> have used so far, except for DSPs.


I'd heard a prominent Forther was thinking of doing a one-stack Forth, and
I thought he was BSing me, but if the one stack was the parameter stack
that would explain it. Your instruction counts for call/return are about
proportional to how long one near call instruction actually takes on a 386
anyway, since it's at some level doing the same things. 

CLD DD Clear Direction Flag

Opcode    Instruction     Clocks   Description

FC        CLD             2        Clear direction flag; SI and DI
                                   will increment during string

E8  cw    CALL rel16       7+m            Call near, displacement relative
                                          to next instruction
			   ^^^best case timing

Linux has 60% more calls than returns. That's probably a LOT of functions
that are in fact inline code that's being called. The subroutine threaded
H3sm has 3 times the calls as returns though, since all thread words are
mostly calls. If I did hardware I'd be looking into what I was talking to
Jan Coombs about almost always doing a call on each instruction
fetch. Then you can maybe get the return stack activity in parallel. 

Rick Hohensee

> The response to my postings has been enough to make me decide to get
> off my butt and finally get a personal website together. I hope to
> have presentable documentation organized within 2 weeks. VHDL source
> for the CPU and I/O chips and Forth for the assembler and JTAG code
> downloader will be available. I hope to include schematics for the
> prototype system, if the scanner I have access to will co-operate.
> Myron Plichota
> ------------------------
> To Unsubscribe from this list, send mail to Mdaemon@xxxxxxxxxxxxxxxxxx with:
> unsubscribe NOSC
> as the first and only line within the message body
> Problems   -   List-Admin@xxxxxxxxxxxxxxxxxx
> Main 4th site   -   http://www.


To Unsubscribe from this list, send mail to Mdaemon@xxxxxxxxxxxxxxxxxx with:
unsubscribe NOSC
as the first and only line within the message body
Problems   -   List-Admin@xxxxxxxxxxxxxxxxxx
Main 4th site   -   http://www.