Software Development Paradigm Trap main page

Further Reading

Feedback: Software Development Paradigm Trap

Bandit - 8/4/2006


I read your initial article, and all of the feedback on your webpage.

I am an embedded systems engineer with 30 years experience. I cut my eye-teeth on both the 8080 and mainframes. (The first system I used was at age 8 on a GE 225 running timeshared BASIC. My father was on the OS team at GE.)

The multi-uP approach is very appealing and you point out the issue of debugging. This is a critical aspect to SoC - will we be able to build the tools and processes to allow us to debug the darn thing?

The one (two?) real negative feedback you have recieved on this is power/cost. Power has many of the same factors as cost.

I must admit cost is important for consumer items. However, there is a balance between NRE, development time, and manufacturing cost. Most of the things for the consumer market have a natural upper limit on complexity, just because most consumer items are not that complex when compared to industry projects, including the big embedded stuff like telecom and avionics. The little stuff - toys, remote controls, hotel door knobs, are very price aware, but are limited in complexity. Probably the most common consumer embedded device is the cell phone. One modern cell phone has more compute power than all of the computers - world wide - in 1970.

As Jack Ganssle points out, if you reduce the NRE by a factor of 2, there is half of the NRE overhead on the final product.

Let me use another proposal of mine that I have used in the past to develop boards of reasonable complexity. The original idea is not mine; I saw this fairly early in my career.

The problem with developing a board these days is the size and packages of the components. Attach a logic probe to a BGA package? Right. You first. And the boards are always built to the final size. You have to solder little wires onto little vias and pads. Makes me want to shout and carry on.

The proposal is this: the development board can be as big as you want. If the final size is a 3U cPCI board, there is no technical reason the development board cannot be 2 feet by 2 feet. (Yes, there are cases where the signal path length is important, but those should be fairly easy to isolate.) You can have all components on the top side. You can put 0.1 inch headers on all signal paths. Any hook you can imagine to aid in the development. Cable to the bus interface.

(I normally propose you go even further: the first layout and cut of the development board, then the start of the final layout to the point of routing all paths to get an idea of how to do the final route, then as development requires hardware changes, update the schematic. When you think you have all of the development done for the hardware, layout and cut rev 2 of the development board and test everything. Then, when you are happy with all of the hardware and software, finish the layout and cut the final board. If any problems do occur at this stage, you have a good handle on them. Rev 1 gets put on the wall as a trophy, and rev 2 is carefully packed away.)

Now - how does this apply to the problem at hand?

First, the major cost of the NRE is human time. Us hairy nerds sitting down and scratching ourselves. The cost of my proposed boards is a day or two of our grandiose salaries. Maybe a week if we really go hawg wild on the things. We can burn a week or two on a medium hard problem easy - we've all been there.

Second: Assume you can obtain the uP's you are going to use in both chip and IP. This is not that hard - ARM and 8051 are both available this way, as are many other cores. If there is an issue where only the IP is available, put the thing in an FPGA and make it look like a chip. Maybe just have a bunch of bigger-than-needed FPGAs to start with.

Make the first development board. Put all of the headers, test points, etc on it. Have a button that orders pizza and chinese. You do not care about power - you have a power plant behind you. If you have N developers, make 2*N boards (some will fail). You don't need to populate all of the chips on all the boards, just enough on most to allow each developer to do his/her thing.

Everybody gets to the point where integration is the Next Big Step. Choose two modules that mumble with each other (either peers or master/slave). Have at each other. If the general architecture is a pipeline, start at the head. If it is master/slave with the master being the bigger chip (ie ARM in this example), prepopulate a set of the development boards with a single ARM and a single 8051. They will be used in both the 8051-only phase and the integration phase.

Wet, lather, rinse, repeat. Stop when everything is working correctly (a big word!).

You now have a working set. It would be useful for one more hardware spin in most cases, for paranoia. You have your development platform.

Now - all of the cores are IP. If you are clever up front, the interfaces between them should be simple and clean, so you now shove everything into an FPGA. (Maybe two FPGAs, the master in one and all the slaves in the other.) You should probably use the same integration sequence as you used before. However, with JTAG and "test points" on unused/dedicated FPGA pins, you can see what is going on. Tektronics has a pretty good method of muxing test pins on an FPGA.

Of course, various issues like each core's clock, and how the outside world is accounted for and controlled need to be thought about upfront.

BTW - if all of the code will fit in an internal RAM or flash, there is no need for a cache. But that is a more complex discussion for later.

Take care... bandit

Jack Ganssle - 8/4/2006


Wow - you're making a radical proposal. You're saying we should build a (shudder) prototype!

Alas, it seems everyone wants the final version as the first step. My observation is that this initial final version then gets re-spun... and re-re-spun... and re-re-re-spun.

There are no software prototypes; if the darn thing works at all, it gets shipped. Now that dysfunctional approach is getting into the hardware zeitgeist, too. And that's a shame.

As always, good to hear from you.

All the best,

Bandit - 8/4/2006

Well - yeah - I guess it is radical.

But - the problem with SoC is precicely the problem that a prototype is intended to solve. The tools for SoC are just not here yet, in the maturity that will be needed to properly do the job. We also need to re-think the mechanisms we use on-chip to see what is going on in the wee beasty. JTAG is a Good Thing, but I suspect (gut, no numbers backing me up) that JTAG chains are not going to be up to the task as more stuff gets crammed into/onto the silicon.

The trick, for me, is getting a client to do a prototype first, then being able to show the cost savings in time/NRE. I had one HW manager at a client that is really good, and he bought into it. Of course, he knew the value.

Of course, you also run into the situation where the first board needs mods but they don't respin, and all the units go out with 20..30 blue wires. Fortunately, I work in the avionics and telecom market, where that sort of thing is frowned upon.


Take care... bandit

Mark Bereit - 8/4/2006

Bandit, Jack,

Thanks for some lively back-and-forth on this topic!

Your observations about testable prototypes are good, and call to mind several examples good and bad in my own experience. But I would take this a step further: I already believe that in some hypothetical lots-of-modules system with good partitioning, the present model of code debugging isn't gonna fly. We're already there: the model of single-stepping through a sequence of events fails spectacularly in a multi-threaded system, and that's all happening in just a single core! I think the debugging approaches will need to come down to two things: individual objects (components, whatever term you want to use) being heavily verifiable in isolated test-bench situations (finally yielding to XP approaches), and an ability to "probe" an interconnected system. When my digital hardware design isn't running I don't try to single-step the master clock; rather, I attach probes to points of flow and try to capture clues in a logic state analyzer or capture scope. The ability to do this to computational models needs to be maintained even when the components share the same chip. Just as today's MCUs routinely devote a chunk of transistors to in-system debugging, so tomorrow's multi-core FPGAs and such will need to devote a chunk of transistors to real-time data flow capture. (Not just probing internal pins in a frozen system; this is simply the next step.)

This is just a quick summary of a thought process that's chewing a lot of my personal compute cycles lately. (Now that I've opened my big mouth about exploring new paradigms I feel challenged to keep exploring myself.) But I thank you both for throwing more thoughts into the mix.

Jack Ganssle - 8/4/2006


Another aspect of this I've been pondering is the technology of hierarchical state charts, a la Quantum Leaps. Somehow - not quite sure - it sure seems like that could map into the many CPUs idea.

All the best,

Bandit - 8/4/2006

I think we are in violent agreement (it's this way!! Yes, it's this way!!) about the need to: prototype, do it right, verify it, release it.

And yes, there are lots of problems that occur when you have threads. One of the implications about this whole discussion is to either put one thread per CPU, or only a couple of threads per CPU. If there are more than one thread per CPU, they should either be compatable (not get in each others way) or complementary (share the load).

I can think of a complementary set of threads for handling devices. For example, one just deals with the hardware (part of the thread is an ISR, or treat the ISR as a thread), and the other thread(s) parse out and cook the data. For example, a network protocol handler where the ISR gets the packet from the device, the next layer/thread handles the simple stuff like ping and verifying the packet format (thus eliminating buffer overflow, mangled packets, etc), and the third level/thread handles the data extraction. The third level only sees clean packets. It accepts data from two sources: the second level or the master (for output).

The need to debug is obviously reduced when the "module/object" is small. Also - when created in isolation, it's easier to verify.

I do a lot of device drivers. I will get one working, then create a test that just schleps bits. I will run it for days looking for a failure, testing various parameters (ie different baud rates, packet sizes, etc). My device drivers work because I do this.

We need to create test systems where the pieces can interact, and we can watch the interactions. The SoC stuff needs to have better debug stuff built-in to the hardware.

Your point about single-step vs "let-er-rip" is very valid. I did a voltage controller device driver for a flash tester. I put in all sorts of debugging into it, at many levels of "slime." All slime went into a file, under program control (ran under NT controlling a big chunck of hardware we would feed via a cable). The simplest was a call to turn on the parameter dump to the file of the inputs and outputs to the API functions.

I got a lot of grief from some of the other programmers. (They were more "pure" software guys.) They would single-step thru to "verify" their code & that was good enough.

However, the Apps folks were jumping for joy because they could run a test and see where the problem was. The API calls were not consistent (not my fault!) and the slot/pin arguments were mostly (slot, pin) but there were a couple that were (pin, slot) and it was easy to forget the special cases.

There were other cases where the machine would not act properly, and they could tell that because the return values were wrong.

The API call I created would take a bitmask that I used to determine if I should send slime to the file, and the level of verbosity. This gave them control into the detail level they needed to create a test or other application.

I saw a demo at the west coast embedded show, by TI (HP?). They have some soft-IP they give away (I recollect) with their stuff to create a big mux so a lot of test points can funnel thru a few pins. I forget some of the details, but it allows code running on the FPGA to select the "input" to the mux and where/when it comes out.

I would think a few dedicated FIFOs would be a good start. The FIFOs would use a couple of the MS bits to indicate who is sending the rest of the bits, which are interpreted by knowing the sender.

I also think we should be looking at the toolsets to add in the kind of debugging we need, ie the C to Hardware compilers. There should be keywords/macros that would be used in a stand-alone execution (ie on a PC) but then converted to a call or other hardware interface so we can see things and the order they occur in. This is critical with threads and shared resources.

My proposal for the hardware prototypes gives a lot of exposure to the various interfaces, which are always one of the sources for bugs.

This is a good thread :^)

... bandit