Ever-smaller mobile devices demand progressively more highly itegrated and lower-power implementations. We have been working on building blocks for system-on-chip (ASIC) and system-on-programmable-chip (FPGA) designs. Specifically, we have been considering a low-power, highly-configurable, full-featured embedded processor that we have named JCN.
FPGAs are particularly useful for prototyping and small-scale production and are widely used in our lab. But FPGAs are not inherently low-power devices, making power efficiency an even greater concern. Our low-power work has focused on technology-independent techniques that are equally applicable to FPGA and ASIC implementations, and that can complement existing power-saving strategies.
Despite claims to the contrary, very few available processor designs successfully target both FPGA and ASIC implementations; the FPGA implementation will either be prohibitively resource hungry, or will provide an inefficient implementation. The Xilinx implementation of JCN is compact (1500 Xilinx LUTs, or ~30K ASIC gates) and efficient (25MHz without optimisations on a <$25 FPGA device). It is also technology independent and has comprehensive tool support with a full ANSI C/C++ compiler.
The JCN processor is a fairly standard 32-bit RISC design. It offers a four-stage pipeline with single-cycle execution of instructions (apart from taken branches or reading/writing memory, which take two cycles). When a branch is taken, the instruction sequentially following the branch is not executed. There is support for hardware interrupts and software traps, along with a low-power sleep mode.
There are three basic instruction formats for JCN:
Many aspects of the processor can be easily configured. For example, the number of registers can be reduced to 16 from the standard 32. Similarly, the number of hardware interrupt sources may also be changed.
In addition, even the instruction set is described in XML, which allows it to be extended by a designer if his application would benefit from exotic instructions.
The tool chain is based around the standard GNU tools. The standard tools consist of:
- C/C++ Compiler
- Linker (ELF format)
- Download and programming utilities
- Runtime environment (including monitor)
- ANSI standard C library
- Instruction set simulator/debugger
- Profiling/code coverage tools
The compiler has been optimised for the JCN instruction set. In addition to the usual features of GCC, it offers a number of extensions to support the use of JCN in an embedded environment.
The HDL, standard peripherals and compiler form the `core' of the JCN processor. In addition, we have been working on a number of extensions:
Configuration of the processor is performed using an XML specification. Items that can be modified include the number of registers and the number of hardware interrupts.
The XML description is also used to specify the target hardware device, and specified which peripherals and optional extensions should be included in the generated hardware.
- Extending the Instruction Set:
The processor instruction set has been designed to efficiently target C and C++. It is not based on any existing instruction sets, so is free of any licensing issues.
The instruction set is also specified in XML. From this XML file all of the control files to drive the tools are generated, along with the documentation describing the instruction set. This allows the instruction set to be easily extended to add application-specific instructions.
- 16-bit Instructions:
When designing the instruction set, we reserved bit #0 to indicate 'extensions' to the basic instruction set. This could be user-defined instructions, but we have also used this for an experimental 16-bit instruction set:
Bit #0 indicates whether each word contains a single 32-bit instruction, or 2x 16-bit instructions. At the expense of slightly more complicated instruction fetch hardware, we are also investigating allowing non-aligned 32-bit instructions, which would allow greater code density.
All 16-bit instructions have a direct mapping to a 32-bit instruction, so the translation block simply extends the bit pattern, filling in any blanks. This makes it very efficient to implement.
Work on fixing the 16-bit instruction subset is on-going, but preliminary results suggest we achieve around a 30% code size reduction.
- In-circuit Debug:
One of the optional components that can be added via the XML configuration is a collection of in-circuit debug features. Each of the features can be enabled or disabled, as required by a design, allowing trade-off between size of resulting design and debug facilities available.
The options include:
- Breakpoints (number is user configurable)
- Single stepping
- Register read/write
- Memory read/write
- User-defined extensions
The register and memory access can be carried out while the processor is executing user code. Also, the debug interface does not significantly reduce the overall performance of the core processor.
By default, all debug operations are carried out over the cheap, industry-standard JTAG interface to the FPGA. For higher performance, parallel interfaces are also possible.
- Power Optimising Compiler:
Our power saving technology is based around the observation that the power used by a CMOS transistor is related to the frequency with which it changes and the load it must drive. Signals with the largest load include various system buses, memory address and data buses, and any external buses.
We have developed a number of patent-pending techniques to reduce the switching activity on these system buses. These techniques include a novel number representation for immediate data values, rescheduling of instructions, and allocation of redundant instruction bits to minimise transitions.
Our results show that these modification can result in a 45% reduction in the switching activity of the instruction fetch bus. When applied to our hardware prototyping platform consisting of an FPGA connected to a Flash-based memory, this resulted in a 20% reduction in the power consumed by the system. This saving was without any loss in performance.
The hardware modifications required to support our optimisations are minimal, and the techniques are widely applicable to a variety of DSP and RISC processors. As they are applied to a design before it is mapped to an implementation technology, they are complementary to existing techniques such as clock-gating, voltage scaling and so on.
We have also produced variations of the processor that will operate without an external clock source, and a reduced power DMA bus specification.
If you would like any further information not covered by these pages, then please contact us.
|Paul Webster||Phil Endecott||Alan Mycroft
Summer Intern (2001)
These pages were last modified 7/1/02.
Please note that some of the techniques disclosed here have been the subject of patent applications.
For comments, suggestions and further information please contact us.