7. DSP Processors and Modern Alternatives // Digital Signal Processing // bhaswanth

A digital filter, an FFT, or a multirate cascade is just code. You can run it on any computer. But "any computer" varies hugely in efficiency for DSP workloads. This section surveys the architectures specialized for DSP and the ways they have evolved.

7.1 What makes a processor DSP-friendly

Three things:

Single-cycle multiply-accumulate (MAC). A digital filter is dominated by acc += coef * sample operations. A DSP processor has a hardware multiplier with an accumulator output, so each MAC takes one cycle (regardless of operand size, within the chip's data width).
Multiple-access memory. A MAC needs two operands per cycle (coefficient and sample) plus an instruction fetch. General-purpose CPUs have one memory bus and would need three cycles. DSPs use a modified Harvard architecture: separate program memory (for instructions) and one or two data memories (for operands), so all three fetches happen in one cycle.
Specialized addressing modes:
- Modulo (circular) addressing for delay lines: a pointer wraps automatically when it hits a buffer boundary. Lets a delay line live in a fixed memory region without per-sample bounds checking.
- Bit-reversed addressing for FFTs: the natural-order address gets bit-reversed in hardware to access the right input sample. Lets DIT FFTs read input samples in bit-reversed order without explicit permutation.
- Zero-overhead loops: a hardware loop counter that decrements and branches with no instruction overhead. Lets innermost FIR or FFT loops run at one instruction per output sample.

7.2 The TMS320 family (TI)

Texas Instruments' TMS320 line is the canonical DSP family. Generations:

TMS320C5x (1990s). 16-bit fixed-point. 40-bit accumulator, 17×17-bit multiplier, single-cycle MAC. Used in cellphones, modems, and consumer audio equipment. The "C5x" is still in production for legacy designs.
TMS320C6x. Floating-point and fixed-point variants. VLIW (Very Long Instruction Word) architecture: 8 functional units, up to 8 operations per cycle. Used in 3G/4G base stations, communications infrastructure, defense radar.
TMS320C66x. Multi-core VLIW. Used in 4G/5G basestations, medical imaging.

Analog Devices made the ADSP-21xx (16-bit fixed) and ADSP-2106x SHARC (32-bit float) lines, similarly oriented. SHARC chips dominate professional audio (high-end mixing consoles, audio effects processors).

7.3 Single-instruction multiple-data (SIMD) on general-purpose CPUs

For a long time, DSP chips and general CPUs were separate. Then SIMD changed the picture. A SIMD instruction operates on multiple data elements in parallel, exploiting the wide internal datapaths in modern CPUs.

Intel SSE/SSE2/AVX/AVX-512. 128-bit, 256-bit, 512-bit registers. AVX-512 can do 16 single-precision multiply-adds per cycle, on each of multiple cores.
ARM NEON. 128-bit SIMD on ARM cores. Standard in every smartphone. Used for audio codecs, video decoding, image processing.
ARM SVE. Scalable Vector Extension, variable-length SIMD up to 2048 bits. In server-class ARM cores.

A modern Cortex-A (in any flagship phone) can run a real-time multi-band audio EQ, multi-channel speaker crossover, and dynamic range compression all in fractional-percent CPU. A decade ago this needed a dedicated DSP chip; today it is software on a SIMD-equipped CPU.

7.4 GPUs

GPUs are massively parallel SIMD machines (thousands of cores). For DSP problems with high parallelism (image processing, large FFTs, deep neural networks), GPUs are unbeatable. cuFFT (NVIDIA) and other libraries provide FFT throughput in the tens of teraflops on modern GPUs.

GPUs win when there is enough parallel work to fill thousands of cores. They lose for low-latency single-channel work because of kernel-launch overhead.

7.5 FPGAs

For ultra-low-latency or ultra-high-throughput DSP, an FPGA holds the math directly in custom logic. Each multiply-and-add can be a configured DSP slice (Xilinx DSP48, Intel/Altera DSP block) that runs at 500+ MHz. A pipelined FFT processor on an FPGA can produce one output bin per clock indefinitely, with bounded latency.

Used in: software-defined radio (Ettus USRP, RFSoC), high-frequency trading (microsecond-deterministic order processing), 5G remote-radio-heads, defense and instrumentation.

The down-side: FPGA design is hard; HDL coding, timing closure, and tooling are nontrivial.

7.6 ASICs and NPUs

For shipping a billion-unit consumer product (cellphone modems, hearing aids, AirPods), the DSP becomes a custom ASIC. NPU (Neural Processing Unit) is the latest variant, optimized for INT8 matrix multiplication at extremely low power, used for on-device AI inference (image recognition in your phone's camera app, voice activation in your earbuds).

The trend is clear: classical DSP chips are losing ground to general-purpose CPUs with SIMD (for moderate workloads), GPUs (for massive parallel batches), FPGAs (for low-latency / high-throughput specialty), and ASICs/NPUs (for shipped products). Pure-DSP chips remain in the niches where they win: very-low-power audio (hearing aids, headphone ANC), medical implants, industrial controllers, and legacy infrastructure.