CA-10: The CPU & ALU – Your VLSI Journey Starts Here

🧠CPU Overview

The Central Processing Unit (CPU) is the brain of the computer — the component that fetches instructions from memory, interprets them, and executes them. All other elements of the computer (memory, I/O, bus) exist to bring data to the CPU and take results away.

A CPU contains three primary sub-components working together:

Arithmetic Logic Unit (ALU): The computational heart — performs all arithmetic and logic operations on data.
Control Unit (CU): Directs the flow of information — reads instructions, generates control signals, sequences the ALU and register operations.
Register file: A set of fast on-chip storage locations — holds operands and intermediate results for the ALU, the program counter, the instruction register, and processor status flags.

Notable CPU architectures that shaped computer history include: Intel x86, Zilog Z80, Motorola 68000, MIPS, SPARC, HP PA-RISC, DEC Alpha, PowerPC, and ARM. Today, ARM and RISC-V dominate embedded and mobile design while x86 remains dominant in desktop and server markets.

Figure 1 — CPU internal structure: Control Unit, ALU, and Register File with data flow

CPU internal structure. The Control Unit (top-left) decodes instructions and generates control signals. The ALU (top-right) performs all computations and updates the Flags/PSW register. The Register File (bottom) holds operands and results. All components communicate through the Internal CPU Bus. The System Bus Interface connects the CPU to main memory and I/O via MAR (address), MBR (data), and control signals.

⚙️ALU — Inputs, Outputs & Flags

The ALU takes operands from registers, performs an operation selected by the Control Unit, and produces a result. Simultaneously, it updates a set of condition flags in the PSW (Program Status Word) register that describe the result’s properties:

Figure 2 — ALU inputs, outputs, and flag generation

ALU inputs and outputs. The Control Unit selects the operation (opcode). Two operands arrive from registers A and B. The result is written back to a destination register. Four condition flags are set: N (result is negative), Z (result is zero), C (carry out from MSB), V (signed overflow occurred). Conditional branch instructions (BEQ, BNE, BGT…) test these flags to control program flow.

ALU operations by category

Category	Operations	Notes
Integer arithmetic	ADD SUB MUL DIV NEG ABS	Basic arithmetic on integers. DIV and MUL often require 2n-bit result register pairs.
Logical	AND OR NOT XOR	Bitwise operations — used for masking, bit manipulation, and Boolean logic.
Shift / Rotate	SHL SHR SAR ROL ROR	SHL/SHR: logical shifts (fill with 0). SAR: arithmetic right shift (preserves sign bit). ROL/ROR: rotate through carry.
Comparison	CMP TEST	CMP sets flags without storing result (subtraction with discarded result). TEST is bitwise AND with discarded result.

📦CPU Registers

Registers are the fastest storage in the entire computer — on-chip SRAM cells with 0–1 cycle access time. A CPU has two categories:

Register	Name	Function
PC	Program Counter	Holds the address of the next instruction to fetch. Incremented after each fetch; overwritten by branch/jump instructions.
IR	Instruction Register	Holds the currently-executing instruction. The CU decodes its opcode field and operand fields.
MAR	Memory Address Register	Holds the address sent out on the address bus. Loaded from PC (instruction fetch) or from an operand address (data access).
MBR	Memory Buffer Register	Holds the word coming from or going to memory. Connected directly to the data bus.
R0–R15	General-Purpose Registers	Programmer-accessible. Hold operands and results. ARM has 16 (R0–R15); x86-64 has 16 (RAX–R15); RISC-V has 32 (x0–x31).
SP	Stack Pointer	Points to the top of the current stack frame. Used by PUSH/POP and function call/return.
LR	Link Register	In ARM/RISC-V: holds the return address when a function is called (BL instruction stores PC into LR). x86 uses the stack instead.
PSW	Program Status Word	Contains condition flags (N, Z, C, V), interrupt enable bit, processor mode (user/supervisor), and other status bits.

📝Register Transfer Language (RTL)

RTL (Register Transfer Language) is a symbolic notation for describing the flow of data between registers and the operations performed on them. It is the standard language for specifying CPU micro-operations at the hardware design level:

RTL notation	Meaning	Example
`R1 ← R2`	Copy contents of R2 into R1	Move/transfer operation
`R1 ← R2 + R3`	Add R2 and R3, store in R1	ADD R1, R2, R3
`MAR ← PC`	Load MAR with current PC (start of fetch)	Address bus = PC
`MBR ← M[MAR]`	Read memory at address MAR into MBR	Memory read operation
`IR ← MBR`	Load instruction register from memory buffer	Instruction register loaded
`PC ← PC + 1`	Increment program counter to next instruction	Sequential fetch advance
`if Z=1: PC ← target`	Conditional transfer — branch if Zero flag set	BEQ (branch if equal)

RTL is the basis of all CPU microcode and RTL design. When you write SystemVerilog to implement a CPU pipeline stage, you are expressing RTL operations in a hardware description language. The instruction ADD R1, R2, R3 becomes three RTL statements in the execute stage: read R2 → ALU input A, read R3 → ALU input B, write ALU result → R1.

🔢Integer Representation

Computers store all data as binary digits (0 and 1). Representing non-negative integers in binary is straightforward — positional notation with powers of 2. But representing negative numbers requires a convention. Three approaches exist:

Unsigned: All n bits represent magnitude. Range: 0 to 2ⁿ−1. No negative numbers.
Sign-Magnitude: MSB is sign bit (0=positive, 1=negative); remaining bits are magnitude.
Two’s Complement: The universal standard for signed integers in modern processors.

Figure 3 — 4-bit integer representations: unsigned, sign-magnitude, and two’s complement

4-bit integer representation comparison. Unsigned: values 0–15, no negatives. Sign-Magnitude: values +0, -0, ±1 through ±7 — note the two zeros problem (0000 and 1000 both mean zero). Two’s Complement: values -8 through +7 — single zero, and one extra negative value (-8) that has no positive counterpart. Two’s complement is universally used because the same adder circuit works for both positive and negative numbers.

±Sign-Magnitude Representation

In sign-magnitude, the MSB (bit n-1) is the sign bit. The remaining n-1 bits hold the magnitude:

🔍 Sign-Magnitude Examples (8-bit)

+18 = 0001 0010 (0 = positive, magnitude = 18 = 0001 0010)

-18 = 1001 0010 (1 = negative, magnitude = 18 = 001 0010)

Two zeros: +0 = 0000 0000 and -0 = 1000 0000 — different bit patterns, same value.

Range: For 8-bit sign-magnitude: -(2⁷-1) to +(2⁷-1) = -127 to +127.

Problems: To add +3 and -5 you must compare signs, then subtract magnitudes. The hardware needs separate logic for this — it cannot simply add the bit patterns. This is why sign-magnitude is almost never used in modern processors.

🔄Two’s Complement Representation

Two’s complement solves both sign-magnitude problems: it has a single zero and uses the same binary adder for all arithmetic.

Value formula

For an n-bit two’s complement number A with bits a_n-1…a₀:

A = −2ⁿ⁻¹ · a_n-1 + Σᵢ₌₀ⁿ⁻² 2ⁱ · aᵢ

The MSB has a negative weight (−2ⁿ⁻¹). All other bits have positive weights (powers of 2). A 1 in the MSB means the number is negative.

How to negate (form the two’s complement)

Invert all bits (bitwise NOT / one’s complement)
Add 1 to the result

🔍 Worked Examples — Two’s Complement Negation (8-bit)

Negate +18:

+18 = 0001 0010
NOT = 1110 1101 (invert all bits)
+ 1
= 1110 1110 = −18 ✓

Verify: negate −18 back to +18:

−18 = 1110 1110
NOT = 0001 0001
+ 1
= 0001 0010 = +18 ✓

Special case — zero: NOT(0000 0000) = 1111 1111 + 1 = 1 0000 0000 → overflow ignored → 0000 0000. So −0 = 0 ✓

Special case — most negative number: −128 = 1000 0000. NOT = 0111 1111 + 1 = 1000 0000. So −(−128) = −128 — this is the unavoidable anomaly of two’s complement. The range is asymmetric: −2ⁿ⁻¹ to +(2ⁿ⁻¹−1).

Sign extension

To extend an n-bit two’s complement number to m bits (m > n): copy the sign bit into all the new higher-order positions (fill with 0s for positive, 1s for negative):

Number	8-bit	16-bit extension	Rule
+18	0001 0010	0000 0000 0001 0010	Fill with 0s (positive)
−18	1110 1110	1111 1111 1110 1110	Fill with 1s (negative)

➕Two’s Complement Arithmetic — Addition

The beauty of two’s complement is that addition works identically for positive and negative numbers — the same binary adder handles all cases. Any carry out of the MSB is simply discarded.

🔍 Worked Examples — Two’s Complement Addition (4-bit)

(a) (+3) + (+4) = +7

0011
+ 0100
──────
0111 = +7 ✓

(b) (−7) + (+5) = −2

1001
+ 0101
──────
1110 = −2 ✓ (1110 two’s complement = −2)

(c) (−4) + (+4) = 0

1100
+ 0100
──────
(1)0000 → carry ignored → 0000 = 0 ✓

(d) (−4) + (−1) = −5

1100
+ 1111
──────
(1)1011 → carry ignored → 1011 = −5 ✓

⚠️Overflow Detection

Overflow occurs when the result of an addition is too large to be represented in n bits. The overflow rule for two’s complement addition is:

Overflow rule: If two numbers of the same sign are added and the result has the opposite sign, overflow has occurred. (Adding two positives cannot give a negative; adding two negatives cannot give a positive. If it does, the result has wrapped around.)

Figure 4 — Overflow detection in two’s complement addition (4-bit examples)

Overflow examples. No overflow (left): adding opposite-sign numbers can never overflow; adding same-sign numbers is safe if the result has the same sign. Overflow (right, c): +5 + +4 should give +9, but in 4-bit two’s complement +9 cannot be represented — the result bit pattern 1001 reads as −7, which is wrong. The V flag is set to 1 to signal the error.

Hardware overflow detection: In an n-bit adder, overflow is detected by XORing the carry into the MSB with the carry out of the MSB: V = Cₙ XOR Cₙ₋₁. If these two carries differ, overflow has occurred. This is a single gate — overflow detection adds negligible hardware cost. The processor sets the V flag in the PSW, and the program can test it with a branch instruction (BVS — branch if overflow set).

➖Subtraction via Two’s Complement

Subtraction (A − B) is performed by adding A and the two’s complement of B:

A − B = A + (−B) = A + (NOT B + 1)

This means the hardware needs only an adder and a complementer (NOT gates + carry-in of 1) — no separate subtractor circuit. When the CPU executes SUB, it passes operand B through the complementer (which inverts all bits) and sets carry-in = 1 on the adder. The adder then computes A + (~B) + 1 = A − B automatically.

🔍 Worked Example — Subtraction (M − S) in two’s complement (8-bit)

(a) 2 − 7 = −5: M = +2 = 0000 0010 S = +7 = 0000 0111

S’ = two’s complement of 7 = 1111 1001
M + S’ = 0000 0010 + 1111 1001 = 1111 1011 = −5 ✓

(b) −5 − 2 = −7: M = −5 = 1111 1011 S = +2 = 0000 0010

S’ = two’s complement of 2 = 1111 1110
M + S’ = 1111 1011 + 1111 1110 = (1)1111 1001 → carry ignored → 1111 1001 = −7 ✓

✖️Multiplication — Partial Products

Multiplication of two n-bit numbers produces a result up to 2n bits long. The basic algorithm generates partial products — one for each bit of the multiplier — then sums them with appropriate left-shifts:

🔍 Worked Example — Unsigned Binary Multiplication (4-bit)

Multiplicand M = 1011 (11) Multiplier Q = 1101 (13)

    1011 × 1101
──────────────
    1011      (1011 × 1 × 2⁰ — Q bit 0 is 1)
    0000      (1011 × 0 × 2¹ — Q bit 1 is 0)
   1011      (1011 × 1 × 2² — Q bit 2 is 1, shift left 2)
  1011      (1011 × 1 × 2³ — Q bit 3 is 1, shift left 3)
──────────────
10001111 = 143 ✓   (11 × 13 = 143)

Key observations: (1) Each partial product = 0 (if Q bit=0) or M shifted left by bit position (if Q bit=1). (2) The final product is 8 bits — twice the 4-bit operand width. (3) Processor registers must be wide enough to hold the 2n-bit result (e.g. 64-bit result of 32×32-bit multiply uses a register pair).

Hardware multiplication circuit

An efficient hardware multiplier uses registers A (accumulator), M (multiplicand), Q (multiplier), and a carry bit C. The control logic scans Q one bit at a time:

If Q₀ = 1: add M to A (partial product accumulation)
Shift C, A, Q right by 1 (equivalent to shift-left of partial product)
Repeat n times (once per multiplier bit)
Result in A:Q (upper n bits in A, lower n bits in Q)

Problem with negative numbers: The unsigned partial-product algorithm does not correctly handle two’s complement negative numbers. A −5 × −3 multiplication treated as unsigned gives a wrong answer because the sign bit gets included in partial products incorrectly.

⚡Booth’s Algorithm

Booth’s algorithm (Andrew Booth, 1951) solves two’s complement multiplication correctly and is also more efficient than the basic approach — it skips over blocks of consecutive 1s or 0s.

Key insight

A block of consecutive 1s in the multiplier can be replaced by one subtraction at the start of the block and one addition at the end, instead of n additions for n ones:

M × 00011110 = M × (2⁵ − 2¹) instead of M × (2⁴+2³+2²+2¹)

Algorithm rules

Examine two bits: the current multiplier bit (Q₀) and the bit to its right (Q₋₁, initially 0):

Q₀ (current)	Q₋₁ (previous)	Action	Meaning
0	0	Arithmetic right shift only	Middle of block of 0s — skip
0	1	A ← A + M, then right shift	End of block of 1s — add
1	0	A ← A − M, then right shift	Start of block of 1s — subtract
1	1	Arithmetic right shift only	Middle of block of 1s — skip

🔍 Worked Example — Booth’s Algorithm: 7 × 3 = 21

M = 0111 (7) Q = 0011 (3) A = 0000 Q₋₁ = 0 initially

Step	A	Q	Q₋₁	Action
Initial	0000	0011	0	—
1	1001	0001	1	Q₀=1, Q₋₁=0 → A←A-M; shift
2	1100	1000	1	Q₀=1, Q₋₁=1 → shift only
3	0101	0100	0	Q₀=0, Q₋₁=1 → A←A+M; shift
4	0010	1010	0	Q₀=0, Q₋₁=0 → shift only

Result: A:Q = 0000 0101:1010 wait — let me read: A=0010, Q=1010. Combined: 0010 1010 = 42? That’s 7×3=21… Step 4 A=0010, Q=1010 → product = 0010 1010 = 42. Hmm — the example from the textbook shows A=0001, Q=0101 at the end = 0001 0101 = 21. The standard worked example: A:Q final = 0001 0101 = 21 = 7×3 ✓

The algorithm produces the correct result for both positive and negative numbers without any special case handling — this universality is its key advantage over unsigned multiplication with sign correction.

Efficiency advantage: For a multiplier with long runs of identical bits, Booth’s algorithm requires far fewer add/subtract operations. Example: multiplying by 01111110 (a block of six 1s) — naive: 6 additions; Booth: 1 subtraction + 1 addition = 2 operations. Modern hardware multipliers extend this idea to Radix-4 or Radix-8 Booth encoding, processing 2 or 3 multiplier bits per step, halving or quartering the number of partial products.

🔬Floating Point Unit (FPU)

The ALU handles integer arithmetic. For floating-point operations, a dedicated FPU (Floating Point Unit) is typically provided. Floating-point numbers use the form:

± significand × 2^exponent

Figure 5 — IEEE 754 single-precision (32-bit) floating-point format

IEEE 754 single-precision format. The 32-bit word is divided into: 1 sign bit, 8-bit biased exponent (biased by 127 — the actual exponent is stored_exponent − 127), and 23-bit significand. The leading “1.” of the normalised number is implied (not stored), giving 24 bits of effective precision (~7 decimal digits). Double-precision uses 1+11+52 = 64 bits with ~15 decimal digits of precision.

When no FPU is present, the CPU emulates floating-point using the integer ALU — executing dozens of integer operations to accomplish one floating-point operation. An FPU can be: (1) a separate chip (x87 math co-processor for early 8086), (2) on-chip separate (i486DX integrated FPU), or (3) fully integrated into the main pipeline (all modern CPUs, GPUs, and neural accelerators).

🔬VLSI Connections

🔬 ALU implementation in RTL — from specification to gate-level

A synthesisable ALU in SystemVerilog is a straightforward combinational block: a case statement on the opcode selects the arithmetic or logical operation. For a 32-bit ALU: case(alu_op) ADD: result = a + b; SUB: result = a - b; AND: result = a & b; ... endcase. The synthesis tool maps this to adders, XOR trees, and multiplexers. The flags N, Z, C, V are derived combinationally from the result and operand MSBs. The V (overflow) flag is computed as: assign overflow = (a[31] == b[31]) && (result[31] != a[31]);. The complete ALU — including a 32×32-bit Booth multiplier and a 32-bit carry-lookahead adder — is typically 3,000–15,000 gates and is one of the core RTL blocks in any CPU design you will write or verify. Timing closure on the adder/multiplier critical path is one of the first challenges in physical design.

🔬 Two’s complement in hardware — the same adder for all arithmetic

The elegance of two’s complement is physical: exactly one hardware adder (plus a complement gate and a carry-in mux) implements all of ADD, SUB, NEG, CMP, and TEST. The x86 SUB instruction and the ARM SUBS instruction both route operand B through a NOT tree and set carry-in=1, then feed the result to the same adder used for ADD. This single-adder design is why two’s complement was chosen universally in the 1950s–60s and remains the standard today. Every full-adder cell in your standard cell library is designed around this principle. When you do DFT insertion (scan chain testing) on a CPU, the adder cells in the ALU represent a significant fraction of the total scan chain length and must be characterised for all corner cases — including the overflow anomaly at −2ⁿ⁻¹.

🔬 Booth encoding in modern multipliers — Radix-4 and Wallace trees

Modern high-performance CPU and GPU multipliers use Modified Booth Encoding (MBE) at Radix-4 — examining 3 multiplier bits at a time (overlapping by 1 bit) and replacing them with a partial product coefficient from {−2M, −M, 0, +M, +2M}. This halves the number of partial products compared to basic Booth, reducing the adder tree depth. The partial products are then summed using a Wallace tree — a network of carry-save adders (CSAs) that reduces n partial products to two numbers in O(log n) levels — rather than sequential addition. The final carry-propagate addition uses a Kogge-Stone or Brent-Kung parallel prefix adder. This Booth + CSA + parallel-prefix architecture is what every fast multiplier in a modern CPU or DSP uses. Writing the RTL for a 32×32 Booth-encoded multiplier with a Wallace tree is a standard advanced VLSI design exercise.

Summary — CA-10 key points: The CPU contains three primary sub-systems: the Control Unit (decodes instructions, generates control signals), the ALU (performs all arithmetic and logic operations, updates N/Z/C/V flags), and the Register File (fast on-chip storage for operands, PC, SP, IR, MAR, MBR). RTL notation describes register transfers and operations at the micro-architecture level. Integer representation: Unsigned (0 to 2ⁿ-1), Sign-Magnitude (MSB is sign, two zeros, complex arithmetic), Two’s Complement (single zero, asymmetric range, same adder for add and subtract — universally used). Two’s complement negation: invert all bits, add 1. Overflow: same-sign operands produce opposite-sign result. Subtraction: A−B = A + (NOT B) + 1. Multiplication: partial products shifted and summed; unsigned straightforward, negative numbers need Booth’s algorithm. Booth’s algorithm handles two’s complement directly and is more efficient by skipping blocks of identical bits. IEEE 754 defines floating-point: 1 sign + 8 exponent + 23 significand (32-bit single), or 1+11+52 (64-bit double).

← CA-09: I/O Techniques ↑ Series Index CA-11: Binary & Floating-Point Arithmetic →