CA-10: The CPU & ALU — Registers, Control Unit, Integer Arithmetic — VLSI Trainers
VLSI Trainers CA Series · 10 / 12
Computer Architecture · Article 10 of 12

CA-10: The CPU & ALU

The internal organisation of a CPU — registers, control unit, and ALU. Integer representation in sign-magnitude and two’s complement. Integer arithmetic: addition, subtraction, overflow detection, and multiplication via partial products and Booth’s algorithm. Register Transfer Language (RTL) for hardware description.

🧠CPU Overview

The Central Processing Unit (CPU) is the brain of the computer — the component that fetches instructions from memory, interprets them, and executes them. All other elements of the computer exist to bring data to the CPU and take results away.

A CPU contains three primary sub-components working together:

Notable CPU architectures: Intel x86, Zilog Z80, Motorola 68000, MIPS, SPARC, DEC Alpha, PowerPC, and ARM. Today, ARM and RISC-V dominate embedded and mobile design while x86 remains dominant in desktop and server markets.

Figure 1 — CPU internal structure: Control Unit, ALU, and Register File with data flow
CPU boundary Control Unit Decodes IR opcode Generates control signals Sequences ALU operations Holds: IR, PC, MAR, MBR Controls memory via MAR/MBR ALU ADD · SUB · MUL · DIV AND · OR · NOT · XOR SHL · SHR · CMP Flags: N Z C V (OVF) Operands from registers Flags / PSW N Z C V INTERNAL CPU BUS Register File R0–R3 R4–R7 R8–R11 R12–R15 PC SP LR IR MAR / MBR System Bus Interface Address Bus (MAR →) Data Bus (MBR ↔) Control Bus (CU →) vlsitrainers.com

CPU internal structure. The Control Unit (top-left) decodes instructions and generates control signals. The ALU (top-right) performs all computations and updates the Flags/PSW register. The Register File (bottom) holds operands and results. All components communicate through the Internal CPU Bus. The System Bus Interface connects the CPU to main memory and I/O via MAR, MBR, and control signals.

⚙️ALU — Inputs, Outputs & Flags

The ALU takes operands from registers, performs an operation selected by the Control Unit, and produces a result. Simultaneously, it updates a set of condition flags in the PSW (Program Status Word) register:

Figure 2 — ALU inputs, outputs, and flag generation
Operand A (from register Rn) Operand B (from register Rm) Control Unit — opcode ALU ADD SUB MUL DIV AND OR NOT XOR SHL SHR ROL ROR CMP TEST NEG ABS Result (written to register Rd) Condition Flags (PSW) N Negative Z Zero C Carry V Overflow vlsitrainers.com

ALU inputs and outputs. The Control Unit selects the operation (opcode). Two operands arrive from registers A and B. The result is written back to a destination register. Four condition flags are set: N (result is negative), Z (result is zero), C (carry out from MSB), V (signed overflow occurred). Conditional branch instructions (BEQ, BNE, BGT…) test these flags to control program flow.

ALU operations by category

CategoryOperationsNotes
Integer arithmeticADD SUB MUL DIV NEG ABSBasic arithmetic on integers. DIV and MUL often require 2n-bit result register pairs.
LogicalAND OR NOT XORBitwise operations — used for masking, bit manipulation, and Boolean logic.
Shift / RotateSHL SHR SAR ROL RORSHL/SHR: logical shifts (fill with 0). SAR: arithmetic right shift (preserves sign bit). ROL/ROR: rotate through carry.
ComparisonCMP TESTCMP sets flags without storing result (subtraction with discarded result). TEST is bitwise AND with discarded result.

📦CPU Registers

Registers are the fastest storage in the entire computer — on-chip SRAM cells with 0–1 cycle access time. A CPU has two categories:

RegisterNameFunction
PCProgram CounterHolds the address of the next instruction to fetch. Incremented after each fetch; overwritten by branch/jump instructions.
IRInstruction RegisterHolds the currently-executing instruction. The CU decodes its opcode field and operand fields.
MARMemory Address RegisterHolds the address sent out on the address bus. Loaded from PC (instruction fetch) or from an operand address (data access).
MBRMemory Buffer RegisterHolds the word coming from or going to memory. Connected directly to the data bus.
R0–R15General-Purpose RegistersProgrammer-accessible. Hold operands and results. ARM has 16 (R0–R15); x86-64 has 16 (RAX–R15); RISC-V has 32 (x0–x31).
SPStack PointerPoints to the top of the current stack frame. Used by PUSH/POP and function call/return.
LRLink RegisterIn ARM/RISC-V: holds the return address when a function is called (BL instruction stores PC into LR).
PSWProgram Status WordContains condition flags (N, Z, C, V), interrupt enable bit, processor mode (user/supervisor), and other status bits.

📝Register Transfer Language (RTL)

RTL (Register Transfer Language) is a symbolic notation for describing the flow of data between registers and the operations performed on them:

RTL notationMeaningExample
R1 ← R2Copy contents of R2 into R1Move/transfer operation
R1 ← R2 + R3Add R2 and R3, store in R1ADD R1, R2, R3
MAR ← PCLoad MAR with current PC (start of fetch)Address bus = PC
MBR ← M[MAR]Read memory at address MAR into MBRMemory read operation
IR ← MBRLoad instruction register from memory bufferInstruction register loaded
PC ← PC + 1Increment program counter to next instructionSequential fetch advance
if Z=1: PC ← targetConditional transfer — branch if Zero flag setBEQ (branch if equal)
RTL is the basis of all CPU microcode and RTL design. When you write SystemVerilog to implement a CPU pipeline stage, you are expressing RTL operations in a hardware description language. The instruction ADD R1, R2, R3 becomes three RTL statements in the execute stage: read R2 → ALU input A, read R3 → ALU input B, write ALU result → R1.

🔢Integer Representation

Computers store all data as binary digits. Representing negative numbers requires a convention. Three approaches exist:

  1. Unsigned: All n bits represent magnitude. Range: 0 to 2ⁿ−1. No negative numbers.
  2. Sign-Magnitude: MSB is sign bit (0=positive, 1=negative); remaining bits are magnitude.
  3. Two’s Complement: The universal standard for signed integers in modern processors.
Figure 3 — 4-bit integer representations: unsigned, sign-magnitude, and two’s complement
Binary Unsigned Sign-Magnitude Two’s Complement Key differences Sign-Magnitude problems: • Two zeros: 0000 (+0) and 1000 (-0) • Addition needs sign comparison • Separate adder/subtractor needed Two’s complement advantages: • Single zero representation • Same adder works for + and – • Overflow simple to detect • Universally used in all processors 4-bit ranges: Unsigned: 0 to 15 Sign-Mag: -7 to +7 (two zeros) Two’s Cmp: -8 to +7 0111 7 +7 +7 0110 6 +6 +6 0001 1 +1 +1 0000 0 +0 0 1000 8 -0 ← problem! -8 ← extra value! 1111 15 -7 -1 1110 14 -6 -2 1001 9 -7 -7 1000 8 [-0 dup] -8 only here! vlsitrainers.com

4-bit integer representation comparison. Unsigned: values 0–15. Sign-Magnitude: +0, -0, ±1 through ±7 — note the two zeros problem. Two’s Complement: -8 through +7 — single zero, one extra negative value (-8). Two’s complement is universally used because the same adder circuit works for both positive and negative numbers.

±Sign-Magnitude Representation

In sign-magnitude, the MSB (bit n-1) is the sign bit. The remaining n-1 bits hold the magnitude:

🔍 Sign-Magnitude Examples (8-bit)

+18 = 0001 0010    (0 = positive, magnitude = 18)

-18 = 1001 0010    (1 = negative, magnitude = 18)

Two zeros: +0 = 0000 0000 and -0 = 1000 0000 — different bit patterns, same value.

Range: For 8-bit sign-magnitude: -(2⁷-1) to +(2⁷-1) = -127 to +127.

Problems: To add +3 and -5 you must compare signs, then subtract magnitudes. The hardware needs separate logic — it cannot simply add the bit patterns. This is why sign-magnitude is almost never used in modern processors.

🔄Two’s Complement Representation

Two’s complement solves both sign-magnitude problems: it has a single zero and uses the same binary adder for all arithmetic.

How to negate (form the two’s complement)

  1. Invert all bits (bitwise NOT / one’s complement)
  2. Add 1 to the result
🔍 Worked Examples — Two’s Complement Negation (8-bit)

Negate +18:

+18 = 0001 0010
NOT = 1110 1101
+ 1
     = 1110 1110 = −18 ✓

Special case — most negative number: −128 = 1000 0000. NOT = 0111 1111 + 1 = 1000 0000. So −(−128) = −128 — the unavoidable anomaly of two’s complement. The range is asymmetric: −2ⁿ⁻¹ to +(2ⁿ⁻¹−1).

Sign extension

Number8-bit16-bit extensionRule
+180001 00100000 0000 0001 0010Fill with 0s (positive)
−181110 11101111 1111 1110 1110Fill with 1s (negative)

Two’s Complement Arithmetic — Addition

The beauty of two’s complement is that addition works identically for positive and negative numbers — the same binary adder handles all cases. Any carry out of the MSB is simply discarded.

🔍 Worked Examples — Two’s Complement Addition (4-bit)

(a) (+3) + (+4) = +7

0011
+ 0100
──────
0111 = +7 ✓

(b) (−7) + (+5) = −2

1001
+ 0101
──────
1110 = −2 ✓

(c) (−4) + (+4) = 0

1100
+ 0100
──────
(1)0000 → carry ignored → 0000 = 0 ✓

(d) (−4) + (−1) = −5

1100
+ 1111
──────
(1)1011 → carry ignored → 1011 = −5 ✓

⚠️Overflow Detection

Overflow occurs when the result of an addition is too large to be represented in n bits.

Overflow rule: If two numbers of the same sign are added and the result has the opposite sign, overflow has occurred. (Adding two positives cannot give a negative; adding two negatives cannot give a positive. If it does, the result has wrapped around.)
Figure 4 — Overflow detection in two’s complement addition (4-bit examples)
✓ No Overflow (a) (−7) + (+5) = −2  [pos + neg → ok] 1001 + 0101 ────── 1110 = −2 ✓ Signs differ → addition of opposite signs never overflows (b) (−4) + (−1) = −5  [both neg, result neg → ok] (1)1011 = −5 ✓ carry ignored ✗ Overflow! (c) (+5) + (+4) = OVERFLOW 0101 (+5) + 0100 (+4) ────── 1001 = −7?? ✗ Both positive → result should be positive → 1001 is negative → OVERFLOW (d) (−7) + (−6) = OVERFLOW 0011 = +3?? ✗ Both neg → pos = OVF vlsitrainers.com

Overflow examples. No overflow (left): adding opposite-sign numbers can never overflow. Overflow (right, c): +5 + +4 should give +9, but in 4-bit two’s complement the result bit pattern 1001 reads as −7, which is wrong. The V flag is set to 1 to signal the error.

Hardware overflow detection: In an n-bit adder, overflow is detected by XORing the carry into the MSB with the carry out of the MSB: V = Cₙ XOR Cₙ₋₁. If these two carries differ, overflow has occurred. This is a single gate — overflow detection adds negligible hardware cost.

Subtraction via Two’s Complement

Subtraction (A − B) is performed by adding A and the two’s complement of B:

A − B = A + (−B) = A + (NOT B + 1)

This means the hardware needs only an adder and a complementer (NOT gates + carry-in of 1) — no separate subtractor circuit. When the CPU executes SUB, it passes operand B through the complementer and sets carry-in = 1 on the adder.

🔍 Worked Example — Subtraction (M − S) in two’s complement (8-bit)

(a) 2 − 7 = −5:

S’ = two’s complement of 7 = 1111 1001
M + S’ = 0000 0010 + 1111 1001 = 1111 1011 = −5 ✓

(b) −5 − 2 = −7:

S’ = two’s complement of 2 = 1111 1110
M + S’ = 1111 1011 + 1111 1110 = (1)1111 1001 → carry ignored → 1111 1001 = −7 ✓

✖️Multiplication — Partial Products

Multiplication of two n-bit numbers produces a result up to 2n bits long. The basic algorithm generates partial products — one for each bit of the multiplier — then sums them with appropriate left-shifts:

🔍 Worked Example — Unsigned Binary Multiplication (4-bit)

Multiplicand M = 1011 (11)   Multiplier Q = 1101 (13)

    1011 × 1101
──────────────
    1011      (Q bit 0 is 1)
    0000      (Q bit 1 is 0)
   1011      (Q bit 2 is 1, shift left 2)
  1011      (Q bit 3 is 1, shift left 3)
──────────────
10001111 = 143 ✓   (11 × 13 = 143)

Key observations: (1) The final product is 8 bits — twice the 4-bit operand width. (2) Processor registers must be wide enough to hold the 2n-bit result (e.g. 64-bit result of 32×32-bit multiply uses a register pair).

Booth’s Algorithm

Booth’s algorithm (Andrew Booth, 1951) solves two’s complement multiplication correctly and is also more efficient — it skips over blocks of consecutive 1s or 0s.

Algorithm rules

Examine two bits: the current multiplier bit (Q₀) and the bit to its right (Q₋₁, initially 0):

Q₀ (current)Q₋₁ (previous)ActionMeaning
00Arithmetic right shift onlyMiddle of block of 0s — skip
01A ← A + M, then right shiftEnd of block of 1s — add
10A ← A − M, then right shiftStart of block of 1s — subtract
11Arithmetic right shift onlyMiddle of block of 1s — skip
Efficiency advantage: For a multiplier with long runs of identical bits, Booth’s algorithm requires far fewer add/subtract operations. Modern hardware multipliers extend this idea to Radix-4 or Radix-8 Booth encoding, processing 2 or 3 multiplier bits per step, halving or quartering the number of partial products.

🔬Floating Point Unit (FPU)

The ALU handles integer arithmetic. For floating-point operations, a dedicated FPU (Floating Point Unit) is typically provided. Floating-point numbers use the form:

± significand × 2exponent

Figure 5 — IEEE 754 single-precision (32-bit) floating-point format
31 30 23 0 S sign 1 bit Biased Exponent bias = 127 (single) / 1023 (double) 8 bits → range −126 to +127 Significand (Mantissa) implied leading 1 bit (not stored) → effective 24-bit precision 23 bits → ~7 significant decimal digits Value = (−1)ˢ × 1.significand × 2^(exponent−127)    Double-precision: 1+11+52 bits Special: Exponent all-1s → NaN or ±Infinity · Exponent 0 → denormalised number vlsitrainers.com

IEEE 754 single-precision format. The 32-bit word is divided into: 1 sign bit, 8-bit biased exponent (actual exponent = stored_exponent − 127), and 23-bit significand. The leading “1.” of the normalised number is implied (not stored), giving 24 bits of effective precision (~7 decimal digits). Double-precision uses 1+11+52 = 64 bits with ~15 decimal digits of precision.

🔬VLSI Connections

🔬 ALU implementation in RTL — from specification to gate-level

A synthesisable ALU in SystemVerilog is a straightforward combinational block: a case statement on the opcode selects the arithmetic or logical operation. For a 32-bit ALU: case(alu_op) ADD: result = a + b; SUB: result = a - b; AND: result = a & b; ... endcase. The synthesis tool maps this to adders, XOR trees, and multiplexers. The flags N, Z, C, V are derived combinationally from the result and operand MSBs. The V (overflow) flag is computed as: assign overflow = (a[31] == b[31]) && (result[31] != a[31]);. The complete ALU — including a 32×32-bit Booth multiplier and a 32-bit carry-lookahead adder — is typically 3,000–15,000 gates and is one of the core RTL blocks in any CPU design you will write or verify.

🔬 Two’s complement in hardware — the same adder for all arithmetic

The elegance of two’s complement is physical: exactly one hardware adder (plus a complement gate and a carry-in mux) implements all of ADD, SUB, NEG, CMP, and TEST. The x86 SUB instruction and the ARM SUBS instruction both route operand B through a NOT tree and set carry-in=1, then feed the result to the same adder used for ADD. Every full-adder cell in your standard cell library is designed around this principle. When you do DFT insertion (scan chain testing) on a CPU, the adder cells in the ALU represent a significant fraction of the total scan chain length and must be characterised for all corner cases — including the overflow anomaly at −2ⁿ⁻¹.

🔬 Booth encoding in modern multipliers — Radix-4 and Wallace trees

Modern high-performance CPU and GPU multipliers use Modified Booth Encoding (MBE) at Radix-4 — examining 3 multiplier bits at a time (overlapping by 1 bit) and replacing them with a partial product coefficient from {−2M, −M, 0, +M, +2M}. This halves the number of partial products compared to basic Booth, reducing the adder tree depth. The partial products are then summed using a Wallace tree — a network of carry-save adders (CSAs) that reduces n partial products to two numbers in O(log n) levels. This Booth + CSA + parallel-prefix architecture is what every fast multiplier in a modern CPU or DSP uses. Writing the RTL for a 32×32 Booth-encoded multiplier with a Wallace tree is a standard advanced VLSI design exercise.

Summary — CA-10 key points: The CPU contains three primary sub-systems: the Control Unit (decodes instructions, generates control signals), the ALU (performs all arithmetic and logic operations, updates N/Z/C/V flags), and the Register File (fast on-chip storage for operands, PC, SP, IR, MAR, MBR). RTL notation describes register transfers and operations at the micro-architecture level. Integer representation: Unsigned (0 to 2ⁿ-1), Sign-Magnitude (MSB is sign, two zeros, complex arithmetic), Two’s Complement (single zero, asymmetric range, same adder for add and subtract — universally used). Two’s complement negation: invert all bits, add 1. Overflow: same-sign operands produce opposite-sign result. Subtraction: A−B = A + (NOT B) + 1. Multiplication: partial products shifted and summed. Booth’s algorithm handles two’s complement directly and is more efficient. IEEE 754 defines floating-point: 1 sign + 8 exponent + 23 significand (32-bit single), or 1+11+52 (64-bit double).
I/O Techniques ☰ CA Series Index Binary & Floating-Point Arithmetic
Scroll to Top