The internal organisation of a CPU — registers, control unit, and ALU. Integer representation in sign-magnitude and two’s complement. Integer arithmetic: addition, subtraction, overflow detection, and multiplication via partial products and Booth’s algorithm. Register Transfer Language (RTL) for hardware description.
The Central Processing Unit (CPU) is the brain of the computer — the component that fetches instructions from memory, interprets them, and executes them. All other elements of the computer exist to bring data to the CPU and take results away.
A CPU contains three primary sub-components working together:
Notable CPU architectures: Intel x86, Zilog Z80, Motorola 68000, MIPS, SPARC, DEC Alpha, PowerPC, and ARM. Today, ARM and RISC-V dominate embedded and mobile design while x86 remains dominant in desktop and server markets.
CPU internal structure. The Control Unit (top-left) decodes instructions and generates control signals. The ALU (top-right) performs all computations and updates the Flags/PSW register. The Register File (bottom) holds operands and results. All components communicate through the Internal CPU Bus. The System Bus Interface connects the CPU to main memory and I/O via MAR, MBR, and control signals.
The ALU takes operands from registers, performs an operation selected by the Control Unit, and produces a result. Simultaneously, it updates a set of condition flags in the PSW (Program Status Word) register:
ALU inputs and outputs. The Control Unit selects the operation (opcode). Two operands arrive from registers A and B. The result is written back to a destination register. Four condition flags are set: N (result is negative), Z (result is zero), C (carry out from MSB), V (signed overflow occurred). Conditional branch instructions (BEQ, BNE, BGT…) test these flags to control program flow.
| Category | Operations | Notes |
|---|---|---|
| Integer arithmetic | ADD SUB MUL DIV NEG ABS | Basic arithmetic on integers. DIV and MUL often require 2n-bit result register pairs. |
| Logical | AND OR NOT XOR | Bitwise operations — used for masking, bit manipulation, and Boolean logic. |
| Shift / Rotate | SHL SHR SAR ROL ROR | SHL/SHR: logical shifts (fill with 0). SAR: arithmetic right shift (preserves sign bit). ROL/ROR: rotate through carry. |
| Comparison | CMP TEST | CMP sets flags without storing result (subtraction with discarded result). TEST is bitwise AND with discarded result. |
Registers are the fastest storage in the entire computer — on-chip SRAM cells with 0–1 cycle access time. A CPU has two categories:
| Register | Name | Function |
|---|---|---|
| PC | Program Counter | Holds the address of the next instruction to fetch. Incremented after each fetch; overwritten by branch/jump instructions. |
| IR | Instruction Register | Holds the currently-executing instruction. The CU decodes its opcode field and operand fields. |
| MAR | Memory Address Register | Holds the address sent out on the address bus. Loaded from PC (instruction fetch) or from an operand address (data access). |
| MBR | Memory Buffer Register | Holds the word coming from or going to memory. Connected directly to the data bus. |
| R0–R15 | General-Purpose Registers | Programmer-accessible. Hold operands and results. ARM has 16 (R0–R15); x86-64 has 16 (RAX–R15); RISC-V has 32 (x0–x31). |
| SP | Stack Pointer | Points to the top of the current stack frame. Used by PUSH/POP and function call/return. |
| LR | Link Register | In ARM/RISC-V: holds the return address when a function is called (BL instruction stores PC into LR). |
| PSW | Program Status Word | Contains condition flags (N, Z, C, V), interrupt enable bit, processor mode (user/supervisor), and other status bits. |
RTL (Register Transfer Language) is a symbolic notation for describing the flow of data between registers and the operations performed on them:
| RTL notation | Meaning | Example |
|---|---|---|
R1 ← R2 | Copy contents of R2 into R1 | Move/transfer operation |
R1 ← R2 + R3 | Add R2 and R3, store in R1 | ADD R1, R2, R3 |
MAR ← PC | Load MAR with current PC (start of fetch) | Address bus = PC |
MBR ← M[MAR] | Read memory at address MAR into MBR | Memory read operation |
IR ← MBR | Load instruction register from memory buffer | Instruction register loaded |
PC ← PC + 1 | Increment program counter to next instruction | Sequential fetch advance |
if Z=1: PC ← target | Conditional transfer — branch if Zero flag set | BEQ (branch if equal) |
ADD R1, R2, R3 becomes three RTL statements in the execute stage: read R2 → ALU input A, read R3 → ALU input B, write ALU result → R1.Computers store all data as binary digits. Representing negative numbers requires a convention. Three approaches exist:
4-bit integer representation comparison. Unsigned: values 0–15. Sign-Magnitude: +0, -0, ±1 through ±7 — note the two zeros problem. Two’s Complement: -8 through +7 — single zero, one extra negative value (-8). Two’s complement is universally used because the same adder circuit works for both positive and negative numbers.
In sign-magnitude, the MSB (bit n-1) is the sign bit. The remaining n-1 bits hold the magnitude:
+18 = 0001 0010 (0 = positive, magnitude = 18)
-18 = 1001 0010 (1 = negative, magnitude = 18)
Two zeros: +0 = 0000 0000 and -0 = 1000 0000 — different bit patterns, same value.
Range: For 8-bit sign-magnitude: -(2⁷-1) to +(2⁷-1) = -127 to +127.
Problems: To add +3 and -5 you must compare signs, then subtract magnitudes. The hardware needs separate logic — it cannot simply add the bit patterns. This is why sign-magnitude is almost never used in modern processors.
Two’s complement solves both sign-magnitude problems: it has a single zero and uses the same binary adder for all arithmetic.
Negate +18:
+18 = 0001 0010
NOT = 1110 1101
+ 1
= 1110 1110 = −18 ✓
Special case — most negative number: −128 = 1000 0000. NOT = 0111 1111 + 1 = 1000 0000. So −(−128) = −128 — the unavoidable anomaly of two’s complement. The range is asymmetric: −2ⁿ⁻¹ to +(2ⁿ⁻¹−1).
| Number | 8-bit | 16-bit extension | Rule |
|---|---|---|---|
| +18 | 0001 0010 | 0000 0000 0001 0010 | Fill with 0s (positive) |
| −18 | 1110 1110 | 1111 1111 1110 1110 | Fill with 1s (negative) |
The beauty of two’s complement is that addition works identically for positive and negative numbers — the same binary adder handles all cases. Any carry out of the MSB is simply discarded.
(a) (+3) + (+4) = +7
0011
+ 0100
──────
0111 = +7 ✓
(b) (−7) + (+5) = −2
1001
+ 0101
──────
1110 = −2 ✓
(c) (−4) + (+4) = 0
1100
+ 0100
──────
(1)0000 → carry ignored → 0000 = 0 ✓
(d) (−4) + (−1) = −5
1100
+ 1111
──────
(1)1011 → carry ignored → 1011 = −5 ✓
Overflow occurs when the result of an addition is too large to be represented in n bits.
Overflow examples. No overflow (left): adding opposite-sign numbers can never overflow. Overflow (right, c): +5 + +4 should give +9, but in 4-bit two’s complement the result bit pattern 1001 reads as −7, which is wrong. The V flag is set to 1 to signal the error.
V = Cₙ XOR Cₙ₋₁. If these two carries differ, overflow has occurred. This is a single gate — overflow detection adds negligible hardware cost.Subtraction (A − B) is performed by adding A and the two’s complement of B:
A − B = A + (−B) = A + (NOT B + 1)
This means the hardware needs only an adder and a complementer (NOT gates + carry-in of 1) — no separate subtractor circuit. When the CPU executes SUB, it passes operand B through the complementer and sets carry-in = 1 on the adder.
(a) 2 − 7 = −5:
S’ = two’s complement of 7 = 1111 1001
M + S’ = 0000 0010 + 1111 1001 = 1111 1011 = −5 ✓
(b) −5 − 2 = −7:
S’ = two’s complement of 2 = 1111 1110
M + S’ = 1111 1011 + 1111 1110 = (1)1111 1001 → carry ignored → 1111 1001 = −7 ✓
Multiplication of two n-bit numbers produces a result up to 2n bits long. The basic algorithm generates partial products — one for each bit of the multiplier — then sums them with appropriate left-shifts:
Multiplicand M = 1011 (11) Multiplier Q = 1101 (13)
1011 × 1101
──────────────
1011 (Q bit 0 is 1)
0000 (Q bit 1 is 0)
1011 (Q bit 2 is 1, shift left 2)
1011 (Q bit 3 is 1, shift left 3)
──────────────
10001111 = 143 ✓ (11 × 13 = 143)
Key observations: (1) The final product is 8 bits — twice the 4-bit operand width. (2) Processor registers must be wide enough to hold the 2n-bit result (e.g. 64-bit result of 32×32-bit multiply uses a register pair).
Booth’s algorithm (Andrew Booth, 1951) solves two’s complement multiplication correctly and is also more efficient — it skips over blocks of consecutive 1s or 0s.
Examine two bits: the current multiplier bit (Q₀) and the bit to its right (Q₋₁, initially 0):
| Q₀ (current) | Q₋₁ (previous) | Action | Meaning |
|---|---|---|---|
| 0 | 0 | Arithmetic right shift only | Middle of block of 0s — skip |
| 0 | 1 | A ← A + M, then right shift | End of block of 1s — add |
| 1 | 0 | A ← A − M, then right shift | Start of block of 1s — subtract |
| 1 | 1 | Arithmetic right shift only | Middle of block of 1s — skip |
The ALU handles integer arithmetic. For floating-point operations, a dedicated FPU (Floating Point Unit) is typically provided. Floating-point numbers use the form:
± significand × 2exponent
IEEE 754 single-precision format. The 32-bit word is divided into: 1 sign bit, 8-bit biased exponent (actual exponent = stored_exponent − 127), and 23-bit significand. The leading “1.” of the normalised number is implied (not stored), giving 24 bits of effective precision (~7 decimal digits). Double-precision uses 1+11+52 = 64 bits with ~15 decimal digits of precision.
A synthesisable ALU in SystemVerilog is a straightforward combinational block: a case statement on the opcode selects the arithmetic or logical operation. For a 32-bit ALU: case(alu_op) ADD: result = a + b; SUB: result = a - b; AND: result = a & b; ... endcase. The synthesis tool maps this to adders, XOR trees, and multiplexers. The flags N, Z, C, V are derived combinationally from the result and operand MSBs. The V (overflow) flag is computed as: assign overflow = (a[31] == b[31]) && (result[31] != a[31]);. The complete ALU — including a 32×32-bit Booth multiplier and a 32-bit carry-lookahead adder — is typically 3,000–15,000 gates and is one of the core RTL blocks in any CPU design you will write or verify.
The elegance of two’s complement is physical: exactly one hardware adder (plus a complement gate and a carry-in mux) implements all of ADD, SUB, NEG, CMP, and TEST. The x86 SUB instruction and the ARM SUBS instruction both route operand B through a NOT tree and set carry-in=1, then feed the result to the same adder used for ADD. Every full-adder cell in your standard cell library is designed around this principle. When you do DFT insertion (scan chain testing) on a CPU, the adder cells in the ALU represent a significant fraction of the total scan chain length and must be characterised for all corner cases — including the overflow anomaly at −2ⁿ⁻¹.
Modern high-performance CPU and GPU multipliers use Modified Booth Encoding (MBE) at Radix-4 — examining 3 multiplier bits at a time (overlapping by 1 bit) and replacing them with a partial product coefficient from {−2M, −M, 0, +M, +2M}. This halves the number of partial products compared to basic Booth, reducing the adder tree depth. The partial products are then summed using a Wallace tree — a network of carry-save adders (CSAs) that reduces n partial products to two numbers in O(log n) levels. This Booth + CSA + parallel-prefix architecture is what every fast multiplier in a modern CPU or DSP uses. Writing the RTL for a 32×32 Booth-encoded multiplier with a Wallace tree is a standard advanced VLSI design exercise.