CPU Internal Organisation, Integer Arithmetic and RTL — Complete Guide

🧠CPU Overview

The Central Processing Unit (CPU) is the brain of the computer — the component that fetches instructions from memory, interprets them, and executes them. All other elements of the computer exist to bring data to the CPU and take results away.

A CPU contains three primary sub-components working together:

Arithmetic Logic Unit (ALU): The computational heart — performs all arithmetic and logic operations on data.
Control Unit (CU): Directs the flow of information — reads instructions, generates control signals, sequences the ALU and register operations.
Register file: A set of fast on-chip storage locations — holds operands and intermediate results for the ALU, the program counter, the instruction register, and processor status flags.

Notable CPU architectures: Intel x86, Zilog Z80, Motorola 68000, MIPS, SPARC, DEC Alpha, PowerPC, and ARM. Today, ARM and RISC-V dominate embedded and mobile design while x86 remains dominant in desktop and server markets.

Figure 1 — CPU internal structure: Control Unit, ALU, and Register File with data flow

CPU internal structure. The Control Unit (top-left) decodes instructions and generates control signals. The ALU (top-right) performs all computations and updates the Flags/PSW register. The Register File (bottom) holds operands and results. All components communicate through the Internal CPU Bus. The System Bus Interface connects the CPU to main memory and I/O via MAR, MBR, and control signals.

⚙️ALU — Inputs, Outputs & Flags

The ALU takes operands from registers, performs an operation selected by the Control Unit, and produces a result. Simultaneously, it updates a set of condition flags in the PSW (Program Status Word) register:

Figure 2 — ALU inputs, outputs, and flag generation

ALU inputs and outputs. The Control Unit selects the operation (opcode). Two operands arrive from registers A and B. The result is written back to a destination register. Four condition flags are set: N (result is negative), Z (result is zero), C (carry out from MSB), V (signed overflow occurred). Conditional branch instructions (BEQ, BNE, BGT…) test these flags to control program flow.

ALU operations by category

Category	Operations	Notes
Integer arithmetic	ADD SUB MUL DIV NEG ABS	Basic arithmetic on integers. DIV and MUL often require 2n-bit result register pairs.
Logical	AND OR NOT XOR	Bitwise operations — used for masking, bit manipulation, and Boolean logic.
Shift / Rotate	SHL SHR SAR ROL ROR	SHL/SHR: logical shifts (fill with 0). SAR: arithmetic right shift (preserves sign bit). ROL/ROR: rotate through carry.
Comparison	CMP TEST	CMP sets flags without storing result (subtraction with discarded result). TEST is bitwise AND with discarded result.

📦CPU Registers

Registers are the fastest storage in the entire computer — on-chip SRAM cells with 0–1 cycle access time. A CPU has two categories:

Register	Name	Function
PC	Program Counter	Holds the address of the next instruction to fetch. Incremented after each fetch; overwritten by branch/jump instructions.
IR	Instruction Register	Holds the currently-executing instruction. The CU decodes its opcode field and operand fields.
MAR	Memory Address Register	Holds the address sent out on the address bus. Loaded from PC (instruction fetch) or from an operand address (data access).
MBR	Memory Buffer Register	Holds the word coming from or going to memory. Connected directly to the data bus.
R0–R15	General-Purpose Registers	Programmer-accessible. Hold operands and results. ARM has 16 (R0–R15); x86-64 has 16 (RAX–R15); RISC-V has 32 (x0–x31).
SP	Stack Pointer	Points to the top of the current stack frame. Used by PUSH/POP and function call/return.
LR	Link Register	In ARM/RISC-V: holds the return address when a function is called (BL instruction stores PC into LR).
PSW	Program Status Word	Contains condition flags (N, Z, C, V), interrupt enable bit, processor mode (user/supervisor), and other status bits.

📝Register Transfer Language (RTL)

RTL (Register Transfer Language) is a symbolic notation for describing the flow of data between registers and the operations performed on them:

RTL notation	Meaning	Example
`R1 ← R2`	Copy contents of R2 into R1	Move/transfer operation
`R1 ← R2 + R3`	Add R2 and R3, store in R1	ADD R1, R2, R3
`MAR ← PC`	Load MAR with current PC (start of fetch)	Address bus = PC
`MBR ← M[MAR]`	Read memory at address MAR into MBR	Memory read operation
`IR ← MBR`	Load instruction register from memory buffer	Instruction register loaded
`PC ← PC + 1`	Increment program counter to next instruction	Sequential fetch advance
`if Z=1: PC ← target`	Conditional transfer — branch if Zero flag set	BEQ (branch if equal)

RTL is the basis of all CPU microcode and RTL design. When you write SystemVerilog to implement a CPU pipeline stage, you are expressing RTL operations in a hardware description language. The instruction ADD R1, R2, R3 becomes three RTL statements in the execute stage: read R2 → ALU input A, read R3 → ALU input B, write ALU result → R1.

🔢Integer Representation

Computers store all data as binary digits. Representing negative numbers requires a convention. Three approaches exist:

Unsigned: All n bits represent magnitude. Range: 0 to 2ⁿ−1. No negative numbers.
Sign-Magnitude: MSB is sign bit (0=positive, 1=negative); remaining bits are magnitude.
Two’s Complement: The universal standard for signed integers in modern processors.

Figure 3 — 4-bit integer representations: unsigned, sign-magnitude, and two’s complement

4-bit integer representation comparison. Unsigned: values 0–15. Sign-Magnitude: +0, -0, ±1 through ±7 — note the two zeros problem. Two’s Complement: -8 through +7 — single zero, one extra negative value (-8). Two’s complement is universally used because the same adder circuit works for both positive and negative numbers.

±Sign-Magnitude Representation

In sign-magnitude, the MSB (bit n-1) is the sign bit. The remaining n-1 bits hold the magnitude:

🔍 Sign-Magnitude Examples (8-bit)

+18 = 0001 0010 (0 = positive, magnitude = 18)

-18 = 1001 0010 (1 = negative, magnitude = 18)

Two zeros: +0 = 0000 0000 and -0 = 1000 0000 — different bit patterns, same value.

Range: For 8-bit sign-magnitude: -(2⁷-1) to +(2⁷-1) = -127 to +127.

Problems: To add +3 and -5 you must compare signs, then subtract magnitudes. The hardware needs separate logic — it cannot simply add the bit patterns. This is why sign-magnitude is almost never used in modern processors.

🔄Two’s Complement Representation

Two’s complement solves both sign-magnitude problems: it has a single zero and uses the same binary adder for all arithmetic.

How to negate (form the two’s complement)

Invert all bits (bitwise NOT / one’s complement)
Add 1 to the result

🔍 Worked Examples — Two’s Complement Negation (8-bit)

Negate +18:

+18 = 0001 0010
NOT = 1110 1101
+ 1
= 1110 1110 = −18 ✓

Special case — most negative number: −128 = 1000 0000. NOT = 0111 1111 + 1 = 1000 0000. So −(−128) = −128 — the unavoidable anomaly of two’s complement. The range is asymmetric: −2ⁿ⁻¹ to +(2ⁿ⁻¹−1).

Sign extension

Number	8-bit	16-bit extension	Rule
+18	0001 0010	0000 0000 0001 0010	Fill with 0s (positive)
−18	1110 1110	1111 1111 1110 1110	Fill with 1s (negative)

➕Two’s Complement Arithmetic — Addition

The beauty of two’s complement is that addition works identically for positive and negative numbers — the same binary adder handles all cases. Any carry out of the MSB is simply discarded.

🔍 Worked Examples — Two’s Complement Addition (4-bit)

(a) (+3) + (+4) = +7

0011
+ 0100
──────
0111 = +7 ✓

(b) (−7) + (+5) = −2

1001
+ 0101
──────
1110 = −2 ✓

(c) (−4) + (+4) = 0

1100
+ 0100
──────
(1)0000 → carry ignored → 0000 = 0 ✓

(d) (−4) + (−1) = −5

1100
+ 1111
──────
(1)1011 → carry ignored → 1011 = −5 ✓

⚠️Overflow Detection

Overflow occurs when the result of an addition is too large to be represented in n bits.

Overflow rule: If two numbers of the same sign are added and the result has the opposite sign, overflow has occurred. (Adding two positives cannot give a negative; adding two negatives cannot give a positive. If it does, the result has wrapped around.)

Figure 4 — Overflow detection in two’s complement addition (4-bit examples)

Overflow examples. No overflow (left): adding opposite-sign numbers can never overflow. Overflow (right, c): +5 + +4 should give +9, but in 4-bit two’s complement the result bit pattern 1001 reads as −7, which is wrong. The V flag is set to 1 to signal the error.

Hardware overflow detection: In an n-bit adder, overflow is detected by XORing the carry into the MSB with the carry out of the MSB: V = Cₙ XOR Cₙ₋₁. If these two carries differ, overflow has occurred. This is a single gate — overflow detection adds negligible hardware cost.

➖Subtraction via Two’s Complement

Subtraction (A − B) is performed by adding A and the two’s complement of B:

A − B = A + (−B) = A + (NOT B + 1)

This means the hardware needs only an adder and a complementer (NOT gates + carry-in of 1) — no separate subtractor circuit. When the CPU executes SUB, it passes operand B through the complementer and sets carry-in = 1 on the adder.

🔍 Worked Example — Subtraction (M − S) in two’s complement (8-bit)

(a) 2 − 7 = −5:

S’ = two’s complement of 7 = 1111 1001
M + S’ = 0000 0010 + 1111 1001 = 1111 1011 = −5 ✓

(b) −5 − 2 = −7:

S’ = two’s complement of 2 = 1111 1110
M + S’ = 1111 1011 + 1111 1110 = (1)1111 1001 → carry ignored → 1111 1001 = −7 ✓

✖️Multiplication — Partial Products

Multiplication of two n-bit numbers produces a result up to 2n bits long. The basic algorithm generates partial products — one for each bit of the multiplier — then sums them with appropriate left-shifts:

🔍 Worked Example — Unsigned Binary Multiplication (4-bit)

Multiplicand M = 1011 (11) Multiplier Q = 1101 (13)

    1011 × 1101
──────────────
    1011      (Q bit 0 is 1)
    0000      (Q bit 1 is 0)
   1011      (Q bit 2 is 1, shift left 2)
  1011      (Q bit 3 is 1, shift left 3)
──────────────
10001111 = 143 ✓   (11 × 13 = 143)

Key observations: (1) The final product is 8 bits — twice the 4-bit operand width. (2) Processor registers must be wide enough to hold the 2n-bit result (e.g. 64-bit result of 32×32-bit multiply uses a register pair).

⚡Booth’s Algorithm

Booth’s algorithm (Andrew Booth, 1951) solves two’s complement multiplication correctly and is also more efficient — it skips over blocks of consecutive 1s or 0s.

Algorithm rules

Examine two bits: the current multiplier bit (Q₀) and the bit to its right (Q₋₁, initially 0):

Q₀ (current)	Q₋₁ (previous)	Action	Meaning
0	0	Arithmetic right shift only	Middle of block of 0s — skip
0	1	A ← A + M, then right shift	End of block of 1s — add
1	0	A ← A − M, then right shift	Start of block of 1s — subtract
1	1	Arithmetic right shift only	Middle of block of 1s — skip

Efficiency advantage: For a multiplier with long runs of identical bits, Booth’s algorithm requires far fewer add/subtract operations. Modern hardware multipliers extend this idea to Radix-4 or Radix-8 Booth encoding, processing 2 or 3 multiplier bits per step, halving or quartering the number of partial products.

🔬Floating Point Unit (FPU)

The ALU handles integer arithmetic. For floating-point operations, a dedicated FPU (Floating Point Unit) is typically provided. Floating-point numbers use the form:

± significand × 2^exponent

Figure 5 — IEEE 754 single-precision (32-bit) floating-point format

IEEE 754 single-precision format. The 32-bit word is divided into: 1 sign bit, 8-bit biased exponent (actual exponent = stored_exponent − 127), and 23-bit significand. The leading “1.” of the normalised number is implied (not stored), giving 24 bits of effective precision (~7 decimal digits). Double-precision uses 1+11+52 = 64 bits with ~15 decimal digits of precision.

🔬VLSI Connections

🔬 ALU implementation in RTL — from specification to gate-level

A synthesisable ALU in SystemVerilog is a straightforward combinational block: a case statement on the opcode selects the arithmetic or logical operation. For a 32-bit ALU: case(alu_op) ADD: result = a + b; SUB: result = a - b; AND: result = a & b; ... endcase. The synthesis tool maps this to adders, XOR trees, and multiplexers. The flags N, Z, C, V are derived combinationally from the result and operand MSBs. The V (overflow) flag is computed as: assign overflow = (a[31] == b[31]) && (result[31] != a[31]);. The complete ALU — including a 32×32-bit Booth multiplier and a 32-bit carry-lookahead adder — is typically 3,000–15,000 gates and is one of the core RTL blocks in any CPU design you will write or verify.

🔬 Two’s complement in hardware — the same adder for all arithmetic

The elegance of two’s complement is physical: exactly one hardware adder (plus a complement gate and a carry-in mux) implements all of ADD, SUB, NEG, CMP, and TEST. The x86 SUB instruction and the ARM SUBS instruction both route operand B through a NOT tree and set carry-in=1, then feed the result to the same adder used for ADD. Every full-adder cell in your standard cell library is designed around this principle. When you do DFT insertion (scan chain testing) on a CPU, the adder cells in the ALU represent a significant fraction of the total scan chain length and must be characterised for all corner cases — including the overflow anomaly at −2ⁿ⁻¹.

🔬 Booth encoding in modern multipliers — Radix-4 and Wallace trees

Modern high-performance CPU and GPU multipliers use Modified Booth Encoding (MBE) at Radix-4 — examining 3 multiplier bits at a time (overlapping by 1 bit) and replacing them with a partial product coefficient from {−2M, −M, 0, +M, +2M}. This halves the number of partial products compared to basic Booth, reducing the adder tree depth. The partial products are then summed using a Wallace tree — a network of carry-save adders (CSAs) that reduces n partial products to two numbers in O(log n) levels. This Booth + CSA + parallel-prefix architecture is what every fast multiplier in a modern CPU or DSP uses. Writing the RTL for a 32×32 Booth-encoded multiplier with a Wallace tree is a standard advanced VLSI design exercise.

Summary — CA-10 key points: The CPU contains three primary sub-systems: the Control Unit (decodes instructions, generates control signals), the ALU (performs all arithmetic and logic operations, updates N/Z/C/V flags), and the Register File (fast on-chip storage for operands, PC, SP, IR, MAR, MBR). RTL notation describes register transfers and operations at the micro-architecture level. Integer representation: Unsigned (0 to 2ⁿ-1), Sign-Magnitude (MSB is sign, two zeros, complex arithmetic), Two’s Complement (single zero, asymmetric range, same adder for add and subtract — universally used). Two’s complement negation: invert all bits, add 1. Overflow: same-sign operands produce opposite-sign result. Subtraction: A−B = A + (NOT B) + 1. Multiplication: partial products shifted and summed. Booth’s algorithm handles two’s complement directly and is more efficient. IEEE 754 defines floating-point: 1 sign + 8 exponent + 23 significand (32-bit single), or 1+11+52 (64-bit double).