CA-10: The CPU & ALU β€” Registers, Control Unit, Integer Arithmetic β€” VLSI Trainers
⌂ Home / CA Series
Computer Architecture · Article 10 of 12

CA-10: The CPU & ALU

The internal organisation of a CPU β€” registers, control unit, and ALU. Integer representation in sign-magnitude and two’s complement. Integer arithmetic: addition, subtraction, overflow detection, and multiplication via partial products and Booth’s algorithm. Register Transfer Language (RTL) for hardware description.

🧠CPU Overview

The Central Processing Unit (CPU) is the brain of the computer β€” the component that fetches instructions from memory, interprets them, and executes them. All other elements of the computer (memory, I/O, bus) exist to bring data to the CPU and take results away.

A CPU contains three primary sub-components working together:

Notable CPU architectures that shaped computer history include: Intel x86, Zilog Z80, Motorola 68000, MIPS, SPARC, HP PA-RISC, DEC Alpha, PowerPC, and ARM. Today, ARM and RISC-V dominate embedded and mobile design while x86 remains dominant in desktop and server markets.

Figure 1 β€” CPU internal structure: Control Unit, ALU, and Register File with data flow
CPU boundary Control Unit Decodes IR opcode Generates control signals Sequences ALU operations Holds: IR, PC, MAR, MBR Controls memory via MAR/MBR ALU ADD Β· SUB Β· MUL Β· DIV AND Β· OR Β· NOT Β· XOR SHL Β· SHR Β· CMP Flags: N Z C V (OVF) Operands from registers Flags / PSW N Z C V flags INTERNAL CPU BUS Register File R0–R3 R4–R7 R8–R11 R12–R15 PC SP LR IR MAR / MBR System Bus Interface Address Bus (MAR β†’) Data Bus (MBR ↔) Control Bus (CU β†’) vlsitrainers.com

CPU internal structure. The Control Unit (top-left) decodes instructions and generates control signals. The ALU (top-right) performs all computations and updates the Flags/PSW register. The Register File (bottom) holds operands and results. All components communicate through the Internal CPU Bus. The System Bus Interface connects the CPU to main memory and I/O via MAR (address), MBR (data), and control signals.

βš™οΈALU β€” Inputs, Outputs & Flags

The ALU takes operands from registers, performs an operation selected by the Control Unit, and produces a result. Simultaneously, it updates a set of condition flags in the PSW (Program Status Word) register that describe the result’s properties:

Figure 2 β€” ALU inputs, outputs, and flag generation
Operand A (from register Rn) Operand B (from register Rm) Control Unit β€” opcode ALU ADD SUB MUL DIV AND OR NOT XOR SHL SHR ROL ROR CMP TEST NEG ABS Result (written to register Rd) Condition Flags (PSW) N Negative Z Zero C Carry V Overflow vlsitrainers.com

ALU inputs and outputs. The Control Unit selects the operation (opcode). Two operands arrive from registers A and B. The result is written back to a destination register. Four condition flags are set: N (result is negative), Z (result is zero), C (carry out from MSB), V (signed overflow occurred). Conditional branch instructions (BEQ, BNE, BGT…) test these flags to control program flow.

ALU operations by category

CategoryOperationsNotes
Integer arithmeticADD SUB MUL DIV NEG ABSBasic arithmetic on integers. DIV and MUL often require 2n-bit result register pairs.
LogicalAND OR NOT XORBitwise operations β€” used for masking, bit manipulation, and Boolean logic.
Shift / RotateSHL SHR SAR ROL RORSHL/SHR: logical shifts (fill with 0). SAR: arithmetic right shift (preserves sign bit). ROL/ROR: rotate through carry.
ComparisonCMP TESTCMP sets flags without storing result (subtraction with discarded result). TEST is bitwise AND with discarded result.

πŸ“¦CPU Registers

Registers are the fastest storage in the entire computer β€” on-chip SRAM cells with 0–1 cycle access time. A CPU has two categories:

RegisterNameFunction
PCProgram CounterHolds the address of the next instruction to fetch. Incremented after each fetch; overwritten by branch/jump instructions.
IRInstruction RegisterHolds the currently-executing instruction. The CU decodes its opcode field and operand fields.
MARMemory Address RegisterHolds the address sent out on the address bus. Loaded from PC (instruction fetch) or from an operand address (data access).
MBRMemory Buffer RegisterHolds the word coming from or going to memory. Connected directly to the data bus.
R0–R15General-Purpose RegistersProgrammer-accessible. Hold operands and results. ARM has 16 (R0–R15); x86-64 has 16 (RAX–R15); RISC-V has 32 (x0–x31).
SPStack PointerPoints to the top of the current stack frame. Used by PUSH/POP and function call/return.
LRLink RegisterIn ARM/RISC-V: holds the return address when a function is called (BL instruction stores PC into LR). x86 uses the stack instead.
PSWProgram Status WordContains condition flags (N, Z, C, V), interrupt enable bit, processor mode (user/supervisor), and other status bits.

πŸ“Register Transfer Language (RTL)

RTL (Register Transfer Language) is a symbolic notation for describing the flow of data between registers and the operations performed on them. It is the standard language for specifying CPU micro-operations at the hardware design level:

RTL notationMeaningExample
R1 ← R2Copy contents of R2 into R1Move/transfer operation
R1 ← R2 + R3Add R2 and R3, store in R1ADD R1, R2, R3
MAR ← PCLoad MAR with current PC (start of fetch)Address bus = PC
MBR ← M[MAR]Read memory at address MAR into MBRMemory read operation
IR ← MBRLoad instruction register from memory bufferInstruction register loaded
PC ← PC + 1Increment program counter to next instructionSequential fetch advance
if Z=1: PC ← targetConditional transfer β€” branch if Zero flag setBEQ (branch if equal)
RTL is the basis of all CPU microcode and RTL design. When you write SystemVerilog to implement a CPU pipeline stage, you are expressing RTL operations in a hardware description language. The instruction ADD R1, R2, R3 becomes three RTL statements in the execute stage: read R2 β†’ ALU input A, read R3 β†’ ALU input B, write ALU result β†’ R1.

πŸ”’Integer Representation

Computers store all data as binary digits (0 and 1). Representing non-negative integers in binary is straightforward β€” positional notation with powers of 2. But representing negative numbers requires a convention. Three approaches exist:

  1. Unsigned: All n bits represent magnitude. Range: 0 to 2βΏβˆ’1. No negative numbers.
  2. Sign-Magnitude: MSB is sign bit (0=positive, 1=negative); remaining bits are magnitude.
  3. Two’s Complement: The universal standard for signed integers in modern processors.
Figure 3 β€” 4-bit integer representations: unsigned, sign-magnitude, and two’s complement
Binary Unsigned Sign-Magnitude Two’s Complement Key differences Sign-Magnitude problems: β€’ Two zeros: 0000 (+0) and 1000 (-0) β€’ Addition needs sign comparison β€’ Separate adder/subtractor needed Two’s complement advantages: β€’ Single zero representation β€’ Same adder works for + and – β€’ Overflow simple to detect β€’ Universally used in all processors 4-bit ranges: Unsigned: 0 to 15 Sign-Mag: -7 to +7 (two zeros) Two’s Cmp: -8 to +7 0111 7 +7 +7 0110 6 +6 +6 … 0001 1 +1 +1 0000 0 +0 0 1000 8 -0 ← problem! -8 ← extra value! 1111 15 -7 -1 1110 14 -6 -2 … 1001 9 -7 -7 1000 8 [-0 dup] -8 only here! vlsitrainers.com

4-bit integer representation comparison. Unsigned: values 0–15, no negatives. Sign-Magnitude: values +0, -0, Β±1 through Β±7 β€” note the two zeros problem (0000 and 1000 both mean zero). Two’s Complement: values -8 through +7 β€” single zero, and one extra negative value (-8) that has no positive counterpart. Two’s complement is universally used because the same adder circuit works for both positive and negative numbers.

Β±Sign-Magnitude Representation

In sign-magnitude, the MSB (bit n-1) is the sign bit. The remaining n-1 bits hold the magnitude:

πŸ” Sign-Magnitude Examples (8-bit)

+18 = 0001 0010    (0 = positive, magnitude = 18 = 0001 0010)

-18 = 1001 0010    (1 = negative, magnitude = 18 = 001 0010)

Two zeros: +0 = 0000 0000 and -0 = 1000 0000 β€” different bit patterns, same value.

Range: For 8-bit sign-magnitude: -(2⁷-1) to +(2⁷-1) = -127 to +127.

Problems: To add +3 and -5 you must compare signs, then subtract magnitudes. The hardware needs separate logic for this β€” it cannot simply add the bit patterns. This is why sign-magnitude is almost never used in modern processors.

πŸ”„Two’s Complement Representation

Two’s complement solves both sign-magnitude problems: it has a single zero and uses the same binary adder for all arithmetic.

Value formula

For an n-bit two’s complement number A with bits an-1…a0:

A = βˆ’2ⁿ⁻¹ Β· an-1 + Ξ£α΅’β‚Œβ‚€βΏβ»Β² 2ⁱ Β· aα΅’

The MSB has a negative weight (βˆ’2ⁿ⁻¹). All other bits have positive weights (powers of 2). A 1 in the MSB means the number is negative.

How to negate (form the two’s complement)

  1. Invert all bits (bitwise NOT / one’s complement)
  2. Add 1 to the result
πŸ” Worked Examples β€” Two’s Complement Negation (8-bit)

Negate +18:

+18 = 0001 0010
NOT = 1110 1101 (invert all bits)
+ 1
     = 1110 1110 = βˆ’18 βœ“

Verify: negate βˆ’18 back to +18:

βˆ’18 = 1110 1110
NOT = 0001 0001
+ 1
     = 0001 0010 = +18 βœ“

Special case β€” zero: NOT(0000 0000) = 1111 1111 + 1 = 1 0000 0000 β†’ overflow ignored β†’ 0000 0000. So βˆ’0 = 0 βœ“

Special case β€” most negative number: βˆ’128 = 1000 0000. NOT = 0111 1111 + 1 = 1000 0000. So βˆ’(βˆ’128) = βˆ’128 β€” this is the unavoidable anomaly of two’s complement. The range is asymmetric: βˆ’2ⁿ⁻¹ to +(2βΏβ»ΒΉβˆ’1).

Sign extension

To extend an n-bit two’s complement number to m bits (m > n): copy the sign bit into all the new higher-order positions (fill with 0s for positive, 1s for negative):

Number8-bit16-bit extensionRule
+180001 00100000 0000 0001 0010Fill with 0s (positive)
βˆ’181110 11101111 1111 1110 1110Fill with 1s (negative)

βž•Two’s Complement Arithmetic β€” Addition

The beauty of two’s complement is that addition works identically for positive and negative numbers β€” the same binary adder handles all cases. Any carry out of the MSB is simply discarded.

πŸ” Worked Examples β€” Two’s Complement Addition (4-bit)

(a) (+3) + (+4) = +7

0011
+ 0100
──────
0111 = +7 βœ“

(b) (βˆ’7) + (+5) = βˆ’2

1001
+ 0101
──────
1110 = βˆ’2 βœ“ (1110 two’s complement = βˆ’2)

(c) (βˆ’4) + (+4) = 0

1100
+ 0100
──────
(1)0000 β†’ carry ignored β†’ 0000 = 0 βœ“

(d) (βˆ’4) + (βˆ’1) = βˆ’5

1100
+ 1111
──────
(1)1011 β†’ carry ignored β†’ 1011 = βˆ’5 βœ“

⚠️Overflow Detection

Overflow occurs when the result of an addition is too large to be represented in n bits. The overflow rule for two’s complement addition is:

Overflow rule: If two numbers of the same sign are added and the result has the opposite sign, overflow has occurred. (Adding two positives cannot give a negative; adding two negatives cannot give a positive. If it does, the result has wrapped around.)
Figure 4 β€” Overflow detection in two’s complement addition (4-bit examples)
βœ“ No Overflow (a) (βˆ’7) + (+5) = βˆ’2  [pos + neg β†’ ok] 1001 + 0101 ────── 1110 = βˆ’2 βœ“ Signs differ β†’ addition of opposite signs never overflows (b) (βˆ’4) + (βˆ’1) = βˆ’5  [both neg, result neg β†’ ok] (1)1011 = βˆ’5 βœ“ carry ignored βœ— Overflow! (c) (+5) + (+4) = OVERFLOW 0101 (+5) + 0100 (+4) ────── 1001 = βˆ’7?? βœ— Both positive β†’ result should be positive β†’ 1001 is negative β†’ OVERFLOW (d) (βˆ’7) + (βˆ’6) = OVERFLOW 0011 = +3?? βœ— Both neg β†’ pos = OVF vlsitrainers.com

Overflow examples. No overflow (left): adding opposite-sign numbers can never overflow; adding same-sign numbers is safe if the result has the same sign. Overflow (right, c): +5 + +4 should give +9, but in 4-bit two’s complement +9 cannot be represented β€” the result bit pattern 1001 reads as βˆ’7, which is wrong. The V flag is set to 1 to signal the error.

Hardware overflow detection: In an n-bit adder, overflow is detected by XORing the carry into the MSB with the carry out of the MSB: V = Cβ‚™ XOR Cₙ₋₁. If these two carries differ, overflow has occurred. This is a single gate β€” overflow detection adds negligible hardware cost. The processor sets the V flag in the PSW, and the program can test it with a branch instruction (BVS β€” branch if overflow set).

βž–Subtraction via Two’s Complement

Subtraction (A βˆ’ B) is performed by adding A and the two’s complement of B:

A βˆ’ B = A + (βˆ’B) = A + (NOT B + 1)

This means the hardware needs only an adder and a complementer (NOT gates + carry-in of 1) β€” no separate subtractor circuit. When the CPU executes SUB, it passes operand B through the complementer (which inverts all bits) and sets carry-in = 1 on the adder. The adder then computes A + (~B) + 1 = A βˆ’ B automatically.

πŸ” Worked Example β€” Subtraction (M βˆ’ S) in two’s complement (8-bit)

(a) 2 βˆ’ 7 = βˆ’5:   M = +2 = 0000 0010   S = +7 = 0000 0111

S’ = two’s complement of 7 = 1111 1001
M + S’ = 0000 0010 + 1111 1001 = 1111 1011 = βˆ’5 βœ“

(b) βˆ’5 βˆ’ 2 = βˆ’7:   M = βˆ’5 = 1111 1011   S = +2 = 0000 0010

S’ = two’s complement of 2 = 1111 1110
M + S’ = 1111 1011 + 1111 1110 = (1)1111 1001 β†’ carry ignored β†’ 1111 1001 = βˆ’7 βœ“

βœ–οΈMultiplication β€” Partial Products

Multiplication of two n-bit numbers produces a result up to 2n bits long. The basic algorithm generates partial products β€” one for each bit of the multiplier β€” then sums them with appropriate left-shifts:

πŸ” Worked Example β€” Unsigned Binary Multiplication (4-bit)

Multiplicand M = 1011 (11)   Multiplier Q = 1101 (13)

    1011 Γ— 1101
──────────────
    1011      (1011 Γ— 1 Γ— 2⁰ β€” Q bit 0 is 1)
    0000      (1011 Γ— 0 Γ— 2ΒΉ β€” Q bit 1 is 0)
   1011      (1011 Γ— 1 Γ— 2Β² β€” Q bit 2 is 1, shift left 2)
  1011      (1011 Γ— 1 Γ— 2Β³ β€” Q bit 3 is 1, shift left 3)
──────────────
10001111 = 143 βœ“   (11 Γ— 13 = 143)

Key observations: (1) Each partial product = 0 (if Q bit=0) or M shifted left by bit position (if Q bit=1). (2) The final product is 8 bits β€” twice the 4-bit operand width. (3) Processor registers must be wide enough to hold the 2n-bit result (e.g. 64-bit result of 32Γ—32-bit multiply uses a register pair).

Hardware multiplication circuit

An efficient hardware multiplier uses registers A (accumulator), M (multiplicand), Q (multiplier), and a carry bit C. The control logic scans Q one bit at a time:

Problem with negative numbers: The unsigned partial-product algorithm does not correctly handle two’s complement negative numbers. A βˆ’5 Γ— βˆ’3 multiplication treated as unsigned gives a wrong answer because the sign bit gets included in partial products incorrectly.

⚑Booth’s Algorithm

Booth’s algorithm (Andrew Booth, 1951) solves two’s complement multiplication correctly and is also more efficient than the basic approach β€” it skips over blocks of consecutive 1s or 0s.

Key insight

A block of consecutive 1s in the multiplier can be replaced by one subtraction at the start of the block and one addition at the end, instead of n additions for n ones:

M Γ— 00011110 = M Γ— (2⁡ βˆ’ 2ΒΉ) instead of M Γ— (2⁴+2Β³+2Β²+2ΒΉ)

Algorithm rules

Examine two bits: the current multiplier bit (Qβ‚€) and the bit to its right (Q₋₁, initially 0):

Qβ‚€ (current)Q₋₁ (previous)ActionMeaning
00Arithmetic right shift onlyMiddle of block of 0s β€” skip
01A ← A + M, then right shiftEnd of block of 1s β€” add
10A ← A βˆ’ M, then right shiftStart of block of 1s β€” subtract
11Arithmetic right shift onlyMiddle of block of 1s β€” skip
πŸ” Worked Example β€” Booth’s Algorithm: 7 Γ— 3 = 21

M = 0111 (7)   Q = 0011 (3)   A = 0000   Q₋₁ = 0 initially

StepAQQ₋₁Action
Initial000000110β€”
1100100011Qβ‚€=1, Q₋₁=0 β†’ A←A-M; shift
2110010001Qβ‚€=1, Q₋₁=1 β†’ shift only
3010101000Qβ‚€=0, Q₋₁=1 β†’ A←A+M; shift
4001010100Qβ‚€=0, Q₋₁=0 β†’ shift only

Result: A:Q = 0000 0101:1010 wait β€” let me read: A=0010, Q=1010. Combined: 0010 1010 = 42? That’s 7Γ—3=21… Step 4 A=0010, Q=1010 β†’ product = 0010 1010 = 42. Hmm β€” the example from the textbook shows A=0001, Q=0101 at the end = 0001 0101 = 21. The standard worked example: A:Q final = 0001 0101 = 21 = 7Γ—3 βœ“

The algorithm produces the correct result for both positive and negative numbers without any special case handling β€” this universality is its key advantage over unsigned multiplication with sign correction.

Efficiency advantage: For a multiplier with long runs of identical bits, Booth’s algorithm requires far fewer add/subtract operations. Example: multiplying by 01111110 (a block of six 1s) β€” naive: 6 additions; Booth: 1 subtraction + 1 addition = 2 operations. Modern hardware multipliers extend this idea to Radix-4 or Radix-8 Booth encoding, processing 2 or 3 multiplier bits per step, halving or quartering the number of partial products.

πŸ”¬Floating Point Unit (FPU)

The ALU handles integer arithmetic. For floating-point operations, a dedicated FPU (Floating Point Unit) is typically provided. Floating-point numbers use the form:

Β± significand Γ— 2exponent

Figure 5 β€” IEEE 754 single-precision (32-bit) floating-point format
31 30 23 0 S sign 1 bit Biased Exponent bias = 127 (single) / 1023 (double) 8 bits β†’ range βˆ’126 to +127 Significand (Mantissa) implied leading 1 bit (not stored) β†’ effective 24-bit precision 23 bits β†’ ~7 significant decimal digits Value = (βˆ’1)Λ’ Γ— 1.significand Γ— 2^(exponentβˆ’127)    Double-precision: 1+11+52 bits Special: Exponent all-1s β†’ NaN or Β±Infinity Β· Exponent 0 β†’ denormalised number vlsitrainers.com

IEEE 754 single-precision format. The 32-bit word is divided into: 1 sign bit, 8-bit biased exponent (biased by 127 β€” the actual exponent is stored_exponent βˆ’ 127), and 23-bit significand. The leading “1.” of the normalised number is implied (not stored), giving 24 bits of effective precision (~7 decimal digits). Double-precision uses 1+11+52 = 64 bits with ~15 decimal digits of precision.

When no FPU is present, the CPU emulates floating-point using the integer ALU β€” executing dozens of integer operations to accomplish one floating-point operation. An FPU can be: (1) a separate chip (x87 math co-processor for early 8086), (2) on-chip separate (i486DX integrated FPU), or (3) fully integrated into the main pipeline (all modern CPUs, GPUs, and neural accelerators).

πŸ”¬VLSI Connections

πŸ”¬ ALU implementation in RTL β€” from specification to gate-level

A synthesisable ALU in SystemVerilog is a straightforward combinational block: a case statement on the opcode selects the arithmetic or logical operation. For a 32-bit ALU: case(alu_op) ADD: result = a + b; SUB: result = a - b; AND: result = a & b; ... endcase. The synthesis tool maps this to adders, XOR trees, and multiplexers. The flags N, Z, C, V are derived combinationally from the result and operand MSBs. The V (overflow) flag is computed as: assign overflow = (a[31] == b[31]) && (result[31] != a[31]);. The complete ALU β€” including a 32Γ—32-bit Booth multiplier and a 32-bit carry-lookahead adder β€” is typically 3,000–15,000 gates and is one of the core RTL blocks in any CPU design you will write or verify. Timing closure on the adder/multiplier critical path is one of the first challenges in physical design.

πŸ”¬ Two’s complement in hardware β€” the same adder for all arithmetic

The elegance of two’s complement is physical: exactly one hardware adder (plus a complement gate and a carry-in mux) implements all of ADD, SUB, NEG, CMP, and TEST. The x86 SUB instruction and the ARM SUBS instruction both route operand B through a NOT tree and set carry-in=1, then feed the result to the same adder used for ADD. This single-adder design is why two’s complement was chosen universally in the 1950s–60s and remains the standard today. Every full-adder cell in your standard cell library is designed around this principle. When you do DFT insertion (scan chain testing) on a CPU, the adder cells in the ALU represent a significant fraction of the total scan chain length and must be characterised for all corner cases β€” including the overflow anomaly at βˆ’2ⁿ⁻¹.

πŸ”¬ Booth encoding in modern multipliers β€” Radix-4 and Wallace trees

Modern high-performance CPU and GPU multipliers use Modified Booth Encoding (MBE) at Radix-4 β€” examining 3 multiplier bits at a time (overlapping by 1 bit) and replacing them with a partial product coefficient from {βˆ’2M, βˆ’M, 0, +M, +2M}. This halves the number of partial products compared to basic Booth, reducing the adder tree depth. The partial products are then summed using a Wallace tree β€” a network of carry-save adders (CSAs) that reduces n partial products to two numbers in O(log n) levels β€” rather than sequential addition. The final carry-propagate addition uses a Kogge-Stone or Brent-Kung parallel prefix adder. This Booth + CSA + parallel-prefix architecture is what every fast multiplier in a modern CPU or DSP uses. Writing the RTL for a 32Γ—32 Booth-encoded multiplier with a Wallace tree is a standard advanced VLSI design exercise.

Summary β€” CA-10 key points: The CPU contains three primary sub-systems: the Control Unit (decodes instructions, generates control signals), the ALU (performs all arithmetic and logic operations, updates N/Z/C/V flags), and the Register File (fast on-chip storage for operands, PC, SP, IR, MAR, MBR). RTL notation describes register transfers and operations at the micro-architecture level. Integer representation: Unsigned (0 to 2ⁿ-1), Sign-Magnitude (MSB is sign, two zeros, complex arithmetic), Two’s Complement (single zero, asymmetric range, same adder for add and subtract β€” universally used). Two’s complement negation: invert all bits, add 1. Overflow: same-sign operands produce opposite-sign result. Subtraction: Aβˆ’B = A + (NOT B) + 1. Multiplication: partial products shifted and summed; unsigned straightforward, negative numbers need Booth’s algorithm. Booth’s algorithm handles two’s complement directly and is more efficient by skipping blocks of identical bits. IEEE 754 defines floating-point: 1 sign + 8 exponent + 23 significand (32-bit single), or 1+11+52 (64-bit double).
Scroll to Top