Unsigned binary addition and subtraction with carry/borrow, hardware adder–subtractor design, multiplication of negative numbers via Booth’s algorithm, and IEEE 754 floating-point — representation, special values, and the FPU pipeline.
Unsigned binary addition follows the same positional rules as decimal addition — but with carries at 2 instead of 10. The fundamental single-bit addition table:
| Augend (A) | Addend (B) | Carry out | Sum bit |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 0 |
Only 1+1 generates a carry. For multi-bit numbers, carries propagate from the LSB (bit 0) to the MSB (bit n-1). If a carry propagates out of the MSB and the numbers are unsigned, this indicates the result exceeded the n-bit word capacity.
(a) 1001₂ + 0101₂ = ? (9 + 5 = 14)
Carry: 1
1001
+ 0101
──────
1110 = 14₁₀ ✓
(b) 1110₂ + 0111₂ = ? (14 + 7 = 21)
Carry: 111
1110
+ 0111
──────
10101 = 21₁₀ ✓ (5-bit result — carry out of MSB means we need one more bit)
Subtraction follows the same positional rule as decimal, but with borrows at 2. The single-bit subtraction table (minuend − subtrahend):
| Minuend (x) | Subtrahend (y) | Borrow out | Difference |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 |
| 1 | 1 | 0 | 0 |
(a) 1001₂ − 0010₂ = ? (9 − 2 = 7)
Borrow: 11
1001
− 0010
──────
0111 = 7₁₀ ✓
(b) 0011₂ − 0110₂ = ? (3 − 6, result would be negative)
The borrow indicates the result is negative. In practice, processors detect this via the borrow (carry) flag.
The hardware insight from two’s complement: A − B = A + (NOT B) + 1. A single hardware block — one binary adder plus a controllable complementer — implements both addition and subtraction:
Adder–subtractor block diagram. The SUB control signal simultaneously inverts all bits of Register B and sets carry-in=1. With SUB=0: A+B. With SUB=1: A + NOT(B) + 1 = A − B. One circuit, two operations. The OVF flag is generated by XOR-ing the carry into and out of the MSB.
As covered in CA-10, multiplying negative two’s complement numbers requires Booth’s algorithm. The key rules (with an additional example):
Booth’s algorithm for (7) × (−3) = −21. M = 0111 (+7). Multiplier Q = 1101 (−3 in 4-bit two’s complement). Red rows: subtract M (start of a 1-block). Green rows: add M (end of a 1-block). Yellow rows: arithmetic right shift. Final A:Q = 1110 1011 = −21 ✓.
All the integer representations discussed in CA-10 and this article are examples of fixed-point representation — the binary point is fixed, implicitly, to the right of the least significant bit. A programmer can represent fractions by mentally repositioning the binary point:
| Stored integer | Binary | Fixed-point value (4.4 format) |
|---|---|---|
| 0001 0000 | 0001.0000 | 1.0000 (integer 16 ÷ 16) |
| 0000 1000 | 0000.1000 | 0.5000 (integer 8 ÷ 16) |
| 0001 0100 | 0001.0100 | 1.2500 (integer 20 ÷ 16) |
Floating-point represents numbers in scientific notation: sign × significand × 2exponent. The binary point “floats” — its position is determined by the exponent. This gives an enormous dynamic range:
1.xxx in binary — the leading 1 is implied (not stored), giving one extra bit of precision for freeIEEE 754 (1985, revised 2008) is the universal standard for floating-point arithmetic. Three common formats:
Three IEEE 754 formats. Half-precision (fp16) is popular in AI inference. Single-precision is the default for most CPU/GPU work. Double-precision is used where rounding errors would accumulate. All use biased exponent encoding so that comparing two floats is equivalent to comparing two integers of the same bit width.
Step 1 — Convert to binary: 0.75 = 0.11₂ (since 0.5 + 0.25 = 0.75)
Step 2 — Normalise: 0.11₂ = 1.1 × 2⁻¹
Step 3 — Encode sign: Negative → S = 1
Step 4 — Encode exponent: Actual exponent = −1. Stored = −1 + 127 = 126 = 0111 1110₂
Step 5 — Encode significand: 1.1₂ → drop the leading 1 → store 1000 0000 0000 0000 0000 000₂
Final encoding:
S=1 Exp=0111 1110 Mantissa=100 0000 0000 0000 0000 0000
= 0xBF400000
Hex → binary: 0x41200000 = 0 10000010 01000000000000000000000
Sign bit: 0 → positive
Exponent: 10000010₂ = 130₁₀. Actual exponent = 130 − 127 = 3
Significand: 01000000000000000000000 → restore leading 1: 1.01₂
Value: (+1) × 1.01₂ × 2³ = 1010₂ = 10.0₁₀ ✓
IEEE 754 reserves certain exponent bit patterns for special values that represent mathematical corner cases:
| Stored exponent | Significand | Represents | Example |
|---|---|---|---|
| All zeros (0x00) | All zeros | ±0 (positive or negative zero) | 0x00000000 = +0, 0x80000000 = −0 |
| All zeros (0x00) | Non-zero | Denormalised number (subnormal) | Very small numbers near zero — no implied leading 1 |
| All ones (0xFF) | All zeros | ±Infinity | 0x7F800000 = +∞ · Result of 1.0/0.0 |
| All ones (0xFF) | Non-zero | NaN (Not a Number) | 0x7FC00000 = quiet NaN · Result of 0.0/0.0, √(−1) |
Adding two floating-point numbers requires aligning the binary points (making exponents equal) before adding significands:
Floating-point addition requires four steps: (1) compare exponents and right-shift the smaller operand’s significand to align binary points; (2) add the aligned significands; (3) normalise the result back to 1.xxx form; (4) round to the target precision. IEEE 754 mandates “round to nearest, ties to even” as the default mode.
A = 9.25: 1001.01₂ = 1.00101 × 2³
B = 1.5: 1.1₂ = 1.1 × 2⁰
Step 1 — Align: Exponent difference = 3. Shift B right by 3: 0.001 1 × 2³
Step 2 — Add:
1.001 0100 × 2³
+ 0.001 1000 × 2³
──────────────────
1.010 1100 × 2³
Step 3 — Normalise: Already normalised (starts with 1.)
Result: 1.0101100 × 2³ = 1010.1100₂ = 10.75₁₀ ✓
A high-performance FPU pipelines the four addition steps so a new floating-point operation can begin every clock cycle:
Pipelined FPU with 4 stages. Latency is 4 cycles. Throughput is 1 result/cycle once the pipeline is full. Multiply typically uses 5–7 stages. Division is often non-pipelined and takes 20–50 cycles — a common FP performance bottleneck.
Implementing a correct IEEE 754 compliant FPU is one of the hardest RTL design tasks. The four-stage addition pipeline hides dozens of tricky corner cases: denormalised number handling (no implicit leading 1), catastrophic cancellation (subtracting nearly-equal numbers), sticky bit computation for correct rounding, and correct handling of all NaN and infinity propagation rules. Intel’s Pentium FDIV bug (1994) — a table error in the divide algorithm — resulted in a $475 million recall. Formal verification of FPU arithmetic is now an industry standard, using theorem provers (ACL2, Coq) to mathematically prove that the RTL produces IEEE 754 correct results for all possible input combinations.
Modern AI inference chips (Google TPU, NVIDIA A100/H100, Apple Neural Engine) use reduced-precision floating-point to pack more multiply-accumulate (MAC) units per mm². bfloat16 (Brain Float 16, 1+8+7) uses the same exponent field as float32 (1+8+23) but truncates the significand — this allows lossless conversion between float32 and bfloat16 and was developed by Google Brain for training large language models. NVIDIA’s Tensor Cores natively execute 16×16 matrix multiplications in fp16 or bfloat16 in a single clock cycle. Understanding why bfloat16 was chosen — it preserves float32’s range (critical for gradient updates) while halving memory bandwidth requirements — requires understanding why the exponent field width determines dynamic range.
The multiplier is almost always the critical-path bottleneck in a CPU or DSP data path. A 32×32-bit Booth-encoded multiplier produces 16 partial products (Radix-4 MBE), which are reduced by a Wallace tree of carry-save adders to two 64-bit numbers, then a final carry-propagate adder produces the 64-bit result. The Wallace tree depth is approximately log₁.₅(16) ≈ 6 levels of full adders; the final carry-propagate adder adds ~log₂(64) = 6 gate delays with a carry-lookahead tree. Total critical path: ~12–20 gate delays = 1–3 ns at 28nm. Physical designers floorplan the multiplier carefully — placing it close to the register file and ALU and manually placing critical cells to meet timing closure.