Unsigned binary addition and subtraction with carry/borrow, hardware adder–subtractor design, multiplication of negative numbers via Booth’s algorithm, and IEEE 754 floating-point — representation, special values, and the FPU pipeline.
Unsigned binary addition follows exactly the same positional rules as decimal addition — but with carries at 2 instead of 10. The fundamental single-bit addition table:
| Augend (A) | Addend (B) | Carry out | Sum bit |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 0 |
Only 1+1 generates a carry. For multi-bit numbers, carries propagate from the LSB (bit 0) to the MSB (bit n-1). If a carry propagates out of the MSB, and the numbers are unsigned, this indicates the result exceeded the n-bit word capacity.
(a) 1001₂ + 0101₂ = ? (9 + 5 = 14)
Carry: 1
1001
+ 0101
──────
1110 = 14₁₀ ✓
(b) 1110₂ + 0111₂ = ? (14 + 7 = 21)
Carry: 111
1110
+ 0111
──────
10101 = 21₁₀ ✓ (5-bit result — carry out of MSB means we need one more bit)
Subtraction follows the same positional rule as decimal, but with borrows at 2. The single-bit subtraction table (minuend − subtrahend):
| Minuend (x) | Subtrahend (y) | Borrow out | Difference |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 |
| 1 | 1 | 0 | 0 |
When the minuend bit is smaller than the subtrahend bit (0 − 1), a borrow is needed from the next higher bit position. Borrowing adds 2 to the current bit (turning 0 into 2), and propagates a borrow-in of 1 to the next position.
(a) 1001₂ − 0010₂ = ? (9 − 2 = 7)
Borrow: 11
1001
− 0010
──────
0111 = 7₁₀ ✓
(b) 0011₂ − 0110₂ = ? (3 − 6, result would be negative)
Borrow: 1
0011
− 0110
──────
11101₂ (the carry represents a borrow of 16; remaining bits 1101 = positive 13;
so result = 13 − 16 = −3, which is correct: 3 − 6 = −3)
Key point: Unsigned subtraction is not defined when the result would be negative (minuend < subtrahend). In practice, processors detect this via the borrow (carry) flag and the programmer decides whether the numbers should be treated as signed (two’s complement) or truly unsigned.
The hardware insight from two’s complement: A − B = A + (NOT B) + 1. A single hardware block — one binary adder plus a controllable complementer — implements both addition and subtraction:
Adder–subtractor block diagram. The SUB control signal simultaneously: (1) enables the complementer to invert all bits of Register B (NOT gates), and (2) connects SUB to the carry-in of the adder. With SUB=0: A+B (normal addition). With SUB=1: A + NOT(B) + 1 = A − B. One circuit, two operations. The overflow flag (OVF) is generated by XOR-ing the carry into and out of the MSB.
As covered in CA-10, multiplying negative two’s complement numbers requires Booth’s algorithm. The key rules (repeated here for completeness with additional examples):
Booth’s algorithm for (7) × (−3) = −21. M = 0111 (+7). Multiplier Q = 1101 (−3 in 4-bit two’s complement). Red rows: subtract M (start of a 1-block, Q₀=1,Q₋₁=0). Green rows: add M (end of a 1-block, Q₀=0,Q₋₁=1). Yellow rows: arithmetic right shift (sign bit preserved). Final A:Q = 1110 1011 = −21 as an 8-bit two’s complement number ✓.
All the integer representations discussed in CA-10 and this article are examples of fixed-point representation — the binary (radix) point is fixed, implicitly, to the right of the least significant bit. All numbers are integers.
A programmer can represent fractions using fixed-point by mentally repositioning the binary point. For example, treating an 8-bit integer as a 4.4 fixed-point number (4 bits before the point, 4 after) divides all values by 16:
| Stored integer | Binary | Fixed-point value (4.4 format) |
|---|---|---|
| 0001 0000 | 0001.0000 | 1.0000 (integer 16 ÷ 16) |
| 0000 1000 | 0000.1000 | 0.5000 (integer 8 ÷ 16) |
| 0001 0100 | 0001.0100 | 1.2500 (integer 20 ÷ 16) |
Fixed-point arithmetic is used in DSP (digital signal processing), embedded systems, and GPU shaders where floating-point hardware is absent or too expensive. The programmer handles the scaling manually. All the addition/subtraction/multiplication operations discussed in CA-10 and this article apply directly to fixed-point numbers — just keep track of where the decimal point is.
Floating-point represents numbers in scientific notation: sign × significand × 2exponent. The binary point “floats” — its position is not fixed but determined by the exponent. This gives an enormous dynamic range — the ability to represent both very large and very small numbers:
The key components of a floating-point number:
1.xxx in binary — the leading 1 is implied (not stored), giving one extra bit of precision for freeIEEE 754 (1985, revised 2008) is the universal standard for floating-point arithmetic. Three common formats:
Three IEEE 754 formats. Half-precision (fp16) is popular in AI inference due to its small size. Single-precision is the default for most CPU/GPU work. Double-precision is used where rounding errors would accumulate (scientific simulations, financial calculations). All use biased exponent encoding (actual_exp = stored_exp − bias) so that comparing two floats is equivalent to comparing two integers of the same bit width.
Step 1 — Convert to binary: 0.75 = 0.11₂ (since 0.5 + 0.25 = 0.75)
Step 2 — Normalise: 0.11₂ = 1.1 × 2⁻¹ (shift binary point 1 place right)
Step 3 — Encode sign: Negative → S = 1
Step 4 — Encode exponent: Actual exponent = −1. Stored = −1 + 127 = 126 = 0111 1110₂
Step 5 — Encode significand: 1.1₂ → drop the leading 1 → store 1000 0000 0000 0000 0000 000₂ (23 bits, pad with zeros)
Final 32-bit encoding:
S=1 Exp=0111 1110 Mantissa=100 0000 0000 0000 0000 0000
= 1 01111110 10000000000000000000000 = 0xBF400000
Hex → binary: 0x41200000 = 0 10000010 01000000000000000000000
Sign bit: 0 → positive
Exponent: 10000010₂ = 130₁₀. Actual exponent = 130 − 127 = 3
Significand: 01000000000000000000000 → restore leading 1: 1.01₂
Value: (+1) × 1.01₂ × 2³ = 1010₂ = 10.0₁₀ ✓
IEEE 754 reserves certain exponent bit patterns for special values that represent mathematical corner cases:
| Stored exponent | Significand | Represents | Example |
|---|---|---|---|
| All zeros (0x00) | All zeros | ±0 (positive or negative zero) | 0x00000000 = +0, 0x80000000 = −0 |
| All zeros (0x00) | Non-zero | Denormalised number (subnormal) | Very small numbers near zero — no implied leading 1 |
| All ones (0xFF) | All zeros | ±Infinity | 0x7F800000 = +∞ · Result of 1.0/0.0 |
| All ones (0xFF) | Non-zero | NaN (Not a Number) | 0x7FC00000 = quiet NaN · Result of 0.0/0.0, √(−1) |
Adding two floating-point numbers requires aligning the binary points (making exponents equal) before adding significands. This is the most complex basic FP operation:
Floating-point addition requires four steps: (1) compare exponents and right-shift the smaller operand’s significand to align binary points; (2) add the aligned significands with integer addition; (3) normalise the result back to 1.xxx form, adjusting the exponent; (4) round to the target precision. Rounding is the most subtle step — IEEE 754 mandates “round to nearest, ties to even” as the default mode.
A = 9.25: 1001.01₂ = 1.00101 × 2³ (S=0, E=130, M=00101…)
B = 1.5: 1.1₂ = 1.1 × 2⁰ (S=0, E=127, M=10000…)
Step 1 — Align: Exponent difference = 3−0 = 3. Shift B right by 3: 1.1 × 2⁰ → 0.001 1 × 2³
Step 2 — Add:
1.001 0100 × 2³
+ 0.001 1000 × 2³
──────────────────
1.010 1100 × 2³
Step 3 — Normalise: 1.0101100 × 2³ → already normalised (starts with 1.)
Step 4 — Round: Exact representation, no rounding needed.
Result: 1.0101100 × 2³ = 1010.1100₂ = 10.75₁₀ ✓ (9.25 + 1.5 = 10.75)
A high-performance FPU pipelines the four addition steps so a new floating-point operation can begin every clock cycle, even though one operation takes multiple cycles to complete:
Pipelined FPU with 4 stages. Latency is 4 cycles (from input to result). Throughput is 1 result/cycle once the pipeline is full — four different operations overlap simultaneously. Multiply typically uses 5–7 stages. Division is often non-pipelined and takes 20–50 cycles — a common FP performance bottleneck.
Implementing a correct IEEE 754 compliant FPU is one of the hardest RTL design tasks. The four-stage addition pipeline is straightforward in principle but hides dozens of tricky corner cases: denormalised number handling (no implicit leading 1), catastrophic cancellation (subtracting nearly-equal numbers leaving only a few significant bits), sticky bit computation for correct rounding, and correct handling of all NaN and infinity propagation rules. Intel’s Pentium FDIV bug (1994) — a table error in the divide algorithm that gave wrong results in 1 in 9 billion divisions — resulted in a $475 million recall. Formal verification of FPU arithmetic is now an industry standard, using theorem provers (ACL2, Coq) to mathematically prove that the RTL produces IEEE 754 correct results for all possible input combinations. Companies like AMD and ARM use extensive formal FPU verification teams.
Modern AI inference chips (Google TPU, NVIDIA A100/H100, Apple Neural Engine) use reduced-precision floating-point to pack more multiply-accumulate (MAC) units per mm². IEEE 754 fp16 (1+5+10) gives ~7 decimal digits of precision — insufficient for training but adequate for inference. bfloat16 (Brain Float 16, 1+8+7) uses the same exponent field as float32 (1+8+23) but truncates the significand — this allows lossless conversion between float32 and bfloat16 and was developed by Google Brain for training large language models. NVIDIA’s Tensor Cores natively execute 16×16 matrix multiplications in fp16 or bfloat16 in a single clock cycle. Understanding the IEEE 754 format — specifically why the exponent field width determines dynamic range and the significand width determines precision — is the prerequisite for understanding why bfloat16 was chosen: it preserves float32’s range (critical for gradient updates in training) while halving memory bandwidth requirements.
The multiplier is almost always the critical-path bottleneck in a CPU or DSP data path. A 32×32-bit Booth-encoded multiplier produces 16 partial products (Radix-4 MBE), which are reduced by a Wallace tree of carry-save adders to two 64-bit numbers, then a final carry-propagate adder produces the 64-bit result. The Wallace tree depth is approximately log₁.₅(16) ≈ 6 levels of full adders. Each CSA level adds ~1 gate delay; the final carry-propagate adder adds ~log₂(64) = 6 gate delays with a carry-lookahead tree. Total critical path: ~12–20 gate delays = 1–3 ns at 28nm. If your target clock period is 1 ns (1 GHz), the multiplier is the hardest block to close timing on. Physical designers floorplan the multiplier carefully — placing it close to the register file and ALU, aligning it with the clock distribution tree, and sometimes manually placing critical cells.