CA-11: Binary & Floating-Point Arithmetic — VLSI Trainers
VLSI Trainers CA Series · 11 / 12
Computer Architecture · Article 11 of 12

CA-11: Binary & Floating-Point Arithmetic

Unsigned binary addition and subtraction with carry/borrow, hardware adder–subtractor design, multiplication of negative numbers via Booth’s algorithm, and IEEE 754 floating-point — representation, special values, and the FPU pipeline.

Unsigned Binary Addition

Unsigned binary addition follows the same positional rules as decimal addition — but with carries at 2 instead of 10. The fundamental single-bit addition table:

Augend (A)Addend (B)Carry outSum bit
0000
0101
1001
1110

Only 1+1 generates a carry. For multi-bit numbers, carries propagate from the LSB (bit 0) to the MSB (bit n-1). If a carry propagates out of the MSB and the numbers are unsigned, this indicates the result exceeded the n-bit word capacity.

🔍 Worked Examples — Unsigned Binary Addition

(a) 1001₂ + 0101₂ = ?    (9 + 5 = 14)

Carry:  1
    1001
+  0101
──────
    1110 = 14₁₀ ✓

(b) 1110₂ + 0111₂ = ?    (14 + 7 = 21)

Carry: 111
    1110
+  0111
──────
 10101 = 21₁₀ ✓   (5-bit result — carry out of MSB means we need one more bit)

Unsigned Binary Subtraction

Subtraction follows the same positional rule as decimal, but with borrows at 2. The single-bit subtraction table (minuend − subtrahend):

Minuend (x)Subtrahend (y)Borrow outDifference
0000
0111
1001
1100
🔍 Worked Examples — Unsigned Binary Subtraction

(a) 1001₂ − 0010₂ = ?    (9 − 2 = 7)

Borrow:  11
    1001
−  0010
──────
    0111 = 7₁₀ ✓

(b) 0011₂ − 0110₂ = ?    (3 − 6, result would be negative)

The borrow indicates the result is negative. In practice, processors detect this via the borrow (carry) flag.

🔧Hardware Adder–Subtractor

The hardware insight from two’s complement: A − B = A + (NOT B) + 1. A single hardware block — one binary adder plus a controllable complementer — implements both addition and subtraction:

Figure 1 — Hardware adder–subtractor: one circuit for both ADD and SUB
Register A (augend / minuend) Register B (addend / subtrahend) Complementer (NOT gates, controlled by SUB signal) SUB control 0=ADD, 1=SUB Cᵢₙ = SUB n-Bit Binary Adder computes: A + B_eff + Cᵢₙ OVF flag Cₙ XOR Cₙ₋₁ Result — goes back to A register (or result register) Operation modes: SUB=0: Adder sees B unchanged, Cᵢₙ=0 → A + B SUB=1: Adder sees NOT(B), Cᵢₙ=1 → A + NOT(B) + 1 = A − B vlsitrainers.com

Adder–subtractor block diagram. The SUB control signal simultaneously inverts all bits of Register B and sets carry-in=1. With SUB=0: A+B. With SUB=1: A + NOT(B) + 1 = A − B. One circuit, two operations. The OVF flag is generated by XOR-ing the carry into and out of the MSB.

✖️Two’s Complement Multiplication — Booth’s Algorithm Summary

As covered in CA-10, multiplying negative two’s complement numbers requires Booth’s algorithm. The key rules (with an additional example):

Figure 2 — Booth’s algorithm: register state evolution for (7) × (−3) = −21
Step A (accumulator) Q (multiplier) Q₋₁ Action Reason M = 0111 (+7)    Initial state: A=0000, Q=1101(−3), Q₋₁=0 Initial 0000 1101 0 1a: SUB 1001 1101 0 A ← A − M Q₀=1, Q₋₁=0 → start block 1b: shift 1100 1110 1 arith. right shift 2a: ADD 0011 1110 1 A ← A + M Q₀=0, Q₋₁=1 → end block 2b: shift 0001 1111 0 3a: SUB 1010 1111 0 A ← A − M Q₀=1, Q₋₁=0 → start block 3b: shift 1101 0111 1 4: shift 1110 1011 1 Result: 1110 1011 = −21 ✓ vlsitrainers.com

Booth’s algorithm for (7) × (−3) = −21. M = 0111 (+7). Multiplier Q = 1101 (−3 in 4-bit two’s complement). Red rows: subtract M (start of a 1-block). Green rows: add M (end of a 1-block). Yellow rows: arithmetic right shift. Final A:Q = 1110 1011 = −21 ✓.

📍Fixed-Point Representation

All the integer representations discussed in CA-10 and this article are examples of fixed-point representation — the binary point is fixed, implicitly, to the right of the least significant bit. A programmer can represent fractions by mentally repositioning the binary point:

Stored integerBinaryFixed-point value (4.4 format)
0001 00000001.00001.0000   (integer 16 ÷ 16)
0000 10000000.10000.5000   (integer 8 ÷ 16)
0001 01000001.01001.2500   (integer 20 ÷ 16)
Fixed-point vs floating-point trade-off: Fixed-point is fast and area-efficient but has limited dynamic range. A 16-bit floating-point (IEEE 754 half-precision) can represent values from ~±6×10⁻⁵ to ~±65,504 — a dynamic range of 10 billion to 1. For audio/image processing with known range, fixed-point is preferred. For scientific computing and ML, floating-point is essential.

🌊Floating-Point Concept

Floating-point represents numbers in scientific notation: sign × significand × 2exponent. The binary point “floats” — its position is determined by the exponent. This gives an enormous dynamic range:

📐IEEE 754 Standard

IEEE 754 (1985, revised 2008) is the universal standard for floating-point arithmetic. Three common formats:

Figure 3 — IEEE 754 floating-point formats: half, single, and double precision
Half-precision (16-bit, fp16) S Exp (5 bits, bias=15) Significand (10 bits) Range: ±6.55×10⁴ · Precision: ~3.3 decimal digits · Used in ML/AI inference Single-precision (32-bit, float) S Exponent (8 bits, bias=127) Significand / Mantissa (23 bits) — implied leading 1, effective 24 bits Range: ±3.4×10³⁸ · Precision: ~7.2 decimal digits · Default for most CPU/GPU floating-point Double-precision (64-bit, double) S Exp (11 bits, bias=1023) Significand (52 bits) — effective 53 bits with implied leading 1 Range: ±1.8×10³⁰⁸ · Precision: ~15.9 decimal digits · Used for scientific computing, financial calculations vlsitrainers.com

Three IEEE 754 formats. Half-precision (fp16) is popular in AI inference. Single-precision is the default for most CPU/GPU work. Double-precision is used where rounding errors would accumulate. All use biased exponent encoding so that comparing two floats is equivalent to comparing two integers of the same bit width.

🔍 Worked Example — Encoding −0.75 as IEEE 754 Single-Precision

Step 1 — Convert to binary: 0.75 = 0.11₂ (since 0.5 + 0.25 = 0.75)

Step 2 — Normalise: 0.11₂ = 1.1 × 2⁻¹

Step 3 — Encode sign: Negative → S = 1

Step 4 — Encode exponent: Actual exponent = −1. Stored = −1 + 127 = 126 = 0111 1110₂

Step 5 — Encode significand: 1.1₂ → drop the leading 1 → store 1000 0000 0000 0000 0000 000₂

Final encoding:

S=1   Exp=0111 1110   Mantissa=100 0000 0000 0000 0000 0000
= 0xBF400000

🔍 Worked Example — Decoding IEEE 754 Single-Precision 0x41200000

Hex → binary: 0x41200000 = 0 10000010 01000000000000000000000

Sign bit: 0 → positive

Exponent: 10000010₂ = 130₁₀. Actual exponent = 130 − 127 = 3

Significand: 01000000000000000000000 → restore leading 1: 1.01₂

Value: (+1) × 1.01₂ × 2³ = 1010₂ = 10.0₁₀

IEEE 754 Special Values

IEEE 754 reserves certain exponent bit patterns for special values that represent mathematical corner cases:

Stored exponentSignificandRepresentsExample
All zeros (0x00)All zeros±0 (positive or negative zero)0x00000000 = +0, 0x80000000 = −0
All zeros (0x00)Non-zeroDenormalised number (subnormal)Very small numbers near zero — no implied leading 1
All ones (0xFF)All zeros±Infinity0x7F800000 = +∞ · Result of 1.0/0.0
All ones (0xFF)Non-zeroNaN (Not a Number)0x7FC00000 = quiet NaN · Result of 0.0/0.0, √(−1)
Why two zeros? −0 and +0 are mathematically equal but represent signed limits from different directions: 1/+0 = +∞, 1/−0 = −∞. In VLSI design, NaN propagation bugs are notoriously hard to debug — a single NaN input to a long computation chain silently contaminates all results.

Floating-Point Addition — Step-by-Step

Adding two floating-point numbers requires aligning the binary points (making exponents equal) before adding significands:

Figure 4 — Floating-point addition algorithm: four-step process
① Compare Exponents Find which operand has the smaller exponent. Shift its significand right by the difference in exponent values. Example: A = 1.001 × 2³ B = 1.010 × 2¹ Align B: 0.00101 × 2³ (right-shift B by 2) ② Add Significands Add the two aligned significands using integer addition. 1.001 00 × 2³ + 0.001 01 × 2³ ───────────── 1.010 01 × 2³ ③ Normalise Adjust the result so it has the form 1.xxx (one non-zero bit before the binary point). If result = 10.xxxx shift right 1, exp+1. If = 0.00xxx, shift left. ④ Round The significand may have more bits than the format allows. Round according to the rounding mode: • Round to nearest even • Round toward +∞ • Round toward −∞ • Round toward zero vlsitrainers.com

Floating-point addition requires four steps: (1) compare exponents and right-shift the smaller operand’s significand to align binary points; (2) add the aligned significands; (3) normalise the result back to 1.xxx form; (4) round to the target precision. IEEE 754 mandates “round to nearest, ties to even” as the default mode.

🔍 Worked Example — IEEE 754 Single-Precision: 9.25 + 1.5

A = 9.25: 1001.01₂ = 1.00101 × 2³

B = 1.5: 1.1₂ = 1.1 × 2⁰

Step 1 — Align: Exponent difference = 3. Shift B right by 3: 0.001 1 × 2³

Step 2 — Add:

  1.001 0100 × 2³
+ 0.001 1000 × 2³
──────────────────
  1.010 1100 × 2³

Step 3 — Normalise: Already normalised (starts with 1.)

Result: 1.0101100 × 2³ = 1010.1100₂ = 10.75₁₀

🏭FPU Pipeline

A high-performance FPU pipelines the four addition steps so a new floating-point operation can begin every clock cycle:

Figure 5 — Pipelined FPU: four stages, one result per cycle at steady state
Stage 1 Exp compare & align Stage 2 Significand add Stage 3 Normalise Stage 4 Round & encode Result Cycle N: op A Cycle N: op B Cycle N: op C Cycle N: op D Cycle N+1: op B Cycle N+1: op C Cycle N+1: op D Cycle N+1: op E Latency = 4 cycles per operation · Throughput = 1 result/cycle (once pipeline is full) vlsitrainers.com

Pipelined FPU with 4 stages. Latency is 4 cycles. Throughput is 1 result/cycle once the pipeline is full. Multiply typically uses 5–7 stages. Division is often non-pipelined and takes 20–50 cycles — a common FP performance bottleneck.

FPU latency matters for performance: An FP addition taking 4 cycles is fine if the result is not immediately needed. But if the next instruction depends on the result, the CPU must stall. This is why compilers perform instruction scheduling — reordering FP operations to keep the pipeline busy while results are pending.

🔬VLSI Connections

🔬 FPU RTL design and IEEE 754 corner cases

Implementing a correct IEEE 754 compliant FPU is one of the hardest RTL design tasks. The four-stage addition pipeline hides dozens of tricky corner cases: denormalised number handling (no implicit leading 1), catastrophic cancellation (subtracting nearly-equal numbers), sticky bit computation for correct rounding, and correct handling of all NaN and infinity propagation rules. Intel’s Pentium FDIV bug (1994) — a table error in the divide algorithm — resulted in a $475 million recall. Formal verification of FPU arithmetic is now an industry standard, using theorem provers (ACL2, Coq) to mathematically prove that the RTL produces IEEE 754 correct results for all possible input combinations.

🔬 fp16 and bfloat16 in AI accelerators

Modern AI inference chips (Google TPU, NVIDIA A100/H100, Apple Neural Engine) use reduced-precision floating-point to pack more multiply-accumulate (MAC) units per mm². bfloat16 (Brain Float 16, 1+8+7) uses the same exponent field as float32 (1+8+23) but truncates the significand — this allows lossless conversion between float32 and bfloat16 and was developed by Google Brain for training large language models. NVIDIA’s Tensor Cores natively execute 16×16 matrix multiplications in fp16 or bfloat16 in a single clock cycle. Understanding why bfloat16 was chosen — it preserves float32’s range (critical for gradient updates) while halving memory bandwidth requirements — requires understanding why the exponent field width determines dynamic range.

🔬 Booth multiplier timing closure in physical design

The multiplier is almost always the critical-path bottleneck in a CPU or DSP data path. A 32×32-bit Booth-encoded multiplier produces 16 partial products (Radix-4 MBE), which are reduced by a Wallace tree of carry-save adders to two 64-bit numbers, then a final carry-propagate adder produces the 64-bit result. The Wallace tree depth is approximately log₁.₅(16) ≈ 6 levels of full adders; the final carry-propagate adder adds ~log₂(64) = 6 gate delays with a carry-lookahead tree. Total critical path: ~12–20 gate delays = 1–3 ns at 28nm. Physical designers floorplan the multiplier carefully — placing it close to the register file and ALU and manually placing critical cells to meet timing closure.

Summary — CA-11 key points: Unsigned binary addition uses positional carry-propagation; subtraction uses borrow-propagation. A single adder–subtractor circuit implements both ADD and SUB by complementing operand B and setting carry-in=1 when SUB=1. Booth’s algorithm correctly handles two’s complement multiplication by examining pairs of multiplier bits and performing subtract (start of 1-block), add (end of 1-block), or shift-only (middle of block). Fixed-point representation uses integer hardware with a programmer-defined implicit binary point. Floating-point uses sign + biased exponent + significand — IEEE 754 is the universal standard (half/single/double precision). Special values: ±0, denormals, ±Infinity, NaN. FP addition requires four steps: align, add significands, normalise, round. FPU pipelines these stages for throughput of 1 result/cycle. Division is typically non-pipelined and much slower.
CPU & ALU ☰ CA Series Index Combinational & Sequential Logic
Scroll to Top