CA-11: Binary & Floating-Point Arithmetic – Your VLSI Journey Starts Here

➕Unsigned Binary Addition

Unsigned binary addition follows exactly the same positional rules as decimal addition — but with carries at 2 instead of 10. The fundamental single-bit addition table:

Augend (A)	Addend (B)	Carry out	Sum bit
0	0	0	0
0	1	0	1
1	0	0	1
1	1	1	0

Only 1+1 generates a carry. For multi-bit numbers, carries propagate from the LSB (bit 0) to the MSB (bit n-1). If a carry propagates out of the MSB, and the numbers are unsigned, this indicates the result exceeded the n-bit word capacity.

🔍 Worked Examples — Unsigned Binary Addition

(a) 1001₂ + 0101₂ = ? (9 + 5 = 14)

Carry: 1
    1001
+  0101
──────
    1110 = 14₁₀ ✓

(b) 1110₂ + 0111₂ = ? (14 + 7 = 21)

Carry: 111
    1110
+  0111
──────
10101 = 21₁₀ ✓   (5-bit result — carry out of MSB means we need one more bit)

➖Unsigned Binary Subtraction

Subtraction follows the same positional rule as decimal, but with borrows at 2. The single-bit subtraction table (minuend − subtrahend):

Minuend (x)	Subtrahend (y)	Borrow out	Difference
0	0	0	0
0	1	1	1
1	0	0	1
1	1	0	0

When the minuend bit is smaller than the subtrahend bit (0 − 1), a borrow is needed from the next higher bit position. Borrowing adds 2 to the current bit (turning 0 into 2), and propagates a borrow-in of 1 to the next position.

🔍 Worked Examples — Unsigned Binary Subtraction

(a) 1001₂ − 0010₂ = ? (9 − 2 = 7)

Borrow: 11
    1001
−  0010
──────
    0111 = 7₁₀ ✓

(b) 0011₂ − 0110₂ = ? (3 − 6, result would be negative)

Borrow: 1
    0011
−  0110
──────
11101₂    (the carry represents a borrow of 16; remaining bits 1101 = positive 13;
            so result = 13 − 16 = −3, which is correct: 3 − 6 = −3)

Key point: Unsigned subtraction is not defined when the result would be negative (minuend < subtrahend). In practice, processors detect this via the borrow (carry) flag and the programmer decides whether the numbers should be treated as signed (two’s complement) or truly unsigned.

🔧Hardware Adder–Subtractor

The hardware insight from two’s complement: A − B = A + (NOT B) + 1. A single hardware block — one binary adder plus a controllable complementer — implements both addition and subtraction:

Figure 1 — Hardware adder–subtractor: one circuit for both ADD and SUB

Adder–subtractor block diagram. The SUB control signal simultaneously: (1) enables the complementer to invert all bits of Register B (NOT gates), and (2) connects SUB to the carry-in of the adder. With SUB=0: A+B (normal addition). With SUB=1: A + NOT(B) + 1 = A − B. One circuit, two operations. The overflow flag (OVF) is generated by XOR-ing the carry into and out of the MSB.

✖️Two’s Complement Multiplication — Booth’s Algorithm Summary

As covered in CA-10, multiplying negative two’s complement numbers requires Booth’s algorithm. The key rules (repeated here for completeness with additional examples):

Figure 2 — Booth’s algorithm: register state evolution for (7) × (−3) = −21

Booth’s algorithm for (7) × (−3) = −21. M = 0111 (+7). Multiplier Q = 1101 (−3 in 4-bit two’s complement). Red rows: subtract M (start of a 1-block, Q₀=1,Q₋₁=0). Green rows: add M (end of a 1-block, Q₀=0,Q₋₁=1). Yellow rows: arithmetic right shift (sign bit preserved). Final A:Q = 1110 1011 = −21 as an 8-bit two’s complement number ✓.

📍Fixed-Point Representation

All the integer representations discussed in CA-10 and this article are examples of fixed-point representation — the binary (radix) point is fixed, implicitly, to the right of the least significant bit. All numbers are integers.

A programmer can represent fractions using fixed-point by mentally repositioning the binary point. For example, treating an 8-bit integer as a 4.4 fixed-point number (4 bits before the point, 4 after) divides all values by 16:

Stored integer	Binary	Fixed-point value (4.4 format)
0001 0000	0001.0000	1.0000 (integer 16 ÷ 16)
0000 1000	0000.1000	0.5000 (integer 8 ÷ 16)
0001 0100	0001.0100	1.2500 (integer 20 ÷ 16)

Fixed-point arithmetic is used in DSP (digital signal processing), embedded systems, and GPU shaders where floating-point hardware is absent or too expensive. The programmer handles the scaling manually. All the addition/subtraction/multiplication operations discussed in CA-10 and this article apply directly to fixed-point numbers — just keep track of where the decimal point is.

Fixed-point vs floating-point trade-off: Fixed-point is fast and area-efficient (no exponent logic needed) but has limited dynamic range. A 16-bit fixed-point number can represent values from −32,768 to +32,767. A 16-bit floating-point (IEEE 754 half-precision) can represent values from ~±6×10⁻⁵ to ~±65,504 — a dynamic range of 10 billion to 1. For audio/image processing with known range, fixed-point is preferred. For scientific computing and ML, floating-point is essential.

🌊Floating-Point Concept

Floating-point represents numbers in scientific notation: sign × significand × 2^exponent. The binary point “floats” — its position is not fixed but determined by the exponent. This gives an enormous dynamic range — the ability to represent both very large and very small numbers:

Very large: Avogadro’s number ≈ 6.022 × 10²³
Very small: Electron charge ≈ 1.602 × 10⁻¹⁹
No fixed-point scheme with finite bits can represent both in the same format

The key components of a floating-point number:

Sign (S): 0 = positive, 1 = negative
Exponent (E): Stored with a bias (offset) so that the stored value is always positive — enables direct comparison of two floating-point numbers using integer comparison
Significand / Mantissa (M): The fractional part after normalisation. A normalised number always has the form 1.xxx in binary — the leading 1 is implied (not stored), giving one extra bit of precision for free

📐IEEE 754 Standard

IEEE 754 (1985, revised 2008) is the universal standard for floating-point arithmetic. Three common formats:

Figure 3 — IEEE 754 floating-point formats: half, single, and double precision

Three IEEE 754 formats. Half-precision (fp16) is popular in AI inference due to its small size. Single-precision is the default for most CPU/GPU work. Double-precision is used where rounding errors would accumulate (scientific simulations, financial calculations). All use biased exponent encoding (actual_exp = stored_exp − bias) so that comparing two floats is equivalent to comparing two integers of the same bit width.

🔍 Worked Example — Encoding −0.75 as IEEE 754 Single-Precision

Step 1 — Convert to binary: 0.75 = 0.11₂ (since 0.5 + 0.25 = 0.75)

Step 2 — Normalise: 0.11₂ = 1.1 × 2⁻¹ (shift binary point 1 place right)

Step 3 — Encode sign: Negative → S = 1

Step 4 — Encode exponent: Actual exponent = −1. Stored = −1 + 127 = 126 = 0111 1110₂

Step 5 — Encode significand: 1.1₂ → drop the leading 1 → store 1000 0000 0000 0000 0000 000₂ (23 bits, pad with zeros)

Final 32-bit encoding:

S=1 Exp=0111 1110 Mantissa=100 0000 0000 0000 0000 0000
= 1 01111110 10000000000000000000000 = 0xBF400000

🔍 Worked Example — Decoding IEEE 754 Single-Precision 0x41200000

Hex → binary: 0x41200000 = 0 10000010 01000000000000000000000

Sign bit: 0 → positive

Exponent: 10000010₂ = 130₁₀. Actual exponent = 130 − 127 = 3

Significand: 01000000000000000000000 → restore leading 1: 1.01₂

Value: (+1) × 1.01₂ × 2³ = 1010₂ = 10.0₁₀ ✓

⚡IEEE 754 Special Values

IEEE 754 reserves certain exponent bit patterns for special values that represent mathematical corner cases:

Stored exponent	Significand	Represents	Example
All zeros (0x00)	All zeros	±0 (positive or negative zero)	`0x00000000` = +0, `0x80000000` = −0
All zeros (0x00)	Non-zero	Denormalised number (subnormal)	Very small numbers near zero — no implied leading 1
All ones (0xFF)	All zeros	±Infinity	`0x7F800000` = +∞ · Result of 1.0/0.0
All ones (0xFF)	Non-zero	NaN (Not a Number)	`0x7FC00000` = quiet NaN · Result of 0.0/0.0, √(−1)

Why two zeros? −0 and +0 are mathematically equal but represent the signed limit approached from different directions: 1/+0 = +∞, 1/−0 = −∞. Two NaN types exist: quiet NaN (propagates silently through calculations) and signalling NaN (raises a floating-point exception if used). In VLSI design, NaN propagation bugs are notoriously hard to debug — a single NaN input to a long computation chain silently contaminates all results.

➕Floating-Point Addition — Step-by-Step

Adding two floating-point numbers requires aligning the binary points (making exponents equal) before adding significands. This is the most complex basic FP operation:

Figure 4 — Floating-point addition algorithm: four-step process

Floating-point addition requires four steps: (1) compare exponents and right-shift the smaller operand’s significand to align binary points; (2) add the aligned significands with integer addition; (3) normalise the result back to 1.xxx form, adjusting the exponent; (4) round to the target precision. Rounding is the most subtle step — IEEE 754 mandates “round to nearest, ties to even” as the default mode.

🔍 Worked Example — IEEE 754 Single-Precision: 9.25 + 1.5

A = 9.25: 1001.01₂ = 1.00101 × 2³ (S=0, E=130, M=00101…)

B = 1.5: 1.1₂ = 1.1 × 2⁰ (S=0, E=127, M=10000…)

Step 1 — Align: Exponent difference = 3−0 = 3. Shift B right by 3: 1.1 × 2⁰ → 0.001 1 × 2³

Step 2 — Add:

1.001 0100 × 2³
+ 0.001 1000 × 2³
──────────────────
1.010 1100 × 2³

Step 3 — Normalise: 1.0101100 × 2³ → already normalised (starts with 1.)

Step 4 — Round: Exact representation, no rounding needed.

Result: 1.0101100 × 2³ = 1010.1100₂ = 10.75₁₀ ✓ (9.25 + 1.5 = 10.75)

🏭FPU Pipeline

A high-performance FPU pipelines the four addition steps so a new floating-point operation can begin every clock cycle, even though one operation takes multiple cycles to complete:

Figure 5 — Pipelined FPU: four stages, one result per cycle at steady state

Pipelined FPU with 4 stages. Latency is 4 cycles (from input to result). Throughput is 1 result/cycle once the pipeline is full — four different operations overlap simultaneously. Multiply typically uses 5–7 stages. Division is often non-pipelined and takes 20–50 cycles — a common FP performance bottleneck.

FPU latency matters for performance: An FP addition taking 4 cycles is fine if the result is not immediately needed. But if the next instruction depends on the result, the CPU must stall (or the compiler must insert other instructions). This FP latency is the key reason compilers perform instruction scheduling — reordering FP operations to keep the pipeline busy while results are pending.

🔬VLSI Connections

🔬 FPU RTL design and IEEE 754 corner cases

Implementing a correct IEEE 754 compliant FPU is one of the hardest RTL design tasks. The four-stage addition pipeline is straightforward in principle but hides dozens of tricky corner cases: denormalised number handling (no implicit leading 1), catastrophic cancellation (subtracting nearly-equal numbers leaving only a few significant bits), sticky bit computation for correct rounding, and correct handling of all NaN and infinity propagation rules. Intel’s Pentium FDIV bug (1994) — a table error in the divide algorithm that gave wrong results in 1 in 9 billion divisions — resulted in a $475 million recall. Formal verification of FPU arithmetic is now an industry standard, using theorem provers (ACL2, Coq) to mathematically prove that the RTL produces IEEE 754 correct results for all possible input combinations. Companies like AMD and ARM use extensive formal FPU verification teams.

🔬 fp16 and bfloat16 in AI accelerators

Modern AI inference chips (Google TPU, NVIDIA A100/H100, Apple Neural Engine) use reduced-precision floating-point to pack more multiply-accumulate (MAC) units per mm². IEEE 754 fp16 (1+5+10) gives ~7 decimal digits of precision — insufficient for training but adequate for inference. bfloat16 (Brain Float 16, 1+8+7) uses the same exponent field as float32 (1+8+23) but truncates the significand — this allows lossless conversion between float32 and bfloat16 and was developed by Google Brain for training large language models. NVIDIA’s Tensor Cores natively execute 16×16 matrix multiplications in fp16 or bfloat16 in a single clock cycle. Understanding the IEEE 754 format — specifically why the exponent field width determines dynamic range and the significand width determines precision — is the prerequisite for understanding why bfloat16 was chosen: it preserves float32’s range (critical for gradient updates in training) while halving memory bandwidth requirements.

🔬 Booth multiplier timing closure in physical design

The multiplier is almost always the critical-path bottleneck in a CPU or DSP data path. A 32×32-bit Booth-encoded multiplier produces 16 partial products (Radix-4 MBE), which are reduced by a Wallace tree of carry-save adders to two 64-bit numbers, then a final carry-propagate adder produces the 64-bit result. The Wallace tree depth is approximately log₁.₅(16) ≈ 6 levels of full adders. Each CSA level adds ~1 gate delay; the final carry-propagate adder adds ~log₂(64) = 6 gate delays with a carry-lookahead tree. Total critical path: ~12–20 gate delays = 1–3 ns at 28nm. If your target clock period is 1 ns (1 GHz), the multiplier is the hardest block to close timing on. Physical designers floorplan the multiplier carefully — placing it close to the register file and ALU, aligning it with the clock distribution tree, and sometimes manually placing critical cells.

Summary — CA-11 key points: Unsigned binary addition uses positional carry-propagation; subtraction uses borrow-propagation. A single adder–subtractor circuit implements both ADD and SUB by complementing operand B and setting carry-in=1 when SUB=1 (A − B = A + NOT(B) + 1). Booth’s algorithm correctly handles two’s complement multiplication by examining pairs of multiplier bits and performing subtract (start of 1-block), add (end of 1-block), or shift-only (middle of block). Fixed-point representation uses integer hardware with a programmer-defined implicit binary point. Floating-point uses sign + biased exponent + significand — IEEE 754 is the universal standard (half/single/double precision). Special values: ±0, denormals, ±Infinity, NaN. FP addition requires four steps: align (shift smaller operand), add significands, normalise (restore 1.xxx form), round. FPU pipelines these stages for throughput of 1 result/cycle. Division is typically non-pipelined and much slower.

← CA-10: CPU & ALU ↑ Series Index CA-12: Combinational & Sequential Logic →