Binary and Floating-Point Arithmetic — Complete Guide for Computer Architecture

➕Unsigned Binary Addition

Unsigned binary addition follows the same positional rules as decimal addition — but with carries at 2 instead of 10. The fundamental single-bit addition table:

Augend (A)	Addend (B)	Carry out	Sum bit
0	0	0	0
0	1	0	1
1	0	0	1
1	1	1	0

Only 1+1 generates a carry. For multi-bit numbers, carries propagate from the LSB (bit 0) to the MSB (bit n-1). If a carry propagates out of the MSB and the numbers are unsigned, this indicates the result exceeded the n-bit word capacity.

🔍 Worked Examples — Unsigned Binary Addition

(a) 1001₂ + 0101₂ = ? (9 + 5 = 14)

Carry: 1
    1001
+  0101
──────
    1110 = 14₁₀ ✓

(b) 1110₂ + 0111₂ = ? (14 + 7 = 21)

Carry: 111
    1110
+  0111
──────
10101 = 21₁₀ ✓   (5-bit result — carry out of MSB means we need one more bit)

➖Unsigned Binary Subtraction

Subtraction follows the same positional rule as decimal, but with borrows at 2. The single-bit subtraction table (minuend − subtrahend):

Minuend (x)	Subtrahend (y)	Borrow out	Difference
0	0	0	0
0	1	1	1
1	0	0	1
1	1	0	0

🔍 Worked Examples — Unsigned Binary Subtraction

(a) 1001₂ − 0010₂ = ? (9 − 2 = 7)

Borrow: 11
    1001
−  0010
──────
    0111 = 7₁₀ ✓

(b) 0011₂ − 0110₂ = ? (3 − 6, result would be negative)

The borrow indicates the result is negative. In practice, processors detect this via the borrow (carry) flag.

🔧Hardware Adder–Subtractor

The hardware insight from two’s complement: A − B = A + (NOT B) + 1. A single hardware block — one binary adder plus a controllable complementer — implements both addition and subtraction:

Figure 1 — Hardware adder–subtractor: one circuit for both ADD and SUB

Adder–subtractor block diagram. The SUB control signal simultaneously inverts all bits of Register B and sets carry-in=1. With SUB=0: A+B. With SUB=1: A + NOT(B) + 1 = A − B. One circuit, two operations. The OVF flag is generated by XOR-ing the carry into and out of the MSB.

✖️Two’s Complement Multiplication — Booth’s Algorithm Summary

As covered in CA-10, multiplying negative two’s complement numbers requires Booth’s algorithm. The key rules (with an additional example):

Figure 2 — Booth’s algorithm: register state evolution for (7) × (−3) = −21

Booth’s algorithm for (7) × (−3) = −21. M = 0111 (+7). Multiplier Q = 1101 (−3 in 4-bit two’s complement). Red rows: subtract M (start of a 1-block). Green rows: add M (end of a 1-block). Yellow rows: arithmetic right shift. Final A:Q = 1110 1011 = −21 ✓.

📍Fixed-Point Representation

All the integer representations discussed in CA-10 and this article are examples of fixed-point representation — the binary point is fixed, implicitly, to the right of the least significant bit. A programmer can represent fractions by mentally repositioning the binary point:

Stored integer	Binary	Fixed-point value (4.4 format)
0001 0000	0001.0000	1.0000 (integer 16 ÷ 16)
0000 1000	0000.1000	0.5000 (integer 8 ÷ 16)
0001 0100	0001.0100	1.2500 (integer 20 ÷ 16)

Fixed-point vs floating-point trade-off: Fixed-point is fast and area-efficient but has limited dynamic range. A 16-bit floating-point (IEEE 754 half-precision) can represent values from ~±6×10⁻⁵ to ~±65,504 — a dynamic range of 10 billion to 1. For audio/image processing with known range, fixed-point is preferred. For scientific computing and ML, floating-point is essential.

🌊Floating-Point Concept

Floating-point represents numbers in scientific notation: sign × significand × 2^exponent. The binary point “floats” — its position is determined by the exponent. This gives an enormous dynamic range:

Sign (S): 0 = positive, 1 = negative
Exponent (E): Stored with a bias (offset) so that the stored value is always positive — enables direct comparison of two floating-point numbers using integer comparison
Significand / Mantissa (M): The fractional part after normalisation. A normalised number always has the form 1.xxx in binary — the leading 1 is implied (not stored), giving one extra bit of precision for free

📐IEEE 754 Standard

IEEE 754 (1985, revised 2008) is the universal standard for floating-point arithmetic. Three common formats:

Figure 3 — IEEE 754 floating-point formats: half, single, and double precision

Three IEEE 754 formats. Half-precision (fp16) is popular in AI inference. Single-precision is the default for most CPU/GPU work. Double-precision is used where rounding errors would accumulate. All use biased exponent encoding so that comparing two floats is equivalent to comparing two integers of the same bit width.

🔍 Worked Example — Encoding −0.75 as IEEE 754 Single-Precision

Step 1 — Convert to binary: 0.75 = 0.11₂ (since 0.5 + 0.25 = 0.75)

Step 2 — Normalise: 0.11₂ = 1.1 × 2⁻¹

Step 3 — Encode sign: Negative → S = 1

Step 4 — Encode exponent: Actual exponent = −1. Stored = −1 + 127 = 126 = 0111 1110₂

Step 5 — Encode significand: 1.1₂ → drop the leading 1 → store 1000 0000 0000 0000 0000 000₂

Final encoding:

S=1 Exp=0111 1110 Mantissa=100 0000 0000 0000 0000 0000
= 0xBF400000

🔍 Worked Example — Decoding IEEE 754 Single-Precision 0x41200000

Hex → binary: 0x41200000 = 0 10000010 01000000000000000000000

Sign bit: 0 → positive

Exponent: 10000010₂ = 130₁₀. Actual exponent = 130 − 127 = 3

Significand: 01000000000000000000000 → restore leading 1: 1.01₂

Value: (+1) × 1.01₂ × 2³ = 1010₂ = 10.0₁₀ ✓

⚡IEEE 754 Special Values

IEEE 754 reserves certain exponent bit patterns for special values that represent mathematical corner cases:

Stored exponent	Significand	Represents	Example
All zeros (0x00)	All zeros	±0 (positive or negative zero)	`0x00000000` = +0, `0x80000000` = −0
All zeros (0x00)	Non-zero	Denormalised number (subnormal)	Very small numbers near zero — no implied leading 1
All ones (0xFF)	All zeros	±Infinity	`0x7F800000` = +∞ · Result of 1.0/0.0
All ones (0xFF)	Non-zero	NaN (Not a Number)	`0x7FC00000` = quiet NaN · Result of 0.0/0.0, √(−1)

Why two zeros? −0 and +0 are mathematically equal but represent signed limits from different directions: 1/+0 = +∞, 1/−0 = −∞. In VLSI design, NaN propagation bugs are notoriously hard to debug — a single NaN input to a long computation chain silently contaminates all results.

➕Floating-Point Addition — Step-by-Step

Adding two floating-point numbers requires aligning the binary points (making exponents equal) before adding significands:

Figure 4 — Floating-point addition algorithm: four-step process

Floating-point addition requires four steps: (1) compare exponents and right-shift the smaller operand’s significand to align binary points; (2) add the aligned significands; (3) normalise the result back to 1.xxx form; (4) round to the target precision. IEEE 754 mandates “round to nearest, ties to even” as the default mode.

🔍 Worked Example — IEEE 754 Single-Precision: 9.25 + 1.5

A = 9.25: 1001.01₂ = 1.00101 × 2³

B = 1.5: 1.1₂ = 1.1 × 2⁰

Step 1 — Align: Exponent difference = 3. Shift B right by 3: 0.001 1 × 2³

Step 2 — Add:

1.001 0100 × 2³
+ 0.001 1000 × 2³
──────────────────
1.010 1100 × 2³

Step 3 — Normalise: Already normalised (starts with 1.)

Result: 1.0101100 × 2³ = 1010.1100₂ = 10.75₁₀ ✓

🏭FPU Pipeline

A high-performance FPU pipelines the four addition steps so a new floating-point operation can begin every clock cycle:

Figure 5 — Pipelined FPU: four stages, one result per cycle at steady state

Pipelined FPU with 4 stages. Latency is 4 cycles. Throughput is 1 result/cycle once the pipeline is full. Multiply typically uses 5–7 stages. Division is often non-pipelined and takes 20–50 cycles — a common FP performance bottleneck.

FPU latency matters for performance: An FP addition taking 4 cycles is fine if the result is not immediately needed. But if the next instruction depends on the result, the CPU must stall. This is why compilers perform instruction scheduling — reordering FP operations to keep the pipeline busy while results are pending.

🔬VLSI Connections

🔬 FPU RTL design and IEEE 754 corner cases

Implementing a correct IEEE 754 compliant FPU is one of the hardest RTL design tasks. The four-stage addition pipeline hides dozens of tricky corner cases: denormalised number handling (no implicit leading 1), catastrophic cancellation (subtracting nearly-equal numbers), sticky bit computation for correct rounding, and correct handling of all NaN and infinity propagation rules. Intel’s Pentium FDIV bug (1994) — a table error in the divide algorithm — resulted in a $475 million recall. Formal verification of FPU arithmetic is now an industry standard, using theorem provers (ACL2, Coq) to mathematically prove that the RTL produces IEEE 754 correct results for all possible input combinations.

🔬 fp16 and bfloat16 in AI accelerators

Modern AI inference chips (Google TPU, NVIDIA A100/H100, Apple Neural Engine) use reduced-precision floating-point to pack more multiply-accumulate (MAC) units per mm². bfloat16 (Brain Float 16, 1+8+7) uses the same exponent field as float32 (1+8+23) but truncates the significand — this allows lossless conversion between float32 and bfloat16 and was developed by Google Brain for training large language models. NVIDIA’s Tensor Cores natively execute 16×16 matrix multiplications in fp16 or bfloat16 in a single clock cycle. Understanding why bfloat16 was chosen — it preserves float32’s range (critical for gradient updates) while halving memory bandwidth requirements — requires understanding why the exponent field width determines dynamic range.

🔬 Booth multiplier timing closure in physical design

The multiplier is almost always the critical-path bottleneck in a CPU or DSP data path. A 32×32-bit Booth-encoded multiplier produces 16 partial products (Radix-4 MBE), which are reduced by a Wallace tree of carry-save adders to two 64-bit numbers, then a final carry-propagate adder produces the 64-bit result. The Wallace tree depth is approximately log₁.₅(16) ≈ 6 levels of full adders; the final carry-propagate adder adds ~log₂(64) = 6 gate delays with a carry-lookahead tree. Total critical path: ~12–20 gate delays = 1–3 ns at 28nm. Physical designers floorplan the multiplier carefully — placing it close to the register file and ALU and manually placing critical cells to meet timing closure.

Summary — CA-11 key points: Unsigned binary addition uses positional carry-propagation; subtraction uses borrow-propagation. A single adder–subtractor circuit implements both ADD and SUB by complementing operand B and setting carry-in=1 when SUB=1. Booth’s algorithm correctly handles two’s complement multiplication by examining pairs of multiplier bits and performing subtract (start of 1-block), add (end of 1-block), or shift-only (middle of block). Fixed-point representation uses integer hardware with a programmer-defined implicit binary point. Floating-point uses sign + biased exponent + significand — IEEE 754 is the universal standard (half/single/double precision). Special values: ±0, denormals, ±Infinity, NaN. FP addition requires four steps: align, add significands, normalise, round. FPU pipelines these stages for throughput of 1 result/cycle. Division is typically non-pipelined and much slower.