Count the Number of 1s
Four implementations of the population count (popcount) circuit — combinational loop, sequential serial counter, parallel operator, and tree-adder — with an exhaustive self-checking testbench covering all 256 input values of an 8-bit input.
🔢 Introduction & Theory
Counting the number of 1-bits in a binary word is called a population count (popcount), Hamming weight, or bit count. It is one of the most fundamental operations in digital design, appearing in:
- Error correction codes — computing syndrome weight (Hamming, BCH)
- Cryptography — Hamming distance between keys
- Compression — counting set bits in bitmaps
- Processor ISAs — POPCNT instruction in x86, CNT in ARM, PCNT in RISC-V
- Network processing — counting set bits in masks and prefix lengths
📊 Truth Table & Examples
| Input [7:0] | Binary pattern | Count [3:0] |
|---|---|---|
| 8’h00 | 0000_0000 | 0000 = 0 |
| 8’h01 | 0000_0001 | 0001 = 1 |
| 8’h03 | 0000_0011 | 0010 = 2 |
| 8’h0F | 0000_1111 | 0100 = 4 |
| 8’h55 | 0101_0101 | 0100 = 4 |
| 8’hAA | 1010_1010 | 0100 = 4 |
| 8’hF0 | 1111_0000 | 0100 = 4 |
| 8’hFF | 1111_1111 | 1000 = 8 |
| 8’h7F | 0111_1111 | 0111 = 7 |
| 8’h80 | 1000_0000 | 0001 = 1 |
for i = 0 to N-1
🔁 Implementation 1 — For Loop (Combinational)
The simplest approach: a for loop inside an always @(*) block iterates over all bits and accumulates the count. Synthesis unrolls the loop into a chain of adders.
// ============================================================ // Module : count_ones_loop // Method : Combinational for-loop — iterate all bits, accumulate // Synthesis: Unrolled to a chain of 8 single-bit adders // Latency : O(N) adder chain — 8 additions deep // ============================================================ `timescale 1ns/1ps `default_nettype none module count_ones_loop #( parameter N = 8, // input width parameter CNT = $clog2(N+1) // output width: 4 for N=8 ) ( input [N-1:0] data, // N-bit input word output reg [CNT:0] count // count of 1-bits (0 to N) ); integer i; always @(*) begin count = 0; // start at zero each evaluation for (i = 0; i < N; i = i + 1) count = count + data[i]; // add each bit end endmodule `default_nettype wire
c1 = data[0]; c2 = c1 + data[1]; ... count = c7 + data[7]. The result is a ripple-carry adder chain — simple but not the fastest possible circuit. The critical path passes through all 8 additions.
⏱ Implementation 2 — Sequential Serial Counter
A sequential implementation shifts the input register one bit at a time, adding the shifted-out LSB to the running count. Uses minimum combinational area (just one adder) but takes N clock cycles to produce the result — a classic area-speed trade-off.
// ============================================================ // Module : count_ones_serial // Method : Sequential — load, shift, accumulate over N cycles // Latency : N clock cycles (8 cycles for N=8) // Area : 1 adder + N+CNT flip-flops (minimum logic) // Ports : start=1 loads data; done=1 when count is valid // ============================================================ `timescale 1ns/1ps `default_nettype none module count_ones_serial #( parameter N = 8, parameter CNT = $clog2(N+1) ) ( input clk, input rst_n, input start, // pulse high to load data and begin input [N-1:0] data, output reg [CNT:0] count, output reg done // asserts when count is valid ); reg [N-1:0] shift_reg; // holds bits still to process reg [3:0] bit_cnt; // how many bits processed so far always @(posedge clk) begin if (!rst_n) begin shift_reg <= 0; count <= 0; bit_cnt <= 0; done <= 0; end else if (start) begin // Load: capture data, reset accumulator shift_reg <= data; count <= 0; bit_cnt <= 0; done <= 0; end else if (bit_cnt < N) begin // Process one bit per clock cycle count <= count + shift_reg[0]; // add LSB shift_reg <= shift_reg >> 1; // shift right bit_cnt <= bit_cnt + 1; done <= (bit_cnt == N - 1); // last iteration end end endmodule `default_nettype wire
done output signals to downstream logic when the count is valid.
⚡ Implementation 3 — Parallel Add ($countones)
The most concise implementation uses Verilog’s built-in $countones() system function (SystemVerilog) or a direct arithmetic sum. Both synthesize to an efficient parallel popcount tree.
// ============================================================ // Module : count_ones_parallel // Method : Parallel — sum all bit signals in one expression // Synthesis: Tool maps to popcount hardware (Wallace tree etc.) // Latency : O(log N) — fastest possible combinational path // ============================================================ `timescale 1ns/1ps `default_nettype none module count_ones_parallel #( parameter N = 8, parameter CNT = $clog2(N+1) ) ( input [N-1:0] data, output [CNT:0] count ); // ── Method A: Explicit bit sum (works in both Verilog and SV) ── assign count = data[0] + data[1] + data[2] + data[3] + data[4] + data[5] + data[6] + data[7]; // ── Method B: SystemVerilog $countones (one-liner) ───────────── // assign count = $countones(data); // Note: $countones is a SV system function — use method A for // pure Verilog-2001 compatibility // ── Method C: Parameterised generate (any N) ─────────────────── // genvar i; // wire [CNT:0] partial [0:N]; // assign partial[0] = 0; // generate // for (i = 0; i < N; i = i+1) // assign partial[i+1] = partial[i] + data[i]; // endgenerate // assign count = partial[N]; endmodule `default_nettype wire
🌳 Implementation 4 — Tree Adder (Fastest)
The tree adder (also called a Wallace tree or carry-save adder tree) computes the popcount in log₂(N) stages by adding bits in pairs, then pairs of sums, and so on. This is the minimum-depth implementation — the critical path is O(log N) instead of O(N).
// ============================================================ // Module : count_ones_tree // Method : Binary tree adder — log2(N) depth stages // Critical : 3 additions deep for N=8 (vs 7 for loop method) // Explicitly shows tree structure — synthesis also finds this // ============================================================ `timescale 1ns/1ps `default_nettype none module count_ones_tree ( input [7:0] data, output [3:0] count ); // ── Stage 1: Add pairs of bits (4 additions in parallel) ── wire [1:0] s0 = data[0] + data[1]; // 0..2 wire [1:0] s1 = data[2] + data[3]; // 0..2 wire [1:0] s2 = data[4] + data[5]; // 0..2 wire [1:0] s3 = data[6] + data[7]; // 0..2 // ── Stage 2: Add pairs of stage-1 sums (2 additions in parallel) wire [2:0] t0 = s0 + s1; // 0..4 wire [2:0] t1 = s2 + s3; // 0..4 // ── Stage 3: Final addition (1 addition) ───────────────── assign count = t0 + t1; // 0..8 → 4-bit result endmodule `default_nettype wire
a+b+c+d+...) into tree adders, so the explicit tree code and the parallel one-liner typically produce identical netlists after optimization.
🧪 Comprehensive Testbench
The testbench instantiates all three combinational implementations simultaneously and verifies them exhaustively against a simple reference model across all 256 possible 8-bit input values. The serial implementation is tested separately with correct clocking protocol.
// ============================================================ // Testbench : count_ones_tb // Strategy : Exhaustive — all 256 values of 8-bit input // DUTs : loop, parallel, tree (combinational, simultaneous) // serial (separate sequential test) // Reference : $countones() system function // ============================================================ `timescale 1ns/1ps `default_nettype none module count_ones_tb; // ── Shared stimulus ──────────────────────────────────────── reg [7:0] data; wire [3:0] cnt_loop, cnt_par, cnt_tree; // ── Instantiate combinational DUTs ───────────────────────── count_ones_loop dut_loop (.data(data), .count(cnt_loop)); count_ones_parallel dut_par (.data(data), .count(cnt_par)); count_ones_tree dut_tree (.data(data), .count(cnt_tree)); // ── Serial DUT ───────────────────────────────────────────── reg clk=0, rst_n=1, start=0; wire [3:0] cnt_ser; wire done_ser; count_ones_serial dut_ser ( .clk(clk),.rst_n(rst_n),.start(start), .data(data),.count(cnt_ser),.done(done_ser) ); always #5 clk = ~clk; initial begin $dumpfile("count_ones.vcd"); $dumpvars(0, count_ones_tb); end integer pass_cnt=0, fail_cnt=0, test_num=0; integer v, exp; initial begin $display("\n============================================================"); $display(" Count-the-1s Testbench — Exhaustive 256-Vector Coverage"); $display("============================================================"); // ── Combinational tests: all 256 values ─────────────────── $display("\n --- Combinational (loop, parallel, tree) ---"); for (v=0; v<=255; v=v+1) begin data = v[7:0]; #2; // combinational settle // Reference model: manual popcount using integer arithmetic exp = v[0]+v[1]+v[2]+v[3]+v[4]+v[5]+v[6]+v[7]; test_num++; if (cnt_loop===exp && cnt_par===exp && cnt_tree===exp) begin if (v % 32 == 0) // print every 32nd result $display(" PASS data=0x%02h (%08b) → count=%0d", data, data, exp); pass_cnt++; end else begin $display(" FAIL data=0x%02h | loop=%0d par=%0d tree=%0d exp=%0d", data, cnt_loop, cnt_par, cnt_tree, exp); fail_cnt++; end end // ── Serial tests: selected values ────────────────────────── $display("\n --- Serial (clocked, 8 cycles per test) ---"); rst_n=0; @(posedge clk); #1; rst_n=1; // Test selected values with the serial module begin : serial_tests integer test_vals [0:7]; integer k; test_vals[0] = 8'h00; test_vals[1] = 8'hFF; test_vals[2] = 8'hAA; test_vals[3] = 8'h55; test_vals[4] = 8'h01; test_vals[5] = 8'h80; test_vals[6] = 8'h0F; test_vals[7] = 8'hF0; for (k=0; k<8; k=k+1) begin data = test_vals[k]; start = 1; @(posedge clk); #1; start = 0; // Wait 8 cycles for serial result repeat(9) @(posedge clk); #1; exp = data[0]+data[1]+data[2]+data[3]+ data[4]+data[5]+data[6]+data[7]; test_num++; if (cnt_ser===exp && done_ser===1'b1) begin $display(" PASS Serial data=0x%02h → count=%0d (done=%b)", data, cnt_ser, done_ser); pass_cnt++; end else begin $display(" FAIL Serial data=0x%02h | got=%0d exp=%0d done=%b", data, cnt_ser, exp, done_ser); fail_cnt++; end end end // ── Summary ───────────────────────────────────────────── $display("\n============================================================"); $display(" RESULTS: %0d / %0d PASS | %0d FAIL", pass_cnt,test_num,fail_cnt); $display("============================================================"); if(fail_cnt==0) $display(" ✅ ALL TESTS PASSED — all 256 input values verified\n"); else $fatal(1," ❌ %0d FAILURE(S)\n",fail_cnt); #20; $finish; end endmodule `default_nettype wire
📈 Simulation Waveform
The combinational waveform shows the count output responding immediately as the input changes through selected test values. The serial waveform shows the 8-cycle latency after each start pulse.
💻 Simulation Console Output
How to Run
# ── Icarus Verilog ──────────────────────────────────────────── iverilog -o popcount_sim \ count_ones_loop.v \ count_ones_serial.v \ count_ones_parallel.v \ count_ones_tree.v \ count_ones_tb.v vvp popcount_sim gtkwave count_ones.vcd # ── ModelSim ────────────────────────────────────────────────── vlog count_ones_loop.v count_ones_serial.v \ count_ones_parallel.v count_ones_tree.v count_ones_tb.v vsim -c count_ones_tb -do "run -all; quit -f"
🔬 Design Analysis & Comparison
| Implementation | Critical path depth | Area (adders) | Latency (cycles) | Synthesizable? | Best for |
|---|---|---|---|---|---|
| Loop (for) | O(N) — 7 adders in series | N-1 adders | 0 (combinational) | ✅ Yes | Clarity, any N via parameter |
| Serial | 1 adder (per cycle) | 1 adder + N+4 FFs | N cycles | ✅ Yes | Ultra-low area, high throughput with pipelining |
| Parallel (assign) | O(log N) after tool opt. | N-1 adders (tree) | 0 (combinational) | ✅ Yes | Concise code, trust synthesis |
| Tree adder | O(log N) — 3 levels | N-1 adders (tree) | 0 (combinational) | ✅ Yes | Maximum speed, explicit parallelism |
Critical Path Comparison — N=8
🔴 Loop (chain): 7 additions deep
d[0] → +d[1] → +d[2] → +d[3]
→ +d[4] → +d[5] → +d[6] → +d[7]
Delay = 7 × t_ADD
🟢 Tree: 3 additions deep
Stage1: 4 adds in parallel (1×t_ADD)
Stage2: 2 adds in parallel (1×t_ADD)
Stage3: 1 final add (1×t_ADD)
Delay = 3 × t_ADD = log₂(8) × t_ADD
parameter N = 8 with parameter CNT = $clog2(N+1) to automatically derive the output width. To count 1s in a 32-bit word, instantiate with #(.N(32)) — CNT becomes 6 (to hold values 0–32), the loop runs 32 iterations, and the tree adder becomes 5 levels deep. The serial module will take 32 clock cycles. No other changes required.
