Verilog Designs · Module 27

Verilog Designs — Count the Number of 1s — VLSI Trainers
Verilog Designs · Module 27

Count the Number of 1s

Four implementations of the population count (popcount) circuit — combinational loop, sequential serial counter, parallel operator, and tree-adder — with an exhaustive self-checking testbench covering all 256 input values of an 8-bit input.

🔢 Introduction & Theory

Counting the number of 1-bits in a binary word is called a population count (popcount), Hamming weight, or bit count. It is one of the most fundamental operations in digital design, appearing in:

  • Error correction codes — computing syndrome weight (Hamming, BCH)
  • Cryptography — Hamming distance between keys
  • Compression — counting set bits in bitmaps
  • Processor ISAs — POPCNT instruction in x86, CNT in ARM, PCNT in RISC-V
  • Network processing — counting set bits in masks and prefix lengths
📥
Input
An N-bit binary word (this module uses N=8). Each bit is independently 0 or 1.
📤
Output
A ⌈log₂(N+1)⌉-bit count. For N=8 bits, count range is 0–8, requiring 4 output bits.
Four Approaches
Loop (simple), serial shift (area-efficient), parallel add (one-liner), and tree adder (minimum delay).
🌐
Synthesizable
All four implementations are fully synthesizable. The tree adder produces the fastest hardware with minimum critical path depth.

📊 Truth Table & Examples

Input [7:0]Binary patternCount [3:0]
8’h000000_00000000 = 0
8’h010000_00010001 = 1
8’h030000_00110010 = 2
8’h0F0000_11110100 = 4
8’h550101_01010100 = 4
8’hAA1010_10100100 = 4
8’hF01111_00000100 = 4
8’hFF1111_11111000 = 8
8’h7F0111_11110111 = 7
8’h801000_00000001 = 1
Key Facts
Input width: N = 8 bits
Count range: 0 to 8
Output width: ⌈log₂(9)⌉ = 4 bits
Total cases: 2⁸ = 256
Max count: 8 = 4’b1000
Exhaustive TB: all 256 tested
count = Σ data[i]
for i = 0 to N-1

🔁 Implementation 1 — For Loop (Combinational)

The simplest approach: a for loop inside an always @(*) block iterates over all bits and accumulates the count. Synthesis unrolls the loop into a chain of adders.

1
count_ones_loop
Combinational for-loop · Synthesis unrolls to adder chain · Most readable
🔁 For Loop
// ============================================================
// Module   : count_ones_loop
// Method   : Combinational for-loop — iterate all bits, accumulate
// Synthesis: Unrolled to a chain of 8 single-bit adders
// Latency  : O(N) adder chain — 8 additions deep
// ============================================================
`timescale 1ns/1ps
`default_nettype none

module count_ones_loop #(
  parameter N   = 8,                            // input width
  parameter CNT = $clog2(N+1)                  // output width: 4 for N=8
) (
  input  [N-1:0]   data,   // N-bit input word
  output reg [CNT:0] count   // count of 1-bits (0 to N)
);

  integer i;

  always @(*) begin
    count = 0;                  // start at zero each evaluation
    for (i = 0; i < N; i = i + 1)
      count = count + data[i]; // add each bit
  end

endmodule

`default_nettype wire
How synthesis handles the for-loop: Synthesis tools unroll this loop at compile time, creating 8 separate adder stages in a chain: c1 = data[0]; c2 = c1 + data[1]; ... count = c7 + data[7]. The result is a ripple-carry adder chain — simple but not the fastest possible circuit. The critical path passes through all 8 additions.
Fig 1 — For-loop unrolled: 8 adder stages in a chain
d[0] + d[1]↑ + d[2]↑ + d[3]↑ + d[4]↑ + d[5]↑ + d[6]↑ + d[7]↑ count[3:0] Critical path depth: 7 additions (O(N) chain)

Implementation 2 — Sequential Serial Counter

A sequential implementation shifts the input register one bit at a time, adding the shifted-out LSB to the running count. Uses minimum combinational area (just one adder) but takes N clock cycles to produce the result — a classic area-speed trade-off.

2
count_ones_serial
Sequential — shift register + accumulator · N cycles to complete · Minimum area
⏱ Serial
// ============================================================
// Module   : count_ones_serial
// Method   : Sequential — load, shift, accumulate over N cycles
// Latency  : N clock cycles (8 cycles for N=8)
// Area     : 1 adder + N+CNT flip-flops (minimum logic)
// Ports    : start=1 loads data; done=1 when count is valid
// ============================================================
`timescale 1ns/1ps
`default_nettype none

module count_ones_serial #(
  parameter N   = 8,
  parameter CNT = $clog2(N+1)
) (
  input              clk,
  input              rst_n,
  input              start,      // pulse high to load data and begin
  input  [N-1:0]   data,
  output reg [CNT:0] count,
  output reg         done        // asserts when count is valid
);

  reg [N-1:0] shift_reg;             // holds bits still to process
  reg [3:0]   bit_cnt;               // how many bits processed so far

  always @(posedge clk) begin
    if (!rst_n) begin
      shift_reg <= 0; count <= 0;
      bit_cnt   <= 0; done  <= 0;
    end
    else if (start) begin
      // Load: capture data, reset accumulator
      shift_reg <= data;
      count     <= 0;
      bit_cnt   <= 0;
      done      <= 0;
    end
    else if (bit_cnt < N) begin
      // Process one bit per clock cycle
      count     <= count + shift_reg[0]; // add LSB
      shift_reg <= shift_reg >> 1;       // shift right
      bit_cnt   <= bit_cnt + 1;
      done      <= (bit_cnt == N - 1);   // last iteration
    end
  end

endmodule

`default_nettype wire
Trade-off — area vs speed: The serial implementation uses only one adder (vs eight in the parallel versions), saving significant silicon area. However, it takes 8 clock cycles to produce a result. For applications where throughput is more important than latency — or where area is severely constrained — the serial approach is ideal. The done output signals to downstream logic when the count is valid.

Implementation 3 — Parallel Add ($countones)

The most concise implementation uses Verilog’s built-in $countones() system function (SystemVerilog) or a direct arithmetic sum. Both synthesize to an efficient parallel popcount tree.

3
count_ones_parallel
Data flow — direct addition · Single assign · Tool infers optimal tree
⚡ Parallel
// ============================================================
// Module   : count_ones_parallel
// Method   : Parallel — sum all bit signals in one expression
// Synthesis: Tool maps to popcount hardware (Wallace tree etc.)
// Latency  : O(log N) — fastest possible combinational path
// ============================================================
`timescale 1ns/1ps
`default_nettype none

module count_ones_parallel #(
  parameter N   = 8,
  parameter CNT = $clog2(N+1)
) (
  input  [N-1:0]   data,
  output [CNT:0]   count
);

  // ── Method A: Explicit bit sum (works in both Verilog and SV) ──
  assign count = data[0] + data[1] + data[2] + data[3] +
                  data[4] + data[5] + data[6] + data[7];

  // ── Method B: SystemVerilog $countones (one-liner) ─────────────
  // assign count = $countones(data);
  // Note: $countones is a SV system function — use method A for
  // pure Verilog-2001 compatibility

  // ── Method C: Parameterised generate (any N) ───────────────────
  // genvar i;
  // wire [CNT:0] partial [0:N];
  // assign partial[0] = 0;
  // generate
  //   for (i = 0; i < N; i = i+1)
  //     assign partial[i+1] = partial[i] + data[i];
  // endgenerate
  // assign count = partial[N];

endmodule

`default_nettype wire

🌳 Implementation 4 — Tree Adder (Fastest)

The tree adder (also called a Wallace tree or carry-save adder tree) computes the popcount in log₂(N) stages by adding bits in pairs, then pairs of sums, and so on. This is the minimum-depth implementation — the critical path is O(log N) instead of O(N).

Fig 2 — Tree adder structure: 8 bits reduced to count in 3 stages
Stage 0 — Individual bits
d[0]
d[1]
d[2]
d[3]
d[4]
d[5]
d[6]
d[7]
Stage 1 — 4 pairs summed (2-bit results, 0-2 each)
s[0] = d[0]+d[1]
s[1] = d[2]+d[3]
s[2] = d[4]+d[5]
s[3] = d[6]+d[7]
Stage 2 — 2 pair-sums added (3-bit results, 0-4 each)
t[0] = s[0]+s[1]
t[1] = s[2]+s[3]
Stage 3 — Final sum (4-bit result, 0-8)
count = t[0] + t[1]
Critical path: 3 additions deep (log₂8 = 3)
4
count_ones_tree
Tree adder — 3 pipeline stages · O(log N) depth · Maximum speed
🌳 Tree
// ============================================================
// Module   : count_ones_tree
// Method   : Binary tree adder — log2(N) depth stages
// Critical : 3 additions deep for N=8 (vs 7 for loop method)
// Explicitly shows tree structure — synthesis also finds this
// ============================================================
`timescale 1ns/1ps
`default_nettype none

module count_ones_tree (
  input  [7:0] data,
  output [3:0] count
);

  // ── Stage 1: Add pairs of bits (4 additions in parallel) ──
  wire [1:0] s0 = data[0] + data[1];  // 0..2
  wire [1:0] s1 = data[2] + data[3];  // 0..2
  wire [1:0] s2 = data[4] + data[5];  // 0..2
  wire [1:0] s3 = data[6] + data[7];  // 0..2

  // ── Stage 2: Add pairs of stage-1 sums (2 additions in parallel)
  wire [2:0] t0 = s0 + s1;             // 0..4
  wire [2:0] t1 = s2 + s3;             // 0..4

  // ── Stage 3: Final addition (1 addition) ─────────────────
  assign count = t0 + t1;              // 0..8 → 4-bit result

endmodule

`default_nettype wire
Why tree adder is fastest: The loop adder has 7 additions in series (critical path = 7 × t_ADD). The tree adder has only 3 additions in series (critical path = 3 × t_ADD = log₂8 × t_ADD). For N=64 bits, the speedup is even more dramatic: 63 vs 6 series additions. Modern synthesis tools automatically convert parallel adder expressions (a+b+c+d+...) into tree adders, so the explicit tree code and the parallel one-liner typically produce identical netlists after optimization.

🧪 Comprehensive Testbench

The testbench instantiates all three combinational implementations simultaneously and verifies them exhaustively against a simple reference model across all 256 possible 8-bit input values. The serial implementation is tested separately with correct clocking protocol.

TB
count_ones_tb
All 256 input combinations · All 4 implementations · Reference model comparison
🧪 Testbench
// ============================================================
// Testbench  : count_ones_tb
// Strategy   : Exhaustive — all 256 values of 8-bit input
// DUTs       : loop, parallel, tree (combinational, simultaneous)
//              serial (separate sequential test)
// Reference  : $countones() system function
// ============================================================
`timescale 1ns/1ps
`default_nettype none

module count_ones_tb;

  // ── Shared stimulus ────────────────────────────────────────
  reg  [7:0] data;
  wire [3:0] cnt_loop, cnt_par, cnt_tree;

  // ── Instantiate combinational DUTs ─────────────────────────
  count_ones_loop     dut_loop (.data(data), .count(cnt_loop));
  count_ones_parallel dut_par  (.data(data), .count(cnt_par));
  count_ones_tree     dut_tree (.data(data), .count(cnt_tree));

  // ── Serial DUT ─────────────────────────────────────────────
  reg        clk=0, rst_n=1, start=0;
  wire [3:0] cnt_ser;
  wire       done_ser;

  count_ones_serial dut_ser (
    .clk(clk),.rst_n(rst_n),.start(start),
    .data(data),.count(cnt_ser),.done(done_ser)
  );
  always #5 clk = ~clk;

  initial begin
    $dumpfile("count_ones.vcd");
    $dumpvars(0, count_ones_tb);
  end

  integer pass_cnt=0, fail_cnt=0, test_num=0;
  integer v, exp;

  initial begin
    $display("\n============================================================");
    $display("  Count-the-1s Testbench — Exhaustive 256-Vector Coverage");
    $display("============================================================");

    // ── Combinational tests: all 256 values ───────────────────
    $display("\n  --- Combinational (loop, parallel, tree) ---");
    for (v=0; v<=255; v=v+1) begin
      data = v[7:0];
      #2;  // combinational settle

      // Reference model: manual popcount using integer arithmetic
      exp  = v[0]+v[1]+v[2]+v[3]+v[4]+v[5]+v[6]+v[7];

      test_num++;
      if (cnt_loop===exp && cnt_par===exp && cnt_tree===exp) begin
        if (v % 32 == 0)  // print every 32nd result
          $display("  PASS  data=0x%02h (%08b) → count=%0d",
                   data, data, exp);
        pass_cnt++;
      end else begin
        $display("  FAIL  data=0x%02h | loop=%0d par=%0d tree=%0d exp=%0d",
                 data, cnt_loop, cnt_par, cnt_tree, exp);
        fail_cnt++;
      end
    end

    // ── Serial tests: selected values ──────────────────────────
    $display("\n  --- Serial (clocked, 8 cycles per test) ---");
    rst_n=0; @(posedge clk); #1; rst_n=1;

    // Test selected values with the serial module
    begin : serial_tests
      integer test_vals [0:7];
      integer k;
      test_vals[0] = 8'h00; test_vals[1] = 8'hFF;
      test_vals[2] = 8'hAA; test_vals[3] = 8'h55;
      test_vals[4] = 8'h01; test_vals[5] = 8'h80;
      test_vals[6] = 8'h0F; test_vals[7] = 8'hF0;

      for (k=0; k<8; k=k+1) begin
        data  = test_vals[k];
        start = 1;
        @(posedge clk); #1;
        start = 0;
        // Wait 8 cycles for serial result
        repeat(9) @(posedge clk); #1;

        exp = data[0]+data[1]+data[2]+data[3]+
              data[4]+data[5]+data[6]+data[7];
        test_num++;
        if (cnt_ser===exp && done_ser===1'b1) begin
          $display("  PASS  Serial data=0x%02h → count=%0d (done=%b)",
                   data, cnt_ser, done_ser);
          pass_cnt++;
        end else begin
          $display("  FAIL  Serial data=0x%02h | got=%0d exp=%0d done=%b",
                   data, cnt_ser, exp, done_ser);
          fail_cnt++;
        end
      end
    end

    // ── Summary ─────────────────────────────────────────────
    $display("\n============================================================");
    $display("  RESULTS: %0d / %0d PASS  |  %0d FAIL",
             pass_cnt,test_num,fail_cnt);
    $display("============================================================");
    if(fail_cnt==0)
      $display("  ✅ ALL TESTS PASSED — all 256 input values verified\n");
    else
      $fatal(1,"  ❌ %0d FAILURE(S)\n",fail_cnt);
    #20; $finish;
  end

endmodule
`default_nettype wire

📈 Simulation Waveform

The combinational waveform shows the count output responding immediately as the input changes through selected test values. The serial waveform shows the 8-cycle latency after each start pulse.

Fig 3 — Combinational count_ones: input changes, output responds immediately
data[7:0] count[3:0] count (dec) 0 1 2 3 4 5 6 0x00 0xFF 0x55 0xAA 0x0F 0xF0 0000 1000 0100 0100 0100 0100 0 8 4 4 4 4 0x55=01010101, 0xAA=10101010, 0x0F=00001111, 0xF0=11110000 → all count=4

💻 Simulation Console Output

============================================================ Count-the-1s Testbench — Exhaustive 256-Vector Coverage ============================================================ — Combinational (loop, parallel, tree) — PASS data=0x00 (00000000) → count=0 PASS data=0x20 (00100000) → count=1 PASS data=0x40 (01000000) → count=1 PASS data=0x60 (01100000) → count=2 PASS data=0x80 (10000000) → count=1 PASS data=0xA0 (10100000) → count=2 PASS data=0xC0 (11000000) → count=2 PASS data=0xE0 (11100000) → count=3 … (256 total tests, all pass) … PASS data=0xFF (11111111) → count=8 — Serial (clocked, 8 cycles per test) — PASS Serial data=0x00 → count=0 (done=1) PASS Serial data=0xFF → count=8 (done=1) PASS Serial data=0xAA → count=4 (done=1) PASS Serial data=0x55 → count=4 (done=1) PASS Serial data=0x01 → count=1 (done=1) PASS Serial data=0x80 → count=1 (done=1) PASS Serial data=0x0F → count=4 (done=1) PASS Serial data=0xF0 → count=4 (done=1) ============================================================ RESULTS: 264 / 264 PASS | 0 FAIL ============================================================ ✅ ALL TESTS PASSED — all 256 input values verified

How to Run

Compile all four modules and the testbench
# ── Icarus Verilog ────────────────────────────────────────────
iverilog -o popcount_sim \
    count_ones_loop.v     \
    count_ones_serial.v   \
    count_ones_parallel.v \
    count_ones_tree.v     \
    count_ones_tb.v
vvp popcount_sim
gtkwave count_ones.vcd

# ── ModelSim ──────────────────────────────────────────────────
vlog count_ones_loop.v count_ones_serial.v \
     count_ones_parallel.v count_ones_tree.v count_ones_tb.v
vsim -c count_ones_tb -do "run -all; quit -f"

🔬 Design Analysis & Comparison

Implementation Critical path depth Area (adders) Latency (cycles) Synthesizable? Best for
Loop (for) O(N) — 7 adders in series N-1 adders 0 (combinational) ✅ Yes Clarity, any N via parameter
Serial 1 adder (per cycle) 1 adder + N+4 FFs N cycles ✅ Yes Ultra-low area, high throughput with pipelining
Parallel (assign) O(log N) after tool opt. N-1 adders (tree) 0 (combinational) ✅ Yes Concise code, trust synthesis
Tree adder O(log N) — 3 levels N-1 adders (tree) 0 (combinational) ✅ Yes Maximum speed, explicit parallelism

Critical Path Comparison — N=8

🔴 Loop (chain): 7 additions deep

d[0] → +d[1] → +d[2] → +d[3]
     → +d[4] → +d[5] → +d[6] → +d[7]

Delay = 7 × t_ADD

🟢 Tree: 3 additions deep

Stage1: 4 adds in parallel (1×t_ADD)
Stage2: 2 adds in parallel (1×t_ADD)
Stage3: 1 final add        (1×t_ADD)

Delay = 3 × t_ADD = log₂(8) × t_ADD
Parameterisation to any width: All four implementations use parameter N = 8 with parameter CNT = $clog2(N+1) to automatically derive the output width. To count 1s in a 32-bit word, instantiate with #(.N(32)) — CNT becomes 6 (to hold values 0–32), the loop runs 32 iterations, and the tree adder becomes 5 levels deep. The serial module will take 32 clock cycles. No other changes required.
Hardware instruction equivalents: The POPCNT instruction introduced in x86 SSE4.2 counts bits in a 64-bit register in a single cycle. ARM64’s CNT instruction counts bits in each byte of a vector register. These hardware instructions implement the tree adder structure in dedicated silicon — exactly what our tree implementation synthesizes to. The for-loop version, while identical in function, would produce slower hardware without synthesis optimization.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top