Verilog Designs — Count the Number of 1s — VLSI Trainers

Verilog Designs · Module 27

Count the Number of 1s

Four implementations of the population count (popcount) circuit — combinational loop, sequential serial counter, parallel operator, and tree-adder — with an exhaustive self-checking testbench covering all 256 input values of an 8-bit input.

🔢 Introduction & Theory

Counting the number of 1-bits in a binary word is called a population count (popcount), Hamming weight, or bit count. It is one of the most fundamental operations in digital design, appearing in:

Error correction codes — computing syndrome weight (Hamming, BCH)
Cryptography — Hamming distance between keys
Compression — counting set bits in bitmaps
Processor ISAs — POPCNT instruction in x86, CNT in ARM, PCNT in RISC-V
Network processing — counting set bits in masks and prefix lengths

📥

Input

An N-bit binary word (this module uses N=8). Each bit is independently 0 or 1.

📤

Output

A ⌈log₂(N+1)⌉-bit count. For N=8 bits, count range is 0–8, requiring 4 output bits.

⚡

Four Approaches

Loop (simple), serial shift (area-efficient), parallel add (one-liner), and tree adder (minimum delay).

🌐

Synthesizable

All four implementations are fully synthesizable. The tree adder produces the fastest hardware with minimum critical path depth.

📊 Truth Table & Examples

Input [7:0]	Binary pattern	Count [3:0]
8’h00	0000_0000	0000 = 0
8’h01	0000_0001	0001 = 1
8’h03	0000_0011	0010 = 2
8’h0F	0000_1111	0100 = 4
8’h55	0101_0101	0100 = 4
8’hAA	1010_1010	0100 = 4
8’hF0	1111_0000	0100 = 4
8’hFF	1111_1111	1000 = 8
8’h7F	0111_1111	0111 = 7
8’h80	1000_0000	0001 = 1

Key Facts

Input width: N = 8 bits

Count range: 0 to 8

Output width: ⌈log₂(9)⌉ = 4 bits

Total cases: 2⁸ = 256

Max count: 8 = 4’b1000

Exhaustive TB: all 256 tested

          count = Σ data[i]
for i = 0 to N-1
        

🔁 Implementation 1 — For Loop (Combinational)

The simplest approach: a for loop inside an always @(*) block iterates over all bits and accumulates the count. Synthesis unrolls the loop into a chain of adders.

count_ones_loop

Combinational for-loop · Synthesis unrolls to adder chain · Most readable

🔁 For Loop

// ============================================================
// Module   : count_ones_loop
// Method   : Combinational for-loop — iterate all bits, accumulate
// Synthesis: Unrolled to a chain of 8 single-bit adders
// Latency  : O(N) adder chain — 8 additions deep
// ============================================================
`timescale 1ns/1ps
`default_nettype none

module count_ones_loop #(
  parameter N   = 8,                            // input width
  parameter CNT = $clog2(N+1)                  // output width: 4 for N=8
) (
  input  [N-1:0]   data,   // N-bit input word
  output reg [CNT:0] count   // count of 1-bits (0 to N)
);

  integer i;

  always @(*) begin
    count = 0;                  // start at zero each evaluation
    for (i = 0; i < N; i = i + 1)
      count = count + data[i]; // add each bit
  end

endmodule

`default_nettype wire

How synthesis handles the for-loop: Synthesis tools unroll this loop at compile time, creating 8 separate adder stages in a chain: c1 = data[0]; c2 = c1 + data[1]; ... count = c7 + data[7]. The result is a ripple-carry adder chain — simple but not the fastest possible circuit. The critical path passes through all 8 additions.

Fig 1 — For-loop unrolled: 8 adder stages in a chain

⏱ Implementation 2 — Sequential Serial Counter

A sequential implementation shifts the input register one bit at a time, adding the shifted-out LSB to the running count. Uses minimum combinational area (just one adder) but takes N clock cycles to produce the result — a classic area-speed trade-off.

count_ones_serial

Sequential — shift register + accumulator · N cycles to complete · Minimum area

⏱ Serial

// ============================================================
// Module   : count_ones_serial
// Method   : Sequential — load, shift, accumulate over N cycles
// Latency  : N clock cycles (8 cycles for N=8)
// Area     : 1 adder + N+CNT flip-flops (minimum logic)
// Ports    : start=1 loads data; done=1 when count is valid
// ============================================================
`timescale 1ns/1ps
`default_nettype none

module count_ones_serial #(
  parameter N   = 8,
  parameter CNT = $clog2(N+1)
) (
  input              clk,
  input              rst_n,
  input              start,      // pulse high to load data and begin
  input  [N-1:0]   data,
  output reg [CNT:0] count,
  output reg         done        // asserts when count is valid
);

  reg [N-1:0] shift_reg;             // holds bits still to process
  reg [3:0]   bit_cnt;               // how many bits processed so far

  always @(posedge clk) begin
    if (!rst_n) begin
      shift_reg <= 0; count <= 0;
      bit_cnt   <= 0; done  <= 0;
    end
    else if (start) begin
      // Load: capture data, reset accumulator
      shift_reg <= data;
      count     <= 0;
      bit_cnt   <= 0;
      done      <= 0;
    end
    else if (bit_cnt < N) begin
      // Process one bit per clock cycle
      count     <= count + shift_reg[0]; // add LSB
      shift_reg <= shift_reg >> 1;       // shift right
      bit_cnt   <= bit_cnt + 1;
      done      <= (bit_cnt == N - 1);   // last iteration
    end
  end

endmodule

`default_nettype wire

Trade-off — area vs speed: The serial implementation uses only one adder (vs eight in the parallel versions), saving significant silicon area. However, it takes 8 clock cycles to produce a result. For applications where throughput is more important than latency — or where area is severely constrained — the serial approach is ideal. The done output signals to downstream logic when the count is valid.

⚡ Implementation 3 — Parallel Add ($countones)

The most concise implementation uses Verilog’s built-in $countones() system function (SystemVerilog) or a direct arithmetic sum. Both synthesize to an efficient parallel popcount tree.

count_ones_parallel

Data flow — direct addition · Single assign · Tool infers optimal tree

⚡ Parallel

// ============================================================
// Module   : count_ones_parallel
// Method   : Parallel — sum all bit signals in one expression
// Synthesis: Tool maps to popcount hardware (Wallace tree etc.)
// Latency  : O(log N) — fastest possible combinational path
// ============================================================
`timescale 1ns/1ps
`default_nettype none

module count_ones_parallel #(
  parameter N   = 8,
  parameter CNT = $clog2(N+1)
) (
  input  [N-1:0]   data,
  output [CNT:0]   count
);

  // ── Method A: Explicit bit sum (works in both Verilog and SV) ──
  assign count = data[0] + data[1] + data[2] + data[3] +
                  data[4] + data[5] + data[6] + data[7];

  // ── Method B: SystemVerilog $countones (one-liner) ─────────────
  // assign count = $countones(data);
  // Note: $countones is a SV system function — use method A for
  // pure Verilog-2001 compatibility

  // ── Method C: Parameterised generate (any N) ───────────────────
  // genvar i;
  // wire [CNT:0] partial [0:N];
  // assign partial[0] = 0;
  // generate
  //   for (i = 0; i < N; i = i+1)
  //     assign partial[i+1] = partial[i] + data[i];
  // endgenerate
  // assign count = partial[N];

endmodule

`default_nettype wire

🌳 Implementation 4 — Tree Adder (Fastest)

The tree adder (also called a Wallace tree or carry-save adder tree) computes the popcount in log₂(N) stages by adding bits in pairs, then pairs of sums, and so on. This is the minimum-depth implementation — the critical path is O(log N) instead of O(N).

Fig 2 — Tree adder structure: 8 bits reduced to count in 3 stages

Stage 0 — Individual bits

d[0]

d[1]

d[2]

d[3]

d[4]

d[5]

d[6]

d[7]

Stage 1 — 4 pairs summed (2-bit results, 0-2 each)

s[0] = d[0]+d[1]

s[1] = d[2]+d[3]

s[2] = d[4]+d[5]

s[3] = d[6]+d[7]

Stage 2 — 2 pair-sums added (3-bit results, 0-4 each)

t[0] = s[0]+s[1]

t[1] = s[2]+s[3]

Stage 3 — Final sum (4-bit result, 0-8)

count = t[0] + t[1]

Critical path: 3 additions deep (log₂8 = 3)

count_ones_tree

Tree adder — 3 pipeline stages · O(log N) depth · Maximum speed

🌳 Tree

// ============================================================
// Module   : count_ones_tree
// Method   : Binary tree adder — log2(N) depth stages
// Critical : 3 additions deep for N=8 (vs 7 for loop method)
// Explicitly shows tree structure — synthesis also finds this
// ============================================================
`timescale 1ns/1ps
`default_nettype none

module count_ones_tree (
  input  [7:0] data,
  output [3:0] count
);

  // ── Stage 1: Add pairs of bits (4 additions in parallel) ──
  wire [1:0] s0 = data[0] + data[1];  // 0..2
  wire [1:0] s1 = data[2] + data[3];  // 0..2
  wire [1:0] s2 = data[4] + data[5];  // 0..2
  wire [1:0] s3 = data[6] + data[7];  // 0..2

  // ── Stage 2: Add pairs of stage-1 sums (2 additions in parallel)
  wire [2:0] t0 = s0 + s1;             // 0..4
  wire [2:0] t1 = s2 + s3;             // 0..4

  // ── Stage 3: Final addition (1 addition) ─────────────────
  assign count = t0 + t1;              // 0..8 → 4-bit result

endmodule

`default_nettype wire

Why tree adder is fastest: The loop adder has 7 additions in series (critical path = 7 × t_ADD). The tree adder has only 3 additions in series (critical path = 3 × t_ADD = log₂8 × t_ADD). For N=64 bits, the speedup is even more dramatic: 63 vs 6 series additions. Modern synthesis tools automatically convert parallel adder expressions (a+b+c+d+...) into tree adders, so the explicit tree code and the parallel one-liner typically produce identical netlists after optimization.

🧪 Comprehensive Testbench

The testbench instantiates all three combinational implementations simultaneously and verifies them exhaustively against a simple reference model across all 256 possible 8-bit input values. The serial implementation is tested separately with correct clocking protocol.

count_ones_tb

All 256 input combinations · All 4 implementations · Reference model comparison

🧪 Testbench

// ============================================================
// Testbench  : count_ones_tb
// Strategy   : Exhaustive — all 256 values of 8-bit input
// DUTs       : loop, parallel, tree (combinational, simultaneous)
//              serial (separate sequential test)
// Reference  : $countones() system function
// ============================================================
`timescale 1ns/1ps
`default_nettype none

module count_ones_tb;

  // ── Shared stimulus ────────────────────────────────────────
  reg  [7:0] data;
  wire [3:0] cnt_loop, cnt_par, cnt_tree;

  // ── Instantiate combinational DUTs ─────────────────────────
  count_ones_loop     dut_loop (.data(data), .count(cnt_loop));
  count_ones_parallel dut_par  (.data(data), .count(cnt_par));
  count_ones_tree     dut_tree (.data(data), .count(cnt_tree));

  // ── Serial DUT ─────────────────────────────────────────────
  reg        clk=0, rst_n=1, start=0;
  wire [3:0] cnt_ser;
  wire       done_ser;

  count_ones_serial dut_ser (
    .clk(clk),.rst_n(rst_n),.start(start),
    .data(data),.count(cnt_ser),.done(done_ser)
  );
  always #5 clk = ~clk;

  initial begin
    $dumpfile("count_ones.vcd");
    $dumpvars(0, count_ones_tb);
  end

  integer pass_cnt=0, fail_cnt=0, test_num=0;
  integer v, exp;

  initial begin
    $display("\n============================================================");
    $display("  Count-the-1s Testbench — Exhaustive 256-Vector Coverage");
    $display("============================================================");

    // ── Combinational tests: all 256 values ───────────────────
    $display("\n  --- Combinational (loop, parallel, tree) ---");
    for (v=0; v<=255; v=v+1) begin
      data = v[7:0];
      #2;  // combinational settle

      // Reference model: manual popcount using integer arithmetic
      exp  = v[0]+v[1]+v[2]+v[3]+v[4]+v[5]+v[6]+v[7];

      test_num++;
      if (cnt_loop===exp && cnt_par===exp && cnt_tree===exp) begin
        if (v % 32 == 0)  // print every 32nd result
          $display("  PASS  data=0x%02h (%08b) → count=%0d",
                   data, data, exp);
        pass_cnt++;
      end else begin
        $display("  FAIL  data=0x%02h | loop=%0d par=%0d tree=%0d exp=%0d",
                 data, cnt_loop, cnt_par, cnt_tree, exp);
        fail_cnt++;
      end
    end

    // ── Serial tests: selected values ──────────────────────────
    $display("\n  --- Serial (clocked, 8 cycles per test) ---");
    rst_n=0; @(posedge clk); #1; rst_n=1;

    // Test selected values with the serial module
    begin : serial_tests
      integer test_vals [0:7];
      integer k;
      test_vals[0] = 8'h00; test_vals[1] = 8'hFF;
      test_vals[2] = 8'hAA; test_vals[3] = 8'h55;
      test_vals[4] = 8'h01; test_vals[5] = 8'h80;
      test_vals[6] = 8'h0F; test_vals[7] = 8'hF0;

      for (k=0; k<8; k=k+1) begin
        data  = test_vals[k];
        start = 1;
        @(posedge clk); #1;
        start = 0;
        // Wait 8 cycles for serial result
        repeat(9) @(posedge clk); #1;

        exp = data[0]+data[1]+data[2]+data[3]+
              data[4]+data[5]+data[6]+data[7];
        test_num++;
        if (cnt_ser===exp && done_ser===1'b1) begin
          $display("  PASS  Serial data=0x%02h → count=%0d (done=%b)",
                   data, cnt_ser, done_ser);
          pass_cnt++;
        end else begin
          $display("  FAIL  Serial data=0x%02h | got=%0d exp=%0d done=%b",
                   data, cnt_ser, exp, done_ser);
          fail_cnt++;
        end
      end
    end

    // ── Summary ─────────────────────────────────────────────
    $display("\n============================================================");
    $display("  RESULTS: %0d / %0d PASS  |  %0d FAIL",
             pass_cnt,test_num,fail_cnt);
    $display("============================================================");
    if(fail_cnt==0)
      $display("  ✅ ALL TESTS PASSED — all 256 input values verified\n");
    else
      $fatal(1,"  ❌ %0d FAILURE(S)\n",fail_cnt);
    #20; $finish;
  end

endmodule
`default_nettype wire

📈 Simulation Waveform

The combinational waveform shows the count output responding immediately as the input changes through selected test values. The serial waveform shows the 8-cycle latency after each start pulse.

Fig 3 — Combinational count_ones: input changes, output responds immediately

💻 Simulation Console Output

============================================================ Count-the-1s Testbench — Exhaustive 256-Vector Coverage ============================================================ — Combinational (loop, parallel, tree) — PASS data=0x00 (00000000) → count=0 PASS data=0x20 (00100000) → count=1 PASS data=0x40 (01000000) → count=1 PASS data=0x60 (01100000) → count=2 PASS data=0x80 (10000000) → count=1 PASS data=0xA0 (10100000) → count=2 PASS data=0xC0 (11000000) → count=2 PASS data=0xE0 (11100000) → count=3 … (256 total tests, all pass) … PASS data=0xFF (11111111) → count=8 — Serial (clocked, 8 cycles per test) — PASS Serial data=0x00 → count=0 (done=1) PASS Serial data=0xFF → count=8 (done=1) PASS Serial data=0xAA → count=4 (done=1) PASS Serial data=0x55 → count=4 (done=1) PASS Serial data=0x01 → count=1 (done=1) PASS Serial data=0x80 → count=1 (done=1) PASS Serial data=0x0F → count=4 (done=1) PASS Serial data=0xF0 → count=4 (done=1) ============================================================ RESULTS: 264 / 264 PASS | 0 FAIL ============================================================ ✅ ALL TESTS PASSED — all 256 input values verified

How to Run

Compile all four modules and the testbench

# ── Icarus Verilog ────────────────────────────────────────────
iverilog -o popcount_sim \
    count_ones_loop.v     \
    count_ones_serial.v   \
    count_ones_parallel.v \
    count_ones_tree.v     \
    count_ones_tb.v
vvp popcount_sim
gtkwave count_ones.vcd

# ── ModelSim ──────────────────────────────────────────────────
vlog count_ones_loop.v count_ones_serial.v \
     count_ones_parallel.v count_ones_tree.v count_ones_tb.v
vsim -c count_ones_tb -do "run -all; quit -f"

🔬 Design Analysis & Comparison

Implementation	Critical path depth	Area (adders)	Latency (cycles)	Synthesizable?	Best for
Loop (for)	O(N) — 7 adders in series	N-1 adders	0 (combinational)	✅ Yes	Clarity, any N via parameter
Serial	1 adder (per cycle)	1 adder + N+4 FFs	N cycles	✅ Yes	Ultra-low area, high throughput with pipelining
Parallel (assign)	O(log N) after tool opt.	N-1 adders (tree)	0 (combinational)	✅ Yes	Concise code, trust synthesis
Tree adder	O(log N) — 3 levels	N-1 adders (tree)	0 (combinational)	✅ Yes	Maximum speed, explicit parallelism

Critical Path Comparison — N=8

🔴 Loop (chain): 7 additions deep

d[0] → +d[1] → +d[2] → +d[3]
     → +d[4] → +d[5] → +d[6] → +d[7]

Delay = 7 × t_ADD

🟢 Tree: 3 additions deep

Stage1: 4 adds in parallel (1×t_ADD)
Stage2: 2 adds in parallel (1×t_ADD)
Stage3: 1 final add        (1×t_ADD)

Delay = 3 × t_ADD = log₂(8) × t_ADD

Parameterisation to any width: All four implementations use parameter N = 8 with parameter CNT = $clog2(N+1) to automatically derive the output width. To count 1s in a 32-bit word, instantiate with #(.N(32)) — CNT becomes 6 (to hold values 0–32), the loop runs 32 iterations, and the tree adder becomes 5 levels deep. The serial module will take 32 clock cycles. No other changes required.

Hardware instruction equivalents: The POPCNT instruction introduced in x86 SSE4.2 counts bits in a 64-bit register in a single cycle. ARM64’s CNT instruction counts bits in each byte of a vector register. These hardware instructions implement the tree adder structure in dedicated silicon — exactly what our tree implementation synthesizes to. The for-loop version, while identical in function, would produce slower hardware without synthesis optimization.

Verilog Designs · Module 27

🔢 Introduction & Theory

📊 Truth Table & Examples

🔁 Implementation 1 — For Loop (Combinational)

⏱ Implementation 2 — Sequential Serial Counter

⚡ Implementation 3 — Parallel Add ($countones)

🌳 Implementation 4 — Tree Adder (Fastest)

🧪 Comprehensive Testbench

📈 Simulation Waveform

💻 Simulation Console Output

How to Run

🔬 Design Analysis & Comparison

Critical Path Comparison — N=8

🔴 Loop (chain): 7 additions deep

🟢 Tree: 3 additions deep

Leave a Comment Cancel Reply