CA-03: Von Neumann Architecture — Organisation, Structure & Function — VLSI Trainers
Computer Architecture · Article 3 of 12

CA-03: Von Neumann Architecture

Computer architecture vs organisation, structure and function, the four computer functions, Von Neumann’s stored-program model, CPU internals, performance factors, and how every concept maps to modern SoC design.

🏛️Architecture vs Organisation

These two terms are often used interchangeably, but they describe fundamentally different levels of a computer system. Understanding the distinction is essential for anyone working in VLSI — it is the difference between the specification a programmer sees and the physical implementation that an engineer builds.

COMPUTER ARCHITECTURE

Definition: Attributes visible to a programmer — those with a direct impact on the logical execution of a program.

  • Instruction set (ISA) — what instructions exist
  • Number of bits for each data type (8-bit, 16-bit, 32-bit, 64-bit)
  • I/O mechanisms — how peripherals are addressed
  • Memory addressing techniques — flat, segmented, virtual
  • Whether a MUL instruction exists in the ISA

Architecture persists across many hardware generations. IBM System/370 architecture from 1970 still runs today on z-Series mainframes.

COMPUTER ORGANISATION

Definition: Operational units and their interconnections that realise the architectural specification — hardware details transparent to the programmer.

  • Control signals — how the control unit sequences operations
  • Interfaces between CPU and peripherals
  • Memory technology used (SRAM, DRAM, flash)
  • Whether MUL uses a dedicated multiply unit or repeated addition
  • Bus width, clocking scheme, pipeline depth

Organisation changes with every technology generation while the architecture remains stable — protecting software investment.

🔍 Worked Example — Architecture vs Organisation for MUL instruction

Architectural question: “Should this CPU support a MUL (multiply) instruction?” → This is an ISA decision. If the answer is yes, every programmer targeting this architecture can use MUL.

Organisational question: “How should MUL be implemented?” → This is a micro-architecture decision. Options:

Option A: Dedicated hardware multiplier (1-cycle latency, larger die area)
Option B: Repeated calls to the adder (multi-cycle, smaller area, less power)

The programmer only sees “MUL exists and produces the correct result.” The implementation choice is invisible — but it profoundly affects performance, power, and cost. This trade-off is at the heart of every CPU micro-architecture design decision.

IBM System/370 — the classic example: Introduced in 1970, this architecture has been implemented in dozens of hardware generations — each with completely different organisation (technology, transistor count, clock frequency) while running identical software. The same architecture survived from vacuum-era punch-card mainframes to today’s z-Series servers. This is why architecture stability is commercially invaluable.

🔗Structure vs Function

A computer is a complex hierarchical system. At each level of the hierarchy, we describe it in terms of two aspects:

AspectDefinitionExample at CPU level
Structure The way in which components are related to each other — the physical and logical connections between parts The ALU connects to registers via an internal CPU bus; the control unit connects to the ALU via control lines
Function The operation of each individual component as part of the structure — what each part does The ALU performs arithmetic and logic operations; the control unit sequences instruction execution; registers hold temporary data

This structure/function duality applies at every level of the hierarchy:

Design principle: At each level, the designer only needs to understand the simplified, abstracted behaviour of the level below — not its internal details. An ALU designer specifies inputs and outputs; they do not need to know the transistor-level implementation of each gate. This abstraction is what makes complex systems manageable to design and verify.

⚙️The Four Computer Functions

At the top level, every computer performs exactly four functions — regardless of its technology, generation, or classification:

Figure 1 — The four fundamental computer functions
⚙️ Data Processing CPU performs arithmetic & logical operations on data (ADD, AND, SHIFT …) Responsible: ALU 💾 Data Storage Short-term: registers and RAM Long-term: files on disk / flash Responsible: Memory hierarchy 🔄 Data Movement I/O: between CPU & peripherals Data communications: over network Responsible: I/O modules 🎛️ Control Manages & orchestrates all other components in response to instructions Responsible: Control Unit vlsitrainers.com

The four fundamental computer functions. All four operate interdependently — control orchestrates the other three. Data processing is meaningless without storage to hold operands, data movement to bring data in and take results out, and control to sequence the operations correctly.

Data movement: I/O vs data communication

Data movement has two sub-categories with an important distinction:

🖥️Top-Level Computer Structure

At the highest structural level, a computer is defined by four main components and how they connect. Everything else is a refinement of this structure.

Figure 2 — Top-level computer structure: four main components
SYSTEM BUS — Address Bus · Data Bus · Control Bus CPU Control Unit ALU Registers Main Memory Address 0000: instruction Address 0001: data Address 0002: instruction ← data & instructions co-resident I/O Modules I/O Module 1 I/O Module 2 Buffers + Control registers ⌨️ Keyboard 🖥️ Monitor 💾 Disk Address lines: which location? Data lines: what value? Control lines: read/write/IRQ vlsitrainers.com

Top-level computer structure. Three main blocks (CPU, Main Memory, I/O Modules) connect through the system bus. The key Von Neumann property is visible in Main Memory: data and instructions occupy the same address space — the CPU cannot distinguish between them by location alone.

The system bus — three component sets

Bus componentDirectionFunctionWidth (typical)
Address BusCPU → Memory / I/O (unidirectional)Specifies the memory address or I/O device to access32-bit or 64-bit
Data BusBidirectionalCarries the actual data being read or written8 / 16 / 32 / 64-bit
Control BusBidirectionalCarries command signals: read, write, interrupt request (IRQ), bus grant, clockVaries by design

💡Von Neumann Architecture — Three Key Concepts

John von Neumann’s architecture, described in his 1945 EDVAC report, is built on three fundamental concepts that define virtually all computers built since:

Figure 3 — Von Neumann architecture: three defining concepts
① Shared Memory Data AND instructions stored in the same read-write memory Any address holds either data or code RAM is uniform — program loads are just memory writes ② Sequential Execution Instructions execute one after another in sequential memory order PC increments after each instruction fetch Jump/branch instructions alter this default flow ③ Basic Logic Operations A small set of binary logic components combined to build any computation AND, OR, NOT → all arithmetic possible This is why the ALU is the universal computing engine vlsitrainers.com

The three Von Neumann principles. Every modern CPU — from an 8-bit AVR microcontroller to a 64-core server processor — is built on these three ideas. The shared memory concept enables software to be loaded and changed without rewiring; sequential execution provides a default program flow; basic logic operations make the ALU a universal computing engine.

Why shared memory was revolutionary: Before von Neumann, changing a program meant physically re-wiring the machine (ENIAC). With shared memory, a program is just data in memory. To run a different program, you load different data. This made software possible as an independent artifact separate from hardware — the foundation of the entire software industry.

🔩CPU Internal Structure

Inside the CPU, the same structure/function duality applies at a lower level. The CPU’s four structural components are:

Figure 4 — CPU internal structure: Control Unit, ALU, Registers, and Internal Bus
── CPU boundary ── Control Unit (CU) Interprets instructions Generates control signals Sequences all CPU operations Registers: IR, PC, MAR, MBR Controls memory via MAR/MBR Arithmetic Logic Unit (ALU) Add · Sub · Multiply · Divide AND · OR · NOT · XOR · Shift Compare operands → set flags Outputs: Result + N Z C V flags Operands come from registers INTERNAL CPU BUS (address · data · control) Register File R0–R3 R4–R7 R8–R11 R12–R15 PC SP FLAGS ACC/IR Fastest on-chip storage · ~1 cycle access · operands go to ALU, results come back System Bus Interface Address Bus (MAR →) Data Bus (MBR ↔) Control Bus (CU →) → Main Memory & I/O to/from CU ↔ bus ALU ↔ bus Regs ↔ bus vlsitrainers.com

CPU internal structure. The horizontal Internal CPU Bus connects all three components — Control Unit (top-left), ALU (top-right), and Register File (bottom). All operands flow from registers up to the ALU; results flow back down. The System Bus Interface (right) connects the internal bus to the external world: MAR drives the address bus, MBR exchanges data, and the CU drives control signals to reach Main Memory and I/O.

ComponentControlsPrimary function
Control Unit (CU)All CPU components via control signalsInterprets instructions, generates timing and control signals that direct every operation
ALUOperated by CU; reads/writes registersPerforms all arithmetic (+, −, ×, ÷) and all logical (AND, OR, NOT, XOR, shift) operations
RegistersAccessed by CU and ALUFastest on-chip storage; hold current operands, results, PC, stack pointer, flags
Internal CPU BusCarries data between CU/ALU/registersHigh-speed interconnect within the CPU boundary; separate from the external system bus

🔀Harvard vs Von Neumann Architecture

The Von Neumann architecture’s single shared memory for data and instructions creates a fundamental bottleneck: the CPU can only do one thing with the bus at a time — fetch an instruction or read/write data. This is the Von Neumann bottleneck.

Figure 5 — Von Neumann vs Harvard architecture
Von Neumann Architecture CPU single bus Shared Memory Instructions + Data same address space ⚠ Von Neumann Bottleneck CPU must alternately fetch instructions AND access data — same bus, same memory Harvard Architecture CPU inst. bus data bus Instruction Memory Data Memory ✓ Fetch instruction AND access data simultaneously Used in DSPs, microcontrollers, CPU caches (modified Harvard) vlsitrainers.com

Von Neumann (left): single shared memory and bus for both instructions and data — simple but creates a bottleneck. Harvard (right): separate instruction and data memories with separate buses — eliminates the bottleneck, enables simultaneous fetch+access. Modern CPUs use a “modified Harvard” approach: separate L1 instruction and data caches backed by unified L2/L3.

Modern processors use Modified Harvard: Internally, the L1 instruction cache and L1 data cache are separate (Harvard-style — simultaneous access). But at the L2 cache and main memory level, a unified address space is used (Von Neumann-style). This gives the speed advantage of Harvard at the first level while retaining the programming simplicity of Von Neumann at the system level. Every ARM, x86, and RISC-V processor uses this hybrid.

📈Performance Factors

Computer performance is not determined by a single factor. It is the product of multiple interacting system components. Understanding these factors is essential for both computer architecture and SoC design.

Figure 6 — Factors that determine CPU and system performance
System Performance Clock Rate Higher Hz = more steps per second CPI (Clocks/Instr.) Lower CPI = more efficient execution Instruction Count Fewer instructions = less work Memory Speed Slow RAM = bottleneck Cache reduces misses Bus Width Wider bus = more data per cycle I/O Technique Interrupt / DMA vs polling vlsitrainers.com

Six factors that determine system performance. No single factor tells the whole story — a fast CPU is bottlenecked by slow memory; a wide bus helps nothing if the CPU has high CPI. Performance optimisation must consider all six simultaneously.

High performance vs low cost: bus design trade-offs

Design decisionHigh performance optionLow cost option
Bus typeSeparate address and data buses (parallel transfers)Multiplexed address/data bus (fewer pins)
Data bus widthWider bus (64-bit transfers more data)Narrower bus (low pin count, cheaper package)
Bus mastersMultiple masters with arbitration (parallelism)Single master (no arbitration logic needed)
Transfer sizeMultiple words per transfer (burst mode)Single word per transfer (simpler protocol)
ClockingSynchronous (faster, predictable timing)Asynchronous (works with any speed device)

🧮CPU Performance Equation

CPU performance can be precisely characterised by three measurable quantities. Their relationship is captured in the CPU Performance Equation:

🔍 CPU Performance Equation

CPU Execution Time = Instruction Count × CPI × Clock Cycle Time

Or equivalently: CPU Time = (IC × CPI) / Clock Rate

Where:

IC = Instruction Count — total instructions executed (determined by algorithm + ISA + compiler)
CPI = Cycles Per Instruction — average clock cycles per instruction (determined by CPU micro-architecture)
Clock Cycle Time = 1 / Clock Rate (Hz) — duration of one clock tick (determined by silicon technology + logic depth)

Example: A program executes 10 million instructions. CPI = 4. Clock rate = 1 GHz.
CPU Time = (10×10⁶ × 4) / (1×10⁹) = 40×10⁶ / 10⁹ = 0.04 seconds = 40 ms

To improve performance, reduce any of the three factors:
— Reduce IC: better compiler, better algorithm, richer ISA (CISC)
— Reduce CPI: superscalar execution, out-of-order, better pipeline, larger cache
— Reduce clock cycle time: advanced process node, shallower pipeline, better cells

MIPS — a misleading metric: MIPS (Millions of Instructions Per Second) is only meaningful when comparing machines with the same ISA. A RISC machine executing simple instructions at high MIPS can be slower than a CISC machine executing complex instructions at lower MIPS — because the CISC machine accomplishes more per instruction. Always use execution time as the definitive performance metric.

🔬VLSI Connections

🔬 Architecture vs Organisation in your daily VLSI work

When you join a chip design team, this distinction is immediately practical. The architecture team defines the ISA — what instructions exist, how many registers, what addressing modes, what memory model. This becomes the specification your RTL team implements. The micro-architecture team decides how to implement it — pipeline depth, cache sizes, issue width, branch predictor design. You might be verifying that the micro-architectural implementation correctly realises the architectural specification. A bug where a multiply instruction produces the wrong result is an organisational failure (wrong RTL implementation) not an architectural one. SVA (SystemVerilog Assertions) are the formal language for expressing architectural properties that micro-architectural implementations must satisfy.

🔬 The system bus → AXI, APB, AHB interconnect

The system bus concept — address bus, data bus, control bus connecting CPU, memory, and I/O — is implemented in modern SoCs as the AMBA bus family. AXI4 (Advanced eXtensible Interface) is the high-performance bus connecting the CPU cores to caches and memory controllers. AHB (Advanced High-performance Bus) connects mid-speed peripherals. APB (Advanced Peripheral Bus) connects low-speed peripherals (GPIO, UART, SPI). AXI4’s separate read/write channels are a direct implementation of the separate address and data buses in the high-performance column of the bus design trade-off table in S8. Understanding the conceptual system bus is the prerequisite for understanding AXI4 handshaking, which every SoC verification engineer works with daily.

🔬 CPU performance equation → micro-architecture design decisions

The CPU performance equation (IC × CPI × Clock Time) directly drives every micro-architectural design decision. Reducing CPI by 2× requires either doubling issue width (superscalar) or halving average cache miss rate — both require significant area and power investment. Halving clock cycle time requires either a technology shrink (moving from 7nm to 3nm) or shallowing the pipeline — which increases CPI. These three-way trade-offs (IC, CPI, frequency) explain why processor design involves simultaneous optimisation of dozens of micro-architectural parameters. When you write simulation scripts comparing RTL implementations, you are measuring CPI. When you run timing analysis, you are measuring clock cycle time. The performance equation is the quantitative framework that connects all three metrics.

Summary — CA-03 key points: Architecture = programmer-visible attributes (ISA, data types, addressing); Organisation = implementation details (control signals, memory technology, bus width). Structure = how components connect; Function = what each component does. Four computer functions: data processing, data storage, data movement (I/O + communications), control. Top-level structure: CPU + Main Memory + I/O + System Bus. Von Neumann architecture: (1) shared memory for data and instructions, (2) sequential execution, (3) basic logic operations. CPU structure: Control Unit + ALU + Registers + Internal bus. Harvard architecture separates instruction and data memories for simultaneous access — modern CPUs use a hybrid (modified Harvard L1 cache, unified Von Neumann address space). CPU performance = IC × CPI × Clock Cycle Time.
Scroll to Top