Tuesday, June 26, 2007

Background: I/O Subsystem Bottlenecks

Background: I/O Subsystem Bottlenecks

New I/O buses are typically developed in response to changing system requirements and to promote lower cost implementations. Current-generation I/O buses such as PCI are rapidly falling behind the capabilities of other system components such as processors and memory. Some of the reasons why the I/O bottlenecks are becoming more apparent are described below.

Server Or Desktop Computer: Three Subsystems

A server or desktop computer system is comprised of three major subsystems:

  1. Processor (in servers, there may be more than one)

  2. Main DRAM Memory. There are a number of different synchronous DRAM types, including SDRAM, DDR, and Rambus.

  3. I/O (Input/Output devices). Generally, all components which are not processors or DRAM are lumped together in this subsystem group. This would include such things as graphics, mass storage, legacy hardware, and the buses required to support them: PCI, PCI-X, AGP, USB, IDE, etc.

CPU Speed Makes Other Subsystems Appear Slow

Because of improvements in CPU internal execution speed, processors are more demanding than ever when they access external resources such as memory and I/O. Each external read or write by the processor represents a huge performance hit compared to internal execution.

Multiple CPUs Aggravate The Problem

In systems with multiple CPUs, such as servers, the problem of accessing external devices becomes worse because of competition for access to system DRAM and the single set of I/O resources.

DRAM Memory Keeps Up Fairly Well

Although it is external to the processor(s), system DRAM memory keeps up fairly well with the increasing demands of CPUs for a couple of reasons. First, the performance penalty for accessing external memory is mitigated by the use of internal processor caches. Modern processors generally implement multiple levels of internal caches that run at the full CPU clock rate and are tuned for high "hit rates". Each fetch from an internal cache eliminates the need for an external bus cycle to memory.

In addition, in cases where an external memory fetch is required, DRAM technology and the use of synchronous bus interfaces to it (e.g. DDR, RAMBUS, etc.) have allowed it to maintain bandwidths comparable with the processor external bus rates.

I/O Bandwidth Has Not Kept Pace

While the processor internal speed has raced forward, and memory access speed has managed to follow along reasonably well with the help of caches, I/O subsystem evolution has not kept up.

This Slows Down The Processor

Although external DRAM accesses by processors can be minimized through the use of internal caches, there is no way to avoid external bus operations when accessing I/O devices. The processor must perform small, inefficient external transactions which then must find their way through the I/O subsystem to the bus hosting the device.

It Also Hurts Fast Peripherals

Similarly, bus master I/O devices using PCI or other subsystem buses to reach main memory are also hindered by the lack of bandwidth. Some modern peripheral devices (e.g. SCSI and IDE hard drives) are capable of running much faster than the busses they live on. This represents another system bottleneck. This is a particular problem in cases where applications are running that emphasize time-critical movement of data through the I/O subsystem over CPU processing.

Reducing I/O Bottlenecks

Two important schemes have been used to connect I/O devices to main memory. The first is the shared bus approach, as used in PCI and PCI-X. The second involves point-to-point component interconnects, and includes some proprietary busses as well as open architectures such as HyperTransport. These are described here, along with the advantages and disadvantages of each.

The Shared Bus Approach

Figure 1-1 on page 12 depicts the common "North-South" bridge PCI implementation. Note that the PCI bus acts as both an "add-in" bus for user peripheral cards and as an interconnect bus to memory for all devices residing on or below it. Even traffic to and from the USB and IDE controllers integrated in the South Bridge must cross the PCI bus to reach main memory.

No comments: