Hypertransport CPU technology: June 2007

Tuesday, June 26, 2007

ISA/LPC Buses

The ISA and LPC buses reside typically on the PCI bus. These buses support bus mastering and legacy DMA transfers. These devices differ from HT and PCI devices in that they do not support either split transactions or retries.

Deadlocks
The specification defines two possible deadlock conditions that can occur because the ISA and LPC (Low Pin-Count) buses do not support transaction retry. For example, if an ISA (LPC) Master initiates a transaction that requires a response, the bus cannot handle a new request prior to the current transaction having completed. This type of protocol is extremely simple from an ordering perspective because all transactions must complete before the next one begins; thus, no ordering rules are required. Of course the downside to this approach is that all other devices are stalled while they wait for the current transaction to complete. Delayed transactions supported by the PCI bus and split transactions supported by PCI-X and HyperTransport can handle new transactions while a response to a previous transaction is pending. The price — complex ordering rules to ensure that transactions complete in the intended order.

Deadlock Scenario 1

Consider the following sequence of events as they relate to the limitations of the ISA/LPC bus as discussed above and to the PCI-based Producer/Consumer transaction ordering model.

An ISA/LPC Master initiates a transaction that requires a response from the Host-to-HT Bridge (e.g., a memory read from main memory).
The CPU initiates a write operation targeting a device on the ISA/LPC bus, and the Host Bridge issues this write as a posted operation.
The posted write reaches HT-to-PCI bridge where it is sent across the PCI bus to the south bridge.
The south bridge cannot accept the write targeting the ISA bus because the ISA/LPC bus is waiting for the outstanding response. So, the south bridge issues a retry.
The read response reaches the HT/PCI bridge. However, the Producer/Consumer model requires that all previously-posted write headed to the PCI bus be completed before sending a read response. The read response is now stuck behind a posted write that cannot complete prior to the read response. Result: Deadlock!

The recommended solution to this problem is to require that all requests targeting the ISA/LPC bus be non-posted operations. This eliminates the problem because non-posted operations can be forwarded to the PCI bus in any order.

Deadlock Scenario 2

Once again because the ISA or LPC bus is unable to accept any requests while it waits for a response to its own requests a possible deadlock can occur. This deadlock can occur when the downstream non-posted request channel fills up while awaiting a response to an ISA DMA request. The sequence of events is as follows:

A DMA request is issued by an ISA/LPC device to main memory.
Downstream requests targeting the ISA bus are initiated but stack up because they are not being accepted by the south bridge, because its's waiting on a response from the previously issued DMA request. Consequently, it is possible for the downstream nonposted request channel to fill.
A peer-to-peer operation is initiated to a device on the same chain that is in the non-posted request queue ahead of the ISA/LPC request (in step 1) This peer-to-peer transaction is sent to the Host, which attempts to reflect the transaction downstream to the target device. However, because the downstream request channel is full; the upstream nonposted peer request stalls as does the request from the ISA bus. This prevents the ISA/LPC bridge from making forward progress.

The solution to this deadlock is for the host to limit the number of requests it makes to the ISA/LPC bus to a known number of requests (typically one) that the bridge can accept. Because the host cannot limit peer requests without eventually blocking the upstream nonposted channel (and causing another deadlock), no peer requests to the ISA/LPC bus are allowed. Peer requests to devices below the ISA/LPC bridge on the chain (including other devices in the same node as the ISA/LPC bridge) cannot be performed without deadlock unless the ISA/LPC bridge sinks the above mentioned known number of requests without blocking requests forwarded down the chain. This can be implemented with a buffer (or set of buffers) for requests targeting the bridge, but separate from the buffering for other requests.

AGP Bus Issues

AGP Configuration Space Requirements

Some legacy operating systems require that the AGP capability registers be mapped at Bus 0, Device 0, and Function 0. Also, The AGP aperture base address configuration register must be at Bus 0, Device 0, Function 0, Offset 10h. In a legacy system, these registers are located within the Host to PCI bridge configuration space (Host to HT bridge in our example).

For complete legacy software support, the specification recommends that the AGP subsystem be designed as follows:

AGP bridges are placed logically on HyperTransport chain 0 (Bus 0).
The AGP interface uses multiple UnitIDs due to AGP configuration being split between the Host to HT bridge and the Host to AGP bridge (i.e., virtual PCI to PCI bridge).
During initialization the base UnitID of an AGP device must be assigned a non-zero value to support configuration of chain 0. Following HT initialization the base UnitID should be changed to zero.
Device number zero, derived from the base UnitID register value, should contain the capabilities header and the AGP aperture base address register (at Offset 10h),
Device number 1, derived from the base UnitID+1, should be used for the Host to AGP bridge.
The UnitID that matches the base (0) is not used for any AGP-initiated I/O streams or responses so that there is no conflict with host-initiated I/O streams or responses. Only UnitIDs greater than the base may be used for I/O streams.
Legacy implementations place the AGP graphics address remapping table (GART) in the host. Thus, the AGP aperture base address register and any other registers that are located in the AGP device but required by the host are copied by software into implementation-specific host registers. These implementation-specific registers should be placed somewhere other than Device 0, to avoid conflicts with other predefined AGP registers. In a sharing double-hosted chain, this requires the hosts to implement the Device Number field so that the hosts may address each other after the AGP bridge has assumed Device 0.

Note that if legacy OS support is not required, the AGP device's base UnitID register may be programmed to any permissible value.

AGP Ordering Requirements

Three categories of AGP transaction types lead to three separate sets of ordering rules. These categories can be thought of as three separate transaction channels. These three channels are completely independent of each other with respect to ordering, and should have their own UnitIDs. The transaction types are:

PCI-based
Low Priority
High Priority

The specification makes the following observation that leads to HT-based AGP ordering requirements being slightly less complex that PCI-based requirements:

The ordering rules presented here for reads are somewhat different from what appears in the AGP specification. That document defines ordering between reads in terms of the order that data is returned to the requesting device. We are concerned here with the order in which the reads are seen at the target (generally, main memory). The I/O bridges can reorder returning read data if necessary. This leads to a slightly relaxed set of rules.

See MindShare's AGP System Architecture book for details regarding the AGP ordering rules.

PCI-Based Ordering

AGP transactions based on the PCI protocol follow the same rules as PCI.

Low Priority Ordering

Ordering rules for the low priority AGP transactions are:

Reads (including flushes) must not pass writes.
Writes must not pass writes.
Fences must not pass other transactions or be passed by other transactions.

High Priority Ordering

High priority transactions only carry graphics data using split transactions. Consequently, the Producer/Consumer model has no relevance and ordering requirements can be reduced to the following single rule:

Writes must not pass writes.

PCI Bus Issues

Several features of the PCI bus must be handled in the correct fashion when interfacing with the HT bus. For background information and details regarding PCI ordering, refer to MindShare's PCI System Architecture book, 4th edition.

PCI Ordering Requirements

Transaction ordering on the PCI bus is based on the Producer/Consumer programming model. This model involves 5 elements:

Producer — PCI master that sources data to a memory target
Target — main memory or any PCI device containing memory
Consumer — PCI master that reads and processes the Producer data from the target
Flag element — a memory or I/O location updated by the producer to indicate that all data has been delivered to the target, and checked by the Consumer to determine when it can begin to read and process the data.
Status element — a memory or I/O location updated by the Consumer to indicate that it has processed all of the Producer data, and checked by the Producer to determine when the next batch of data can be sent.

This model works flawlessly in PCI when all elements reside on the same shared PCI bus. When these elements reside on different PCI buses (i.e. across PCI to PCI bridges, the model can fail without adherence to the PCI ordering rules.

The PCI specification, versions 2.2 and 2.3, defines the required transaction ordering rules. These ordering rules are included in this section as review and to identify rules that have may have no purpose in some HT designs.

PMW stands for posted memory write.
DRR and DRC stand for Delayed Read Request and Delayed Read Completion, respectively.
DWR and DWC stand for Delayed Write Request and Delayed Write Completion, respectively.
"Yes" specifies that the transaction just latched must be ordered ahead of the previously latched transaction indicated in the column heading.
"No" specifies that the transaction just latched must never be ordered ahead of the previously latched transaction indicated in the column heading.
"Yes/No" entries means that the transaction just latched is allowed to be ordered ahead of the previously-latched operation indicated in the column heading, but such reordering is not required. The Producer/Consumer Model works correctly either way.

Avoiding Deadlocks

PCI ordering rules require that Posted Memory Writes (PMWs) in Row 1, be ordered ahead of the delayed requests and delayed completions listed in columns 2-5. This requirement is based on avoiding potential deadlocks. Each of the deadlocks involve scenarios arising from the use PCI bridges based on earlier versions of the specification. If all PCI bridge designs used in HT platforms are based on 2.1 and later versions of the PCI specification, the PCI ordering rules with "Yes" entries in row 1 can be treated as "Yes/No."

Subtractive Decode

PCI employs a technique referred to as subtractive decode to handle devices that are mapped into memory or I/O address space by user selection of switches and jumpers (e.g. ISA devices). Consequently, configuration software has no knowledge of the resources assigned to these devices. Fortunately, these PC legacy devices are mapped into relatively small ranges of address space that can be reserved by platform configuration software.

Subtractive Decode: The PCI Method

Subtractive decode is a process of elimination. Since configuration software allocates and assigns address space for PCI, HT, AGP and other devices, any access to address locations not assigned can be presumed to target a legacy device, or may be an errant address.

All PCI devices must perform a positive decode to determine if they are being targeted by the current request. This decode must be performed as a fast, medium, or slow decode. The device targeted must indicate that it will respond to the request by signaling device select (DEVSEL#) across the shared bus. When device driver software issues a request with an address that has not been assigned by configuration software, no PCI device is targeted (i.e. no DEVSEL# is asserted within the time allowed) By process of elimination, the subtractive decode agent recognizes that no PCI device has responded and therefore it asserts DEVSEL# and forwards the transaction to the ISA bus, where the request is completed.

Subtractive Decode: HT Systems Requiring Extra Support

When the subtractive decode agent is not at the end of a single-hosted chain, or when more than one HT I/O chain is implemented in a system, subtractive decode becomes more difficult.

The Problem

HyperTransport devices in a chain do not share the same bus as in PCI, so a subtractive decode agent cannot detect if a request has not been claimed by other devices on the chain.

The Solution

As described previously, configuration software assigns addresses to all HT, PCI, and AGP devices. Therefore, the host knows when a request will result in a positive decode and when it will not. The specification requires that all hosts connecting to HyperTransport I/O chains implement registers that identify the positive decode ranges for all HyperTransport technology I/O devices and bridges (except as noted in the simple method). One of these I/O chains may also include a subtractive bridge (typically leading to an ISA, or LPC bus). Requests that do not match any of the positive ranges must be issued with the compat bit set, and must be routed to the chain containing the subtractive decode bridge. This chain is referred to as the compatibility chain.

The Compat bit indicates to the subtractive decode bridge that it should claim the request, regardless of address. Requests that fall within the positive decode ranges must not have the Compat bit set, and are passed to the I/O chain upon which the target device resides. The target chain may be the compatibility or any other I/O chain.

PCI Burst Transactions

PCI permits long burst transactions with either contiguous or discontiguous byte masks (byte enables) that may not be supported by HT. These long bursts must be broken into multiple requests to support the HT protocol as follows:

PCI read requests with discontiguous byte masks that cross aligned 4-byte boundaries must be broken into multiple 4-byte HT RdSized (byte) requests.
PCI write requests with discontiguous byte masks that cross 32-byte boundaries must be broken into multiple 32-byte HT WrSized (byte) requests. Note that the resulting sequence of write requests must be strongly ordered in ascending address order.
PCI write requests with contiguous byte masks that cross 64-byte boundaries must be broken into multiple 64-byte HT WrSized (dword) request

The Need For Networking Extensions

While HyperTransport was initially developed to address bandwidth and scalability problems associated with moving data through the I/O subsystems of desktops and servers, the networking extensions bring a number of enhancements which permit the advantages of HyperTransport technology to be extended to communications processing applications. There are some major differences in the requirements of host-centric systems such as desktops and servers and communications processing systems.

Communications Processing Is Often Less Vertical

In communications applications, there may be a number of processors or coprocessors located in various corners of the topology. The host processor may assume responsibility for configuration and control of coprocessors and interface devices, while the coprocessors perform specialized data processing tasks. Because of the distributed responsibility for control and data handling tasks, these systems tend to be much less host processor-centric.

As a result of decentralizing data processing in communications systems, information flow may be omni-directional as coprocessors initiate transactions targeting devices under their control. When switch components are added to the topology, elaborate multi-port configurations are possible.

Summary Of Anticipated Networking Extension Features

Network Extensions Adds Message Semantics

In handling the special problems of communications processing, the HyperTransport networking extensions add message semantics to the storage semantics used in the 1.04 revision of the HyperTransport I/O Link Specification. Storage semantics were described in the last section. Message semantics are more efficient in handling variable length transfers, broadcasting messages, etc. The 64-byte HyperTransport packets are concatenated to form longer messages, and additions to request packet fields identify the start of a message, end of a message, or may even be used to signal the abort of a scheduled transaction. Unlike storage semantics, in which the payload is data targeting an address, messages can also be sent which convey interrupts and other housekeeping events.

Another difference between message semantics and storage semantics is the concept of addressing. In storage semantics, addresses are managed by the source device, and each byte of data transferred is associated with a particular address in the system memory map. This makes sense because the locations are within (and owned by) the device being targeted. In message semantics, the message is tagged as to which stream it belongs, and the destination determines where it goes. The ultimate destination is often external to the system, where the system memory map has no meaning.

16 New Posted Write Virtual Channels

Release 1.1 adds 16 new optional Posted Write Virtual Channels to the hardware of each node (above the three already required). Each of these new virtual channels may be given a dedicated bandwidth allocation, and an arbitration mechanism is defined for managing them.

An End-To-End flow control mechanism has also been added to allow devices to put millions of user streams into these 16 additional virtual channels. In this way, very large numbers of independent real-time streams (e.g. audio or video) make be handled.

Direct Peer-to-Peer Transfers Added

HyperTransport supports the full producer-consumer ordering model of PCI. In cases where this strict global ordering is needed, transactions from one HyperTransport I/O device to another (called peer-to-peer transfers) must first move upstream to the host bridge where they are then reissued downstream to the target device (a process HyperTransport calls reflection). Release 1.1 adds the option of sending send some traffic directly from peer-to-peer when the application does not require strict global ordering (it often isn't a concern in communications processing).

Link-Level Error Detection And Handling

With the addition of direct peer-to-peer transfers, Release 1.1 permits coprocessors and other devices to communicate directly without involvement of the host bridge. Along with this capability, network extensions provide for error detection and correction on the individual link level. In the event of an error, the receiver sends information back to the transmitter which causes a re-transmission of the packet. Obviously, the packet can't be consumed or forwarded until its validity is checked.

64 Bit Addressing Option

In keeping with the very large address space of many newer systems, Release 1.05 allows the optional extension of the normal 40-bit HyperTransport request address field to 64 bits.

Increased Number Of Host Transactions

Release 1.05 increases the number of outstanding transactions that a host bridge may have in progress from 32 to 128.

End-To-End Flow Control

In communication systems, there are occasions when devices are transferring packets to distant targets (not immediate neighbors) which may go "not ready" (or to another state which makes them unable to accept traffic) for extended periods. Prior to Release 1.1, HyperTransport devices only have flow control information for their immediate neighbors. Release 1.1 adds new end-to-end flow control packets which distant devices may send to each other to indicate their ability to participate in transfers. If a device is not ready, the source device does not start sending (or continue sending) packets; this helps eliminate bottlenecks which otherwise occur when the flow control buffers of devices in the path between source and target become full of packets which cannot be forwarded.

Switch Devices Formally Defined

Finally, Release 1.05 formally defines the switch device type which may be used to help implement the complex topologies required in communications systems. A switch behaves much like a two-level HyperTransport-HyperTransport bridge with multiple secondary interfaces. The basic characteristics of a switch include:

A switch consumes one or more UnitIDs on its host interface. The port attached to the host is the default upstream port.
The switch acts as host bridge for each of its other interfaces. Each interface has its own bus number.
Switches, like bridges, are allowed to reassign UnitID, Sequence ID, and SrcTag for transactions passed to other busses. The switch maintains a table of outstanding (non-posted) requests in order to handle returning responses.
Switches may be programmed to perform address translation.
Switches must maintain full producer-consumer ordering for all combinations of transaction paths.
Switches must provide a method for configuration of downstream devices on all ports.

Server And Desktop Topologies Are Host-Centric

a typical desktop or server platform is somewhat vertical. It has one or more processors at the top of the topology, the I/O subsystem at the bottom, and main system DRAM memory in the middle acting as a holding area for processor code and data as well as the source and destination for I/O DMA transactions performed on behalf of the host processor(s). The host processor plays the central role in both device control and in processing data; this is sometimes referred to as managing both the control plane and the data plane.

HyperTransport works well in this dual role because of its bandwidth and the fact that the protocol permits control information including configuration cycles, error handling events, interrupt messages, flow control, etc. to travel over the same bus as data — eliminating the need for a separate control bus or additional sideband signals.

Upstream And Downstream Traffic

There is a strong sense of upstream and downstream data flow in server and desktop systems because very little occurs in the system that is not under the direct control of the processor, acting through the host bridge. Nearly all I/O initiated requests move upstream and target main memory; peer-peer transactions between I/O devices are the infrequent exception.

Storage Semantics In Servers And Desktops

Without the addition of networking extensions, HyperTransport protocol follows the conventional model used in desktop and server busses (CPU host bus, PCI, PCI-X, etc.) in which all data transfers are associated with memory addresses. A write transaction is used to store a data value at an address location, and a read transaction is used to later retrieve it. This is referred to as associating storage semantics with memory addresses. The basic features of the storage semantics model include:

Targets Are Assigned An Address Range In Memory Map

At boot time, the amount of DRAM in the system is determined and a region at the beginning of the system address map is reserved for it. In addition, each I/O device conveys its resource requirements to configuration software, including the amount of prefetchable or non-prefetchable memory-mapped I/O address space it needs in the system address map. Once the requirements of all target devices are known, configuration software assigns the appropriate starting address to each device; the target device then "owns" the address range between the start address and the start address plus the request size.

Each Byte Transferred Has A Unique Target Address

In storage semantics, each data packet byte is associated with a unique target address. The first byte in the data packet payload maps to the start address and successive data packet bytes are assumed to be in sequential addresses following the start address.

The Requester Manages Target Addresses

An important aspect of storage semantics is the fact that the requester is completely responsible for managing transaction addresses within the intended target device. The target has no influence over where the data is placed during write operations or retrieved in read operations.

In HyperTransport, the requester generates request packets containing the target start address, then exchanges packets with the target device. The maximum packet data payload is 64 bytes (16 dwords). Transfers larger than 64 bytes are comprised of multiple discrete transactions, each to an adjusted start address. Using HyperTransport's storage semantics, an ordered sequence of transactions may be initiated using posted writes or including a non-zero SeqID field in the non-posted requests, but there is no concept of streaming data, per se.

Storage Semantics Work Fine In Servers And Desktops

As long as each requester is programmed to know the addresses it must target, managing address locations from the initiator side works well for general purpose data PIO, DMA, and peer-peer exchanges involving CPU(s), memory and I/O devices. When the target is prefetchable memory, storage semantics also help support performance enhancements such as write-posting, read pre-fetching, and caching — all of which depend on a requester having full control of target addresses.

1.04 Protocol Optimized For Host-Centric Systems

Because the HyperTransport I/O Link Protocol was initially developed as an alternative to earlier server and desktop bus protocols that use storage semantics (e.g. PCI), the 1.04 revision of the protocol is optimized to improve performance while maintaining backwards compatibility in host-centric systems:

The strongly ordered producer-consumer model used in PCI transactions which guarantees flag and data coherence regardless of the location of the producer, consumer, flag location, or data storage location is available in the HyperTransport protocol.
Virtual channel ordering may optionally be relaxed in transfers where the full producer-consumer model is not required.
The strong sense of upstream and downstream traffic on busses such as PCI is also preserved in HyperTransport. Programmed I/O (PIO) transactions move downstream from CPU to I/O device via the host bridge. I/O bus master transactions move upstream towards main memory.
Direct peer-peer transfers are not supported in the 1.04 revision of the HyperTransport I/O Link Specification; requests targeting interior devices must travel up to the host bridge, then be reissued (reflected) back downstream towards the target.

All of the above features work well for what they are intended to do: support a host-centric system in which control and data processing functions are both handled by the host processor(s), and I/O devices perform DMA data transfers using main system memory as a source and sink for data.

Some Systems Are Not Host-Centric

Unlike server and desktop computers, some processing applications do not lend themselves well to a host-centric topology. This includes cases where there are multiple levels of processing, complex look-up functions, protocol translation, etc. In these cases, a single processor (or even multiple CPUs on a host bus) can quickly become a bottleneck. Often what works more effectively is to assign control functions to a host processor and distribute data processing functions across multiple co-processors under its control. In some cases, pipeline (cascaded) co-processing is used to reduce latency.

X86 Power Management Support

X86 power management is based on the ACPI specification for the Windows operation environment. The specification defines specific timing requirements associated with STPCLK and SMI message cycles related to power management events. The specification also describes ACPI-defined system state transitions that relate to wakeup event signaling via LDTREQ#. See the specification for reference information related to these events.

Stop Clock Signal

The STPCLK# is one of the basic x86 power management signals. When power management logic asserts this signal, it places the CPU into its Stop Grant State, which has the following effects (Intel PIII example). The processor:

issues a Stop Grant Acknowledge transaction
stops driving the AGTL FSB signals, allowing them to return to the minimum power state (pulled up by termination resistors to VTT)
turns off clocks to internal architecture regions, except external bus (FSB) and interrupt sections (e.g. IOAPIC).
latches incoming interrupts, but does not service them until the CPU returns to the Normal State.
handles requests for Snoop transactions on the FSB; to do this the CPU transitions to the HALT/Grant Snoop State to perform the snoop, then returns to the Stop Grant State upon completion.

When STPCLK# is deasserted, the CPU returns to the Normal State. Many newer CPU's have an additional signal which may be used to expand the number of low power states. For example, the Intel Pentium III has a SLP# (Sleep) signal used in conjunction with STPCLK# to drive the CPU into a very deep low power state (e.g., clocks are stopped, no interrupts are recognized, and no snoops are performed). This is the next best thing to being powered down completely, and the time to recover to normal operation is much faster.

Two Types Of Double-Hosted Chains

There are two basic arrangements for double-hosted chains: sharing and non-sharing.

Sharing Double-Hosted Chain

In a sharing double-hosted chain, traffic is allowed to flow from end to end. Either host may target any of the devices in the chain, including the other host. In this arrangement, one host is the master host bridge and the other is the slave host bridge. The determination about which host is master or slave is not defined in the specification, but must be defined before reset occurs. Most likely, the system board layout will determine master/slave host bridges — possibly through a strapping option on the motherboard.

Two Types Of Double-Hosted Chains

There are two basic arrangements for double-hosted chains: sharing and non-sharing.

Sharing Double-Hosted Chain

If Possible, Assign All Devices To Master Host Bridge

The HyperTransport specification recommends that all resources in a sharing double-hosted chain be assigned to the master host bridge if possible; this eliminates a potential deadlock condition in peer-to-peer transactions. The Slave Command Register Master Host and Default Direction bits in PCI configuration space are used to program tunnel devices with the information needed to recognize the "upstream vs. downstream" directions. This is important because interior devices always issue requests and responses in the upstream direction. They only accept responses in the downstream direction.

If Slave Must Access Devices, It Uses Peer-to-Peer Transfers

The slave host in a sharing double-hosted chain may be required to access the devices on the link. To do so, it may have its Command Register Act as Slave bit set = 1. When this is done, all packets it issues travel first to the master host bridge where they are reissued back to the target devices as peer-to-peer transactions.

Non-Sharing Double-Hosted Chain

A non-sharing double-hosted chain appears logically as two distinct chains with a host bridge at each end.

Software May Break The Chain

Software chooses a point to break the chain in two parts and then:

While the link is idle, the link between the two tunnel devices is broken by programing the End Of Chain (EOC) bits in the appropriate tunnel Link Control registers on each side. The Transmit Off bit in each of the Link Control registers can also be set.
The slave host bridge writes to the Slave Command register for each device now under its control to force the Master Host and Default Direction bits in each to point at the slave host bridge.
Unique bus numbers are assigned to each segment in a non-sharing double-hosted chain. The bus number is used so that chains may be uniquely identified and so type 1 configuration cycles may be forwarded and/or converted to type 0 cycles by bridges.
If peer-to-peer transactions are not required, software link partitioning can also be used for load balancing.

Additional Notes About Double-Hosted Chains

Initialization In A Double-Hosted Chain

One of the responsibilities of a master host bridge in a double-hosted chain is to help with initialization after reset. Following low-level link initialization, the slave host bridge "sleeps" pending set up by the master. The basic steps in master initialization include:

The master host bridge sets the Slave Command CSR master host bit to point towards the master host bridge in all slave devices it finds. This bit is set automatically whenever the Slave Command CSR is written.
When the master host bridge discovers the slave host bridge, it sets the Host Command CSR Double Ended bit in the both its own and the slave's Host Command register. This informs the slave (when it wakes up) that it is in a double-hosted chain and that it is not required to configure devices below it.
If the Double Ended bit is not set in the slave, it will initialize its end of the double ended chain when it awakens.

Type 0 Configuration Cycles In A Double-Hosted Chain

Because all host bridges tend to own UnitID 0, a configuration cycle carrying a device number field of "0" in a double-hosted chain might be misinterpreted. The direction a type 0 configuration cycle request is traveling determines which host bridge is the target. If configuration software wishes to prevent a host bridge (e.g. the slave host) in a double-hosted chain from accessing another host's configuration space, the Host Command Register host hide bit may be set = 1.

Other Fields In The Header of HT Tech

Primary Latency Timer Register

This register is not implemented by HyperTransport devices. Should return 0's if read by software. If primary bus is PCI or PCI-X, use of this register follows that protocol.

Base Address Registers

The two Base Address Registers (BARs) are used by bridges in much the same way as for PCI bridge devices, with the following limits if the primary interface is HyperTransport:

I/O BAR

For an I/O request, a single BAR is implemented. Only the lower 25 bits of the value programmed into the BAR is used for address comparison by the target, and the upper bits of the BAR should be written to zeros by system software. Any I/O request packet sent out on a link should have the start address bits 39-25 programmed for the I/O range in the HyperTransport memory map.

Memory BAR

A request for memory using 32-bit addressing can be accomplished using a single BAR, just as in PCI. This would limit the assigned target start address for the device to the lower 4GB of the 1 TB (40 bit) HyperTransport address map.

Optionally, a HyperTransport device may support 64 bit address decoding, and use a pair of BARs to support it. If this is done, only the lower 40 bits of the 64 bit BAR memory address will be valid, and the upper bits are assumed to be zeros.

Memory windows for HyperTransport devices are always assigned in BARs on 64-byte boundaries; this assures that even the largest transfer (16 dwords/64 bytes) will never cross a device address boundary. This is important because HyperTransport does not support a disconnect mechanism (such as PCI uses) to force early transaction termination.

Capabilities Pointer

This field contains a pointer to the first advanced capability block. Because all HyperTransport bridge devices have at least one advanced capability, this register is always implemented. The pointer is an absolute byte offset from the beginning of configuration space to the first byte of the first advanced capability register block.

Interrupt Line Register

The HyperTransport specification indicates that this register should be read-writable and may be used as a software scratch pad. The information routing information programmed into this register in PCI devices isn't required in HyperTransport because interrupt messages are sent over the links and sideband interrupts are not defined. If the primary bridge interface is PCI or PCI-X, this register is used by software to program the system interrupt mapped to this device.

Interrupt Pin Register

This register is reserved in the HyperTransport Specification. It may optionally be implemented for compatibility with software which may expect to gather interrupt pin information from all PCI-compatible devices. If the primary bus interface is PCI or PCI-X, this register is hard-coded with the interrupt pin driven by this device (if any).

Cache Line Size Register

This register is not implemented by HyperTransport devices. If both interfaces are HyperTransport, bit should be tied low and read back as 0's if read by software. If either interface is PCI, this register is read-write.

Basic Jobs Of A HyperTransport Bridge

As in the case of PCI bridges, a HyperTransport bridge has a number of responsibilities:

It extends the topology through the addition of one or more secondary buses. Each HyperTransport chain (bus) can support up to 32 UnitIDs. Because a device is permitted to consume multiple UnitID's, implementing a bridge is a reasonable way to add a new chain that can support 32 additional UnitIDs (the bridge secondary interface consumes at least one of the new UnitIDs).
It acts as host for each of its secondary chains. There are many aspects to this, including ordering responsibilities, error handling, maintaining a queue for outstanding transactions routed to other buses, reflecting peer-to-peer transactions originating below it, decoding memory addresses so it may claim and forward transactions moving between the primary and secondary bus, forwarding/converting configuration cycles based on target bus number, etc.
In cases where it bridges between HyperTransport and PCI/PCI-X, the bridge also must translate protocols for transactions going in either directiion. It may also have to remap address ranges between the 40-bit HyperTransport address range and the 32/64-bit PCI or PCI-X range.

Why Use Pseudo-Synchronous Clock Mode?

The specification does not address any specific application for Pseudo-Synchronous clock mode. It appears that the main advantage is that a link is given the ability to transfer data in one direction at a higher rate than the other. But this begs the question, "Why not transfer in both directions at the highest speed possible, thereby keeping bus efficiency as high as possible?" It further raises the question of a possible advantage associated with clocking one direction at a slower rate; however, there would be power savings, reduced EMI, and reduced transmit PHY complexity.

Implementation Issues

Pseudo-synchronous clocking mode must take into account the same clock variance issued as synchronous mode. Additionally, several other key issues must be considered for pseudo-synchronous clocking mode. These issues include:

Methods and procedures required to implement pseudo-sync mode.
Managing the FIFOs and pointers given the different transmit and receive clock frequencies.
Is support mandatory?

Methods and Procedures

The specification does not define a mechanism to lower the transmit clock frequency, nor does it provide a method for determining which clock modes are supported by a given HT device. The specification states that:

"The means by which the operating mode is selected for a device that can support multiple modes is outside the scope of this specification."

Further, no definition exists regarding the level of software that would be involved in transitioning a device to the pseudo-sync mode.

FIFO Management

Pseudo-sync mode must consider the same sources of clock variation as in synchronous mode and the receive FIFOs must be sized appropriately and the separation between the write and read pointers must be established.

Because Tx Clock Out may run slower than Rx Clk in pseudo-synchronous mode, incoming packets may be clocked into the receive FIFO more slowly than they are clocked out. This situation results in a buffer underrun condition. To prevent this from happening the unload pointer occasionally must be stopped and then restarted when sufficient data is present in the receive FIFO. One approach to solving the potential underrun problem is to implement the FIFO to set a flag when the read pointer reaches the write pointer. The unload pointer could be stopped to keep additional reads from occurring until the situation is corrected. When sufficient separation between the load and unload pointers have accumulated, the flag can be cleared and reads can continue.

Is Support for Pseudo-Sync Mode Required?

The HT specification clearly requires support for synchronous clocking mode for all devices. It further states that:

"Devices may also implement Pseudo-sync and Async modes based on their unique requirements."

This statement suggests that Pseudo-sync mode is conditionally required; that is, it's optional unless a device has some special conditions that require the support. Further, the specification does not mention any requirement for standard synchronous devices to operate correctly when attached to devices that operate in pseudo-sync mode. It may be that it is expected that all synchronous clocking mode devices will be able to inter-operate with pseudo-sync devices. As discussed in the previous section, support for pseudo-sync mode at the receiving end simply requires that the FIFO read pointer not be allowed to advance to the same entry as the write pointer.

Asynchronous Clock Mode

The asynchronous clock mode permits the transmit and receive clocks to be derived from different sources. The specification limits the maximum difference permitted between the transmit and receive clock frequency. In this case, either the transmit clock or the receive clock may run faster than the other. So, both situations must be taken into account.

Transmit Clock Slower Than Receive Clock

In this case, a potential underrun condition can develop. The solution for preventing underrun is the same as that discussed for the pseudo-synchronous clock mode as discussed in "FIFO Management." on page 401. In summary, the FIFO read pointer is prevented from reaching the write pointer by stopping the read clock until the transmit clock has had a chance to catch up.

Transmit Clock Faster Than Receive Clock

Tx Clock Out can run slightly faster than Rx Clk in asynchronous mode (but by no more than 2000 ppm), thus incoming packets may be clocked into the receive FIFO faster than they are clocked out. This situation will result in a buffer overrun condition, and the receiver has no way of stopping or slowing the incoming packets. The following discussion describes how to prevent the buffer overrun condition from occurring.

CRC bits appear on the link for 4 bit-times (on 8-,16-, and 32-bit links) after every 512 bit-times. These CRC bits are detected by the receiver, but NOT clocked into the receive FIFO. Instead the CRC bits are routed into the CRC error checking logic. Consequently, the FIFO write pointer does not increment during the CRC bit times, but the read pointer continues to increment and data continues to be read from the FIFO. As a result, the unload pointer has sufficient time to catch-up by clock data in the receive FIFO out before the buffer overruns.

Clock Initialization in HT Technology

The receive FIFO in each device must be able to absorb timing differences between the transmit and receive clocks. Data is written into the FIFO in the transmit clock domain and read in the receive clock domain.

The design and operation of this FIFO must account for the dynamic variations in phase between the transmit clock domain (Tx Clock Out) and the receive clock domain (Rx Clock). The FIFO depth must be large enough to store all transmitted data until it has been safely read into the receive clock domain. The separation from the write pointer to which the FIFO data is written and the read pointer from which the FIFO location is read (write-to-read separation) must be large enough to ensure the FIFO location can be read into the receive clock domain.

The deassertion of the incoming CTL/CAD signals across a rising CLK edge is used in the transmit clock domain within each receiver to initialize the write (load) pointer. The same deassertion CTL and CAD signals is read from the FIFO synchronous to the receive clock domain and used to initialize the read (unload) pointer. The separation between the write and read pointers is calculated based on worst-case variation between the transmit and receive clocks.

Note also that CTL cannot be used to initialize the pointers for byte lanes other than 0 in a multi-byte link, because CTL only exists within the byte 0 transmit clock domain.

Synchronous Clock Mode

The specification requires that all HT devices support the synchronous clock mode. This mode is the least complicated method of transferring data from transmitter to receiver. Synchronous clock mode requires that the transmit clock and receive clock have the same source, and operate at the same frequency. If we were to assume that the transmit clock and the receive clock always remained synchronized, then a simple clocking interface could be used as described in the following example.

A Conceptual Example

In this synchronous example, the transmit clock (Tx Clock) and receive clock (Rx Clock) are presumed to be in synchronization. Note, however, that source synchronous clocking requires that Transmit Clock Out (Tx Clk Out) be 90° phase shifted from Tx Clock. In this example all other sources of transmit to receive clock variation are ignored, including the expected clock drift associated with PLLs.

The transmitter delivers data synchronously across the link using the transmit clock. Tx Clock Out is sourced later and lags the data by 90° (or one-half bit time), thereby centering the clock edge in the middle of the valid data interval. When the data arrives at the receiver it is clocked into the FIFO using Tx Clock Out. Note that the clocked FIFO has two entries, which provides a separation of 1 between Tx Clock Out and Rx Clock. Data written into the FIFO during clock 1 would not be read from the FIFO using Rx Clock until clock 2. This one entry separation (called write-to-read separation) permits time for the sample to be stored prior to being read (i.e. the FIFO entry is not being written to and read from in the same clock cycle). In short, two FIFO entries are sufficient to provide the separation needed to ensure that data is safely stored and transferred into the receive clock domain.

However, in the real world many factors contribute to timing differences between the transmit and receive clock that are potentially significant, even though the clocks originate from the same source. These real world perturbations result in somewhat more complicated implementations that must account for and manage the worst case variation between the transmit and receive clocks. Specifically, the specification describes the receive FIFO implementation for handling the variation between the transmit and receive clocks.

Sources of Transmit and Receive Clock Variance

The specification defines and details the sources of transmit and receive clock variation that can exist. These clock differences can create FIFO overflow or underflow if not identified and taken into account. The clock differences can be attributed to two different categories or sources:

Invariant sources — components that represent a constant phase shift between the transmit and receive clock domain.
Variant sources — dynamic variations in the transmit and receive time domain (these phase variations can occur even though both transmit and receive clock are running at the same frequency).

The sources of clock variation in some cases can accumulate over time, causing clock variation to increase over time. However, all of the sources of clock variation are naturally limited in terms of the maximum amount of change that can occur. For example, a PLL is designed to produce an output clock that is synchronized with the input source clock, but with certain limitations. That is, variation of output frequency is specified not to change beyond a certain phase shift. The time over which the clock phase may change can be relatively short or perhaps much longer depending upon conditions. The consideration and assessment of the sources of clock variance is done to determine a FIFO size that can absorb the worst-case clock variation. This would occur if all sources of clock variation simultaneously reach their extremes, a very unlikely circumstance.

This chapter discusses the variant and invariant sources of transmit clock to receive clock variance. It also provides an example timing budget for each source.

Invariant Sources

The time-invariant factors contribute a small proportion of the overall clock variance. The invariant factors include:

Cross-byte skew in multi-byte link implementations
Sampling Error

Cross-byte skew in multi-byte link implementations

Differences in the arrival of Tx Clock Out at the receiver (CLKIN) between each byte lane is caused by path length mismatch. This constant skew is termed T_{bytelaneconst} in the specification. The specification allows up to 1000ps for this skew. Consequently, when multiple bytes are clocked into the FIFO the maximum skew could result in one of the bytes being clocked into the FIFO 1000ps later than the associated bytes. Thus, when the associated bytes are clocked out of the FIFO by Rx Clock, one byte having arrived late may be left behind. This problem is solved by adding additional entries in the FIFOs to handle the maximum lane-to-lane skew, ensuring that all associated bytes are clocked out at the same time. Note that lane-to-lane skew may change due to the effects of temperature, voltage change, etc. This parameter called T_bytelanevar is included in the variant source list.

Sampling Error

Uncertainty in read pointer due to CTL sampling error in the receive clock domain (1 device specific Rx Clock bit time). The specification does not specifically define the source of this sampling error, but is likely caused by phase variations between the Tx Clock Out and Rx Clock that could cause a sample to be missed. Adding an additional bit time solves this problem.

Variant Sources

The phase difference between the transmit and receive clock may change significantly due to dynamic factors such as:

Reference Clock Distribution Skew.
PLL Variation in Transmitter and Receiver.
Transmitter and Link Transfer Variation
Receiver Transfer Variation
Dynamic Cross Byte Lane Variation

All time variant parameters must be considered in terms of their worst-case variance. The total dynamic phase variation due to these factors is called T_variant. Additionally, the transmit clock could either LEAD the receive clock by T_variant or it could LAG the receive clock by T_variant. Consequently, the receive FIFO must be sized to accommodate both phase variations.

Reference Clock Distribution Skew

Synchronous clock mode requires that the input reference clocks to the transmitter and receiver be derived from the same time base. The distribution of the reference clock to the transmitter and the receiver results in skew between the two reference clocks. This is due to:

differences in the output skew of the clock source, including phase error associated with Spread Spectrum Clocking in the reference clock generator, and the skew associated with the mismatch in the distribution path.
differences in the distribution of the clocks to their PLLs due primarily to temperature and voltage changes.

This skew results in phase difference between the Transmit and Receive Clocks and must be included in the T_variant calculation.

PLL Variation in Transmitter and Receiver

The largest contribution to the overall Tx Clock to Rx Clock variance comes from the PLLs. The PLL is constantly making adjustments to the output frequency as a result of a feedback loop. In addition, voltage and temperature changes also add to the possible output clock variation. The sample timing budget included within the specification allows a maximum PLL output phase variation of 3500ps. This represents >1 bit time at the 400 MT/s rate and approximately 5.6 bit times at the 1600MT/s rate.

Transmitter and Link Transfer Variation

The transmitter clock error (accumulated over a single bit time), the transmitter PHY, and the interconnect contribute small amounts of phase error into the link transfer clock domain through all of the parameters included in the link transfer timing. This includes noise on the PCB that affects both the clock and data in the same way causing a minor shift in frequency or phase of clock and data. (Note that if the noise affected the clock and data differently, this would affect the maximum bit transfer rate due to potential violations of T_SU and T_HD).

Receiver Transfer Variation

The receiver contributes small amounts of phase error in the received CLKIN due to distribution effects.

Write-to-Read and Read-to-Write Separation

Recall that the FIFO depth must be large enough to store all transmitted data until it has been safely read into the receive clock domain. The separation from the write pointer location where data is written and the read pointer location from which data is read must be large enough to ensure the FIFO location can be read safely into the receive clock domain.

To accommodate this clock variance in this example, the read pointer within the FIFO would need to be separated from the write pointer by 8 entries (or, bit times). The following three scenarios are provided to explain the operation of the FIFO and its pointers.

Stage A — the write pointer has progressed from entry 0 to entry 8. Because the separation between the write and read pointer is 8, Rx Clock is prevented from clocking data from the FIFO until the separation reaches 8. At this stage, the separation has just been reached, so Rx Clock clocks data from entry 0, while the Tx Clock Out clocks data into entry 8.

Stage B — the write pointer has progressed to entry 15 and because there is still no phase difference between Tx Clock Out and Rx Clock the separation between the pointers remains at 8. Rx Clock is clocking data from entry 7 as Tx Clock Out is clocking data into entry 15.

Stage C — the write pointer has rolled from entry 15 back to entry 0 while the read pointer has advanced to entry 8. This simply illustrates that the separation is still maintained when the write pointer reaches the end of the FIFO and wraps back to entry 0.

Scenario 3: Rx Clock Lags Tx Clock Out

This scenario presents the opposite condition that was illustrated in scenario 2. In this example, the receive clock lags the transmit clock. As in the previous example, the phase difference between the clocks would not likely accumulate so quickly.

Stage A — the write pointer has previously traversed all of the entries and is back at entry 0 again, while the read pointer is at entry 8 This scenario focuses on the possibility that the Rx Clock lags the Tx Clock Out clock. In this case, the read-to-write separation becomes critical. In stage A this separation is 8.

Stage B — the write pointer has advanced to entry 13, while the read pointer has only advanced to entry 15. The write pointer had moved ahead by 13 entries and the read pointer has moved only 7 entries, leaving a read-to-write separation of only 2.

Once again, the large change in clock variance over such a short period of time as illustrated in stage B would not occur. But the example does serve to illustration that over time the clock variance can accumulate and that an appropriately sized FIFO will be able to absorb the clock variance without overflow.

Pseudo-Synchronous Clock Mode

In pseudo-synchronous mode, both Rx Clk in the receiver device and Tx Clk in the transmitter device are generated from the same time base clock just as in the synchronous mode case. During initialization, software configures each link to the maximum common frequency based on the values reported in each device's frequency capability register. The highest frequency supported by both devices is loaded into the Link Frequency register of each device. This value defines the highest frequency that both devices can use when sending packets over the link. In synchronous implementations this would be the exact frequency used by both devices. However, a device implementing pseudo-synchronous mode may arbitrarily lower the transmit clock frequency (Tx Clk or Tx Clock Out) below that specified by the Link Frequency register. Note that the receiver clock (Rx Clk) still runs at the frequency specified by the Link Frequency register.

Other Fields In The Header

The use of other fields in the type 0 header region of a HyperTransport device include:

Cache Line Size Register. (Offset 0Ch)

This read-only register is not implemented by HyperTransport devices. Should return 0's if read by software.

Latency Timer Register. (Offset 0Dh)

This register is not implemented by HyperTransport devices. Should return 0's if read by software.

Base Address Registers. (Offset 10h-24h)

The six Base Address Registers (BARs) are used in much the same way as for PCI devices, with the following limits:

I/O BAR

Memory BAR

CardBus CIS Pointer. (Offset 28h)

This register is not implemented by HyperTransport devices. Should return 0's if read by software.

Capabilities Pointer. (Offset 34h)

This field contains a pointer to the first advanced capability block. Because all HyperTransport devices have at least one advanced capability, this register is always implemented. The pointer is an absolute byte offset from the beginning of configuration space to the first byte of the first advanced capability register block.

Interrupt Line Register. (Offset 3Ch)

The HyperTransport Specification indicates that this register should be read-writable and may be used as a software scratch pad. The information routing information programmed into this register in PCI devices isn't required in HyperTransport because interrupt messages are sent over the links and sideband interrupts are not defined.

Interrupt Pin Register. (Offset 3Dh)

Min_Gnt and Max_Latency Registers. (Offsets 3Eh and 3Fh)

These register fields are associated with PCI shared-bus arbitration, and are not implemented by HyperTransport devices. Should return 0's if read by software.

Block Formats Vary With Capability And Device Type

Each of the HyperTransport capability blocks has its own format. The Type field in the first dword of each capability block defines the format of the entire block. In addition, one of the principal capability block types (Slave/Primary Interface) also varies with the device which implements it because tunnel devices interface to two links and end (cave) devices interface to only one.

The Slave/Primary Interface Block

HyperTransport defines two principal advanced capability register block formats, Slave/Primary and Host/Secondary, which reflect the two possible roles a device interface can perform on a link. The Slave/Primary format is used by all tunnels and single-link peripheral (cave) devices. These devices never act as a host for a bus (they are slaves). In addition, because they are not bridges, they have a single primary interface to the bus and no secondary interfaces.

One complicating factor is the fact that while an end (cave) device interfaces to only one link, a tunnel must interface to two links (still only one bus, though). To accommodate this difference, each Slave/Primary interface has two sets of link management registers, one for each link. A tunnel device implements one Slave/Primary interface and both sets of link management registers; an end (cave) device also implements one Slave/Primary interface but only one set of link management registers.

HyperTransport Configuration Space Format

This section describes the general format of the configuration space used by a HyperTransport functional device. The discussion here focuses on two major areas:

How a HyperTransport device is similar and different from a PCI device in its use of the generic header region of configuration space.
The use of the required and optional HyperTransport advanced capability register blocks also located in the required 256 byte configuration space.

Two Header Formats Are Used

The first one fourth (16 dwords) of any PCI configuration space is called the header. As in the case of PCI devices, HyperTransport devices use two header formats: one for HT-to-HT bridges, called header type 1, and the other for all non-bridge devices (including tunnels and single link end (cave) devices) called header type 0. The lower bits in the Header Type field within both types of PCI configuration header is hard coded with the type code; software checks this field early in the process of device discovery to determine which of the header formats it is dealing with.

The Type 0 Header Format

Basic PCI functionality is managed by having BIOS or other low level software read certain hard-coded header fields to obtain device requirements, then having it program other fields to set up plug-and-play options.

PCI Advanced Capability Registers

While many early PCI devices were managed using just the register fields in the configuration space header, many additional features have been added over the years which require dedicated registers to manage them. For these devices which have capabilities beyond basic PCI compliance, the generic PCI header registers are augmented by one or more additional register sets outside of the header area, but still within the 256 byte PCI configuration space. PCI calls these advanced capability register blocks;

Many Advanced Capabilities Are Defined

Under the current PCI specification, advanced capability block register sets have been defined for all sorts of purposes. Two important classes are:

Register sets for bus extensions such as HyperTransport, PCI-X, and AGP.
Register sets for enhanced device management, including Message Signalled Interrupts (MSI), Power Management, Vital Product Data, etc.

When a PCI compatible device is designed, the basic PCI configuration space type 0 or type 1 header fields are implemented as are any additional advanced capability register blocks which may be needed. The format of an advanced capability block varies with the type, and a Capability ID byte at the start of each block identifies which type it is; the capability ID for HyperTransport is 08. At a minimum, a HyperTransport device must implement the 256 byte PCI configuration space memory, containing a header AND at least one HyperTransport advanced capability block (Host/Secondary or Slave/Primary Interface Block).

Discovering The Advanced Capability Blocks

If a PCI compatible device implements advanced capability blocks, low-level software must find and configure each one. Because the specific location of advanced capability blocks within the 256 byte configuration space is not specified, they must be "discovered" by executing some variation of the following software configuration process

Use the capability pointer (CapPtr) at dword 13 in the header to determine the configuration space offset (from the beginning of configuration space) to the first advanced capability register block. Check the first byte in the block to determine the capability ID (HT = 08).
Next, check the upper byte in the first dword to determine the HyperTransport capability block Type. HyperTransport supports a number of these: Host/Secondary, Slave/Primary, Interrupt Discovery & Configuration, etc.
Set up all of the registers in the capability block using configuration cycles.
Use the next pointer (NPtr) contained in the second byte of the first advanced capability block to determine the offset (from the beginning of configuration space) to the next capability block in the "linked list". If the ID field is "08", this is another HyperTransport capability block. Read the Type field, and set up the register fields as appropriate.
Continue the discovery and set up process until the last capability block has been located and set up. If a block is the last one, its Nptr field is zero — indicating the end of the linked list of advanced capability blocks.

Refer to MindShare's PCI System Architecture, 4th Ed. book for a complete description of configuration space advanced capability management.

HyperTransport Configuration Type 0 Header Fields

In this section, the configuration header format for non-bridge HyperTransport devices (type 0 header format) is described. For the most part, HyperTransport devices use these fields in the same way as PCI devices; the few differences are described here. Header fields not mentioned are used in the same way as in PCI devices.

Header Command Register

the command register occupies the lower 16 bits at dword 01. The header Command register is used by BIOS or other software to enable basic capabilities of the device on the primary bus, including bus mastering, target address decoding, error response capability, etc.

How HyperTransport Handles Configuration Accesses

Configuration Cycles Are Memory Mapped

To generate a configuration space read or write, a HyperTransport bridge simply sends a RdSized or non-posted WrSized request using a reserved address range in the 40-bit HyperTransport memory map. This 32MB range, recognized by all devices,

How The 32MB Configuration Area Is Used

The 32MB HyperTransport memory map address space reserved for configuration cycles is used to access the 256 byte configuration space of each function in each device on each bus. How the address range is interpreted and how a particular device can recognize configuration cycles it should claim vs. those it must forward

Upper 16 Address Bits Indicate Type 0 And Type 1 Cycle

As in PCI configuration cycles, HyperTransport requires two variants of configuration read/write cycles, type 1 and type 0. The type 0 configuration is generated by a bridge when the cycle has reached the target bus (chain) where the device being accessed resides; the type 1 cycle is in transit to the target bus and should be forwarded by bridges or tunnels in the target path. The bridge to the destination bus will convert it to type 0.

Because HyperTransport configuration cycles are distinguished from other read/write requests only by the fact they target the 32MB reserved configuration address range, the first problem is how to distinguish type 1 from type 0 cycles. The 32MB configuration address range is further divided into two parts: request packets carrying addresses in the upper 16MB of the range are type 1 cycles; requests with addresses in the lower 16MB are type 0 cycles.

HyperTransport Type 1 Configuration Cycle

If a SizedRD or SizedWt request carries an address with the upper 16 bits set = FDFFh, then the cycle is a type 1 configuration request. Only bridges are allowed to accept these requests, and only if the bus number field in the address (bits 23:16) falls into the range defined by the bridges Secondary-Subordinate bus number registers. The bridge then passes the request downstream.

HyperTransport Type 0 Configuration Cycle

If a SizedRD or SizedWt request carries an address with the upper 16 bits set = FDF8h, then the cycle is a type 0 configuration request. This will be claimed by the device that also has a match when the device number field (bits 15:11) in the address matches one of its UnitIDs. It then uses the function number and Dword fields to target the particular internal function and configuration space offset.

No IDSEL Signal Needed In HyperTransport

Finally, there is no IDSEL signal to accompany a type 0 configuration cycle in the HyperTransport protocol. The need for this signal has been eliminated because a Base UnitID field has been included in the HyperTransport advanced capability register block so that a device is programmed to "know" its UnitID number(s). This allows the device to decode its own configuration cycles rather than depending on the upstream bridge to do it with IDSEL.

Events In HT Configuration Example

Low level software executing on the CPU requires access of the configuration space in Device 2 on Bus (chain) number 1.
The Host Bridge checks its secondary bus number register, recognizes the target bus is not its secondary bus, and sends a request packet for type 1 configuration cycle onto bus 0 (using the upper half of configuration address range).
The HT-to-HT bridge on Bus 0 checks the bus number field in the request and compares it with its own secondary, and subordinate bus numbers. Because the target bus is below it, the HT-to-HT bridge forwards the configuration cycle onto bus 1; at the same time it converts it to a type 0 because the target bus has been reached. Conversion to type 0 simply means shifting the configuration address into the lower half of the configuration address range. Note that the bus number field is stripped off when the cycle is converted to type 0.
Device 1 claims the cycle because it is a type 0 configuration cycle AND it carries a device number which matches one of its assigned UnitIDs.
Device 1 then uses function number and dword offset fields in the request packet to target the specific internal function and offset location in its configuration space

Initializing Bus Numbers And Unit IDs

One of the first steps in HyperTransport configuration is the initial assignment of bus numbers and UnitIDs for each device and chain in the topology. Using a depth-first search algorithm, enumeration software assigns IDs to each device it discovers; if it finds any HyperTransport bridges, it also assigns the primary, secondary, and subordinate bus numbers so that later configuration cycles may find their way to target buses other than bus 0.

Case 1: A Single Chain With One Host Bridge

In a single chain with only one host bridge, enumeration is fairly simple:

Following a reset assertion on a chain, the Base UnitID field in the Slave Command register of each in HyperTransport device is cleared to "0".
In addition, reset forces the primary, secondary, and subordinate bus number registers in each HyperTransport bridge and the secondary bus number register in host bridges to "0" as well.
The transmitter and receiver interfaces on each link perform the low-level negotiation to determine starting bus width. They also perform the required link initialization sequence. Once synchronization is complete, the Initialization Complete bit in each active Link Control register is set.
After link synchronization, each active transmitter issues buffer release (NOP) packets to the corresponding receiver to indicate its own input flow control buffer capacities. Once this is done, each transmitter issues NOPs until configuration starts.
The host bridge initializes its UnitID counter so it can start assigning UnitIDs to slave devices it discovers (it reserves UnitID 0 for itself).
If the host bridge's Link Control register Initialization Complete and End Of Chain bits indicate that another device is attached to its secondary bus, the host bridge sends a series of configuration cycles to the first device in the chain. These type 0 configuration cycles target Bus 0, Device 0 (UnitID 0), Function 0. Because all devices default to UnitID 0, the first device will claim the cycles. Read cycles will target configuration space locations containing Vendor ID, Device ID, Class Code, Header Type, etc.
At some point, the host bridge assigns new UnitID(s) to the device by reading the Unit Count field in the Slave Command Register and then programming (writing) the Base UnitID field with the next available UnitID (1). For devices which request more than one Unit ID, this Base UnitID is the first in a sequential set. Note that the act of writing the Command register causes the Base UnitID field to be updated and the Master Host bit to be set (indicating the device link which points towards the host bridge). Thereafter, the device uses its new UnitID when claiming configuration cycles, etc. Only a reset or rewriting the Slave Command register causes the Base UnitID field to change.
Once all functions in the first device are configured, the host repeats the process to access the next device in the chain. It again uses the configuration cycle attributes of Bus 0, Device 0 (UnitID 0), Function 0. Now, the device which is already assigned as UnitID 1, forwards the transaction downstream because the UnitID in the request (0) does not match. The second device is then programmed as the first one was, but the UnitID(s) assigned to it start where the previous device left off (i.e., UnitID 2).
After programming each device, the host bridge checks the End-Of-Chain (EOC) bit set in the device's downstream Link Control Register. If this bit is set = 1, the enumeration process for the chain is complete.

Case 2: A HyperTransport Bridge Is Discovered

If the enumeration process on a chain encounters a HyperTransport-To-HyperTransport bridge or a bridge from HyperTransport to a compatible protocol (PCI, AGP, PCI-X), then some additional initialization is needed. A bridge is detected when a read of the Header Type field in the configuration header indicates that the device uses the type 1 header format. Software must program the device in accordance with the type 1 header format which includes:

Programming the secondary and subordinate bus number registers with the next available bus number (1). This will allow this bridge to forward and/or convert subsequent configuration cycles targeting the new bus(ses) below the bridge.
Setting up the Base Address Registers and other fields in the configuration header in accordance with the protocol being used on the secondary bus (HyperTransport, PCI-X, PCI, etc.).

It is permissible for a HyperTransport bridge to have more than one secondary bus and/or a tunnel interface for its primary bus.

A Note About Bus Numbering In HyperTransport

Bus numbering in HyperTransport systems makes no distinction between HyperTransport, PCI, AGP, or PCI-X buses. As bridges to other protocols are discovered during enumeration, bus numbers are assigned without regard to the particular protocol.