Hypertransport CPU technology: Transactions in HT Technology

RdSized And WrSized Requests: Transaction Limits

Using the various request packet option bits when constructing RdSized and WrSized transactions makes it possible to perform byte and dword read and write transfers in a number of variations. The following section describes some of the key limits associated with RdSized and WrSized requests.

RdSized And WrSized (Dword) Transactions

Sized dword read and write transactions can transfer any number of contiguous dwords within a 64 byte, address-aligned block. The request packet Mask/Count field provides the number of dwords to be transferred, beginning at the start address and indexing addresses sequentially upward until the limit defined by the Mask/Count field is reached. All bytes in the range are considered valid. Dword read and write start addresses must be dword aligned. If the start address is 64 byte aligned, the transfer may include the entire 64 byte (16 dword) region; if the start address is not 64 byte aligned, the transfer can only go to the end of the current 64-byte address-aligned block. Dword requests which would cross 64 byte address boundaries must be broken into multiple transactions.

RdSized (Byte) Transactions

Sized byte read transactions can transfer any combination of bytes within one address-aligned dword; requests which would cross an aligned dword address boundary must be broken into multiple transactions. The request packet Mask/Count field provides the "byte enable" mask pattern, indicating which bytes are valid. Mask[0] qualifies byte 0, Mask[1] qualifies byte 1, etc. Any mask pattern is legal; mask bits can be ignored by targets reading from "pre-fetchable" locations (all four bytes in the target dword are always returned).

WrSized (Byte) Transactions

Sized byte write transactions can transfer any combination of bytes within a 32-byte address-aligned region. The request packet Mask/Count field provides the total number of dwords to be transferred including the required single dword "write mask" pattern. The mask itself is sent just ahead of the data byte payload, and indicates which of the data bytes that follow are valid. Mask bit[0] qualifies byte 0, Mask bit [31] qualifies byte 31, etc. Byte write start address must be dword aligned. If the start address is 32 byte aligned, the write transfer may be as large as the entire 32 byte (8 dword) region; if the start address is not 32 byte aligned, the transfer can only go to the end of the current 32 byte address-aligned block. Basically, start address bits [4:2] identify the first the valid dword of data within the 32-byte region defined by start address bits [39:5]. Byte write requests which would cross 32 byte address boundaries must be broken into multiple transactions. A couple of subtle things about these transfers:

The entire dword (32 bit) mask is always sent ahead of the data payload, regardless of start address and number of bytes being transferred. Mask bit fields are cleared for all invalid bytes in the 32-byte region ahead of the start address, for all invalid bytes within the transfer range itself, and for all unsent bytes remaining in the 32-byte region beyond the transfer limit implied by the Mask/Count field.
While it isn't illegal to send invalid dwords at the front and back of a WrSized (Byte) transfer, it is more efficient to adjust the start address and Mask/Count field to trim off completely invalid dwords in front of the first and after the last dwords containing at least one valid byte in the 32 byte aligned region.

RdSized And WrSized Requests: Other Notes

Coherency

The coherency bit in the Command field of RdSized and WrSized request packets (Byte 0, bit 0) indicates whether host cache coherency is a concern when HyperTransport RdSized and WrSized requests target host memory. Some buses, such as PCI, require coherency enforcement any time a transaction originating in the I/O subsystem targets main memory. This can represent a serious performance hit as processors spend much of their time snooping internal caches for accesses which they may not cache anyway.

HyperTransport uses the coherency bit in the Command field of the request packet to inform the system whether coherency actions are required. If the coherency bit is set:

All HyperTransport writes targeting host memory result in the CPU updating or invalidating the relevant cache line.
All HyperTransport reads targeting main memory must result in the latest copy being returned to the requestor. If the CPU has a modified cache line, the system must assure that this is the one returned to the requestor.

If a device has no particular requirement for coherency, it may chose to keep the coherency bit cleared. In this case, the request will complete without any coherency events.

Special Case: Forcing A Coherency Event. A RdSized (byte) targeting host memory with all Mask/Count bits set = 0 (no valid bytes) and coherency bit set = 1 in the request packet Command field causes a host coherency action, using the address provided in the read. One dword of invalid data will be returned.

WrSized Requests And The Posted Bit

Sized write request packets may or may not set the posted bit (bit 5 of the CMD field). The implications of this bit are as follows:

If set, the bit indicates the write request will travel in the posted request virtual channel and that there will not be a response from the target. Each device in the transaction path may de-allocate its buffers as soon as the posted request is transmitted. This also means that the SrcTag field is not used (reserved) because posted writes have no outstanding responses to track. This is in contrast to non- posted requests which require a unique SrcTag field for each request issued.

It the posted bit is not set, the requestor expects a confirmation that the data written has reached the destination — and is willing to suffer the performance penalty and wait for it. Eventually, a Target Done response will be routed back to the original requestor. In HyperTransport, certain address ranges require non-posted writes; this includes configuration and I/O cycles.

Errors During RdSized Transactions

In the event of a read error (SizedRd command), a response and all requested data is returned to the requestor, even though some or all of the data is not valid. Proceeding with a "dummy" read of invalid data is mainly for the benefit of devices in the transaction path that have already allocated flow control buffer space for the returning data. These devices use the return of each byte to simplify de-allocation of buffer space.

PassPW and Response May Pass Posted Requests bits

HyperTransport supports the strict producer-consumer ordering model found in PCI systems. There are occasions when strict producer/consumer ordering may not be required. In these cases, devices are allowed some flexibility in reordering of posted and non-posted request packets, as well as response packets. Ordering rules, including relaxed ordering, are described in more detail in the chapter entitled Ordering. Relaxing ordering rules is application-specific, and may provide better system performance in some cases.

The source of a transaction indicates whether or non relaxed ordering is permitted through the setting or clearing of two bits in a request:

PassPW bit. The PassPW request packet bit (Byte 1, bit 7) is programmed in the request packet and affects how ordering rules are applied to request as it moves toward the target. If set = 1, relaxed ordering is enabled; if PassPW is clear, relaxed ordering is not allowed.
Response May Pass Posted Requests bit. For RdSized transactions, there is also a bit in the Command field of the RdSized request packet called Response May Pass Posted Requests (Byte 0, bit 3). This bit state will be replicated in the PassPW bit of the returning response and affects how ordering rules are applied to response as it moves back to the original source. The Response May Pass Posted Requests bit does not apply to commands other than RdSized. For reads, the bit should be cleared if the strict producer/consumer ordering model is required; otherwise this bit and the PassPW bit should both be set in the request.

Compatibility Bit

In keeping with PCI subtractive decoding, HyperTransport may use the Compat bit in RdSized and WrSized request packets (Byte 2, bit 5) to enable them to reach legacy hardware (e.g. boot firmware) behind the system subtractive decoder. When the Compat bit is set, all system devices should pass the request downstream through the "compatibility chain" to the subtractive decoder. Only the subtractive decoder may claim these transactions. The Compat bit is reserved and must not be set for upstream requests or configuration cycles.

Flush Requests: Transaction Limits

The Flush request is a tool used to manage posted writes headed toward host memory. Two important limitations of the Flush request are:

If the posted writes target memory other than host memory (e.g. peer-to-peer transfers), then the flush request and response only guarantee that the posted writes have reached the destination host bridge, not the ultimate target. After the host bridge re-issues all peer-to-peer requests downstream towards the intended targets, it sends the target done response back to the original requestor; it is entirely possible the flush response (target done) will reach the original requestor before the request is seen at the target.
Flushes have no impact on the isochronous virtual channels. If isochronous flow control is not enabled on a link, then packets which do have the Isoc bit set actually travel in the normal virtual channels and will be affected by Flush requests.

Fence Requests

Another tool in the management of posted write transactions is the HyperTransport Fence command. The main features of the Fence request are:

A Fence request provides a barrier between posted writes which applies to all UnitID's (transaction streams). This is different from the Flush which is specific to the posted writes associated with a single transaction stream. When the Fence is decoded by the bridge, it sends any previously posted writes in its buffers toward memory. As always, ordering is maintained for posted writes within individual single transaction streams, but no particular ordering is required for different streams.
The Fence request travels in the posted virtual channel, meaning that there is no response expected or sent.

Fence Requests: Transaction Limits

The Fence request is a tool used to manage posted writes headed toward host memory from all transaction streams. Limitations of the Fence request include:

Fence requests are issued from a device to a host bridge, or from one host bridge to another. While a tunnel forwards fence requests it sees, tunnels and single-link cave devices are never the target of a fence request and are never required to perform the fence function internally.
Fences have no impact on the isochronous virtual channels. If isochronous flow control is not enabled, then other packets which do have the Isoc bit set actually travel in the normal virtual channels and will be affected by fence requests.
If a fence request is seen by an end-of-chain device, it decodes the transaction and drops it. It may optionally choose to log the event as an end-of-chain error.

Atomic Read-Modify-Write Requests

While sized read and sized write requests can handle most general purpose HyperTransport data transfers, there are times when a combined, or atomic, read/write command is needed.

Two Problems In Shared Memory Schemes

Two problems related to shared memory schemes include:

A memory location may be used for storing a "semaphore" to be checked by multiple devices (e.g. CPUs or I/O masters) before using a shared system resource. If the contents of the semaphore location indicate the resource is available, the device which reads it then over-writes the semaphore value to indicate the resource is now busy. If another agent reads the semaphore location and sees it is busy, it must wait until the agent using it clears the semaphore location, thus indicating it is again free. The problem arises when a sharing agent has read the semaphore and found the device is not busy. Before it over-writes the data value to claim the resource, another agent reads the semaphore location and also concludes the device is not busy. Now there is a race condition which can result in both devices attempting to over-write the semaphore and use the resource.
The second problem is simpler. If a shared memory location is being used as an accumulator, agents will periodically read the current value, add a constant to it, and write the result back. Again, there is a hazard that the location will be read by one agent and before it can modify it and write it back, another agent may read it with a similar intention. In this case, one of the addends may be lost from the sum.

Most modern bus protocols that support shared memory include a mechanism to avoid the conditions just described. HyperTransport uses the Atomic Read-Modify-Write request for this purpose. The purpose of the Atomic RMW is to force a one-qword (8 byte) memory location to remain "locked" for the duration of the read/modify/write operation required to check and change the targeted location. No other agent is allowed to access the address carried by the Atomic RMW request packet until the entire transaction completes. It is the responsibility of the bridge managing the memory to enforce the locking mechanism.

As a transaction, the Atomic RMW behaves like non-posted write that generates a read response. The read response is accompanied by a single qword of data — the value read from the targeted memory location before any changes are made.

Atomic RMW Variants

The Atomic Read-Modify-Write request has two variants that are designed to address the two cases just described.

Compare And Swap

The Compare and Swap variant of the Atomic RMW sends two qwords of data with the request. One qword (the compare value) is to be checked against the current value in memory; the other qword (the input value) is the data to be written to the memory location if the compare value is equal to the current value. If the compare value is not equal to the current value, the input value is not written to memory. In either case, a read response will be returned accompanied by the original qword read from memory.

Fetch And Add

The Fetch and Add variant of Atomic RMW sends a single qword (the input value) of data with the request. When the Atomic RMW reaches the bridge to main memory, the bridge unconditionally reads the current value from memory, adds the input value to it, and writes the result back to memory. The memory location remains locked to other transactions while the read-modify-write is in progress. A read response is then returned to the requestor, accompanied by the original qword read from memory.

Atomic RMW Requests: Transaction Limits

The Atomic RMW request locks a qword memory address block while a read-modify-write operation is performed. Limitations of the Atomic RMW request include:

The request transfer size, as indicated in the Mask/Count field, is restricted to either one or two qwords. Following the request, a read response returns a single qword of data from memory.
These transactions are designed to be generated by I/O devices or bridges, and target system memory. Other than the host bridge, no HyperTransport devices are expected to support atomic operations. If a target detects an unsupported RMW, it may return a one qword read response with the error bit set or perform a non-atomic read-modify-write. The current HyperTransport Specification does not require peer-to-peer reflection of Atomic RMW.

Control Packets: Responses

There are two response types used in HyperTransport: Read Response and Target Done. Responses are returned by target devices following a non-posted request, and much of the response packet field information is extracted from the requests that caused them. Because responses are routed back to the original requestor either implicitly or based on UnitID, they don't require a 40 bit address field like requests do. All response packets are four bytes.

Read Responses

The four-byte read response is returned when data requests are made, including RdSized and Atomic RMW requests. All HyperTransport read transactions are non-posted and split; this means that data is never returned immediately as it generally is on buses such as PCI. The advantage of split reads is that the latency involved, in waiting for a target to access its internal memory before returning read data, can be minimized by sending the request, releasing the bus, and waiting for the target to initiate the return of data when it has it.

In HyperTransport, the read response is used by the target to indicate the return of previously requested data. The read response immediately precedes the data, and contains the following general information:

The response packet type.
Whether the response should travel in the standard or isochronous virtual channel.
UnitID which acts as an address for responses.
A direction bit indicating whether the response is moving upstream or downstream.
Whether relaxed ordering may be used for this response relative to posted writes moving in the same stream.
Error bits indicating whether or not the returning data can be considered valid; if it is invalid, error bits indicate whether the error occurred at the target or if the request inadvertently reached an end-of-chain device.

Hypertransport CPU technology

Tuesday, June 26, 2007

Transactions in HT Technology