Tuesday, June 26, 2007

Error Reporting in Hypertransport Technology

Error Reporting

The three error reporting methods, error responses, fatal and non-fatal interrupts, and Sync flood have different system implications. They are described here in order of increasing severity.

Error Responses (Non-Posted Requests Only)

The HyperTransport specification considers error responses the preferred error reporting mechanism because they are the most localized (conveyed only from target to requester). Error responses are transaction-specific and do not prevent the link from performing other transfers — even to or from the same device.

Every RdSized or Atomic Read-Modify-Write request results in the return of a Read response from the target, followed by all of the requested data. All non-posted WrSized and Flush requests result in the return of a Target Done response which confirms the completion of the operation, but is not accompanied by data.

When either a Read or Target Done response packet is returned to a requester, the requester checks the state of the two error bits — Error and NXA (Non-Existent Address) — contained in the packet to determine if the transaction completed properly. The two sources of error responses are the target device and, in the case of a non-existent address, the end-of-chain device.

Error Response Returned By The Target

If a non-posted request reaches a target, but the target cannot complete the operation (can't source or accept data, etc.), the target will return the appropriate response with the Error bit set. If the request called for the return of data (RdSized or Atomic RMW), all requested data (as indicated in the Mask/Count field of the request) will also be returned. Sending the data (even though it is invalid) allows devices in the path to deallocate buffer space and retire the outstanding transaction.

A returning response with Error set and NXA cleared is equivalent to a PCI target abort; HyperTransport requesters detecting this "non-NXA" error response set the Received Target Abort bit in the PCI Status register. Bridges seeing this error on a secondary bus would set the bit in the Secondary Status CSR.

Error Response Returned By An End-Of-Chain Device

If a non-posted request fails to reach the target (bad address, etc.), an end-of-chain device must send the response on its behalf. The response will have both the Error and NXA bits set. As in the target response above, if the request called for the return of data, all requested data (again, invalid) will be returned as FFh.

A returning response with both Error and NXA set is equivalent to a PCI master abort; HyperTransport requesters detecting the NXA error response set the Received Master Abort bit in the PCI Status register. Bridges seeing this error on a secondary bus would set the equivalent bit in the Secondary Status CSR.

Fatal And Non-Fatal Interrupts

Using interrupts to inform the system of errors is slightly more complex because the interrupt message must travel up through the topology to the host. Interrupts can indicate a non-fatal error (roughly analogous to INTR# in an x86 machine) which implies that the device issuing it has seen an error, but may be able to recover from it; or an interrupt can indicate a fatal error condition (analogous to NMI# in an x86 machine) which indicates that the nature of the error is such that recovery is not possible. Interrupts of either type do not prevent the link from performing other transfers. The conditions under which fatal or non-fatal interrupts are to be used are device and driver specific.

In HyperTransport, interrupts are typically sent using an interrupt message scheme rather than sideband interrupt signals as found in other buses. Devices are not prevented from using external pins as an option, although this method is beyond the scope of the HyperTransport specification.

An interrupt message transaction is actually a special case of the standard size byte write (WrSized Byte) request. Devices in the system can distinguish interrupt messages being sent from other sized writes by the following attributes of the request:

  • Interrupt requests target a reserved address in the system address map (from FD_0000_0000h to FD_F8FF_FFFFh).

  • The command type is WrSized (byte)

  • The Count field is always programmed with a "0", indicating a single dword of data content follows. In standard byte writes, this would be the byte mask dword; in interrupt requests, the single dword data payload contains information about the interrupt.


Sync Flood: When All Else Fails

In some cases, one or more links in HyperTransport may get into a state where ordinary packets cannot be sent reliably. For example, a device may detect a series of CRC errors which indicates to it that either the external link is broken or, more likely, it may not be synchronized with its neighbor with respect to CRC stuffing in the CAD stream. If this is the case, it can't send new packets; it also can't convey the fault using fatal/non-fatal interrupts because they travel in the same channels as other packets.

Sync flood reports errors that cannot be signalled by other methods. It is roughly analogous to the PCI SERR# (system error) event and has a serious impact on the entire chain. Sync flood packets put the chain into an inactive state pending a warm reset to restore normal packet protocol. The behavior of the device initiating the sync flood is slightly different from the other devices which propagate it. The basic rules are described below.

Device Initiating The Sync Flood
  1. The device initiating sync flood must have the SERR# Enable bit in the configuration header Command register set = 1 before it initiates a sync flood for any reason.

  2. If the device intends to initiate sync flood for CRC errors, buffer overflow errors, or protocol errors, it must first check the corresponding "flood enable" bits in the Error Handling and Link Control registers.

  3. The device which initiates a sync flood sets the Signaled System Error bit in the configuration header Status register, LinkFail bit in its Link Control Register, and the Chain Fail bit in the Error Handling CSR. Note: if all conditions for a sync flood have been met, Link Fail is always set — even if the SERR# enable bit in the configuration header Command register is clear (preventing the sync flood packets from actually being sent.)

Devices Detecting Sync Flood
  1. Devices detecting sync flood at a receiver input cease all normal packet transmission on the affected chain.

  2. Each device sets the Chain Fail bit in its Error Handling CSR.

  3. Each device drives sync packets onto all transmitter interfaces, including back to the device which initiated the flood. This assures that sync is seen on all links on the chain.

Sync Flooding And HyperTransport Bridges
  1. Bridges set the Detected System Error bit in the Secondary Status register if they see a sync flood on the secondary bus.

  2. Bridges may forward a secondary bus sync flood upstream to the primary bus if the SERR# Enable bit in its PCI Command register is set. This is similiar to the behavior of PCI-PCI bridges when SERR# is detected on a secondary bus. The bridge may optionally convert the secondary bus sync flood to a fatal or non-fatal interrupt on the primary bus.

  3. Bridges always propagate primary bus sync floods downstream onto their secondary bus(ses).

Miscellaneous Notes
Flooding Continues Until Reset

Once a device commences the sync flood operation, it must continue until a reset is detected on the affected bus. This assures that the sync flood propagates throughout the chain.

No comments: