Hypertransport CPU technology: Errors | Error Checking in HT Technology CPU

HyperTransport defines six types of errors, and three basic ways they may be reported to the system.

Types Of Errors

The error types which may be detected, logged, and reported are:

CRC (Cycle Redundancy Code) Errors
Protocol Errors
Receive Buffer Overflow Errors
End Of Chain Errors
Chain Down Errors
Response Errors

Reporting Methods

Once an error is detected, it can be conveyed to other devices in the system in the following ways:

Error Responses
Error Interrupts (fatal and non-fatal)
Sync Flooding

The Role Of PCI Configuration Space

The PCI Configuration Space required of each HyperTransport device performs several roles in error handling. The Command and Status registers in the header and the Link Error and Error Handling registers in the HyperTransport Advanced Capability Register block are used to report error handling capabilities, program the error reporting mechanism to be used if an error occurs, and to log the errors which occur so that software can later assess the error events seen by each device.

Once the error capabilities of a device have been determined and the error reporting strategy is programmed in configuration space, any errors which occur will be handled accordingly. For example, a HyperTransport device which detects a protocol error may be programmed to set the corresponding log bit in the configuration space Error Handling register and generate a fatal interrupt message.

Most Types Of Error Checking Are Optional

To accommodate differences in how devices and applications may view certain types of errors, the specification only requires CRC generation/checking on each link; other aspects of error detection and handling are optional. If a particular error is not checked, the corresponding enable and logging bits in configuration space must be hardwired to 0.

System Handling Of HyperTransport Errors Varies

As in many other bus protocols, HyperTransport bus behavior during error events is well specified but the action taken by the system in response to reported errors is implementation specific. However, if Sync flood is used as a reporting mechanism, a reset is required on the affected chain(s) to restore proper protocol.

The Error Types

The following section summarizes the required CRC generation/checking as well as the optional protocol, receive buffer overflow, end of chain, chain down, and response error handling.

CRC Errors

The Cycle Redundancy Code (CRC) is used to detect transmission errors on all enabled byte lanes on each link. The 32 bit CRC value is calculated and sent at prescribed intervals by each transmitter, then checked against the CRC value calculated by the corresponding receiver as packets arrive. CRC is calculated by finding the remainder when the sum of packet data (CAD bits plus CTL signal during each bit time) is divided by the CRC polynomial. The polynomial used is:

X³² + X²⁶ + X²³ + X²² +X¹⁶ + X¹² + X¹¹ + X¹⁰ +X⁸ +X⁷ +X⁵ +X⁴ +X² + X +1

CRC On 8, 16, or 32 bit Interfaces

For interfaces which are 8-, 16-, or 32-bits wide, CRC is independently generated and checked for each byte of CAD width.

CRC Generation/Checking: 8/16/32 bit links

After link initialization, each transmitter begins sending packets (NOP, etc.). CRC calculation is based on "raw" CAD/CTL bit patterns on each CAD byte without regard to the packet types being sent.
512 bit times after initialization, the first 32-bit CRC value has been calculated for each byte lane. The window for "stuffing" the 32-bit CRC value into its CAD stream is 64 bit times into the next "window". Note: because of this delay, there is no CRC sent during the first window.
Although each window for CRC calculation is 512 bit times, in reality all windows (after the first one) are actually 516 bit times because CRC for each window is inserted into the following one for four additional bit times. Note that the CRC value stuffed into each window is not included in the subsequent CRC calculation for that window.
There is no special signalling associated with CRC transmission; both devices simply count the bit times starting with link initialization and "know" where the CRC payload falls in each window.
CRC is calculated and sent independently for each 8 bits of CAD width. The CTL signal itself is included in the CRC calculation for the lowest byte of CAD (bits 0-7). On a bus wider than 8 bits, the CTL signal is also factored into the CRC calculation for each of the upper CAD bytes, but is assumed to be 0 during all bit times.
During the driving of the CRC value itself, the CTL signal is driven = 1 (Control) by the transmitter. The CRC bits are inverted before being transmitted onto the link.

CRC Generation/Checking: 2/4 bit links

On links narrower than 8 bits, the CRC value is generated in the same way as for 8-bit links carrying the same value. It simply takes longer to move the packets and CRC value across the link — causing the calculation window and stuffing point for the CRC value to be stretched accordingly. The extra assertions of the CTL signal (after the first bit time in each byte) are not used by the transmitter or receiver in the CRC calculation.

4 Bit CAD Width

A CAD width of four bits requires twice as many bit times as an 8 bit bus for moving information across the link. Therefore:

The CRC window size is 1024 bit times.
The CRC stuffing point starts128 bit times after the start of a window.
It takes 8 bit times to transfer the 32-bit CRC value.

2 Bit CAD Width

A CAD width of two bits requires four times as many bit times as an eight bit bus for moving information across the link. Therefore:

The CRC window size is 2048 bit times.
The CRC stuffing point starts 256 bit times after the start of a window.
It takes 16 bit times to transfer the 32-bit CRC value.

Logging CRC Errors

CRC errors impact both control and data information; if these errors occur on any CAD byte lane, the corresponding error bit(s) will be set in the HyperTransport Advanced Capability block Link Control CSR. The four bits (one for each byte lane) .

Programming The CRC Error Reporting Policy

Informing the system of a CRC error on one or more of the links is handled in the manner programmed at boot time in the Advanced Capability Error Handling and Link Control Registers. Options include sending a fatal interrupt message, non-fatal interrupt message, or initiation of a sync flood.

CRC Test Mode

If both devices on a link support the CRC diagnostic testing mode (determined by checking bit 2 in the Feature Capability register for each device), then software may enable a test sequence that allows stress tests of CRC generation and checking. The basic events involved in link CRC testing include:

Software writes a "1" to the CRC Start Test bit of the Link Control register . Setting this bit informs the transmitter interface that it should enter the CRC diagnostic mode for the following 512 bit times on each enabled byte lane. For 4-or 2-bit CAD widths, this time is stretched to 1024 or 2048 bit times, respectively.
The transmitter sends a NOP packet with the Diag bit set; this informs the receiver that it should ignore CAD and CTL signals for the next 512 bit times but still is required to check CRC. Again, for 4-or 2-bit CAD widths, this time is stretched to 1024 or 2048 bit times, respectively.
With the normal buffers suspended, the transmitter may generate any test pattern it wants; CRC is still stuffed into the CAD test pattern stream in the normal way.
CRC errors detected during this time will be logged normally, and if the Sync flood is enabled, it will be performed. All data content is "don't care" during this time and is dropped.
If the CRC Force Error (CFE) bit is also set during the test , then the test pattern sent by the transmitter will contain at least one CRC error in each of the active byte lanes.
When the test is complete, hardware automatically clears the CRC Start Test bit. This bit may be polled by software to check completion.
At the end of the CRC Diagnostic test, normal packet transfer resumes.

Protocol Errors

Protocol errors are failures on the link involving low-level packet violations. These include the following:

CTL Signal Four-Byte Boundary Violation

The CTL signal may only transition between low-high on four byte boundaries. The exception to this rule is during the CRC diagnostic test mode. If an illegal transition is detected, then either the transmitter has lost track of packet start and ending boundaries or the receiver has.

CTL Deassertion Violation

Other than when CRC diagnostic test mode is in use, a transmitter only deasserts the CTL signal during data packets associated with earlier requests requiring them. Deasserting CTL when data packets are not in transit is another protocol violation.

CTL/Data Interleaving Violation

A transmitter is allowed to interleave new control packets into the data packet associated with an earlier request if the new control packet does not have any immediate data of its own. If an attempt is made to interleave a control packet with immediate data (e.g. a write request) into a data packet already in transit, this is a protocol violation.

Bad Command Code In Control Packet

Control packets (request, response, information) have a 6-bit command field in the first byte to encode the intended operation. Some codes are not used, and are reserved. Sending an illegal command code is another protocol violation.

CTL Deassertion Timeout Violation

The HyperTransport specification limits the amount of time the CTL signal may be deasserted. There are two maximum timeout options (1 millisecond or 1 second) and the one in effect is programmed in bit 15 of the Link Error Register. If the transmitter exceeds the programmed maximum CTL deassertion timeout, it is a protocol violation.

CTL Deasserted During CRC Transmission

CTL is always asserted during the transmission of the 32-bit CRC code in each calculation window. If a receiver detects CTL deasserted during a CRC stuffing period, it is a protocol violation.

Logging Protocol Errors

Protocol error checking is optional. If protocol violations are checked, the Link Error register log the errors; refer to Figure 10-5 on page 239.

Programming The Protocol Error Reporting Policy

Informing the system of a protocol error on one or more of the links is handled in much the same way as for CRC errors. They may be mapped to a fatal or non-fatal interrupt message, or a sync flood. The reporting strategy is programmed in the Error handling CSR

Receive Buffer Overflow Errors

Receive buffer overflow errors can occur if a link transmitter no longer maintains an accurate count of available flow control buffers at the receiver. If a flow-controlled packet (posted request, non-posted request, or response) is sent without an available receiver flow control buffer to accept it, the packet will be lost.

End-Of-Chain Errors

End-Of-Chain (EOC) errors result when a packet moving through HyperTransport is either not claimed by, or does not reach, the intended recipient. Other devices which see the packet forward it and eventually it reaches the device at the end of the chain, where the packet must be handled. Some of the possible reasons for EOC errors include; improper address in a request, invalid Unit ID in a response, the target device is broken, or it has not been programmed properly with UnitID or target base address range.

EOC errors are analogous to the master abort event in PCI. Unlike PCI, however, "misdirected" transactions must be handled by the EOC device rather than simply having the initiator of the transaction time out after a prescribed amount of time. This is important in HyperTransport because it is a series of point-to-point connections rather than a shared bus, and an initiator simply sends packets to the neighboring device and has no way of immediately "knowing" whether the ultimate recipient receives it. The EOC error handling mechanism helps with link management in two ways:

For posted requests and responses which inadvertently reach an EOC device, the EOC error bit and reporting mechanism may be used to let the system know a packet never reached its destination — information that otherwise would be unknown.
For non-posted requests which reach an EOC device in error, the error logging and reporting can also be used. In addition, the EOC device will act as a surrogate for the target and send back a Read or Target Done response to the requestor (with error bits set). For read requests, all of the requested data is also sent back by the EOC device — although it is obviously invalid (all data values are driven to FFh). Sending back the responses (and data) allows all devices in the path back to the requestor to deallocate internal buffer space and retire the outstanding transaction. The original requester examines the response, decodes the error bits, and takes whatever action is appropriate.

How A Device Knows It Is At The End Of A Chain

Single link peripherals (also known as End or Cave devices) are always end-of chain-devices. Any packets reaching these device that they are not programmed to accept (by Command type, UnitID, or Address range), are considered lost. No software programming is required for these devices to carry out their EOC function other than setting up the error reporting mechanism to be used.

Chain Down Errors

If a device detects a Sync flood or an error that would cause a Sync flood, it sets the Chain Fail bit in its Error Handling register and waits for a bus reset. The action taken when the chain goes down depends on the device type:

Host interfaces track outstanding non-posted requests for devices below them. On chain down errors, they flush the state of all internal non-posted requests and return non-NXA error responses to the requesters for each one that is pending.
Slave devices have their internal states re-initialized when the RESET# occurs after a chain goes down; there is generally no need for a flush operation of non-posted requests by these devices. If a slave device were implemented that maintained its state through a HyperTransport RESET#, it would need to perform the non-posted request flush operation after the chain goes down as well.

Response Errors

All non-posted requests that are issued require either a Read or Target Done response. The requester programs UnitID and source tag information into each request packet it issues so that when the response is returned it may be tagged with the same information and find its way back to the original requester. When a downstream response is detected, each device compares the UnitID to its own to see if it should claim the response; if so, it then checks the source tag to determine which of its outstanding transactions is being completed.

It is possible a response may return and be claimed by a requester (UnitID is OK), but not be recognized as being valid. Some of the reasons this might happen include:

A read response (RdResponse) is received by a device which carries the correct UnitID, but has an invalid source tag (SrcTag field). The recipient cannot associate the response with any of its outstanding transactions.
A read response (with data) is received with the correct UnitID and SrcTag fields, but the response type is incorrect (requester is expecting a Target Done response).
A Target Done response is received for a RdSized or Atomic RMW request.
A read response (with data) is received for a RdSized or Atomic RMW, but the (data) count field doesn't match what the requester originally asked for.

Hypertransport CPU technology

Tuesday, June 26, 2007

Errors | Error Checking in HT Technology CPU