Pentium III's SSE implementation
Slot 1 Pentium III CPU mounted on a motherboard
Since Katmai was built in the same 0.25 µm process as Pentium II "Deschutes", it had to implement SSE using as little silicon as possible. To achieve this goal, Intel implemented the 128-bit architecture by double-cycling the existing 64-bit data paths and by merging the SIMD-FP multiplier unit with the x87 scalar FPU multiplier into a single unit. To utilize the existing 64-bit data paths, Katmai issues each SIMD-FP instruction as two μops. To compensate partially for implementing only half of SSE’s architectural width, Katmai implements the SIMD-FP adder as a separate unit on the second dispatch port. This organization allows one half of a SIMD multiply and one half of an independent SIMD add to be issued together bringing the peak throughput back to four floating point operations per cycle — at least for code with an even distribution of multiplies and adds.
The issue was that Katmai’s hardware-implementation contradicted the parallelism model implied by the SSE instruction-set. Programmers faced a code-scheduling dilemma: Should the SSE-code be tuned for Katmai's limited execution resources, or should it be tuned for a future processor with more resources? Katmai-specific SSE optimizations yielded the best possible performance from the Pentium III family but was suboptimal for later Intel processors, such as the Pentium 4 and Core.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment