Roger Ngo's Website

Personal thoughts about life and tech written down.

Analysis of the Pentium Pro

I recently came in possession of a few brand new sealed boxes of Pentium Pro processors. They are all 180 MHz with 256 KB L2 cache. Of course, with my obsession of retro-computing hardware, I just had to grab these for not just collecting, but also for my studying and PC building itch.

While waiting for these to be shipped, I decided to do an architectural and historical study of these CPUs. Why not kick back with a nice cup of coffee and sandwich and read through this to learn about the Pentium Pro?

The Pentium Pro Today and Yesterday

In this present day, the Pentium Pro is a sought after item not for collecting or vintage retro computing, but rather for gold scrapping. My personal opinion of this is quite negative. It saddens me to see these beautiful pieces of silicon be destroyed. The Pentium Pro has quite a historical significance to our computing history. The CPU contained many primitive versions of the modern processor technologies we see today.

Released in November 1995, the Pentium Pro was intended to be the chip that would supersede the Pentium. With the Pentium Pro, Intel introduced the P6 architecture. Intel had intended for the P6 to overtake the P5 in the consumer market, which was what the original Pentium and Pentium MMX were based off of. The Pentium Pro’s P6 microarchitecture would eventually be the base microarchitecture of the Pentium II, and Pentium III as direct successors along with the Pentium M and Core microarchitectures as indirect successors.

The Pentium Pro was also introduced as Intel’s first attempt at a high-end workstation and server CPU. Thus, the Pentium Pro was ideal for those looking to build servers servers, high-end workstations, graphic artists and in general consumers with lots and lots of money with no regard for a budget when buying a computer.

Specifications

The Pentium Pro was manufactured from November 1, 1995 and was sold up until 1998, when the Pentium II, its direct successor was already selling in the market. The CPU came in flavors of 150 MHz and up to 200 MHz along with L2 cache sizes of 256 KB to 1024 KB. Depending on the model, the Pentium Pro would be manufactured on either the 0.35um or 0.50um process. Regardless, each chip consisted of 5.5 million transistors on the die alone.

Pentium Pro Specifications
Date in Production November 1995 to 1998
Manufacturer Intel
CPU Clock Speed 150 MHz to 200 MHz
Front Side Bus Speed 60 MHz/66 MHz
Transistor Size 0.35um to 0.50um
Architecture x86
Microarchitecture P6
Core Count Single Core (1)
Socket Platform Socket 8
Transistor Count 5.5 million
SMP Yes

At the time, Intel had vouched the Pentium Pro to be a solution to scaling up the clock frequencies for their chips. Intel's original intention was for the Pentium Pro to succeed the Pentium at around the 166 MHz clock speed. Due to the sheer cost of the Pentium Pro in the consumer market and the relatively low-yield rates in which I will discuss later, the Pentium Pro did not overtake the Pentium in the market as planned. Instead, Intel eventually released faster Pentiums along with the Pentium MMX to stay competitive within the consumer market.

Architecture - Historical Context

Before I begin the discussion of the architecture of the Pentium Pro, I would like to define some terms I will most likely end up using interchangeably. I want to get this out first so that there will be less confusion as you read through this article.

The Pentium Pro and P6 microarchitecture will be used interchangeably. The Pentium and P5 microarchitecture will also be used interchangeably. Pentium MMX and all other non-Pro Pentium variants are based off of the P5 architecture, but will simply be referred to P5 to make the context easier to understand.

Okay, now that we have gotten that out of the way, let's begin...

Intel's initial goal with the Pentium Pro was to be able to work with a few constraints with regards to their manufacturing process at the time. The Pentium was planned to eventually shrink into the 0.35um process in 1995. Intel had wanted to maintain the same process with their Pentium Pro design. Thus, the challenge was to increase performance through the design of a better chip microarchitecture. This is how the P6 came to be.

At the time of development of the Pentium Pro, the 100 MHz Pentium was the fastest variant available in market. Intel had wanted to use the performance of that chip as a baseline. While the Pentium had a 5 stage pipeline in its superscalar architecture, the P6 microarchitecture took it to 12 stages and decoupled each phase. This caused a 33% increase in performance at each pipeline stage alone. This theoretically meant under the same manufacturing process, the difficulty of producing a 133 MHz Pentium Pro was just as easy as producing a 100 MHz Pentium. With the Pentium's superscalar architecture, where there were two execution units within the microarchitecture, the peak theoretical instruction per clock was two. Intel could not in practicality produce an improved chip using the same fetch-decode-execute approach. Thus, dynamic execution was born.

Intel had found that most CPU cores in general applications, were not utilized to the full potential. There was a lot of idle time measured in between operations. Traditionally, things such as cache misses and memory accesses stall a CPU due to waiting for the bus interface to deliver the data needed by the application. As the speed of CPUs increased, Intel had discovered that the CPU was operating faster than the bus interface could retrieve data. Thus the linear fetch-decode-execute cycle became impractical. Such a linear cycle meant that a CPU could operate fast speeds but would eventually be idle due to waiting for a slow memory access. This is similar to a stop-and-go bumper-to-bumper traffic on the highway. The CPU is too fast and feeds too much data into the pipe, but the pipe cannot produce enough throughput to satisfy the requests.

To solve such a problem, there were several "easy" solutions to consider. Intel could have utilized more CPU by making external components that interfaced with the CPU faster and larger. For example, by creating a faster chipset and bus interface, the time in which data takes to travel across the system would decrease which leads to the result of less time in which the CPU is stalled. A solution such as this would have resulted in a higher cost for the overall platform. Intel had no intention of making this design choice to take to the market.

The second obvious approach to solving the problem of reducing stalls was to increase the L2 cache size in order to decrease the number of cache misses the CPU would make. From a cost perspective, this would also been impractical s ince SRAM, the memory type used in L2 caches was quite expensive at the time.

Therefore, instead of adding more resources and dependencies to the architecture, Intel had decided to reinvent and create the concept of out of order execution. This was dubbed “dynamic execution”. In summary with dynamic execution, instructions are typically evaluated at a 20-30 instruction window and then are analyzed for interdependencies before being executed. The CPU considers all branching results and effects in which each instruction will have against the state of the machine and will rearrange the instructions accordingly. Since the CPU is relatively fast, it is easy to enumerate the data path in the effect of which these 20-30 instructions can have on the system. Essentially the CPU is trading off stall time with a little added latency in the beginning of the fetch-decode-execute cycle. This was minimized with the multiple execution units that Intel had designed into the P6.

Architecture - Comparing the Pentium (P5) and Pentium Pro (P6)

As noted and deduced, the Pentium Pro is considered to be a different microarchitecture than the Pentium. While the Pentium's P5 microarchitecture was considered a 5th generation x86 CPU, the Pentium Pro was considered to be 6th generation. Intel dubbed the microarchitecture for the Pentium Pro the P6. This architecture is synonymous with i686.

The Pentium Pro's P6 microarchitecture brought considerable improvements over the Pentium’s P5 architecture.

Some basic things were:

  1. A multi-staged pipeline with faster pipeline stages decoupled from each other, with improvement of up to 33% compared to the Pentium.
  2. A larger address bus width. The P6 microarchitecture had an address bus width of 36 bits as opposed to the P5's 32 bits. This allowed the Pentium Pro to address up to 64 GB of RAM as opposed to 4 GB of RAM limited by the Pentium.
  3. The Pentium Pro had an 8 KB instruction cache where 16 bytes are used for every clock cycle.
  4. Exceptional 32 bit code execution performance for its time. The Pentium Pro was usually about 25-35% faster than the Pentiums. Paired with a 32 bit operating system such as Windows NT, Linux, UNIX or OS/2 at the time, the Pentium Pro could prove to be a great investment for applications on those platforms.
  5. An inherently highly scalable architecture. Throughout the lifetime of the P6 microarchitecture, Intel initially released the 150 MHz Pentium Pro in 1995 and the architecture scaled all the way up to the 1.4 GHz Pentium III Tualatin in 2002. The architecture would continue to evolve and Intel would release derivatives in the Pentium M, Core and Core 2 microarchitectures.
  6. Better overall branch prediction.
  7. Denser CPU die with 5.5 million transistors as opposed to the 3.1 million transistors found on your average Pentium. This meant more specialized logic packed into the die.
  8. Advantage for multiprocessor configurations. The Pentium Pro was optimized to be scalable at dual and quad processor configurations.

Though the things above brought performance increases, there were two notable technology shifts that made the Pentium Pro performant compared to other processors in its time. Out-of-order instruction execution and a non-blocking, asynchronous on-package L2 cache running at full bus speed with memory accesses transferred through a back-side bus.

Architecture - Dynamic Execution and Superpipelining

Intel marketed the superscalar architecture heavily when the Pentium was released. Compared to the 486, the Pentium had two execution units which could execute up to two instructions per clock cycle. For the Pentium Pro, it was dynamic execution. While dynamic execution consists of out of order execution, it is more than that in that it is a unique combination of branch prediction, data flow and speculative execution.

While the P5 introduced pipelining with 5 stages to x86 CPUs, the P6 brought forth a larger pipeline consisting of 12 stages which were modular. Intel dubbed this “superpipelining” with each stage of the pipeline being almost 33% faster than the P5 counterpart. With more stages in the pipeline, Intel was allowed to clock the CPU much faster than the Pentiums of its time to achieve higher throughput per unit of time.

Conceptually, the basic disadvantage of pipelining exists and becomes more apparent with a larger number of stages: stalling. Instead of adding more dependencies to the platform, Intel sought to reduce the stall time of the CPU by improving the way instructions are decoded and executed. The result of the efforts was an implementation of out-of-order and speculative execution. To understand "dynamic execution" we need to first understand how a traditional CPU prior to the P6 microarchitecture executed instructions.

In a typical CPU, the typical instruction goes through a linear fetch-execute cycle. Instructions are first fetched from program memory pointed by a program counter. These instructions are fetched and executed one at a time regardless of whether or the CPU is pipelined, or not. Instructions are fetched, decoded and executed sequentially one after another with no instruction executing before another in a logical sense.

With the P5, there were two execution units, creating a superscalar architecture. Superscalar architecture allowed the CPU to consist of 2 execution units to process up to 2 instructions at once per clock. Superscalar performed well with work in areas that didn’t have data within the flow of execution. This meant ideally, that if the instruction did not rely on memory accesses which resulted in a cache miss, a superscalar CPU could peak at two instructions per cycle, with one instruction flowing through each execution unit. This form of execution allowed all parts of the CPU to work as much as possible. It was not entire perfect however as the CPU had wasted cycles when a cache miss or memory access was performed. The CPU would stall for a number of cycles while waiting for memory accesses to be performed.

With the P6 and its dynamic execution, instructions are rearranged to prevent stalling as much as possible. Conceptually, CPU stalls still happen when memory accesses are needed, but this is attempted to be minimized because the instructions are arranged in such a manner to minimize the stop-and-go effect. This allows the CPU to constantly be performing useful work as much as possible. The Pentium Pro also had a fully pipelined cache which supported multiple cache misses at the same time. This increased the performance by reducing the number of consecutive stalls within the pipeline. This shows how the P6 microarchitecture is very aggressive when it comes to getting work done.

Architecture - Fetch and Decode

In the Pentium Pro’s fetch-execute cycle, the Pentium Pro leverages a component within the chip called the instruction pool. The instruction pool allows the execution phase of the Pentium Pro to tap into these pool of instructions and rearrange the particular instruction into micro-operations that will then be executed out-of-order to avoid any idle time. While the Pentium's superscalar design had a two execution units, the Pentium Pro possessed three independent units. These three independent units maintain the communication between the instruction pool during the execution of a single instruction: The fetch/decode unit (FDU), the dispatch/execute unit (DEU) and the retirement unit (RU).

A reimplemented fetching and decoding stage in the Pentium Pro was added so that the CPU could translate classical x86 instructions into 118 bit micro-instructions that followed a format reminiscent to RISC. Each micro-instruction is triadic and always consists of an operation, two sources and one destination. A single IA (Intel Architecture) instruction will typically be converted anywhere from 1-4 micro-operations, depending on the complexity of the instruction itself.

The fetch/decode unit makes use of an instruction cache called the ICache and is populated with the aid of the "next instruction pointer". The unit then performs a one way access to the instruction pool and then instructions are decoded out of order. This communication happens when the fetch/decode unit has finished translating the instruction and is made when placing these instructions into the instruction pool. This front-end decoding process of instructions to micro-operations for execution was the beginning of Intel moving into a front-end that would be the carry-over of x86 CPU design that is still seen in this present day.

The reason Intel favored this approach was because the IA instruction set was starting to grow very large and each x86 instruction was had different lengths. Extra internal hardware complexity was just not seen as a scalable solution to the ever-growing instruction set. In order to form some sort of consistency for a future scalable design, Intel decided to cope with minor latency of CISC to RISC-like micro-instruction translation for consistent instruction widths.

The micro-operations that are placed in the instruction pool is synonymous of the term the reorder buffer, or ROB. These micro-operations will eventually exit the ROB and enter the reservation station (RS) where they will be executed.

For the execution stage, the Pentium Pro makes better use of this scheduling to increase throughput. The Pentium Pro furthermore divides these execution phases separate phases called the dispatch/execute and retire phases. Division of the phase into sub-phases allow instructions to be started in any order, but always be committed in the original logical program order.

Architecture - Execution and Retirement Process

The dispatch/execution unit (DEU) communicates through the instruction pool bidirectionally. This unit is also the unit that performs out-of-order execution on the instructions. This out-of-order execution means that the results in which the DEU generates are purely speculative and are not reflective of the final state of the machine in which the instructions influence.

The dispatch unit will select the micro-operations within the instruction pool and will send the micro-operation to be executed. The reservation station within the DEU will receive the instruction and communicate with several execution units that will process the micro-operation. The reservation station contains four accessible ports that allow this communication. The type of micro-operation that needs to be performed determines which port the micro-operation will flow into to the appropriate execution unit. Therefore, we can then think of the reservation station as the router that will dispatch the micro-operation to the appropriate execution unit.

In summary, when speaking at a more granular level, the Pentium Pro maintains two states: (1) the speculative state, where instructions are executed out of order but are not committed and the (2) committed state, where the executed instructions change the machine state in the logical order that they are required to be.

The reservation station can be modeled similar to below:

The four ports lead to the following units:

  • Port 0: Port 0 is shared between a floating point execution unit and an integer execution unit.
  • Port 1: Port 1 routes the micro-operation to the jump execution unit and the second integer execution unit.
  • Port 2: Port 2 routes the micro-operation to the address generation unit if it is a load operation in memory.
  • Port 3/4: Port 3/4 are shared routes that leads the micro-operation to the address generation unit if it is a store operation in memory. There are two address generation units connected this port. One for generating the address and the other for data.

Once the instructions have been executed, they are stored in the retirement unit as the executed instructions are still in a speculative state. The retirement unit is what converts the state of the instructions to be committed. Essentially, micro-operations are reassembled into its original x86 instruction to the correct logical ordering. These instructions then move the machine from the speculative state to the committed state where the data changes in registers are seen by the software developer.

Architecture - Example

  • The following slide pictures were taken from: https://people.cs.clemson.edu/~mark/330/colwell/p6des.pdf
  • The Intel document gives an example of why such a design is beneficial opposed to the execution style found in Pentium CPUs.

    In a Pentium processor (P5), the superscalar architecture allows two instructions to be executed. With instructions that do not demand any memory accesses these instructions allow the processor to achieve maximum throughput of 2 instructions at a time. This is however, an example of a controlled ideal world. A lot of applications demand communication between the CPU and memory. Cache accesses will eventually lead to cache misses, which will then lead to main memory access. When this happens, the pipeline stalls and the CPU is left sitting idle. Over time, the throughput of instructions per clock decreases due to increase of time and decrease of instructions being completed.

    The Pentium Pro avoids this by looking ahead into the instruction pool for any subsequent instructions that will do any useful work while waiting for the bus transfer to complete. Say the next instruction depends on the memory instruction to finish, but the CPU finds that the two instructions after that are executable. The CPU will then execute those 2 instructions out-of-order. The CPU will not commit the results of these instructions to the final machine state immediately, since the CPU still must maintain the original program execution order. What the CPU will do is store these instructions to the retire unit. This retire unit contains the executed instructions in-order. The instructions are removed in logical order and are committed to machine state.

    The Pentium Pro will use a FIFO scheduling algorithm that favors executing micro-operations that are back-to-back. It is by nature that many micro-operations deal with branches. Though the Pentium Pro has a nifty branch predictor, there are times where it cannot predict a branch.

    When a branch is mis-predicted, branch micro-operations are tagged with their fall-through address and the destination that was predicted for them. Then when the branch executes, the real result is compared against the predicted result, and if they match the branch retires. The CPU can be confident in that the speculative work that succeeds the branch prediction is correct.

    Now, if they do not coincide, the jump execution unit, or JEU will change status of all the micro-operations behind the branch and will remove them from instruction pool.

    Cache Design

    Cache is ultimately one of the most important things when it comes to memory. CPU cache is the first level of memory in which the CPU will access outside of its registers. Therefore, having a fast cache will bring performance improvement.

    Unique and innovative at the time was the on-package L2 cache die connected by a full-speed bus that the Pentium Pro was bestowed with. At the time, most CPU L2 caches were upgradeable components that were placed onto the motherboard. Communication between the L2 cache on board to the CPU was limited by the bandwidth of the external bus interface. Compared to the Pentium Pro, the L2 cache found on the chip was extremely fast. The L2 cache size varied from 256 KB to 1024 KB depending on the SKU of the chip.

    Aside from the full-speed on-package L2 cache, a new addition was the back-side bus, in which Intel called a dual independent bus. A dual independent bus meant that the CPU could read both main memory and cache concurrently. This removed a huge bottleneck of synchronous memory movement of a mutually exclusive operation of cache access versus main memory access. The cache was also non-blocking, meaning that the CPU could request more than one cache request at a time. This reduced the number of cache miss penalties. As a result the L2 cache was immensely faster than the motherboard-based caches of the older processors.

    The integrated cache that the Pentium Pro possessed also scaled well in an SMP setting. The I/O performance of the Pentium Pro would skyrocket when in dual or quad processor configuration.

    Platform

    The Pentium Pro was designed to be fitted for the Socket 8 platform. The Socket 8 platform was unique due to the rectangular shape of the socket. The socket was much larger than the Socket 7 platform in which was targeted for the Pentium at the time. The size of the socket had accounted for a larger CPU due to mainly the addition of the on-package cache. This socket provided 387 pins and only two types of CPUs were supported for this platform: the Pentium Pro and Pentium II OverDrive. The main criticism for this socket platform was that there was really not real and affordable upgrade path aside from the Pentium II OverDrive.

    Upon initial design of the Pentium Pro, Intel already had plans of an OverDrive processor in mind for the socket as a potential upgrade path. A voltage regulator module, or VRM would be needed to regulate the voltage required for this CPU.

    This OverDrive processor would eventually be released as the Pentium II OverDrive in 300 MHz and 333 MHz variants. Particularly interesting was the fixed multiplier of the OverDrive processor at 5.0. There was only a single SKU of the Pentium II OverDrive and the clock speed was essentially dependent on the FSB speed that was set through the jumper on the motherboard. If the motherboard’s jumper was 60 MHz, from having a 150 MHz or 180 MHz Pentium Pro, then the OverDrive CPU would clock to 300 MHz. If the motherboard was jumped to 66 MHz, then the OverDrive would clock at 333 MHz upon upgrade. This meant that it was quite easy to achieve the full 333 MHz clock speed by simply moving the jumper pin to 66 MHz had the computer originally been at 60 MHz.

    The Pentium II OverDrive was quite different in that it solved most of the criticized weaknesses that the Pentium Pro possessed, and was actually faster than the Pentium II itself due to a full-speed cache that Intel had maintained for the Socket 8 platform. The Pentium II OverDrive would be sort of a direct ancestor to the Intel Xeon chip that was released shortly.

    Since the Pentium Pro was targeted at clients that needed to have lots of work processed, the RAM support for Socket 8 motherboards was quite large. It was not uncommon to see 1024 MB (1 GB) of RAM supported at that time. This design was kept in mind due to the Pentium Pro being meant for true 32 bit operating systems.

    The Socket 8 platform may not have been as great as the Pentium Pro chip itself. It did not have full support of SD-RAM, and AGP. At a time where multimedia applications were starting to become popular, the need for higher bandwidth memory and peripheral access became very important. This caused the platform to become obsolete in this area of computing.

    Weaknesses

    Despite all the advantages in which the Pentium Pro brought to the table, the Pentium Pro was not without its flaws.

    Manufacturing the CPU was a challenge for Intel. The on-package cache arrangement at the time was unique. The processor and cache were placed on separate dies and were connected by a full speed data path (hence giving it full speed bus access).

    This meant that the two dies, the CPU and cache, had to be bonded together early in the production process before any sort of testing was possible. The consequence of this meant that any minor flaw in either die forced Intel to discard the entire chip. This caused low production yields and a high cost due to a lot of wasted dies. The most expensive chip to produce for the Pentium Pro was the 200 MHz variant with 1024 KB L2 cache. Instead of the typical two die bonding of just the CPU die and the cache die, the 200 MHz 1024 KB L2 cache variant had to be bonded with three dies, a CPU die and two 512 KB cache dies.

    Though performance of 32 bit code surpassed the Pentium, 16 bit code did not perform relatively as well as the Pentium. The average improvement of 16 bit code execution was about 5% and sometimes no improvement, or at worst, a regression as compared to the Pentium. This was a problem for some users since at the time of release, 16 bit code was still very dominant. Most computing platforms still ran MS DOS, Windows 3.1x and Windows 95. The out-of-order execution also slowed down 16 bit code since it tended to cause more stalls in the pipeline.

    As PC multimedia and gaming became more and more popular, Intel had to release the Pentium MMX in response to the demand for a higher performing processor for multimedia applications. From a very high level, the MMX instruction extensions allow a single instructions to perform multiple operations. This gave multimedia applications a huge performance boost when executed on a Pentium MMX. Unfortunately the Pentium Pro was released before the Pentium MMX, and so meant that multimedia applications would perform slower than a similarly clocked Pentium.

    Just like the Pentium, the Pentium Pro had a bug that was discovered in the FPU after it was released. It was dubbed the "Pentium Pro FPU bug". Intel called it the “flag erratum”. The bug would occur under certain conditions where a floating point to integer conversion when the floating point number would not fit into the smaller integer format. This bug caused the FPU to generate results that would be inconsistent. Intel claimed that this bug was minor and that there were very few, if any, applications that were truly impacted.

    Consumer and Enthusiast Reception

    The Pentium Pro was released in a variety of different SKUs.

    • 150 MHz (60 MHz bus)
    • 166 MHz (66 MHz bus)
    • 180 MHz (60 MHz bus)
    • 200 MHz (66 MHz bus)

    Though at the time overclocking was not yet as well known as today, it was possible to overclock the CPUs through the use of jumper settings on the motherboard. The 200 MHz variant was often able to run at 233 MHz and the 180 MHz version was often able to go to 200 MHz for overclocking. This was of course possible by setting the bus speed to 66 MHz from 60 MHz while keeping a constant multiplier.

    The Pentium Pro was quite an expensive chip when it was first released. The Pentium Pro retailed on market with an MSRP of $974 to $1989. Thus, the chip was quite costly in 1995, and was not geared towards the average consumer. For consumers with a big wallet however, there was nothing faster than a Pentium Pro for x86 machines. With 3D DOS gaming being the most popular at the time, the chip was fast enough to run pretty much about anything compared to the 486 and the Pentium competitors in that era.

    The Pentium Pro was also popular amongst engineers, CAD and graphic artists. The high speed cache that the Pro possessed proved to be beneficial for applications that manipulated lots of data on screen.

    The professional and server market was the primary target for the Pentium Pro. Since these processors could run in a multiprocessor configuration, they were very popular in dual and quad CPU configurations as servers. It also was beneficial in that the Socket 8 motherboards which housed these CPUs usually came with a lot of memory slots for large amounts of RAM. 32 bit operating systems such as Windows NT, Unix, Linux and OS/2 were also much more performant on the Pentium Pro than on the Pentium due to the Pro’s affinity towards 32 bit code.

    HPC was an area that the Pentium Pro became used in. The ASCI Red super computer originally used 200 MHz Pentium Pros with a nodes to total up to 9298 processors. The computer was the first to achieve 1 TFlops in the LINPACK benchmark. The upgrade to Pentium II OverDrive CPUs as a drop-in upgrade allowed it to have a longer life and eventually achieve 2 TFlops.

    In retrospect, the Pentium Pro was a highly successful microprocessor from a design and engineering perspective. When looking at sales of the processor as a means of measuring success, it did not sell as successfully as the Pentium. The simple fact was that the Pentium Pro was too expensive of a platform for the average user to appreciate.

    The Pentium Pro Today

    Although a direct, modern descendent of the Pentium Pro is not in the market today, we most recently saw an implementation of it indirectly within the Core 2 microarchitecture in 2006 and was the mainstream CPU until about late 2009.

    Predecessors of the Core 2 such as the Core and Pentium M were also indirect descendants of the P6 microarchitecture. The most recent direct successor to the Pentium Pro which used the P6 microarchitecture was the Pentium III. The Pentium III was also a rather an interesting chip for its time. Released in 1999 and sold until about 2003, it was active in a time where clock speeds were the end all, be all in computing performance.

    The Tualatin iteration of the Pentium III was the most interesting of the iterations. At a time when Intel had already released the Pentium 4, the intention was that the Pentium III would die quickly in favor of the higher clocked Pentium 4.

    To Intel's surprise, the Tualatin did perform favorably against the early Pentium 4 chips clock-for-clock. It was not at all strange to see a Tualatin at 1.4 GHz keep up with a similarly clocked Pentium 4. This shows the power and scalability of the P6.

    As the Pentium 4's clock speed surpassed 2 GHz and onwards to 3 GHz, the Pentium 4 had far surpassed the Pentium III in performance due to the ability to use DDR-RAM for higher bandwidth memory transfers and greater clock speeds for CPU operations. As the clock speed of the Pentium 4 pushed further and with the introduction of the Pentium 4 Prescott core, Intel had issues with thermal output. Consequently, Intel went back to the basic design of the P6 with the Tualatin as a base to create the Pentium M, which ran much cooler.

    The basic designs of the Pentium Pro live on today. Out-of-order execution, multiple execution units, and modern cache design principles have helped iteratively improve processor performance generation after generation.

    References

    1. https://www.pctechguide.com/pentium-cpus/pentium-pro
    2. http://www.anandtech.com/show/250/2
    3. http://www.pcguide.com/ref/cpu/fam/g6PPro-c.html
    4. https://en.wikipedia.org/wiki/P6_(microarchitecture)
    5. http://www.dexsilicium.com/Intel_PentiumPro.pdf
    6. https://people.cs.clemson.edu/~mark/330/p6.html
    7. https://en.wikipedia.org/wiki/P5_(microarchitecture)
    8. http://www.redhill.net.au/c/c-8.html
    9. https://en.wikipedia.org/wiki/Transistor_count
    10. https://en.wikipedia.org/wiki/Superscalar_processor
    11. https://en.wikipedia.org/wiki/Pentium_III
    12. https://people.apache.org/~xli/presentations/history_Intel_CPU.pdf
    13. http://archive.arstechnica.com/cpu/ppro_editorial.html
    14. https://people.cs.clemson.edu/~mark/330/p6.html
    15. https://people.cs.clemson.edu/~mark/330/colwell/p6tour.pdf
    16. https://people.cs.clemson.edu/~mark/330/colwell/p6des.pdf
    17. http://dayintechhistory.com/dith/november-1-1995-pentium-pro-ibm-pcjr-introduced/
    18. https://www.youtube.com/watch?v=dH1fo7jAFnc
    19. https://tams.informatik.uni-hamburg.de/lehre/2001ss/proseminar/mikroprozessoren/papers/pentium-pro-performance.pdf
    20. http://www.os2museum.com/wp/intel-overdrive-part-iii-pentium-ii-overdrive/
    21. https://en.wikipedia.org/wiki/ASCI_Red
    22. http://www.cpushack.com/SocketID.html