Intel hyper threading technology. Once again about Hyper Threading

Many Intel processors include modules that support Hyper-Threading Technology, which, according to the developers' idea, should help increase the performance of the chip and speed up the PC as a whole. What are the specifics of this solution from an American corporation? How can you take advantage of Hyper-Threading?

Technology Basics

Let's look at some key information about Hyper-Threading. What kind of technology is this? It was developed by Intel and first introduced to the public in 2001. The purpose of its creation was to increase server performance. The main principle implemented in Hyper-Threading is the distribution of processor calculations over several threads. Moreover, this is possible even if only one core is installed on the corresponding type of chip (in turn, if there are 2 or more of them, and the threads in the processor are already distributed, the technology successfully complements this mechanism).

Ensuring the operation of the main PC chip within several threads is carried out by creating copies of architectural states during calculations. This uses the same set of resources on the chip. If the application uses the appropriate feature, then practically significant operations are carried out much faster. It is also important that the technology in question is supported by the computer’s input/output system—BIOS.

Enabling Hyper-Threading

If the processor installed in the PC supports the corresponding standard, it is usually activated automatically. But in some cases, you have to manually perform the necessary actions for Hyper-Threading technology to work. How to enable it? Very simple.

You need to enter the main BIOS interface. To do this, at the very beginning of booting the computer, you need to press DEL, sometimes F2, F10, less often other keys, but the desired one always appears in one of the lines of text displayed on the screen immediately after turning on the PC. In the BIOS interface you need to find the Hyper-Threading item: in versions of the I/O system that support it, it is usually located in a prominent place. Having selected the appropriate option, press Enter and activate it, marking it as Enabled. If this mode is already set, then Hyper-Threading Technology is working. You can take advantage of all its benefits. Having activated the technology in the settings, you should save all entries in the BIOS by selecting Save and Exit Setup. After this, the computer will reboot in a mode where the processor is running with Hyper-Theading support. Hyper-Threading is disabled in a similar way. To do this, you need to select another option in the corresponding item - Disabled and save the settings.

Having studied how to enable Hyper-Threading and deactivate this technology, let’s look at its features in more detail.

Processors with Hyper Threading support

The first processor on which the company's concept in question was implemented, according to some sources, was the Intel Xeon MP, also known as Foster MP. This chip is similar in a number of architectural components to the Pentium 4, on which the technology in question was also subsequently implemented. Subsequently, the multi-threaded computing function was introduced on Xeon server processors with the Prestonia core.

If we talk about the current prevalence of Hyper-Threading, what “processors” support it? Among the most popular chips of this type are those belonging to the Core and Xeon families. There is also information that similar algorithms are implemented in Itanium and Atom processors.

Having studied the basic information about Hyper-Threading and processors that support it, we will consider the most remarkable facts regarding the history of the development of the technology.

Development history

As we noted above, Intel showed the concept in question to the public in 2001. But the first steps in creating the technology were taken back in the early 90s. Engineers from an American company noticed that the resources of PC processors were not fully utilized when performing a number of operations.

As Intel experts have calculated, while the user is working on a PC, the chip is not used very actively for significant intervals - almost the majority of the time - by about 30%. Expert opinions regarding this figure are very different - some consider it clearly underestimated, others completely agree with the thesis of the American developers.

However, most IT specialists agreed that although not 70% of the processor’s capacity is idle, it is a very significant amount.

The main task of the developers

Intel decided to correct this state of affairs through a qualitatively new approach to ensuring the efficiency of the main PC chips. It was proposed to create a technology that would facilitate more active use of processor capabilities. In 1996, Intel specialists began its practical development.

According to the concept of the American corporation, the processor, while processing data from one program, could direct idle resources to work with another application (or a component of the current one, but having a different structure and requiring the use of additional resources). The corresponding algorithm also assumed effective interaction with other PC hardware components - RAM, chipset, and programs.

Intel managed to solve the problem. The technology was originally called Willamette. In 1999, it was introduced into the architecture of some processors, and testing began. Soon the technology received its modern name - Hyper-Threading. It’s hard to say what exactly it was - a simple rebranding or radical adjustments to the platform. We already know further facts regarding the appearance of the technology in public and its implementation in various models of Intel processors. Among the common development names today is Hyper-Threading Technology.

Technology Compatibility Considerations

How well is the support for Hyper-Threading technology implemented in operating systems? It can be noted that if we are talking about modern versions of Windows, then there will be no problems with the user taking full advantage of Intel Hyper-Threading Technology. Of course, it is also very important that the technology is supported by the input/output system - we said this above.

Software and hardware factors

Regarding older versions of the OS - Windows 98, NT and the relatively outdated XP, a necessary condition for compatibility with Hyper-Threading is ACPI support. If it is not implemented in the OS, then not all computation threads that are formed by the corresponding modules will be recognized by the computer. Note that Windows XP as a whole ensures the use of the advantages of the technology in question. It is also highly desirable that multithreading algorithms be implemented in applications used by the PC owner.

Sometimes a PC may be required - if processors with Hyper-Threading support are installed on it instead of those that were originally installed on it and were not compatible with the technology. However, as in the case of operating systems, there will not be any special problems if the user has at his disposal a modern PC or at least one that matches the hardware components of the first Hyper Threading processors, as we noted above, implemented in the Core line, and adapted to it chipsets on motherboards fully support the corresponding functions of the chip.

Acceleration criteria

If the computer at the level of hardware and software components is not compatible with Hyper-Threading, then this technology, in theory, can even slow down its operation. This state of affairs has caused some IT specialists to doubt the prospects of Intel's solution. They decided that it was not a technological leap, but a marketing ploy that underlies the concept of Hyper Threading, which, due to its architecture, is not capable of significantly speeding up a PC. But the doubts of critics were quickly dispelled by Intel engineers.

So, the basic conditions for the technology to be successfully used:

Hyper-Threading support by I/O system;

Compatibility of the motherboard with the processor of the corresponding type;

Technology support by the operating system and the specific application running in it.

If there should be no particular problems regarding the first two points, then in terms of program compatibility with Hyper-Threading, some problems may still arise. But it can be noted that if an application supports, for example, working with dual-core processors, then it will be compatible, almost guaranteed, with technology from Intel.

At least there are studies confirming an increase in the performance of programs adapted to dual-core chips by approximately 15-18% if the processor runs Intel Hyper Threading modules. We already know how to disable them (in case the user has doubts about the advisability of using the technology). But there are probably very few tangible reasons for their appearance.

Practical usefulness of Hyper-Threading

Has the technology in question made a tangible difference to Intel? There are different opinions on this matter. But many people note that Hyper-Threading technology has become so popular that this solution has become indispensable for many server system manufacturers, and was also positively received by ordinary PC users.

Hardware processing

The main advantage of the technology is that it is implemented in hardware format. That is, the bulk of the calculations will be performed inside the processor on special modules, and not in the form of software algorithms transferred to the level of the main core of the chip - which would imply a decrease in the overall performance of the PC. In general, as IT experts note, Intel engineers managed to solve the problem that they identified at the beginning of the development of the technology - to make the processor function more efficiently. Indeed, as tests have shown, when solving many problems that are practically significant for the user, the use of Hyper-Threading has significantly speeded up work.

It can be noted that among the 4, those microcircuits that were equipped with modules to support the technology in question worked significantly more efficiently than the first modifications. This was largely expressed in the PC's ability to function in real multitasking mode - when several different types of Windows applications are open, and it is extremely undesirable that, due to the increased consumption of system resources by one of them, the speed of others decreases.

Simultaneous solution of different problems

Thus, processors that support Hyper-Threading are better adapted than chips that are not compatible with it to simultaneously run, for example, a browser, play music and work with documents. Of course, all these advantages are felt by the user in practice only if the software and hardware components of the PC are sufficiently compatible with this operating mode.

Similar developments

Hyper-Threading technology is not the only one that was created to improve PC performance through multi-threaded computing. It has analogues.

For example, POWER5 processors released by IBM also support multithreading. That is, each of them (in total there are 2 corresponding elements installed on it) can perform tasks within 2 threads. Thus, the chip processes 4 computation threads simultaneously.

AMD also has excellent work in the area of multi-threading concepts. Thus, it is known that the Bulldozer architecture uses algorithms similar to Hyper-Threading. The peculiarity of the AMD solution is that each thread processes separate processor blocks. The second level remains general. Similar concepts are implemented in AMD's Bobcat architecture, which is tailored for laptops and small PCs.

Of course, the concepts from AMD, IBM and Intel can be considered direct analogues very conditionally. As well as approaches to designing processor architecture in general. But the principles implemented in the corresponding technologies can be considered quite similar, and the goals set by the developers in terms of increasing the efficiency of microcircuits can be very close in essence, if not identical.

These are the key facts regarding the most interesting technology from Intel. We have determined what it is, how to enable Hyper-Threading or, conversely, deactivate it. The point is probably in the practical use of its advantages, which can be harnessed by making sure that the PC's hardware and software components support the technology.

There was a time when it was necessary to evaluate memory performance in the context of Hyper-threading technology. We have come to the conclusion that its influence is not always positive. When a quantum of free time appeared, there was a desire to continue research and consider the ongoing processes with an accuracy of machine clock cycles and bits, using software of our own design.

Platform under study

The object of the experiments is an ASUS N750JK laptop with an Intel Core i7-4700HQ processor. Clock frequency 2.4GHz, increased in Intel Turbo Boost mode up to 3.4GHz. Installed 16 gigabytes of DDR3-1600 RAM (PC3-12800), operating in dual-channel mode. Operating system – Microsoft Windows 8.1 64 bit.

Fig.1 Configuration of the platform under study.

The processor of the platform under study contains 4 cores, which, when Hyper-Threading technology is enabled, provides hardware support for 8 threads or logical processors. The platform firmware transmits this information to the operating system via the ACPI table MADT (Multiple APIC Description Table). Since the platform contains only one RAM controller, there is no SRAT (System Resource Affinity Table) table, which declares the proximity of processor cores to memory controllers. Obviously, the laptop under study is not a NUMA platform, but the operating system, for the purpose of unification, considers it as a NUMA system with one domain, as indicated by the line NUMA Nodes = 1. A fact that is fundamental for our experiments is that the first-level data cache has size 32 kilobytes for each of the four cores. Two logical processors sharing one core share the L1 and L2 caches.

Operation under study

We will study the dependence of the reading speed of a data block on its size. To do this, we will choose the most productive method, namely reading 256-bit operands using the AVX instruction VMOVAPD. In the graphs, the X axis shows the block size, and the Y axis shows the reading speed. Around point X, which corresponds to the size of the L1 cache, we expect to see an inflection point, since performance should drop after the processed block leaves the cache limits. In our test, in the case of multi-threaded processing, each of the 16 initiated threads works with a separate address range. To control Hyper-Threading technology within the application, each thread uses the SetThreadAffinityMask API function, which sets a mask in which one bit corresponds to each logical processor. A single bit value allows the specified processor to be used by a given thread, a zero value prohibits it. For 8 logical processors of the platform under study, mask 11111111b allows the use of all processors (Hyper-Threading is enabled), mask 01010101b allows the use of one logical processor in each core (Hyper-Threading is disabled).

The following abbreviations are used in the graphs:

MBPS (Megabytes per Second) – block reading speed in megabytes per second;

CPI (Clocks per Instruction) – number of clock cycles per instruction;

TSC (Time Stamp Counter) – CPU cycle counter.

Note: The TSC register clock speed may not match the processor clock speed when running in Turbo Boost mode. This must be taken into account when interpreting the results.

On the right side of the graphs, a hexadecimal dump of the instructions that make up the loop body of the target operation executed in each of the program threads, or the first 128 bytes of this code, is visualized.

Experience No. 1. One thread

Fig.2 Single thread reading

The maximum speed is 213563 megabytes per second. The inflection point occurs at a block size of about 32 kilobytes.

Experience No. 2. 16 threads on 4 processors, Hyper-Threading disabled

Fig.3 Reading in sixteen threads. The number of logical processors used is four

Hyper-Threading is disabled. The maximum speed is 797598 megabytes per second. The inflection point occurs at a block size of about 32 kilobytes. As expected, compared to reading with one thread, the speed increased by approximately 4 times, based on the number of working cores.

Experience No. 3. 16 threads on 8 processors, Hyper-Threading enabled

Fig.4 Reading in sixteen threads. The number of logical processors used is eight

Hyper-Threading is enabled. The maximum speed is 800,722 megabytes per second; as a result of enabling Hyper-Threading, it almost did not increase. The big minus is that the inflection point occurs at a block size of about 16 kilobytes. Enabling Hyper-Threading slightly increased the maximum speed, but the speed drop now occurs at half the block size - about 16 kilobytes, so the average speed has dropped significantly. This is not surprising, each core has its own L1 cache, while the logical processors of the same core share it.

conclusions

The operation studied scales quite well on a multi-core processor. Reasons: Each core contains its own L1 and L2 cache, the target block size is comparable to the cache size, and each thread works with its own address range. For academic purposes, we created these conditions in a synthetic test, recognizing that real-world applications are usually far from ideal optimization. But enabling Hyper-Threading, even under these conditions, had a negative effect; with a slight increase in peak speed, there is a significant loss in the processing speed of blocks whose size ranges from 16 to 32 kilobytes.

15.03.2013

Hyper-Threading technology appeared in Intel processors, scary to say, more than 10 years ago. And at the moment it is an important element of Core processors. However, the question of the need for HT in games is still not completely clear. We decided to conduct a test to understand whether gamers need a Core i7, or if a Core i5 is better. And also find out how much better Core i3 is than Pentium.

Hyper-Threading Technology, developed by Intel and exclusively used in the company's processors, starting with the memorable Pentium 4, is something that is taken for granted at the moment. A significant number of processors of current and previous generations are equipped with it. It will be used in the near future.

And it must be admitted that Hyper-Threading technology is useful and has a positive effect on performance, otherwise Intel would not use it to position its processors within the line. And not as a secondary element, but one of the most important, if not the most important. To make it clear what we are talking about, we have prepared a table that makes it easy to evaluate the principle of segmentation of Intel processors.

As you can see, there are very few differences between the Pentium and Core i3, as well as between the Core i5 and Core i7. In fact, the i3 and i7 models differ from the Pentium and i5 only in the size of the third level cache per core (not counting the clock frequency, of course). The first pair has 1.5 megabytes, and the second pair has 2 megabytes. This difference cannot fundamentally affect the performance of processors, since the difference in cache size is very small. That is why Core i3 and Core i7 received support for Hyper-Threading technology, which is the main element that allows these processors to have a performance advantage over Pentium and Core i5, respectively.

As a result, a slightly larger cache and Hyper-Threading support will allow significantly higher prices for processors. For example, processors of the Pentium line (about 10 thousand tenge) are approximately two times cheaper than Core i3 (about 20 thousand tenge), and this despite the fact that physically, at the hardware level, they are absolutely identical, and, accordingly, have the same cost . The price difference between Core i5 (about 30 thousand tenge) and Core i7 (about 50 thousand tenge) is also very large, although less than two times in younger models.

How justified is this increase in price? What real gain does Hyper-Threading provide? The answer has long been known: the increase varies, it all depends on the application and its optimization. We decided to check what HT can do in games, as one of the most demanding “household” applications. In addition, this test will be an excellent addition to our previous material on the effect of the number of cores in the processor on gaming performance.

Before moving on to the tests, let's remember (or find out) what Hyper-Threading Technology is. As Intel itself said when introducing this technology many years ago, there is nothing particularly complicated about it. In fact, all that is needed to introduce HT at the physical level is to add not one set of registers and an interrupt controller to one physical core, but two. In Pentium 4 processors, these additional elements increased the number of transistors by only five percent. In modern Ivy Bridge cores (as well as Sandy Bridge and future Haswell), the additional elements for even four cores do not increase the die by even 1 percent.

Additional registers and an interrupt controller, coupled with software support, allow the operating system to see not one physical core, but two logical ones. At the same time, the processing of data from two streams that are sent by the system still occurs on the same core, but with some features. One thread still has the entire processor at its disposal, but as soon as some CPU blocks are freed and idle, they are immediately given to the second thread. Thanks to this, it was possible to use all processor blocks simultaneously, and thereby increase its efficiency. As Intel itself stated, the performance increase under ideal conditions can reach up to 30 percent. True, these indicators are true only for the Pentium 4 with its very long pipeline; modern processors benefit from HT less.

But ideal conditions for Hyper-Threading are not always the case. And most importantly, the worst result of HT is not the lack of performance gain, but its decrease. That is, under certain conditions, the performance of a processor with HT will drop relative to a processor without HT due to the fact that the overhead of thread division and queuing will significantly exceed the gain from processing parallel threads, which is possible in this particular case. And such cases occur much more often than Intel would like. Moreover, many years of using Hyper-Threading have not improved the situation. This is especially true for games that are very complex and not at all standard in terms of data calculation and applications.

In order to find out the impact of Hyper-Threading on gaming performance, we again used our long-suffering Core i7-2700K test processor, and simulated four processors at once by disabling cores and turning HT on/off. Conventionally, they can be called Pentium (2 cores, HT disabled), Core i3 (2 cores, HT enabled), Core i5 (4 cores, HT disabled), and Core i7 (4 cores, HT enabled). Why conditional? First of all, because according to some characteristics they do not correspond to real products. In particular, disabling cores does not lead to a corresponding reduction in the volume of the third level cache - its volume for all is 8 megabytes. And, in addition, all our “conditional” processors operate at the same frequency of 3.5 gigahertz, which has not yet been achieved by all processors in the Intel line.

However, this is even for the better, since thanks to the constant change of all important parameters, we will be able to find out the real impact of Hyper-Threading on gaming performance without any reservations. And the percentage difference in performance between our “conditional” Pentium and Core i3 will be close to the difference between real processors, provided the frequencies are equal. It should also not be confusing that we are using a processor with Sandy Bridge architecture, since our efficiency tests, which you can read about in the article “Bare Performance - Examining the Efficiency of ALUs and FPUs,” showed that the influence of Hyper-Threading in the latest generations of processors Core remains unchanged. Most likely, this material will also be relevant for upcoming Haswell processors.

Well, it seems that all the questions regarding the testing methodology, as well as the operating features of Hyper-Threading Technology, have been discussed, and therefore it’s time to move on to the most interesting thing - the tests.

Even in a test in which we studied the impact of the number of processor cores on gaming performance, we found that 3DMark 11 is completely relaxed about CPU performance, working perfectly even on one core. Hyper-Threading had the same “powerful” influence. As you can see, the test does not notice any differences between Pentium and Core i7, not to mention intermediate models.

Metro 2033

But Metro 2033 clearly noticed the appearance of Hyper-Threading. And she reacted negatively to him! Yes, that's right: enabling HT in this game has a negative impact on performance. A small impact, of course - 0.5 frames per second with four physical cores, and 0.7 with two. But this fact gives every reason to say that the Metro 2033 Pentium is faster than the Core i3, and the Core i5 is better than the Core i7. This is confirmation of the fact that Hyper-Threading does not show its effectiveness always and not everywhere.

Crysis 2

This game showed very interesting results. First of all, we note that the influence of Hyper-Threading is clearly visible in dual-core processors - the Core i3 is ahead of the Pentium by almost 9 percent, which is quite a lot for this game. Victory for HT and Intel? Not really, since the Core i7 did not show any gain relative to the noticeably cheaper Core i5. But there is a reasonable explanation for this - Crysis 2 cannot use more than four data streams. Because of this, we see a good increase in the dual-core with HT - still, four threads, albeit logical, are better than two. On the other hand, there was nowhere to put additional Core i7 threads; four physical cores were quite enough. So, based on the results of this test, we can note the positive impact of HT in the Core i3, which is noticeably better than the Pentium here. But among quad-core processors, the Core i5 again looks like a more reasonable solution.

Battlefield 3

The results here are very strange. If in the test for the number of cores, battlefield was an example of a microscopic but linear increase, then the inclusion of Hyper-Threading introduced chaos into the results. In fact, we can state that the Core i3, with its two cores and HT, turned out to be the best of all, ahead of even the Core i5 and Core i7. It’s strange, of course, but at the same time, Core i5 and Core i7 were again on the same level. What explains this is not clear. Most likely, the testing methodology in this game played a role here, which gives greater errors than standard benchmarks.

In the last test, F1 2011 proved to be one of the games that is very critical of the number of cores, and in this test it again surprised us with the excellent impact of Hyper-Threading technology on the performance. And again, as in Crysis 2, the inclusion of HT worked very well on dual-core processors. Look at the difference between our conditional Core i3 and Pentium - it is more than twofold! It is clearly visible that the game is very much lacking two cores, and at the same time its code is parallelized so well that the effect is amazing. On the other hand, you can’t argue with four physical cores - Core i5 is noticeably faster than Core i3. But the Core i7, again, as in previous games, did not show anything outstanding compared to the Core i5. The reason is the same - the game cannot use more than 4 threads, and the overhead of running HT reduces the performance of the Core i7 below the level of the Core i5.

An old warrior needs Hyper-Threading no more than a hedgehog needs a T-shirt - its influence is by no means as clearly noticeable as in F1 2011 or Crysis 2. However, we still note that turning on HT on a dual-core processor brought 1 extra frame. This is certainly not enough to say that Core i3 is better than Pentium. At the very least, this improvement clearly does not correspond to the difference in price of these processors. And it’s not even worth mentioning the price difference between Core i5 and Core i7, since the processor without HT support again turned out to be faster. And noticeably faster - by 7 percent. Whatever one may say, we again state the fact that four threads is the maximum for this game, and therefore HyperThreading in this case does not help the Core i7, but hinders.

In the past, we talked about Simultaneous Multi-Threading (SMT) technology, which is used in Intel processors. And although it was originally codenamed Jackson Technology as a possible option, Intel officially announced its technology at the IDF forum last fall. The codename Jackson was replaced by the more appropriate Hyper-Threading. So, in order to understand how the new technology works, we need some initial knowledge. Namely, we need to know what a thread is, how these threads are executed. Why does the application work? How does the processor know what operations it should perform on what data? All this information is contained in the compiled code of the running application. And as soon as the application receives any command, any data from the user, threads are immediately sent to the processor, as a result of which it performs what it must do in response to the user’s request. From the processor's point of view, a thread is a set of instructions that need to be executed. When you get hit by a projectile in Quake III Arena, or when you open a Microsoft Word document, the processor is sent a specific set of instructions that it must execute.

The processor knows exactly where to get these instructions. A rarely mentioned register called the Program Counter (PC) is designed for this purpose. This register points to the location in memory where the next instruction to be executed is stored. When a thread is sent to the processor, the thread's memory address is loaded into this program counter so that the processor knows exactly where to start execution. After each instruction, the value of this register is incremented. This entire process runs until the thread terminates. At the end of the thread's execution, the address of the next instruction to be executed is entered into the program counter. Threads can interrupt each other, and the processor stores the value of the program counter on the stack and loads a new value into the counter. But there is still a limitation in this process - only one thread can be executed per unit of time.

There is a well-known way to solve this problem. It consists in using two processors - if one processor can execute one thread at a time, then two processors can already execute two threads in the same unit of time. Note that this method is not ideal. It comes with many other problems. Some you are probably already familiar with. First, multiple processors are always more expensive than one. Secondly, managing two processors is also not so easy. In addition, do not forget about the division of resources between processors. For example, before the introduction of the AMD 760MP chipset, all x86 platforms with multiprocessing support shared all the system bus bandwidth among all available processors. But the main drawback is different - for such work, both the application and the operating system itself must support multiprocessing. The ability to distribute the execution of multiple threads across computer resources is often called multithreading. At the same time, the operating system must support multithreading. Applications must also support multithreading to make the most of your computer's resources. Keep this in mind as we look at another approach to solving the multithreading problem, Intel's new Hyper-Threading technology.

Productivity is never enough

There is always a lot of talk about efficiency. And not only in a corporate environment, in some serious projects, but also in everyday life. They say that homo sapiens only partially use the capabilities of their brains. The same applies to the processors of modern computers.

Take the Pentium 4, for example. The processor has a total of seven execution units, two of which can operate at double the speed of two operations (micro-ops) per clock cycle. But in any case, you would not find a program that could fill all these devices with instructions. Conventional programs make do with simple integer calculations and a few operations of loading and storing data, while floating-point operations are left aside. Other programs (for example, Maya) primarily load floating point devices with work.

To illustrate the situation, let's imagine a processor with three execution units: an arithmetic logic unit (integer ALU), a floating point unit (FPU), and a load/store unit (for writing and reading data from memory). Additionally, let's assume that our processor can perform any operation in one clock cycle and can distribute operations across all three devices simultaneously. Let's imagine that a thread of the following instructions is sent to this processor for execution:

The figure below illustrates the load level of actuators (gray indicates an idle device, blue indicates a working device):

So, you see that in each clock cycle only 33% of all actuators are used. This time the FPU remains completely unused. According to Intel, most IA-32 x86 programs use no more than 35% of the Pentium 4 processor's execution units.

Let's imagine another thread and send it to the processor for execution. This time it will consist of the operations of loading data, adding and storing data. They will be executed in the following order:

And again, the load on the actuators is only 33%.

A good way out of this situation would be Instruction Level Parallelism (ILP). In this case, several instructions are executed simultaneously, since the processor is capable of filling several parallel execution units at once. Unfortunately, most x86 programs are not adequately adapted to ILP. Therefore, we have to find other ways to increase productivity. So, for example, if the system used two processors at once, then two threads could be executed simultaneously. This solution is called thread-level parallelism (TLP). By the way, this solution is quite expensive.

What other ways are there to increase the executive power of modern x86 processors?

Hyper-Threading

The problem of underutilization of actuators is due to several reasons. Generally speaking, if the processor cannot receive data at the desired speed (this occurs as a result of insufficient system bus and memory bus bandwidth), then the actuators will not be used as efficiently. In addition, there is another reason - the lack of parallelism at the instruction level in most command threads.

Currently, most manufacturers improve the speed of processors by increasing the clock speed and cache sizes. Of course, in this way you can increase performance, but still the potential of the processor will not be fully utilized. If we could run multiple threads at the same time, we could use the processor much more efficiently. This is precisely the essence of Hyper-Threading technology.

Hyper-Threading is the name of a technology that previously existed outside the x86 world, Simultaneous Multi-Threading (SMT). The idea behind this technology is simple. One physical processor appears to the operating system as two logical processors, and the operating system does not see the difference between one SMT processor or two regular processors. In both cases, the operating system routes threads as if it were a dual-processor system. Further, all issues are resolved at the hardware level.

In a processor with Hyper-Threading, each logical processor has its own set of registers (including a separate program counter), and in order to keep the technology simple, it does not implement simultaneous execution of fetch/decoding instructions in two threads. That is, such instructions are executed one by one. Only ordinary commands are executed in parallel.

The technology was officially announced at the Intel Developer Forum last fall. The technology was demonstrated on a Xeon processor, where rendering was carried out using Maya. In this test, the Xeon with Hyper-Threading performed 30% better than the standard Xeon. A nice performance boost, but what's most interesting is that the technology is already present in the Pentium 4 and Xeon cores, only it's turned off.

The technology has not yet been released, but those of you who purchased the 0.13 micron Xeon and installed this processor on boards with an updated BIOS were probably surprised to see an option in the BIOS to enable/disable Hyper-Threading.

In the meantime, Intel will leave the Hyper-Threading option disabled by default. However, to enable it, you just need to update the BIOS. All this applies to workstations and servers; as for the personal computer market, the company has no plans regarding this technology in the near future. Although it is possible, motherboard manufacturers will provide the ability to enable Hyper-Threading using a special BIOS.

The very interesting question remains, why does Intel want to leave this option disabled?

Going deeper into technology

Remember those two threads from the previous examples? Let's assume this time that our processor is equipped with Hyper-Threading. Let's see what happens if we try to execute these two threads simultaneously:

As before, blue rectangles indicate the execution of the first thread's instruction, and green rectangles indicate the execution of the second thread's instruction. Gray rectangles show unused execution devices, and red ones indicate a conflict when two different instructions from different threads arrived at the same device.

So what do we see? Parallelism at the thread level failed - execution devices began to be used even less efficiently. Instead of executing threads in parallel, the processor executes them more slowly than if it were executing them without Hyper-Threading. The reason is quite simple. We tried to execute two very similar threads at the same time. After all, they both consist of load/store operations and addition operations. If we were to run an "integer" application and a floating-point application in parallel, we would be in a much better situation. As you can see, the effectiveness of Hyper-Threading greatly depends on the type of load on the PC.

Currently, most PC users use their computer approximately as described in our example. The processor performs many very similar operations. Unfortunately, when it comes to similar operations, additional management difficulties arise. There are situations when there are no actuators of the required type left, and, as luck would have it, there are twice as many instructions as usual. In most cases, if home computer processors used Hyper-Threading technology, there would be no increase in performance, and perhaps even a decrease of 0-10%.

On workstations, however, Hyper-Threading has more opportunities to increase productivity. But on the other hand, it all depends on the specific use of the computer. A workstation can mean either a high-end computer for processing 3D graphics, or simply a heavily loaded computer.

The greatest increase in performance from the use of Hyper-Threading is observed in server applications. This is mainly due to the wide variety of operations sent to the processor. A database server that uses transactions can run 20-30% faster when the Hyper-Threading option is enabled. Slightly smaller performance gains are observed on web servers and in other areas.

Maximum efficiency from Hyper-Threading

Do you think Intel developed Hyper-Threading only for its line of server processors? Of course not. If that were the case, they wouldn't waste the chip space on their other processors. In fact, the NetBurst architecture used in the Pentium 4 and Xeon is perfectly suited for a kernel that supports simultaneous multithreading. Let's imagine the processor again. This time it will have one more actuator - a second integer device. Let's see what happens if threads are executed by both devices:

Using the second integer device, the only conflict occurred on the last operation. Our theoretical processor is somewhat similar to the Pentium 4. It has as many as three integer devices (two ALUs and one slow integer device for rotating shifts). More importantly, both Pentium 4 integer devices are capable of running at double speed—performing two micro-ops per clock cycle. This, in turn, means that either of these two Pentium 4/Xeon integer devices could perform those two addition operations from different threads in one clock cycle.

But this does not solve our problem. It would make little sense to simply add additional execution units to the processor in order to increase the performance of Hyper-Threading. In terms of silicon space, this would be extremely expensive. Instead, Intel suggested that developers optimize programs for Hyper-Threading.

Using the HALT instruction, you can suspend one of the logical processors, thereby increasing the performance of applications that do not benefit from Hyper-Threading. So, the application will not run slower, instead one of the logical processors will be stopped and the system will run on one logical processor - performance will be the same as on single-processor computers. Then, when the application decides that it will benefit from Hyper-Threading in performance, the second logical processor will simply resume its work.

There is a presentation on the Intel website that describes exactly how to program to get the most out of Hyper-Threading.

conclusions

While we were all extremely excited when we heard rumors of Hyper-Threading in the cores of all modern Pentium 4/Xeons, it still won't be free performance for all occasions. The reasons are clear, and the technology has a long way to go before we see Hyper-Threading running on all platforms, including home computers. And with the support of developers, the technology can definitely be a good ally for Pentium 4, Xeon, and future generation processors from Intel.

Given the current limitations and packaging technology available, Hyper-Threading seems a smarter choice for the consumer market than, for example, AMD's SledgeHammer approach - these processors use as many as two cores. And until packaging technologies like Bumpless Build-Up Layer mature, the cost of developing multi-core processors may be prohibitive.

It's interesting to note how different AMD and Intel have become over the past few years. After all, AMD once practically copied Intel processors. Now companies have developed fundamentally different approaches to future processors for servers and workstations. AMD has actually come a very long way. And if Sledge Hammer processors actually use two cores, then such a solution will be more efficient in performance than Hyper-Threading. Indeed, in this case, in addition to doubling the number of all actuators, the problems that we described above are eliminated.

Hyper-Threading won't hit the mainstream PC market for a while, but with good developer support, it could be the next technology to make its way down from the server level to mainstream PCs.

"...And we are proud and our enemy is proud
Hand, forget about laziness. Let's see,
who has whose boots at the end
will finally bow his knees..."
© film "D"Artagnan and the Three Musketeers"

Some time ago, the author allowed himself to “grumble a little” about the new paradigm from Intel Hyper Threading. To Intel's credit, the author's bewilderment did not go unnoticed. Therefore, the author was offered help in finding out ( how the corporation's managers delicately assessed) "real" situation with Hyper Threading technology. Well, the desire to find out the truth can only be praised. Isn't that right, dear reader? At least, this is what one of the truisms sounds like: true that's good. Well, we will try to act in accordance with this phrase. Moreover, a certain amount of new information has indeed appeared.

First, let’s formulate what exactly we know about Hyper Threading technology:

1. This technology is designed to increase the efficiency of the processor. The fact is that, according to Intel estimates, only 30% of the time is working ( By the way, this is a rather controversial figure; the details of its calculation are unknown) of all actuators in the processor. Agree, this is quite offensive. And the fact that the idea arose to somehow “add up” the remaining 70% looks quite logical ( Moreover, the Pentium 4 processor itself, in which this technology will be implemented, does not suffer from excessive performance per megahertz). So the author is forced to admit that this idea is quite sound.
2. The essence of Hyper Threading technology is that during the execution of one “thread” of a program, idle executive devices can begin executing another “thread” of the program ( or "threads" another programs). Or, for example, when executing one sequence of commands, wait for data from memory to execute another sequence.
3. Naturally, when executing different “threads”, the processor must somehow distinguish which commands belong to which “thread”. This means there is some mechanism ( some mark), thanks to which the processor distinguishes which “thread” the commands belong to.
4. It is also clear that, given the small number of general purpose registers in the x86 architecture ( total 8), each thread has its own set of registers. However, this is no longer news; this architectural limitation has been circumvented for quite some time using “register renaming”. In other words, there are many more physical registers than logical registers. There are 40 of them in the Pentium III processor. Surely this number for the Pentium 4 is greater; the author has no justification ( except for considerations of "symmetry":-) the opinion is that there are about hundreds of them. No reliable information about their number could be found. According to yet unconfirmed data, there are 256 of them. According to other sources, another number. In general, complete uncertainty... By the way, Intel's position on this the reason is completely incomprehensible :-( The author does not understand what caused such secrecy.
5. It is also known that in the case when several “threads” claim the same resources, or one of the “threads” is waiting for data, in order to avoid a drop in performance, the programmer must insert a special “pause” command. Naturally, this will require another recompilation of the programs.
6. It is also clear that there may be situations where attempts to simultaneously execute several “threads” will lead to a drop in performance. For example, due to the fact that the size of the L2 cache is not infinite, and active “threads” will try to load the cache, it is possible that such a “fight for the cache” will lead to constant clearing and reloading of data in the second level cache.
7. Intel claims that when optimizing programs for this technology, the gain will be up to 30%. ( Or rather, Intel claims that on today's server applications and today's systems up to 30%) Hm... This is more than enough incentive for optimization.

Well, we have formulated some features. Now let's try to think through some implications ( whenever possible, based on information known to us). What can we say? Well, firstly, we need to take a closer look at what exactly is being offered to us. Is this cheese really “free”? First, let's figure out how exactly the “simultaneous” processing of several “threads” will occur. By the way, what does Intel mean by the word "thread"?

The author has the impression ( possibly erroneous), which in this case refers to a program fragment that a multitasking operating system assigns for execution to one of the processors of a multiprocessor hardware system. "Wait!" the attentive reader will say “this is one of the definitions! What’s new here?” And nothing in given The author does not claim originality in this question. I'd like to figure out what Intel was "original" :-). Well, let's take it as a working hypothesis.

Next, a certain thread is executed. Meanwhile, the command decoder ( by the way, completely asynchronous and not included in the notorious 20 stages of Net Burst) performs sampling and decryption ( with all interdependencies) V microinstructions. Here it is necessary to clarify what the author means by the word “asynchronous” - the fact is that the result of the “collapse” of x86 commands in microinstructions occurs in the decryption block. Each x86 instruction can be decoded into one, two, or more microinstructions. At the same time, at the processing stage, interdependencies are clarified and the necessary data is delivered via the system bus. Accordingly, the speed of operation of this block will often depend on the speed of data access from memory and, in the worst case, is determined by it. It would be logical to “untie” it from the pipeline in which, in fact, micro-operations are performed. This was done by placing a decryption block before trace cache. What are we trying to achieve with this? And with the help of this “rearrangement of blocks” we achieve a simple thing: if there are microinstructions for execution in the trace cache, the processor works more efficiently. Naturally, this unit operates at processor frequency, unlike the Rapid Engine. By the way, the author got the impression that this decoder is something like a conveyor belt up to 10–15 stages long. Thus, from retrieving data from the cache to obtaining the result, apparently, there are about 30 35 stages ( including Net Burst pipeline, see Microdesign Resources August2000 Microprocessor report Volume14 Archive8, page12).

The resulting set of microinstructions, along with all interdependencies, accumulates in a trace cache in the same cache, which contains approximately 12,000 microoperations. According to rough estimates, the source of such an estimate is the structure of the P6 microinstruction; the fact is that fundamentally the length of instructions is unlikely to change dramatically ( Considering the length of the microinstruction together with service fields is about 100 bits) trace cache size is obtained from 96 KB to 120 KB!!! However! Against this background, a data cache of size 8 The KB looks somehow asymmetrical :-)… and pale. Of course, as the size increases, access delays increase ( for example, when increasing to 32KB, the delays instead of two clock cycles will be 4). But is the speed of access to this very data cache really so important that an increase in latency by 2 clock cycles ( against the background of the total length of the entire conveyor) makes such an increase in volume unprofitable? Or is it simply a matter of reluctance to increase the crystal size? But then, when moving to 0.13 microns, the first step was to increase this particular cache ( not a second level cache). Those who doubt this thesis should remember the transition from Pentium to Pentium MMX thanks to the increase in the first level cache doubled Almost all programs received a 10 15% increase in performance. What can we say about the increase? quadruple (especially considering that processor speeds have increased to 2 GHz, and the multiplication factor from 2.5 to 20)? According to unconfirmed reports, in the next modification of the Pentium4 (Prescott) core the first level cache will be increased to 16 or 32 KB. The second level cache will also increase. However, at the moment all these are nothing more than rumors. Frankly speaking, this is a slightly confusing situation. Although, let us make a reservation, the author fully admits that such an idea is hampered by a certain specific reason. As an example, certain requirements for the geometry of the arrangement of blocks or a banal lack of free space near the conveyor ( it is clear that it is necessary to locate the data cache closer to the ALU).

Without getting distracted, let's look at the process further. The pipeline is running let the current teams use the ALU. It is clear that FPU, SSE, SSE2 and others are idle. No such luck Hyper Threading comes into play. Noticing that microinstructions are ready along with data for the new thread, the register renaming unit allocates a portion of physical registers to the new thread. By the way, two options are possible: a block of physical registers is common for all threads, or separate for each. Judging by the fact that in the Hyper Threading presentation from Intel, the register renaming block is not listed as the blocks that need to be changed; the first option is selected. Is it good or bad? From the point of view of technologists, this is clearly good, because it saves transistors. From the point of view of programmers, it is still unclear. If the number of physical registers is really 128, then with any reasonable number of threads the “register shortage” situation cannot arise. Then they ( microinstructions) are sent to the scheduler, which, in fact, sends them to the execution device ( if it's not busy) or "queued" if this actuator is not currently available. Thus, ideally, more efficient use of existing actuators is achieved. At this time myself CPU from an OS point of view looks like two "logical" processors. Hm... Is everything really so cloudless? Let's take a closer look at the situation: a piece of equipment ( such as caches, Rapid Engine, transition prediction module) are common to both processors. By the way, transition prediction accuracy from this, most likely, slightly will suffer. Especially if the simultaneously executing threads are not connected to each other. And part ( for example, MIS microinstruction sequence planner a kind of ROM containing a set of pre-programmed sequences of common operations and RAT register renaming table) blocks must be distinguished by different threads running on “different” processors. Along the way ( from the cache community) it follows that if two threads are cache greedy ( that is, increasing the cache has a great effect), That using Hyper Threading can even reduce speed. This happens because at the moment a “competitive” mechanism for fighting for the cache has been implemented: the “active” thread at the moment displaces the “inactive” one. However, the caching mechanism may apparently change. It is also clear that the speed ( at least for now) will decrease in those applications in which it decreased in honest SMP. As an example, SPEC ViewPerf usually shows better results on single-processor systems. Therefore, the results will probably be lower on a system with Hyper Threading than without it. Actually, the results of practical testing of Hyper Threading can be viewed at.

By the way, information leaked onto the Internet that ALU in Pentium 4 16-bit. At first, the author was very skeptical about such information - they say, what did the envious people think :-). And then the publication of such information in the Micro Design Report made me think: what if it’s true? And, although information about this is not directly related to the topic of the article, it is difficult to resist :-). As far as the author “understood enough”, the point is that the ALU is indeed 16-bit. I emphasize ALU only. This has nothing to do with the bit capacity of the processor itself. Thus, in half a beat ( it's called tick, tick) ALU ( double frequency, as you remember) calculates only 16 bits. The second 16 are calculated over the next half beat. Hence, by the way, it is easy to understand the need to make the ALU twice as fast - this is necessary for timely “grinding” of data. Thus, a full 32 bits are calculated in a full clock cycle. In fact, apparently, 2 cycles are needed due to the need to “glue” and “unstick” the bits, but this issue needs to be clarified. Actually, the excavations (about which you can write a separate poem) yielded the following: each ALU is divided into 2 16-bit halves. The first half-cycle processes 16 bits two numbers and form carry bits for the other halves. The other half finishes processing at this time previous numbers. Second tick first half of ALU processes 16 bits from next pairs of numbers and forms their transfers. The second half processes the highest 16 bits first pair numbers and gets a ready 32-bit result. The delay in receiving 1 result is 1 clock cycle, but then every half a clock cycle there is 1 32-bit result. Quite witty and effective. Why was this particular ALU model chosen? Apparently, with such an organization, Intel is killing several birds with one stone:

1. It is clear that a 16-bit wide pipeline is easier to overclock than a 32-bit wide pipeline, simply due to the presence of crosstalk and co
2. Apparently, Intel considered integer calculation operations to be frequent enough to accelerate the ALU, and not, say, the FPU. It is likely that when calculating the results of integer operations, either tables or "carry-accumulated" schemes are used. For comparison, one 32-bit table is 2E32 addresses, i.e. 4 gigabytes. Two 16-bit tables are 2x64kb or 128 kilobytes feel the difference! And the accumulation of hyphens in two 16-bit portions occurs faster than in one 32-bit portion.
3. Saves transistors and... heat. After all, it’s no secret that all these architectural tricks are heated. Apparently, it was quite large (and perhaps home) problem what is Thermal Monitor worth, for example, as a technology! After all, there is not much need for such technology as such; that is, of course, it’s nice that it exists. But let's be honest - a simple blocking would be enough for sufficient reliability. Since such a complex technology was provided for, it means that the option was seriously considered when such frequency changes on the go were one of the normal operating modes. Or maybe the main one? It was not for nothing that there were rumors that the Pentium 4 was planned with a much larger number of actuators. Then the heat problem should have simply become the main one. Or rather, according to the same rumors, the heat release should have been up to 150 W. And then it is very logical to take measures to ensure that the processor operates “at full capacity” only in systems where normal cooling is provided. Moreover, most cases of “Chinese” origin do not shine with thoughtful design from the point of view of cooling. Hm.... We've gone a long way :-)

But all this is theorizing. Are there any processors today that use this technology? Eat. This is Xeon ( Prestonia) and XeonMP. Moreover, it is interesting that XeonMP differs from Xeon in supporting up to 4 processors ( chipsets like IBM Summit support up to 16 processors, the technique is approximately the same as in the ProFusion chipset) and the presence of a third level cache of 512 KB and 1 MB, integrated into the kernel. By the way, why did they integrate the third level cache? Why is the first level cache not increased?? There must be some reasonable reason... Why didn't they increase the second level cache? Perhaps the reason is that Advanced Transfer Cache needs relatively low latency. And increasing the cache size leads to increased latencies. Therefore, the third level cache for the core and the second level cache are generally “represented” as a bus. Just a tire :-). So progress is obvious - everything has been done to ensure that data is fed into the core as quickly as possible ( and, at the same time, the memory bus was loaded less).

Well, it turns out that there are no particularly bottlenecks? Why won't the author be able to "grumble"? One processor - and the OS sees two. Fine! Two processors and the OS sees 4! Beauty! Stop! What kind of OS does this work with 4 processors? Operating systems from Microsoft that understand more than two processors cost completely different money. For example, 2000 Professional, XP Professional, NT4.0 understand only two processors. And, given that for now this technology is intended for the workstation market ( and servers) and is only available in the corresponding processors - it turns out to be just damn offensive. Today we can use processors with this technology only by purchasing a dual-processor board and installing one CPU. The further you go, the stranger it gets, as Alice in Wonderland used to say... That is, a person eager to use this technology is simply forced to buy the Server and Advanced Server versions of current operating systems. Oh, and the “free” processor is a bit expensive... It is worth adding, perhaps, that Intel is currently actively “communicating” with Microsoft, trying to tie the licensing policy to physical processor. At least according to the document, new operating systems from Microsoft will be licensed by physical processor. At least WindowsXP is licensed based on the number of physical processors.

Naturally, you can always turn to operating systems from other manufacturers. Let's just be honest - this is not a very good way out of the current situation... So one can understand the hesitation of Intel, which thought for quite a long time whether to use this technology or not.

Well, let’s not forget a rather important conclusion: Using Hyper Threading can lead to both performance gains and losses. Well, since we have already discussed losing, let’s try to understand what is needed to win: and to win, it is necessary that they know about this technology:

Motherboard BIOS
Operating system (!!!)
Actually, the application itself

Let me dwell on this point in more detail; the point is that the BIOS is not the issue. We discussed the operating system a little earlier. But in those threads that, for example, expect data from memory, you will have to enter a special command pause, so as not to slow down the processor; After all, in the absence of data, the thread is capable of blocking certain actuators. And in order to insert this command, applications will have to be recompiled - this is not good, but, thanks to Intel, everyone has recently begun to get used to it :-). Thus, the main ( according to the author) the disadvantage of Hyper Threading technology is the need for regular compilation. The main advantage of this approach is such recompilation along the way ( and most likely more noticeable :-) will increase performance in “honest” dual-processor systems and this can only be welcomed. By the way, there are already experimental studies that confirm that in in most cases, programs optimized for SMP, benefit from Hyper Threading from 15% to 18%. This is quite good. By the way, you can also see in what cases Hyper Threading leads to a drop in performance.

And finally, let's try to imagine what could change ( improve) in further development of this idea. It is quite obvious that the development of this technology will be directly related to the development of the Pentium 4 core. Thus, let's imagine potential changes in the core. What's next on our plan? 0.09 micron technology, better known as 90nm…. The author is inclined to believe ( at the moment), that the development of this family of processors will go in several directions at once:

Thanks to a more “fine” technical process, the processor frequency will become even higher.
Let's hope that the data cache will be increased. At least up to 32KB.
They will make an “honest” 32-bit ALU. This should improve productivity.
Increase the system bus speed ( however, this is already in the near future).
They will make dual-channel DDR memory ( again, the wait is relatively short).
Perhaps they will introduce an analogue of x86-64 technology, if this technology ( thanks to AMD) will take root. At the same time, the author hopes with all his might that this analogue will be compatible with x86-64. Stop creating extensions that are incompatible with each other... Again, of interest to us will be Jerry Sanders, in which he stated that AMD and Intel last year agreed on cross-licensing for everything except the Pentium4 system bus. Does this mean that Intel will build x86-64 into the next Pentium4 (Prescott) core, and AMD will build Hyper Threading into its processors? Interesting question...
Perhaps the number of actuators will be increased. True, like the previous one, this is a rather controversial point, since it requires an almost complete redesign of the kernel - and this is a long and labor-intensive process.

I wonder if the idea of Hyper Threading will be developed? The fact is that, quantitatively, there is nowhere for it to develop; it is clear that two physical processors are better than three logical ones. Yes, and positioning will not be easy... Interestingly, Hyper Threading can also be useful when integrating two ( or more) processors per chip. Well, by qualitative changes the author means that the presence of such technology in ordinary desktops will lead to the fact that in fact the majority of users will work on [almost] dual-processor machines which is very good. It’s good because such machines work much smoother and more responsive to user actions, even under heavy load. This, from the author’s point of view, is very good.

Instead of an afterword

The author must admit that while working on the article, his attitude towards Hyper Threading changed several times. As information was collected and processed, the attitude became either generally positive or vice versa :-). At the moment we can write the following:

There are only two ways to improve performance: increase the frequency, and increase performance per clock. And, if the entire Pentium4 architecture is designed for the first path, then Hyper Threading is just the second. From this point of view, it can only be welcomed. Hyper Threading also has several interesting consequences, such as: changing the programming paradigm, bringing multiprocessing to the masses, increasing processor performance. However, there are several “big bumps” on this path on which it is important not to “get stuck”: the lack of normal support from operating systems and, most importantly, the need to recompile ( and in some cases, changing the algorithm) applications so that they can take full advantage of Hyper Threading. In addition, the presence of Hyper Threading would make it possible to truly run the operating system and applications in parallel, and not in “pieces” one at a time, as is the case now. Of course, provided that there are enough free actuators.

The author would like to emphasize his gratitude Maxim Lenya(aka C.A.R.C.A.S.S.) and Ilya Vaitsman(aka Stranger_NN) for repeated and invaluable assistance in writing the article.
I would also like to say thank you to all the forum participants who repeatedly expressed valuable comments.