Hyper V mode affects computer performance. Hyper-Threading - what is it? How to enable Hyper-Threading? How Hyper-Threading Works

"...And we are proud and our enemy is proud
Hand, forget about laziness. Let's see,
who has whose boots at the end
will finally bow his knees..."
© film "D"Artagnan and the Three Musketeers"

Some time ago, the author allowed himself to “grumble a little” about the new paradigm from Intel Hyper Threading. To Intel's credit, the author's bewilderment did not go unnoticed. Therefore, the author was offered help in finding out ( how the corporation's managers delicately assessed) "real" situation with Hyper Threading technology. Well, the desire to find out the truth can only be praised. Isn't that right, dear reader? At least, this is what one of the truisms sounds like: true that's good. Well, we will try to act in accordance with this phrase. Moreover, a certain amount of new information has indeed appeared.

First, let’s formulate what exactly we know about Hyper Threading technology:

1. This technology is designed to increase the efficiency of the processor. The fact is that, according to Intel estimates, only 30% works most of the time ( By the way, this is a rather controversial figure; the details of its calculation are unknown) of all actuators in the processor. Agree, this is quite offensive. And the fact that the idea arose to somehow “add up” the remaining 70% looks quite logical ( Moreover, the Pentium 4 processor itself, in which this technology will be implemented, does not suffer from excessive performance per megahertz). So the author is forced to admit that this idea is quite sound.
2. The essence of Hyper Threading technology is that during the execution of one “thread” of a program, idle executive devices can begin executing another “thread” of the program ( or "threads" another programs). Or, for example, when executing one sequence of commands, wait for data from memory to execute another sequence.
3. Naturally, when executing different “threads”, the processor must somehow distinguish which commands belong to which “thread”. This means there is some mechanism ( some mark), thanks to which the processor distinguishes which “thread” the commands belong to.
4. It is also clear that, given the small number of general purpose registers in the x86 architecture ( total 8), each thread has its own set of registers. However, this is no longer news; this architectural limitation has been circumvented for quite some time using “register renaming”. In other words, there are many more physical registers than logical registers. There are 40 of them in the Pentium III processor. Surely this number for the Pentium 4 is greater; the author has no justification ( except for considerations of "symmetry":-) the opinion is that there are about hundreds of them. No reliable information about their number could be found. According to yet unconfirmed data, there are 256 of them. According to other sources, another number. In general, complete uncertainty... By the way, Intel's position on this the reason is completely incomprehensible :-( The author does not understand what caused such secrecy.
5. It is also known that in the case when several “threads” claim the same resources, or one of the “threads” is waiting for data, in order to avoid a drop in performance, the programmer must insert a special “pause” command. Naturally, this will require another recompilation of the programs.
6. It is also clear that there may be situations where attempts to simultaneously execute several “threads” will lead to a drop in performance. For example, due to the fact that the size of the L2 cache is not infinite, and active “threads” will try to load the cache, it is possible that such a “fight for the cache” will lead to constant clearing and reloading of data in the second level cache.
7. Intel claims that when optimizing programs for this technology, the gain will be up to 30%. ( Or rather, Intel claims that on today's server applications and today's systems up to 30%) Hm... This is more than enough incentive for optimization.

Well, we have formulated some features. Now let's try to think through some implications ( whenever possible, based on information known to us). What can we say? Well, firstly, we need to take a closer look at what exactly is being offered to us. Is this cheese really “free”? First, let's figure out how exactly the “simultaneous” processing of several “threads” will occur. By the way, what does Intel mean by the word "thread"?

The author has the impression ( possibly erroneous), which in this case refers to a program fragment that a multitasking operating system assigns for execution to one of the processors of a multiprocessor hardware system. "Wait!" the attentive reader will say “this is one of the definitions! What’s new here?” And nothing in given The author does not claim originality in this question. I'd like to figure out what Intel was "original" :-). Well, let's take it as a working hypothesis.

Next, a certain thread is executed. Meanwhile, the command decoder ( by the way, completely asynchronous and not included in the notorious 20 stages of Net Burst) performs sampling and decryption ( with all interdependencies) V microinstructions. Here it is necessary to clarify what the author means by the word “asynchronous” - the fact is that the result of the “collapse” of x86 commands in microinstructions occurs in the decryption block. Each x86 instruction can be decoded into one, two, or more microinstructions. At the same time, at the processing stage, interdependencies are clarified and the necessary data is delivered via the system bus. Accordingly, the speed of operation of this block will often depend on the speed of data access from memory and, in the worst case, is determined by it. It would be logical to “untie” it from the pipeline in which, in fact, micro-operations are performed. This was done by placing a decryption block before trace cache. What are we trying to achieve with this? And with the help of this “rearrangement of blocks” we achieve a simple thing: if there are microinstructions for execution in the trace cache, the processor works more efficiently. Naturally, this unit operates at processor frequency, unlike the Rapid Engine. By the way, the author got the impression that this decoder is something like a conveyor belt up to 10–15 stages long. Thus, from retrieving data from the cache to obtaining the result, apparently, there are about 30 35 stages ( including Net Burst pipeline, see Microdesign Resources August2000 Microprocessor report Volume14 Archive8, page12).

The resulting set of microinstructions, along with all interdependencies, accumulates in a trace cache in the same cache, which contains approximately 12,000 microoperations. According to rough estimates, the source of such an estimate is the structure of the P6 microinstruction; the fact is that fundamentally the length of instructions is unlikely to change dramatically ( Considering the length of the microinstruction together with service fields is about 100 bits) trace cache size is obtained from 96 KB to 120 KB!!! However! Against this background, a data cache of size 8 The KB looks somehow asymmetrical :-)… and pale. Of course, as the size increases, access delays increase ( for example, when increasing to 32KB, the delays instead of two clock cycles will be 4). But is the speed of access to this very data cache really so important that an increase in latency by 2 clock cycles ( against the background of the total length of the entire conveyor) makes such an increase in volume unprofitable? Or is it simply a matter of reluctance to increase the crystal size? But then, when moving to 0.13 microns, the first step was to increase this particular cache ( not a second level cache). Those who doubt this thesis should remember the transition from Pentium to Pentium MMX thanks to the increase in the first level cache doubled Almost all programs received a 10 15% increase in performance. What can we say about the increase? quadruple (especially considering that processor speeds have increased to 2 GHz, and the multiplication factor from 2.5 to 20)? According to unconfirmed reports, in the next modification of the Pentium4 (Prescott) core the first level cache will be increased to 16 or 32 KB. The second level cache will also increase. However, at the moment all these are nothing more than rumors. Frankly speaking, the situation is a little unclear. Although, let us make a reservation, the author fully admits that such an idea is hampered by a certain specific reason. As an example, certain requirements for the geometry of the arrangement of blocks or a banal lack of free space near the conveyor ( it is clear that it is necessary to locate the data cache closer to the ALU).

Without getting distracted, let's look at the process further. The pipeline is running let the current teams use the ALU. It is clear that FPU, SSE, SSE2 and others are idle. No such luck Hyper Threading comes into play. Noticing that microinstructions are ready along with data for the new thread, the register renaming unit allocates a portion of physical registers to the new thread. By the way, two options are possible: a block of physical registers is common for all threads, or separate for each. Judging by the fact that in the Hyper Threading presentation from Intel, the register renaming block is not listed as the blocks that need to be changed; the first option is selected. Is it good or bad? From the point of view of technologists, this is clearly good, because it saves transistors. From the point of view of programmers, it is still unclear. If the number of physical registers is really 128, then with any reasonable number of threads the “register shortage” situation cannot arise. Then they ( microinstructions) are sent to the scheduler, which, in fact, sends them to the execution device ( if it's not busy) or "queued" if this actuator is not currently available. Thus, ideally, more efficient use of existing actuators is achieved. At this time myself CPU from an OS point of view looks like two "logical" processors. Hm... Is everything really so cloudless? Let's take a closer look at the situation: a piece of equipment ( such as caches, Rapid Engine, transition prediction module) are common to both processors. By the way, transition prediction accuracy from this, most likely, slightly will suffer. Especially if the simultaneously executing threads are not connected to each other. And part ( for example, MIS microinstruction sequence planner a kind of ROM containing a set of pre-programmed sequences of common operations and RAT register renaming table [aliases]) blocks must be distinguished by different threads running on “different” processors. Along the way ( from the cache community) it follows that if two threads are cache greedy ( that is, increasing the cache has a great effect), That using Hyper Threading can even reduce speed. This happens because at the moment a “competitive” mechanism for fighting for the cache has been implemented: the “active” thread at the moment displaces the “inactive” one. However, the caching mechanism may apparently change. It is also clear that the speed ( at least for now) will decrease in those applications in which it decreased in honest SMP. As an example, SPEC ViewPerf usually shows better results on single-processor systems. Therefore, the results will probably be lower on a system with Hyper Threading than without it. Actually, the results of practical testing of Hyper Threading can be viewed at.

By the way, information leaked onto the Internet that ALU in Pentium 4 16-bit. At first, the author was very skeptical about such information - they say, what did the envious people think :-). And then the publication of such information in the Micro Design Report made me think: what if it’s true? And, although information about this is not directly related to the topic of the article, it is difficult to resist :-). As far as the author “understood enough”, the point is that the ALU is indeed 16-bit. I emphasize ALU only. This has nothing to do with the bit capacity of the processor itself. Thus, in half a beat ( it's called tick, tick) ALU ( double frequency, as you remember) calculates only 16 bits. The second 16 are calculated over the next half beat. Hence, by the way, it is easy to understand the need to make the ALU twice as fast - this is necessary for timely “grinding” of data. Thus, a full 32 bits are calculated in a full clock cycle. In fact, apparently, 2 cycles are needed due to the need to “glue” and “unstick” the bits, but this issue needs to be clarified. Actually, the excavations (about which you can write a separate poem) yielded the following: each ALU is divided into 2 16-bit halves. The first half-cycle processes 16 bits two numbers and form carry bits for the other halves. The other half finishes processing at this time previous numbers. Second tick first half of ALU processes 16 bits from next pairs of numbers and forms their transfers. The second half processes the upper 16 bits first pair numbers and gets a ready 32-bit result. The delay in receiving 1 result is 1 clock cycle, but then every half clock cycle, 1 32-bit result comes out. Quite witty and effective. Why was this particular ALU model chosen? Apparently, with such an organization, Intel is killing several birds with one stone:

1. It is clear that a 16-bit wide pipeline is easier to overclock than a 32-bit wide pipeline, simply due to the presence of crosstalk and
2. Apparently, Intel considered integer calculation operations to be frequent enough to accelerate the ALU, and not, say, the FPU. It is likely that when calculating the results of integer operations, either tables or "carry-accumulated" schemes are used. For comparison, one 32-bit table is 2E32 addresses, i.e. 4 gigabytes. Two 16-bit tables are 2x64kb or 128 kilobytes feel the difference! And the accumulation of hyphens in two 16-bit portions occurs faster than in one 32-bit portion.
3. Saves transistors and... heat. After all, it’s no secret that all these architectural tricks are heated. Apparently, it was quite large (and perhaps home) problem what is Thermal Monitor worth, for example, as a technology! After all, there is not much need for such technology as such; that is, of course, it’s nice that it exists. But let's be honest - a simple blocking would be enough for sufficient reliability. Since such a complex technology was provided for, it means that the option was seriously considered when such frequency changes on the go were one of the normal operating modes. Or maybe the main one? It was not for nothing that there were rumors that the Pentium 4 was planned with a much larger number of actuators. Then the heat problem should have simply become the main one. Or rather, according to the same rumors, the heat release should have been up to 150 W. And then it is very logical to take measures to ensure that the processor operates “at full capacity” only in systems where normal cooling is provided. Moreover, most cases of “Chinese” origin do not shine with thoughtful design from the point of view of cooling. Hm.... We've gone a long way :-)

But all this is theorizing. Are there any processors today that use this technology? Eat. This is Xeon ( Prestonia) and XeonMP. Moreover, it is interesting that XeonMP differs from Xeon in supporting up to 4 processors ( chipsets like IBM Summit support up to 16 processors, the technique is approximately the same as in the ProFusion chipset) and the presence of a third level cache of 512 KB and 1 MB, integrated into the kernel. By the way, why did they integrate the third level cache? Why is the first level cache not increased?? There must be some reasonable reason... Why didn't they increase the second level cache? Perhaps the reason is that Advanced Transfer Cache needs relatively low latency. And increasing the cache size leads to increased latencies. Therefore, the third level cache for the core and the second level cache are generally “represented” as a bus. Just a tire :-). So progress is obvious - everything has been done to ensure that data is fed into the core as quickly as possible ( and, at the same time, the memory bus was loaded less).

Well, it turns out that there are no particularly bottlenecks? Why won't the author be able to "grumble"? One processor - and the OS sees two. Fine! Two processors and the OS sees 4! Beauty! Stop! What kind of OS does this work with 4 processors? Operating systems from Microsoft that understand more than two processors cost completely different money. For example, 2000 Professional, XP Professional, NT4.0 understand only two processors. And, given that for now this technology is intended for the workstation market ( and servers) and is only available in the corresponding processors - it turns out to be just damn offensive. Today we can use processors with this technology only by purchasing a dual-processor board and installing one CPU. The further you go, the stranger it gets, as Alice in Wonderland used to say…. That is, a person eager to use this technology is simply forced to buy the Server and Advanced Server versions of current operating systems. Oh, and the “free” processor is a bit expensive... It is worth adding, perhaps, that Intel is currently actively “communicating” with Microsoft, trying to tie the licensing policy to physical processor. At least according to the document, new operating systems from Microsoft will be licensed by physical processor. At least WindowsXP is licensed based on the number of physical processors.

Naturally, you can always turn to operating systems from other manufacturers. Let's just be honest - this is not a very good way out of the current situation... So one can understand the hesitation of Intel, which thought for quite a long time whether to use this technology or not.

Well, let’s not forget a rather important conclusion: Using Hyper Threading can lead to both performance gains and losses. Well, since we have already discussed losing, let’s try to understand what is needed to win: and to win, it is necessary that they know about this technology:

Motherboard BIOS
Operating system (!!!)
Actually, the application itself

Let me dwell on this point in more detail; the point is that the BIOS is not the issue. We discussed the operating system a little earlier. But in those threads that, for example, expect data from memory, you will have to enter a special command pause, so as not to slow down the processor; After all, in the absence of data, the thread is capable of blocking certain actuators. And in order to insert this command, applications will have to be recompiled - this is not good, but, thanks to Intel, everyone has recently begun to get used to it :-). Thus, the main ( according to the author) the disadvantage of Hyper Threading technology is the need for regular compilation. The main advantage of this approach is such recompilation along the way ( and most likely more noticeable :-) will increase performance in “honest” dual-processor systems and this can only be welcomed. By the way, there are already experimental studies that confirm that in in most cases, programs optimized for SMP, benefit from Hyper Threading from 15% to 18%. This is quite good. By the way, you can also see in what cases Hyper Threading leads to a drop in performance.

And finally, let's try to imagine what could change ( improve) in further development of this idea. It is quite obvious that the development of this technology will be directly related to the development of the Pentium 4 core. Thus, let's imagine potential changes in the core. What's next on our plan? 0.09 micron technology, better known as 90nm…. The author is inclined to believe ( at the moment), that the development of this family of processors will go in several directions at once:

Thanks to a more “fine” technical process, the processor frequency will become even higher.
Let's hope that the data cache will be increased. At least up to 32KB.
They will make an “honest” 32-bit ALU. This should improve productivity.
Increase the system bus speed ( however, this is already in the near future).
They will make dual-channel DDR memory ( again, the wait is relatively short).
Perhaps they will introduce an analogue of x86-64 technology, if this technology ( thanks to AMD) will take root. At the same time, the author hopes with all his might that this analogue will be compatible with x86-64. Stop creating extensions that are incompatible with each other... Again, of interest to us will be Jerry Sanders, in which he stated that AMD and Intel last year agreed on cross-licensing for everything except the Pentium4 system bus. Does this mean that Intel will build x86-64 into the next Pentium4 (Prescott) core, and AMD will build Hyper Threading into its processors? Interesting question...
Perhaps the number of actuators will be increased. True, like the previous one, this is a rather controversial point, since it requires an almost complete redesign of the kernel - and this is a long and labor-intensive process.

I wonder if the idea of Hyper Threading will be developed? The fact is that, quantitatively, there is nowhere for it to develop; it is clear that two physical processors are better than three logical ones. Yes, and positioning will not be easy... Interestingly, Hyper Threading can also be useful when integrating two ( or more) processors per chip. Well, by qualitative changes the author means that the presence of such technology in ordinary desktops will lead to the fact that in fact the majority of users will work on [almost] dual-processor machines which is very good. It’s good because such machines operate much smoother and more responsive to user actions, even under heavy load. This, from the author’s point of view, is very good.

Instead of an afterword

The author must admit that while working on the article, his attitude towards Hyper Threading changed several times. As information was collected and processed, the attitude became either generally positive or vice versa :-). At the moment we can write the following:

There are only two ways to improve performance: increase the frequency, and increase performance per clock. And, if the entire Pentium4 architecture is designed for the first path, then Hyper Threading is just the second. From this point of view, it can only be welcomed. Hyper Threading also has several interesting consequences, such as: changing the programming paradigm, bringing multiprocessing to the masses, increasing processor performance. However, there are several “big bumps” on this path on which it is important not to “get stuck”: the lack of normal support from operating systems and, most importantly, the need to recompile ( and in some cases, changing the algorithm) applications so that they can take full advantage of Hyper Threading. In addition, the presence of Hyper Threading would make it possible to truly run the operating system and applications in parallel, and not in “pieces” one at a time, as is the case now. Of course, provided that there are enough free actuators.

The author would like to emphasize his gratitude Maxim Lenya(aka C.A.R.C.A.S.S.) and Ilya Vaitsman(aka Stranger_NN) for repeated and invaluable assistance in writing the article.
I would also like to say thank you to all the forum participants who repeatedly expressed valuable comments.

Hyper-Threading (hyper threading, 'hyper threading', hyper threading - Russian) - technology developed by the company Intel, allowing the processor core to execute more than one (usually two) data threads. Since it was found that a typical processor in most tasks uses no more than 70% of all the computing power, it was decided to use a technology that allows, when certain computing units are idle, to load them with work with another thread. This allows you to increase kernel performance from 10 to 80% depending on the task.

Understanding how Hyper-Threading works .

Let's say the processor performs simple calculations and at the same time the block of instructions is idle and SIMD extensions.

The addressing module detects this and sends data there for subsequent calculation. If the data is specific, then these blocks will execute them more slowly, but the data will not be idle. Or they will pre-process them for further rapid processing by the appropriate block. This gives additional performance gains.

Naturally, the virtual thread does not reach a full-fledged kernel, but this allows you to achieve almost 100% efficiency of computing power, loading almost the entire processor with work, preventing it from being idle. With all this, to implement HT technology it only takes about 5% additional space on the chip, and performance can sometimes be added to 50% . This additional area includes additional register blocks and branch predictions, which stream-calculate where computing power can currently be used and send data there from the additional addressing block.

For the first time, the technology appeared on processors Pentium 4, but there was no big increase in performance, since the processor itself did not have high computing power. The increase was at best 15-20% , and in many tasks the processor worked much slower than without HT.

Slowdown processor due to technology Hyper Threading, occurs if:

Insufficient cache for all this and it reboots cyclically, slowing down the processor.
The data cannot be processed correctly by the branch predictor. Occurs mainly due to lack of optimization for certain software or support from the operating system.
It may also occur due to data dependencies, when, for example, the first thread requires immediate data from the second, but it is not ready yet, or is in line for another thread. Or cyclic data requires certain blocks for fast processing, and they are loaded with other data. There can be many variations of data dependency.
If the core is already heavily loaded, and the “insufficiently smart” branch prediction module still sends data that slows down the processor (relevant for Pentium 4).

After Pentium 4, Intel started using technology only starting from Core i7 first generation, skipping the series 2 .

The computing power of processors has become sufficient for the full implementation of hyperthreading without much harm, even for unoptimized applications. Later, Hyper-Threading appeared on mid-class and even budget and portable processors. Used on all series Core i (i3; i5; i7) and on mobile processors Atom(not at all). What's interesting is that dual-core processors with HT, get a greater performance gain than quad-core ones from using Hyper-Threading, standing on 75% full-fledged quad-nuclear.

Where is HyperThreading technology useful?

It will be useful for use in conjunction with professional, graphic, analytical, mathematical and scientific programs, video and audio editors, archivers ( Photoshop, Corel Draw, Maya, 3D’s Max, WinRar, Sony Vegas & etc). All programs that use a large number of calculations, HT will definitely be useful useful. Fortunately, in 90% cases, such programs are well optimized for its use.

HyperThreading indispensable for server systems. Actually, it was partially developed for this niche. Thanks to HT, you can significantly increase the output of the processor when there are a large number of tasks. Each thread will be unloaded by half, which has a beneficial effect on data addressing and branch prediction.

Many computer games, have a negative attitude towards the presence Hyper-Threading, due to which the number of frames per second decreases. This is due to the lack of optimization for Hyper-Threading from the game side. Optimization on the part of the operating system alone is not always enough, especially when working with unusual, diverse and complex data.

On motherboards that support HT, you can always disable hyperthreading technology.

Hello computer and hardware lovers.

Would you like to have a high-performance processor in your computer that can perform many tasks simultaneously at lightning speed? Who would refuse, right? Then I suggest you get acquainted with hyper threading technology: what it is and how it works, you will learn from this article.

Explanation of the concept

Hyper-threading is translated from English as “hyper-precision”. The technology received such a loud name for a reason. After all, the operating system takes one physical processor equipped with it as two logical cores. Consequently, more commands are processed without any performance impact.

How is this possible? Due to the fact that the processor:

Saves information about several running threads at once;
For each logical processor there is one set of registers - blocks of fast internal memory, as well as one interrupt block. The latter is responsible for sequentially executing requests from different devices.

What does this look like in practice? Let's say that the physical processor is now processing commands from the first logical processor. But some kind of failure occurred in the latter, and it, for example, needs to wait for data from memory. The physical one will not waste time and will immediately switch to the second logical processor.

About improving productivity

The efficiency of a physical processor, as a rule, is no more than 70%. Why? Often some blocks are simply not needed to accomplish a particular task. For example, when the CPU performs trivial computational actions, the instruction block and SIMD extensions are not used. It happens that a failure occurs in the branch prediction module or when accessing the cache.

In situations like this, Hyper-threading fills the “gaps” with other tasks. Thus, the effectiveness of the technology lies in the fact that useful work is not idle and is given to inactive devices.

Appearance and implementation

We can consider that Hyper-threading has already celebrated its 15th anniversary. After all, it was developed on the basis of super-threading technology, which was released in 2002 and first began working in Xeon products, then in the same year it was integrated into the Pentium 4. The copyright for these technologies belongs to Intel.

HT is implemented in processors running on the NetBurst microarchitecture, which has high clock speeds. Technology support has been implemented in models of the Core vPro, M and Xeon families. However, the Core 2 series (“Duo”, “Quad”) is not integrated. A technology similar in operating principle is implemented in Atom and Itanium processors.

How to enable it? You must not only have one of the above processors, but also an operating system that supports the technology and a BIOS that has the option to turn HT on and off. If not, update the BIOS.

Pros and cons of Hyper-threading

You may have already concluded some of the advantages of the technology from the information above. I'll add a couple more words to them:

Stable operation of several programs in parallel;
Reduced response time when surfing the Internet or working with applications.

As you understand, there was a fly in the ointment. There may not be a performance increase for the following reasons:

Insufficient cache memory. For example, 4-core i7 processors have 8 MB cache, but the same number of logical cores. We get only 1 MB per core, which is not enough for most programs to perform computing tasks. Because of this, productivity not only stands still, but even drops.

Data dependency. Let's say the first thread immediately requires information from the second, but it is not ready yet or is queued in another thread. It also happens that cyclic data needs certain blocks to quickly complete a task, but they are already busy with other work.
Kernel overload. It happens that the core may already be overloaded, but despite this, the prediction module still sends data to it, as a result of which the computer begins to slow down.

Where is Hyper-threading needed?

The technology will be useful when using resource-intensive programs: audio, video and photo editors, games, archivers. These include Photoshop, Maya, 3D’s Max, Corel Draw, WinRar, etc.

It is important that the software is optimized to work with Hyper-threading. Otherwise, delays may occur. The fact is that programs consider logical cores to be physical, so they can send different tasks to the same block.

We are waiting for you to visit my blog.

Back in February 2002, Intel's proprietary technology, Hyper-Threading, debuted. What is this and why has it become almost universal today? The answer to this question and more will be discussed in this material.

History of the emergence of HT technology

The first desktop processor to support logical multithreading was the fourth generation Pentium. Hyper-Threading is a technology that in this case made it possible to process two data streams at once on one physical core. Moreover, this chip was installed in the PGA478 processor socket, it operated in 32-bit computing mode, and its clock frequency was 3.06 GHz. Before this, it could only be found in server processor devices of the XEON series.

After achieving successful results in this niche, Intel decided to extend HT to the desktop segment. Subsequently, a whole family of such processors was released within the PGA478. After the LGA775 socket debuted, NT was temporarily forgotten. But with the start of sales of LGA1156, it received a second wind in 2009. Since then, it has become a mandatory attribute of processor solutions from Intel, both in the ultra-performance segment and in budget computer systems.

The concept of this technology

The essence of Intel Hyper-Threading technology is that through minimal changes in the layout of the microprocessor device, developers ensure that at the system and software levels, code is processed in two threads on one physical core. All elements of the computing module remain unchanged, only special registers and a redesigned interrupt controller are added.

If for some reason the physical computing module begins to stand idle, then a second program thread is launched on it, while the first one waits for the necessary data or information to be received. That is, if previously downtime in the operation of the computing part of chips was quite frequent, Hyper-Threading almost completely eliminates this possibility. Let's look at what this technology is below.

At the hardware level

Increased requirements are placed on hardware when using Hyper-Threading. The motherboard, BIOS and processor must support it. At least within the PGA478 processor socket, such compatibility had to be paid special attention to. Not all system logic sets in this case were oriented towards the use of NT, just like processor devices. And even if such a coveted abbreviation was present in the nomenclature of the motherboard, this did not mean at all that the chips were correctly initialized for the reason that it was necessary to update the BIOS.

The situation in this case has changed dramatically since LGA1156. This computing platform was originally designed for the use of Hyper-Threading. Therefore, users did not encounter any significant problems with the use of the latter in this case. The same is true for subsequent processor sockets such as LGA1155, LGA1151 and LGA1150.

High-performance sockets LGA1366, LGA2011 and LGA2011-v3 could boast of a similar lack of problems with the use of HT. To top it off, Intel's direct competitor, AMD, implemented a very similar logical multitasking technology - SMT - in the latest generation of its processors for AM4. It uses an almost identical concept. The only difference is in the name.

Main components from the software side

It should be noted that even if NT is fully supported by hardware resources, it will not always work successfully at the software level. To begin with, the operating system must be able to work simultaneously with several computing cores. Today's outdated versions of MS-DOS or Windows 98 system software do not have this feature. But in the case of Windows 10, no problems arise, and this operating system is already initially designed for such hardware resources of a personal computer.

Now let's figure out how to enable Hyper-Threading in Windows. To do this, all the necessary control application software must be installed on the computer. As a rule, this is a special utility from the motherboard CD. It has a special tab where you can change the values in the BIOS in real time. This, in turn, leads to the fact that the Hyper-Threading option in it goes to the Enabled position, and additional logical threads are activated, even without rebooting the operating system.

Enabling Technology

Many novice users quite often, at the initial stage of using a new computer, ask one important question regarding Hyper-Threading: how to enable it? There are two possible ways to solve this problem. One of them is using BIOS. In this case, you need to do the following:

When you turn on the PC, we initialize the procedure for entering the BIOS. To do this, just hold down the DEL button when the test screen appears (in some cases you need to hold down F2).
After the blue screen appears, use the navigation keys to go to the ADVANCED tab.
Then we find the Hyper-Threading item on it.
Opposite it you need to set the value to Enabled.

The key disadvantage of this method is the need to restart the personal computer to perform this operation. A real alternative is to use the motherboard configuration utility. This method was described in detail in the previous section. And in this case, it is not at all necessary to enter the BIOS.

Disabling NT

By analogy with the methods for turning on NT, there are two ways to deactivate this function. One of them can be performed only during the initialization of the computer system. This, in turn, is not entirely convenient in practice. Therefore, experts opt for the second method, which is based on the use of a computer utility on the motherboard. In the first case, the following manipulations are performed:

When loading an electronic computer, we go into the basic input-output system (its second name is BIOS) according to the previously described method.
Use the cursor keys to navigate to the Advanced menu item.
Next, you need to find the Hyper-Threading menu item (in some motherboard models it may be designated as HT). Opposite it, using the PG DN and PG UP buttons, set the value to Disabled.
We save the demolished changes using F10.
Exit the BIOS and reboot the personal computer.

In the second case, when using the motherboard diagnostic utility, there is no need to restart the PC. This is its key advantage. The algorithm in this case is identical. The difference is that it uses a pre-installed special utility from the motherboard manufacturer.

Previously, two main ways to disable Hyper-Threading were described. Although the second of them is nominally considered more complex, it is more practical for the reason that it does not require restarting the computer.

Processor models supporting NT

Initially, as noted earlier, Hyper-Threading support was implemented only in Pentium 4 series processor devices and only in the PGA478 version. But already within the framework of LGA1156 and later computing platforms, the technology discussed in this material was used in almost all possible chip models. With its help, Celeron processors turned from a single-core to a dual-thread solution. In turn, Penrium and i3 could already process 4 code streams with its help. Well, the flagship solutions of the i7 series are capable of simultaneously working with 8 logical processors.

For clarity, we present the use of NT within the current computing platform from Intel - LGA1151:

Celeron series CPUs do not support this technology and have only 2 computing units.
Pentium line chips are equipped with 2 cores and four threads. As a result, NT in this case is fully supported.
More powerful processor devices of the Core i3 model range have a similar layout: 2 physical modules can operate in 4 threads.
Like most budget Celeron chips, Core i5 is not equipped with HT support.
Flagship i7 solutions also support HT. Only in this case, instead of 2 real cores, there are already 4 code processing units. They, in turn, can already work in 8 threads.

Hyper-Threading - what is this technology and what is its main purpose? This is logical multitasking, which allows, through minimal hardware adjustments, to increase the performance of the computer system as a whole.

In what cases is this technology best used?

In some cases, as noted earlier, NT increases the speed with which the processor processes program code. Hyper-Threading can only work effectively with hot software. Typical examples are video and audio encoders, professional graphics packages and archivers. Also, the presence of such technology can significantly improve the performance of the server system. But with a single-threaded implementation of the program code, the presence of Hyper-Threading is leveled out, that is, you get a regular processor that solves one task on one core.

Advantages and disadvantages

There are certain disadvantages to Intel Hyper-Threading technology. The first of these is the increased cost of the CPU. But greater speed and improved silicon chip layout will in any case increase the price of the CPU. Also, the increased area of the semiconductor base of the processor device leads to an increase in power consumption and temperature. The difference in this case is insignificant, and it does not exceed 5%, but it still exists. There are no other significant shortcomings in this case.

Now about the benefits. The proprietary NT technology from Intel does not affect performance and performance, that is, such a computer will not be able to fall below a certain threshold. If the software perfectly supports parallel computing, then there will be a certain increase in speed and, of course, productivity.

Tests show that in some cases the increase can reach 20%. The most optimized software in this case are various multimedia content transcoders, archivers and graphics packages. But with games everything is not so good. They, in turn, are capable of operating in 4 threads, and, as a result, flagship chips are not able to outperform mid-level processor solutions in this case.

A modern alternative from AMD

Hyper-Threading technology is not the only one of its kind today. She has a real alternative. With the release of the AM4 platform, AMD offered it a worthy competitor in the form of SMT. At the hardware level, these are identical solutions. Only the flagship from Intel can process 8 threads, and the leading AMD chip can process 16. This circumstance alone indicates that the second solution is more promising.

Therefore, Intel is forced to urgently adjust its product release plans and offer completely new processor solutions that can compete with newcomers from AMD. Only today they have not yet been rearranged. Therefore, if you need an affordable computer platform, then it is better to choose LGA1151 from Intel. If you need a performance boost, then AM4 from AMD would be preferable.

There was a time when it was necessary to evaluate memory performance in the context of Hyper-threading technology. We have come to the conclusion that its influence is not always positive. When a quantum of free time appeared, there was a desire to continue research and consider the ongoing processes with an accuracy of machine clock cycles and bits, using software of our own design.

Platform under study

The object of the experiments is an ASUS N750JK laptop with an Intel Core i7-4700HQ processor. Clock frequency 2.4GHz, increased in Intel Turbo Boost mode up to 3.4GHz. Installed 16 gigabytes of DDR3-1600 RAM (PC3-12800), operating in dual-channel mode. Operating system – Microsoft Windows 8.1 64 bit.

Fig.1 Configuration of the platform under study.

The processor of the platform under study contains 4 cores, which, when Hyper-Threading technology is enabled, provides hardware support for 8 threads or logical processors. The platform firmware transmits this information to the operating system via the ACPI table MADT (Multiple APIC Description Table). Since the platform contains only one RAM controller, there is no SRAT (System Resource Affinity Table) table, which declares the proximity of processor cores to memory controllers. Obviously, the laptop under study is not a NUMA platform, but the operating system, for the purpose of unification, considers it as a NUMA system with one domain, as indicated by the line NUMA Nodes = 1. A fact that is fundamental for our experiments is that the first-level data cache has size 32 kilobytes for each of the four cores. Two logical processors sharing one core share the L1 and L2 caches.

Operation under study

We will study the dependence of the reading speed of a data block on its size. To do this, we will choose the most productive method, namely reading 256-bit operands using the AVX instruction VMOVAPD. In the graphs, the X axis shows the block size, and the Y axis shows the reading speed. Around point X, which corresponds to the size of the L1 cache, we expect to see an inflection point, since performance should drop after the processed block leaves the cache limits. In our test, in the case of multi-threaded processing, each of the 16 initiated threads works with a separate address range. To control Hyper-Threading technology within the application, each thread uses the SetThreadAffinityMask API function, which sets a mask in which one bit corresponds to each logical processor. A single bit value allows the specified processor to be used by a given thread, a zero value prohibits it. For 8 logical processors of the platform under study, mask 11111111b allows the use of all processors (Hyper-Threading is enabled), mask 01010101b allows the use of one logical processor in each core (Hyper-Threading is disabled).

The following abbreviations are used in the graphs:

MBPS (Megabytes per Second) – block reading speed in megabytes per second;

CPI (Clocks per Instruction) – number of clock cycles per instruction;

TSC (Time Stamp Counter) – CPU cycle counter.

Note: The TSC register clock speed may not match the processor clock speed when running in Turbo Boost mode. This must be taken into account when interpreting the results.

On the right side of the graphs, a hexadecimal dump of the instructions that make up the loop body of the target operation executed in each of the program threads, or the first 128 bytes of this code, is visualized.

Experience No. 1. One thread

Fig.2 Single thread reading

The maximum speed is 213563 megabytes per second. The inflection point occurs at a block size of about 32 kilobytes.

Experience No. 2. 16 threads on 4 processors, Hyper-Threading disabled

Fig.3 Reading in sixteen threads. The number of logical processors used is four

Hyper-Threading is disabled. The maximum speed is 797598 megabytes per second. The inflection point occurs at a block size of about 32 kilobytes. As expected, compared to reading with one thread, the speed increased by approximately 4 times, based on the number of working cores.

Experience No. 3. 16 threads on 8 processors, Hyper-Threading enabled

Fig.4 Reading in sixteen threads. The number of logical processors used is eight

Hyper-Threading is enabled. The maximum speed is 800,722 megabytes per second; as a result of enabling Hyper-Threading, it almost did not increase. The big minus is that the inflection point occurs at a block size of about 16 kilobytes. Enabling Hyper-Threading slightly increased the maximum speed, but the speed drop now occurs at half the block size - about 16 kilobytes, so the average speed has dropped significantly. This is not surprising, each core has its own L1 cache, while the logical processors of the same core share it.

conclusions

The operation studied scales quite well on a multi-core processor. Reasons: Each core contains its own L1 and L2 cache, the target block size is comparable to the cache size, and each thread works with its own address range. For academic purposes, we created these conditions in a synthetic test, recognizing that real-world applications are usually far from ideal optimization. But enabling Hyper-Threading, even under these conditions, had a negative effect; with a slight increase in peak speed, there is a significant loss in the processing speed of blocks whose size ranges from 16 to 32 kilobytes.