The performance increase purely relates to type of workload etc but can definitely almost double the performance under ideal conditions. Most of the times it’s far less than that, but still noticeably better. Test it yourself, just turn off HT in your bios/uefiantic604 wrote: Tue Jun 04, 2019 2:32 pmThank you - that was very educational!AdvancedFollower wrote: Tue Jun 04, 2019 12:18 pmThe idea behind SMT (called Hyperthreading by Intel) is to share unused resources of one physical core among two logical cores. Modern CPUs have multiple execution resources and are in fact over-provisioned most of the time. A single instruction making its way through the CPU core's pipeline usually won't fully utilize the resources at every stage, and may also be bottle-necked at certain stages while other stages of the pipeline are idle. So by presenting the core as two virtual cores to the OS, the pipeline can be fed more efficiently and fewer execution resources stay idle.
A CPU core already tries to extract instruction-level parallelism to keep multiple operations "in flight" at the same time, but it can be done much more easily and efficiently when executing two independent threads at the same time, where the result of one thread isn't necessarily immediately dependent on the outcome of same calculation in the other thread (unlike an operation inside a thread, which might depend on the outcome of a previous operation).
In some cases, the two logical cores might end up competing for the same execution resource, but it's almost always a net performance gain overall. CPU schedulers are incredibly complex and can usually avoid those situations. The allocation of execution resources constantly changes, it's not a static 50/50 split or anything like that. In modern CPUs, some resources have also been duplicated for the purpose of SMT.![]()
![]()
Are there any (easy to grasp) documents about pros & cons of HT/SMT, in particular researching the benefits in real-life scenarios? I thought it's typically in 5-20% range, based on what I've seen from asynchronous GPU compute (which is a similar idea, I think?), but some people here claim to have almost twice the performance?
Hyperthreading in DAW - ON or OFF?
- KVRist
- 351 posts since 24 Aug, 2017
- Banned
- Topic Starter
- 11467 posts since 4 Jan, 2017 from Warsaw, Poland
Yeah, thanks - I've figured it out since the topic was posted over a year agomixtur.se wrote: Wed Jun 19, 2019 12:16 am This is exactly the case, see my previous post. 1+1=2.5 in this case
-
- KVRer
- 1 posts since 12 Aug, 2018
If its not already in the thread.
1. Even a simple hypothetical HT system will improve performance over a non HT core, because at core, it addresses issues around CPU utilisation due to cache waits.
1 core, with 1 instruction queue, has to wait for long periods of time to fetch data from cache, memory or disk. In most tasks the CPU spends a significant amount of time idle. Even if you were to execute a program with all data in place you would probably end up with thermal throttling after a few seconds anyway, because they are designed knowing these delays exist.
What Hyperthreading allows is for 1 core with 2 instruction queues so it executes instructions from one queue while waiting for data from another queue. In cases where lots of fresh data is needed for both threads -you would easily see a huge performance improvement, as the raw CPU cycles of a core is only part of a CPU's performance.
2. When the number of virtual tasks exceeds the number of hardware instruction queue's a very simplified hypothetical threading system implementation (a simple example to make the point) would need to effectively load a new instruction queue into a hardware instruction queue slot.
With a hyperthreaded system you can keep twice as many queue's loaded at all times, reducing processing delays when switching in new instruction queue's (I think - though there are probably huge improvements in how these have been implemented for some time)
When people migrated from Windows XP, to Multithreaded Windows Systems, those with single threaded CPU's saw a management load of between 5 and 15%, and this is due in part to both those cases above, among more complex causes.
A system written to expect 2 or more cores is going to double the expected number of hardware threads it expects to see, instead of optimising for a single instruction queue. So constant swapping of instruction queue's, and total request volumes and timings at the latency sensitive hardware level -significantly degrade the performance of a CPU, even if its much faster.
Going into the Intel Turbo feature,
These core mechanics discussed above -relating to CPU waits, show how something else can impact performance, how a multicore system that can switch up the speed of a few cores really fast, or several cores quite fast, can drastically alter system performance as a factor outside of the hardware core count or base speed.
An example of a cycle dependent core might be a poorly multicore optimized DAW host, or any software where a handful of threads need reliability under very low latencies. (Kernel level drivers spring to mind)
The best working example off the top of my head in a program exhibiting these issues, and which can be partly solved by having this CPU flexibility, is DirectX11
The rewrite of the API and Driver framework in DirectX12 introduced much more balanced multithreaded performance, but if you played a DX11 game, you would spot that one thread was up at 80%, and all other threads would be at 5-10%. A CPU which can add a Ghz to one core consistently can silently address the issues of Dx11, and programs, or drivers like it.
The Turbo system, and turbo features of a CPU are as important as clock speed.
For example, I could hypothetically order a pair of CPUs that runs with 6 cores at a base speed of 3.3Ghz, or a pair of CPUs with 8 cores and a base speed at 2.9Ghz.
On the surface getting a guaranteed 3.3Ghz for broad performance in audio is perhaps the obvious choice. But inspecting the turbo, it might only be able to accelerate 1 core to 3.6ghz, or a few to 3.4Ghz. The 8 core CPU might have a turbo of 3.8Ghz, and be able to accelerate 3 cores to 3.7Ghz.
In this case, with the extra cores, from the perspective of having several synth tracks, the two CPUs offer the high performance needed on up to 6 cores, for CPU mhz sensitive synths or programs, but then leaves 10 cores which can guarantee around 3Ghz for tracks which have modest or reasonably well implemented parallel optimisations. A chip with HT can also store instruction queues for each thread of each synth, where they take turns in processing the next audio computations while its pair fetches any data.
As a side note: I've noticed Repro 5 seems to avoid loading tasks on two instruction queue's on the same CPU core. Whereas some others don't, or are at least much more ambiguous (Cubase)
Which solution is better for a given piece of software is difficult to quantify without delving into some complex concepts I would need to research before hand - relating to fetching from cache, shared resources, how the CPU manages its instruction queues across silicon, how it organizes RAM in relation to the cores, and what instruction sets it makes use of when compiled (AVX-512 can execute 8 64bit data words with the same operation whereas an X86 instruction might only be able to execute 1 or 2 data words).
I hope this is useful - I only skimmed the thread so I apologise if I've re-stated someone else's contribution, I hope my understanding is somewhat correct. I'm a software engineer but we only have a limited knowledge of the details unless we work with very latency and performance sensitive code.
1. Even a simple hypothetical HT system will improve performance over a non HT core, because at core, it addresses issues around CPU utilisation due to cache waits.
1 core, with 1 instruction queue, has to wait for long periods of time to fetch data from cache, memory or disk. In most tasks the CPU spends a significant amount of time idle. Even if you were to execute a program with all data in place you would probably end up with thermal throttling after a few seconds anyway, because they are designed knowing these delays exist.
What Hyperthreading allows is for 1 core with 2 instruction queues so it executes instructions from one queue while waiting for data from another queue. In cases where lots of fresh data is needed for both threads -you would easily see a huge performance improvement, as the raw CPU cycles of a core is only part of a CPU's performance.
2. When the number of virtual tasks exceeds the number of hardware instruction queue's a very simplified hypothetical threading system implementation (a simple example to make the point) would need to effectively load a new instruction queue into a hardware instruction queue slot.
With a hyperthreaded system you can keep twice as many queue's loaded at all times, reducing processing delays when switching in new instruction queue's (I think - though there are probably huge improvements in how these have been implemented for some time)
When people migrated from Windows XP, to Multithreaded Windows Systems, those with single threaded CPU's saw a management load of between 5 and 15%, and this is due in part to both those cases above, among more complex causes.
A system written to expect 2 or more cores is going to double the expected number of hardware threads it expects to see, instead of optimising for a single instruction queue. So constant swapping of instruction queue's, and total request volumes and timings at the latency sensitive hardware level -significantly degrade the performance of a CPU, even if its much faster.
Going into the Intel Turbo feature,
These core mechanics discussed above -relating to CPU waits, show how something else can impact performance, how a multicore system that can switch up the speed of a few cores really fast, or several cores quite fast, can drastically alter system performance as a factor outside of the hardware core count or base speed.
An example of a cycle dependent core might be a poorly multicore optimized DAW host, or any software where a handful of threads need reliability under very low latencies. (Kernel level drivers spring to mind)
The best working example off the top of my head in a program exhibiting these issues, and which can be partly solved by having this CPU flexibility, is DirectX11
The rewrite of the API and Driver framework in DirectX12 introduced much more balanced multithreaded performance, but if you played a DX11 game, you would spot that one thread was up at 80%, and all other threads would be at 5-10%. A CPU which can add a Ghz to one core consistently can silently address the issues of Dx11, and programs, or drivers like it.
The Turbo system, and turbo features of a CPU are as important as clock speed.
For example, I could hypothetically order a pair of CPUs that runs with 6 cores at a base speed of 3.3Ghz, or a pair of CPUs with 8 cores and a base speed at 2.9Ghz.
On the surface getting a guaranteed 3.3Ghz for broad performance in audio is perhaps the obvious choice. But inspecting the turbo, it might only be able to accelerate 1 core to 3.6ghz, or a few to 3.4Ghz. The 8 core CPU might have a turbo of 3.8Ghz, and be able to accelerate 3 cores to 3.7Ghz.
In this case, with the extra cores, from the perspective of having several synth tracks, the two CPUs offer the high performance needed on up to 6 cores, for CPU mhz sensitive synths or programs, but then leaves 10 cores which can guarantee around 3Ghz for tracks which have modest or reasonably well implemented parallel optimisations. A chip with HT can also store instruction queues for each thread of each synth, where they take turns in processing the next audio computations while its pair fetches any data.
As a side note: I've noticed Repro 5 seems to avoid loading tasks on two instruction queue's on the same CPU core. Whereas some others don't, or are at least much more ambiguous (Cubase)
Which solution is better for a given piece of software is difficult to quantify without delving into some complex concepts I would need to research before hand - relating to fetching from cache, shared resources, how the CPU manages its instruction queues across silicon, how it organizes RAM in relation to the cores, and what instruction sets it makes use of when compiled (AVX-512 can execute 8 64bit data words with the same operation whereas an X86 instruction might only be able to execute 1 or 2 data words).
I hope this is useful - I only skimmed the thread so I apologise if I've re-stated someone else's contribution, I hope my understanding is somewhat correct. I'm a software engineer but we only have a limited knowledge of the details unless we work with very latency and performance sensitive code.
- KVRian
- 1186 posts since 21 Aug, 2017 from Brasil
Hyper-Threading and ZombieLoad
https://www.pcworld.com/article/3395439 ... ploit.html
https://www.pcworld.com/article/3395439 ... ploit.html
-
- KVRian
- 1262 posts since 15 May, 2002 from Finland
Wow, that was a lot of good info, I understand the benefit of turbo boost better now.