(co-authored with Robert Watson)
Recently, our group was treated to a presentation by Ruby Lee of Princeton University, who discussed novel cache architectures which can prevent some cache-based side channel attacks against AES and RSA. The new architecture was fascinating, in particular because it may actually increase cache performance (though this point was spiritedly debated by several systems researchers in attendance). For the security group, though, it raised two interesting and troubling questions. What is the proper defence against side-channels due to processor cache? And why hasn’t it been implemented despite these attacks being around for years?
The history of cached memory being used a side channel goes all the way back to the classic page-fault password attacks against TENEX reported in 1976. The possibility of using processor cache as a side channel against crypto routines was suggested by Paul Kocher in 1996, and again by John Kelsey et al. in 1998, but largely ignored during AES standardisation. Rijndael was selected despite its heavy reliance on table lookup to achieve good performance in software. Daniel Page described theoretical cache attacks against DES in more detail in 2002. Daniel Bernstein finally broke the flood gates open in 2005 with an experimental statistical timing attack on AES. This was followed over the next year by Colin Percival’s cache-snooping attacks against RSA on hyperthreaded processors, Dag Osvik et al.’s cache-preloading-and-inspection attacks, Guido Bertoni et al.’s cache power-consumption attacks, and my own cache-collision timing attacks.
All of the AES attacks have a common structure and cause: AES performs table lookups at indexes dependent on individual bytes of the key. Cache being a shared resource, attackers can potentially see the side-effects of these lookups, such as eviction of the attacker’s data from cache, or increased time and power consumption due to the ratio of hits to misses. This is an excellent example of how security vulnerabilities get overlooked: a bizarre interaction of algorithmic optimisations of AES and the architecture of processor caches. Both of these are messy details which were designed to be abstracted away from the majority of their users.
Here we are in 2009, and the vulnerability still exists. Interestingly, the problem hasn’t seen a lack of proposed solutions (there have been dozens), but too many solutions at different levels of abstraction, each with its own drawbacks:
- Algorithmic level: AES could be de-certified for situations where an attacker may have access to side-channel information. AES runner-up Serpent avoided table-lookups, as do most candidates in the current eSTREAM and SHA-3 competitions. Of course, this relies on millions of implementers to determine if side-channel attacks may apply and then choose an appropriate alternative to AES. For the future, we can chalk up the use of table-lookups in cryptographic software as a one-time mistake and move on, but for now we’re stuck with AES for decades.
- Software level: Many neat implementation tricks have been proposed to protect AES, from altering or randomising the use of caches to bitslicing the encryption and eliminating caches. A software patch worked for RSA, where re-shuffling the pre-computed values of the message has been deployed to mitigate Percival’s hyperthreading attacks against bit-windowed exponentiation. Unfortunately, the cost of this approach is prohibitive for AES as re-shuffling the AES lookup tables between encryptions is many times more costly than encrypting with no tables at all, an approach whose performance got us into the reliance on table-lookup in the first place. Even worse, randomisation doesn’t prevent collision attacks.
- OS level: The operating system, possibly with the assistance of new hardware-instructions, could close the cache side-channel by partitioning the cache between processes, locking critical sections of cache, or simply disabling shared cache or hyperthreading during sections of code marked as “security critical” by application programmers. The downsides here are multifold: this adds a lot of complexity for OS programmers, and the very transparency of caching becomes a problem-will the OS scheduling policy have to be changed for each minor cache design change in hardware, and will OS developer misunderstandings of hardware-specific cache behavior progress from edge-case performance loss to crypto vulnerability? Requiring a large number of people to understand the intricacies of caching behavior is almost certainly unrealistic (try giving section 10 of Intel’s manual a read). We think there’s a nice analogy to the number of bugs introduced by concurrent execution-if each one of these were potentially a security vulnerability, it would be trouble.
- Cache implementation level: This is where Ruby Lee and her colleague’s proposal comes in. Perhaps we can exploit the fact that caches are just a performance optimisation which software shouldn’t rely on in detail, silently changing the caching behaviour to be randomised. This is nice in that action is required of relatively few people, but even assuming there is no performance penalty it is unsettling in that it makes caches even more complex, raising the possibility of future side-channels being found. The proposal also does nothing to prevent collision attacks.
- Instruction set level: Intel has finally announced dedicated AES support with its AVX extensions, due out in 2010. To most people’s satisfaction, a hardware AES implementation eliminates cache vulnerabilities, but servers purchased today will likely be running for decades.
So, in the end we are left with many options and few good ones. We advocate in general that we should aim for a fix which requires the smallest number of people to make changes, and one which reduces complexity so as to prevent future vulnerabilities. By this metric the first and last options seem the best, but they also take the longest to implement, meaning for the short-term we’ll need to rely on hacks at the software and OS level, which means a lot of pain for crypto and operating system implementers. And while crypto algorithms are clear targets for cache attacks due to their iterated implementation which facilitates statistical attacks, we can’t rule out more general attacks in the future against other security-sensitive code.
4 thoughts on “When Layers of Abstraction Don’t Get Along: The Difficulty of Fixing Cache Side-Channel Vulnerabilities”
I disagree on the first. Decertifying AES JUST in the case where your adversary is running on the same processor as you has the effect of causing HUGE issues for incompatibility, so its hardly the “smallest change”, but rather the largest, because this is the entire reason why the AES standardization did not pick two algorithms.
In fact, I’m not a huge fan of the cache-based sidechannel attacks because they require the attacker to be running code on the same system as the victim, in which case why doesn’t the attacker just escalate privilidge to root and just go from there. Its not that they aren’t real, but that there are so many weaker points to attack when your adversary is running code on the same processor as you are.
And by your “Minimum change”, the OS is probably the place to effect the change by your standard on legacy hardware, simply because the AES routines should be OS-supported anyway: If the OS is compromised, you’re unable to win anyway.
But by making the crypto an OS supported library, you get a chance to control interruptions and can just prefetch the cache when starting the routine, and then pin those blocks so they aren’t evicted until the group of blocks is done.
The only case where you have to prevent other processes is when your caches are too small for your implementation, in which case then you have to instead do the “no other processes during crypto” and wipe the cache state when done.
Thus it isn’t a matter of marking “Security critical” code, its a matter of providing an AES API which takes care of the vaguarities.
Side channels are as people are now recognising, an interesting example of “unintended consiquences”.
I guess they are arguably a “TEMPEST” attack against a system. And are a consiquence of trying to use the hardware in a cost effective or efficient manner.
One rule of TEMPEST design is “red and green channel segregation”. Normally it is thought of incorectly as just an electromagnetic or “cabeling/layout issue”. It’s not it’s more fundemental than that, it is an axiom of the design philosophy.
A number of “TEMPEST” rules of thumb for design are now fairly well known (energy/bandwidth). But the “clock your inputs and outputs” and “single function” or “function segregation” do not appear to be. Or are ignored for the sake of cost or efficiancy.
Ignoring these rules will almost always give opportunities for confidential information to be leaked via complex interaction within a device or system, be it deliberate or unintentional.
One of the most obvious of these side channels is “timing attacks” which is what we are basical talking about here (however there are several others as well).
In a single function unit generaly the point to apply the attack is at the input and the point to monitor it is the output.
That is if I vary the input timing I can cause the delays of various parts of the system to be seen modulated on the output.
However if the inputs and outputs are “clocked” then the available bandwidth of this attack is limited to the recipricol of the clock rate.
More importantly any time invalid input is more easily detected and the whole function should be aborted when such an error is detected thereby limiting the available information bandwidth to the attacker even further.
And importantly also raising an alarm to the posibility of an active attack.
However in a multi function device the number of points to both apply and monitor an attack go up extrodinarily.
But importantly as they are mainly within the device the potential for limiting the bandwidth or information leaked is effectivly removed. As is the ability to detect active attacks.
Which is why you have the single function or segregated function rules.
As designing effective segregation in a device is extreamly difficult at the best of times most designs opt for keeping it simple and over engineered.
In essence the design is single function only, and usually internaly further broken down into smaller segregated functional units. With aditional monitoring and alarms.
That is internaly the functional units are “pipe-lined” with all steps designed to be done well within a single clock cycle. With other associated circuitry at each stage on sensitive input or output lines to detect and deal with fault states.
This is just one of the reasons TEMPEST designed products are expensive.
However not only are they expensive from the costs of the segregation they are also functionaly inefficient as well adding further cost.
Whilst these costs might be acceptable to governmental departments invariably they are not to commercial organisations.
Which is why the use of software running on COTS technology is so seductive.
Effectivly you currently make an ordinary application of your crypto functions and it sits on top of a consumer grade operating system on consumer grade hardware.
The problem is that consumer grade technology is almost always designed to be efficient and absolutly no consideration is given for side channel segregation or other TEMPEST related issues.
Further the end user likewise wants to efficiently use their equipment so the crypto application is very likley to end up on a system with other ordinary user accessable applications via a shared high bandwidth communications channel (ethernet network).
Even if the crypto app is the only non OS application running on the machine the OS and ethernet network alow for timing attacks to be carried out.
Whilst none of these issues are insermountable they require a bottom up re-think of how to go about them.
In the short term I suspect it will be by “fuzzing” the timing by use of spread spectrum type techniques. In the longer term by incorperating appropriate changes in both the hardware and OS.
But neither of these will 100% solve the side channel issues on non segregated systems.
As noted above the obvious timing side channel is not the only one. And comercial entities are unlikley to want to give up the “efficiency” of running user accessable applications on the same system as the crypto.
So I suspect that whatever the solution it will not be either “sufficiently efficient” or “sufficiently secure” for software on ordinary COTS technology use.
Which raises the question of should crypto ever be certified for software use on unknown or insecure hardware?
All of which beings you around to the age old question,
“What price for security”.
Interestingly, I heard this line of reasoning proposed by a CPU vendor while on a teleconference with a major ISP consumer of their CPUs discussing this problem. It became clear quite quickly that the vendor and ISP didn’t see eye-to-eye: while it’s economically convenient for CPU manufacturers to argue that you should buy one server for each customer, scalability in hosting environments relies on sharing, be it for reasons of performance, physical footprint, administrative cost, hardware cost, or power. Ssoftware security models in current OSes are known to have weaknesses, but virtual hosting ISPs will mitigate those weaknesses using a variety of technologies — hardening and sandboxing in various forms, OS-level virtualization, machine-level virtualization, and so on. None of these protect you against the cache attacks on crypto, but do quite effectively protect you against local-to-root exploits, which are decreasingly common attack vectors on hosting environments. If you are a virtual provider host then protecting the integrity of your service platform is fundamental to protecting individual customers and your own service provisioning system, and the idea that a customer might be able to extract the host’s SSH host key, or the SSL private key of a management component or another customer in another virtual OS instance is pretty worrying.
Padlock XCRYPT instruction has been around how many years longer than yet to get here AVX?
A few watts and you get 780,259.42kB/sec aes-256-cbc. 266,131.14kB/sec sha256. tens of Mbps of hw entropy. etc, etc.
Entropy, mont. mult, digests and ciphers should all be in hardware at instruction level in any self respecting core!