I have started a blog on Theoretical Security and Software Engineering. Here is my first post. I hope you enjoy it :)
Designing and implementing secure software solutions usually involves a discussion about the level of security and the effort and cost of achieving that level of security. The cost and effort are a function of the upfront cost of development, time to market, and the cost of fixing the security problem when they are exploited. Depending on the product, the company, and the severity of the problem, the later cost could also involve regaining the trust of the customers (e.g., numerous security breaches of Facebook) or losing a significant part of the business as in the case of VFEmail. Based on the risk tolerance and other factors, software products fall somewhere in the following spectrum of security.
- No security! It is fast, easy, and cheap to develop but carries significant risk and will require a huge cost to make it secure later.
- Some levels of security have been considered and implemented and the rest is protected with NDAs or other non-technical means.
- Absolute security without trusting any third-party (more formally known as Information-theoretic security or Unconditional security).
In this post, I will talk about the last item and discuss how feasible or practical it is. I will argue that although absolute security is not possible in practice, it is important that we put the effort in research in fields such as "Secure Multiparty Computation" to provide the tools necessary for making a reasonable compromise with regard to the security of the product and the number of things that we have to trust.
Let's start with a simple example to demonstrate if absolute security is achievable or not. In this example, we have a client
which wants to encrypt a secret file
and outsource it to a cloud
for storage such that the cloud can never find the content of the file.
Theory
You can achieve information-theoretic security by using a simple XOR
function. Consider the case where the content of the secret file
is a binary string 101010
. Now assume the client has picked a password which is also a binary string, say 100101
. The client XORs the file content with the password (using the following function) and sends the result to the cloud.
def encrypt(content, password):
m = int(content, 2)
k = int(password, 2)
c = m ^ k
return "{0:b}".format(c).zfill(6)
encrypt("101010", "100101")
#=> prints '001111'
Why is this way of encryption secure? Assume you are the cloud and you have received 001111
from the client. Also assume that you have unlimited CPU and memory and you decide to try all possible passwords (i.e., perform a brute-force attack) to find the content of the secret file. You notice that the message is 6 characters long and therefore you try all the passwords between 000000
to 111111
using the following function.
def brute_force(secret):
c = int(secret, 2)
possible_results = []
for possible_password in xrange(int("111111", 2)+1):
possibility = "{0:b}".format(possible_password ^ c)
possible_results.append(possibility.zfill(6))
return possible_results
brute_force("001111")
#=> Can you guess what it prints?
If you run this function, you will notice that it is printing all values between 000000
and 111111
. In other words, it will print all the possible contents of the secret file! This means that brute-forcing did not help at all and you (i.e., the cloud) still have no idea which of the possibilities is more likely! This is what we call information-theoretic security: regardless of your computation power, the best you can do is guess which one of all the possible contents is the answer.
Alright, we have achieved information-theoretic security and we are done, right? No! We have yet to examine the practical side.
Practice
In our brute-force example, the cloud is looking for the correct answer among all the possibilities. An observant reader might have noticed that the content of the secret file
is 101010
or 42
and of course 42 is the answer! Joking aside, this shows an attack when the attacker has some knowledge about the form of the secret. Even if we do not consider this type of attack, there can still be problems with the implementations. For example, consider the following implementation of the encryption function.
def bad_encrypt(content, password):
m = int(content, 2)
k = int(password, 2)
c = m ^ k
print "DEBUG: Encrypting {0} with {1} resulting in {2}".format(m, k, c)
return "{0:b}".format(c).zfill(6)
bad_encrypt("101010", "100101")
#=> prints DEBUG: Encrypting 42 with 37 resulting in 15
# '001111'
hmmm, that does not look very secure! As is evident, even when using algorithms that give you information-theoretic security, you are still trusting that the developers have implemented a secure program. Now, let's assume that the program is written securely and there is no vulnerability in the code. The next question is where to store the password. If an attacker can find the password, then all bets are off. Therefore, even with theoretically secure algorithms and assuming that your programs have no vulnerabilities, you are still trusting that your operational security is perfect!
Now, assume that your code is secure and your operational practices are secure, and you have no malicious insider that is leaking your secrets, would you choose the XOR function for your encryption needs? Probably not! Note that to encrypt a message of length 6, you needed to create a password of length 6. Similarly, to encrypt 10 Terabyte of data that you are outsourcing to a cloud, you would need to store 10 Terabytes of passwords locally! Which is a huge and pointless price to pay to achieve information-theoretic security.
Reality
At this point, you might be wondering why are we even bothering with theoretical security! Well, the picture that I painted so far was an extreme case to make a point about the importance of security in all aspects of software development and also to point out that high level of security is not cheap and in the vast majority of cases you pay it through more computation or more memory/network usage. The solution is research on making theoretical approaches more efficient and to find reasonable compromises on practical security. In the past few decades, there have been significant researches that are bringing us closer to an efficient and reasonable theoretical and practical security.
In my PhD thesis, I have done a small part in furthering such research (AMPR14,AHMR15, AMR17) in the field of "Secure Multiparty Computation" and I believe it is one the most promising fields. In my future blog posts, I will introduce this field from the point of view of a Software Engineer with the goal of encouraging other Software Engineers to adapt and use the results of the amazing research that is being done in this field.
Top comments (2)
Perhaps you should also give the hashtag of
cryptography
? Also, nice post. This is also known as an one-time pad. But the key must be the length of the message to avoid leaking information.Do not use the key more than once!
Also, it's not a true one-time pad unless you have true random data. However it's not feasible to accomplish as of today.
This type of encryption is vulnerable to known-plaintext attacks.
It's hard to meet the requirements to properly implement one-time pads.
Thanks for your comment Tari. The properties you said are correct :) I did not mention them to keep the post lightweight and more appealing and encouraging to everyone. That being said, a short description similar to the one you said could definitely improve the quality of the post.
Also, nice suggestion on the cryptography tag :)