Almost all modern web applications need, in one way or another, to encrypt their users' passwords. We could say that, from the moment that an application has users, and users sign in using a password, these passwords have to be stored in an encrypted way.
There are some intuitive reasons for this: our data stores can be compromised, and so can our communications. But the most important reason is that we have to think of our users' passwords as sensitive personal data. Their passwords are their key to their privacy, so they are personal, they are sensitive, and no one (not even us) has the right to know them. And we must honor this if we want to gain our user's trust.
So, we have to encrypt passwords, but... how? Here comes our first rule:
This is because, except for some specific scenarios (mainly regarding legacy integration), there is absolutely no reason for a password being decrypted. If you encrypt your passwords using password-based encryption (a two-way technique) and an attacker gets to know your encryption password, all of your user passwords will be revealed (and, probably, all at a time). If you don't have such encryption password (or key) to be able to decrypt, this risk disappears, and the attacker will have to trust on brute force or similar strategies.
'But what if one of my users loses his/her password? Can't I remind it to him/her?'
The answer is a loud and clear NO. Not only you cannot do such a thing as reminding their passwords to them, but in fact you should not even have a way to get to read/know/see your users' passwords, no matter if you are the system administrator! If one of your users loses his/her password, just reset it to a new value and send him/her a message to a verified email address with the new one, asking to change it as soon as possible.
Now that it is clear that password digesting is a must, which digest algorithm should we use? Well, there are several, and it largely depends on your needs. The most used ones are:
MD5 algorithm
SHA family: SHA-1 algorithm and SHA-2 variants (SHA-224, SHA-256, SHA-384 and SHA-512)
In most cases, both MD5 or SHA-1 will be adequate choices for password digesting, although applying these algorithms will not be enough, as we will see later on.
When we are told that we should use digests for password encryption, and given that digests are one-way techniques, the next question that arises to our minds usually is: 'If I cannot decrypt passwords... how will I check if my users entered the right one?'
There is a very simple answer to this question, which we will adopt as our second rule:
Which means that, once our users have entered their passwords at sign in, we will digest their input with the same algorithm we have previously used when storing the password, and then compare both digests. As digest algorithms guarantee that two equal inputs will get equal digests (which is not true in the opposite direction), if digests match we can then consider the password input by the user as valid.
All the digest algorithms (also cryptographic hash functions) mentioned above share a feature: they are public. They are well-known and largely implemented algorithms, and so anyone can use them, and not only us.
If anyone can use the same algorithm that we use, and for some reason attackers get to see our database of password digests, how can we be sure that they will not be able to guess some of our user passwords by simply trying all the possibilities until they find one which digest matches the stored one (brute force)?
Well, the answer is... we cannot. But we can make this such an overwhelming and time-consuming task that they will not want to, unless they can wait for eternity. For achieving this, two concepts come in our help: the salt and the iteration count.
The salt is a sequence of bytes that is added to the password before being digested. This makes our digests different to what they would be if we encrypted the password alone, and as a result protects us against dictionary attacks. We can adopt two different strategies regarding salt:
Use a fixed salt, a sequence of bytes that we will use for digesting every password. We can keep this salt hidden and consider it an added security value, but it can make our system more vulnerable to birthday attacks and, in general, attacks driven against our database of passwords as a whole.
Use a variable salt, which is usually a safer option (better if it is random). This is generated or computed separately for each password being digested and it allows each stored password to be decoupled from the others, creating a stronger overall protection and highly improving safety for attacks driven against our database of passwords as a whole.
In practice, random (or at least variable) salt is a much better idea because, although its being random will force us to store it unencrypted along with the digest (so that we can recover it) and this will make it trivial for an attacker to know it, it will still let each of our users' passwords remain decoupled from the rest, so that they will have to be attacked separately.
Think that, if we use a fixed salt and the attacker gets to know this fixed salt, the security of the whole database of passwords will be dramatically decreased. And there are several ways an attacker can get to know this salt, like applying brute force on his own password or on a password he/she has got somehow from a valid user. In a fixed-salt scenario, a weak user could lead us to a weak overall password system.
Nevertheless, if still you want to keep some part of the salt secret, a good approach could be a mix of both techniques, using a salt composed of both a fixed secret portion and a random one, with only the random bytes being stored undigested along with the digest result.
The minimum recommended size of salt is 8 bytes. If a mixed approach is used, at least 8 of its bytes should be random. This said, we can state our third rule:
The iteration count refers to the number of times that the hash function with which we are digesting is applied to its own results.
This means that, once we have selected a salt and concatenated the password to it, we will have to apply the hash function (say, an MD5 algorithm), get the result, and then pass it again as input to the same hash function, then do the same again and again and again... a number of times.
The minimum recommended number of iterations is 1,000, and it will provide us with a good amount of extra security. Think that, when you are creating a single password digest for a new user, the difference between applying the hash function once or a thousand times won't be a problem for you, maybe a couple hundreds of milliseconds... but attackers would have to generate an enormous amount of tentative password digests when brute-forcing and, for an attacker, the difference between applying the hash function once and applying it a thousand times for each try would be a real computational problem.
So, it seems that we have a fourth rule:
Before being able of correctly storing our password digests, there still is something that we should care about, which is the mismatch between character strings and byte sequences: two identical character strings may be represented with different byte sequences depending on the encoding being applied for translation (ISO-8859-1, UTF-8, etc...).
This will affect us because passwords are usually input by users as character strings, but digest algorithms work at a byte level.
Imagine that a new user signs up with a password containing a non-ASCII character, and your sign-up application, running on a Windows 2000 machine in Western Europe, converts the input character string to bytes without choosing a specific encoding for the operation, and thus using the default one, which in that Windows system would be ISO-8859-1. The sign-up logic digests the resulting bytes and the password gets stored.
Then, this user comes around again and this time he/she does not visit your sign-up application, but instead some other app running on the same password database but on a Linux machine. The user correctly inputs his/her password and... they don't match!
Why? because most Linux systems use UTF-8 as a default encoding, and thus the passwords input by the user in the two different scenarios have been translated into different byte sequences by each of the machines, affecting correct password matching.
How to solve this: by setting a fixed encoding for string-to-byte translations. Here comes rule number five:
If you are using Java, where String objects are encoding-independent (although backed by UTF-16), you won't have to worry whether your application uses ISO-8859-1 or any other encoding instead of UTF-8 for its user interface, as will not be necessary that the encoding for password digesting matches that of the user interface. It is just necessary that this encoding be a fixed one, and UTF-8 will bring you the right balance between size and charset completeness (Unicode).
Also, you should care about Unicode Normalization, as depending on the input system, you could get different character sequences (and thus byte sequences) for the same unicode visual character representation. A normalization operation which makes sure your code is in NFC form may be necessary here. For more information on Unicode normalization, see this issue of Core Java Technologies Tech Tips.
We will normally want to manage and store the digested password as a character string, but the digest function will output a sequence of bytes which will not necessarily represent a valid character string in any encoding. So, translating our digested byte sequence back into a character string could not be possible, and we could risk data loss when doing it.
This is where BASE64 encoding comes to the rescue. By encoding our digested sequence of bytes in BASE64, we will make sure that the output byte sequence represents a valid, displayable, US-ASCII character string. So we will be able to safely translate the BASE64-encoded byte sequence into a character string specifying US-ASCII as the encoding.
As an alternative to BASE64, you could also encode your output as hexadecimal strings, which would be an equally valid method (although you would get more lengthy digest strings).
We now have a complete list of rules to be applied:
I. Encrypt passwords using one-way techniques, this is, digests.
II. Match input and stored passwords by comparing digests, not unencrypted strings.
III. Use a salt containing at least 8 random bytes, and attach these random bytes, undigested, to the result.
IV. Iterate the hash function at least 1,000 times.
V. Prior to digesting, perform string-to-byte sequence translation using a fixed encoding, preferably UTF-8.
VI. Finally, apply BASE64 encoding and store the digest as an US-ASCII character string.
The easiest way of encrypting passwords in Java using the explained techniques is using Jasypt, which already does all this processing transparently for you.
If we cannot use jasypt, or if for some reason we wish to develop encryption features ourselves, we will need the following tools for our task:
java.security.MessageDigest for creating digests. This class allows to specify the digest algorithm we wish to use.
java.security.SecureRandom for generating random salt in a secure manner, using algorithms like SHA1PRNG.
The java.lang.String.getBytes(String charsetName) method, for obtaining a byte sequence from the input String, specifying a fixed encoding ("UTF-8").
org.apache.commons.codec.binary.Base64, part of the Apache Commons-Codec library, for performing BASE64 translations on hash output.
java.text.Normalizer (only in Java SE 6) or com.ibm.icu.text.Normalizer (part of the International Components for Unicode package), for Unicode normalization operations.
Also, the source code for classes org.jasypt.digest.StandardByteDigester and org.jasypt.digest.StandardStringDigester, available at the Source Repository, can be used as a guide.
Performed on: A single user password.
Description: The attacker tries to get the user's password by exhaustively generating all possible passwords, digesting them and testing if they match with the user's password digest. Learn more [wikipedia.org].
Our defense: By iterating the hash function to a number like 1,000 (minimum recommended), the overhead of password digest creation for the user at sign-up or sign-in time is not significant, but the accumulated cost for a brute force attacker generating millions of digests will be very considerable. Remember that one of the best ways to protect your encrypted data is making the cost of breaking your security too high to be worth the effort.
Performed on: Either a single user password or a database of user passwords as a whole.
Description: The attacker tries to get the user's password by matching its digest against a set of "most possible" password digests, typically generated from a list of words in a dictionary. This attack exploits a severe weakness in nowadays applications, as an important amount of users set a dictionary word as their password. Learn more [wikipedia.org].
Our defense: By adding a random salt, the weakness of the dictionary-based passwords many people use is reduced (they are no longer dictionary words), and the possibility of the digest appearing on a set of digests previously created by the attacker is minimal.
Performed on: Database of user passwords as a whole.
Description: This attack exploits the Birthday paradox [wikipedia.org], which in brief states that, having a large set of user password digests, the probability of generating a password which digest collides with at least one of the digests in the set is very much higher than what you would intuitively expect. And this probability increases dramatically as the size of the set (the number of users) augments. Learn more [wikipedia.org].
Our defense: By adding a random salt the possibilities of a birthday attack to succeed are minimum, because the attacker would have to attack each password separately, and not the set of passwords as a whole, to find a collision. This is because he/she would have to find a password that creates the same digest as the attacked one using the same salt which was used for digesting it, which is different for each password (this is, it would become a brute force attack).