Next to encryption, hashing is perhaps the most important building block of modern cryptosystems. But what is a hash? Why is it important? How can some ways of computing a hash be better than others, and what makes a particular method suitable for cryptography?
- What is a hash?
- What do we use hashes for?
- What makes a hash suitable for cryptographic purposes?
- Testing it with OpenSSL
- Which hashing algorithms are secure in 2021?
- Conclusion
What is a hash?
A “hash” or “digest” is the output of a one-way function which takes arbitrary input and computes an output of a predictable length which “describes” the input. Run the same input through the function a thousand times and you will always get the same result. If you change even a single bit of the input, your output is likely to be wildly different. Crucially, it is very computationally expensive to attempt to go in the other direction – given only the output, backtracking to the input of the function is non-obvious and computationally expensive. Keep in mind that there are only so many possible outputs that can exist of a fixed length, but an infinite number of possible inputs. This means that the output of a hashing algorithm is not unique between all possible inputs. In fact, there is a special name for two inputs which result in the same output – a collision.
What do we use hashes for?
What are hashes used for? Hashes are useful in a handful of circumstances, both cryptographic and otherwise. Have you ever seen a hash provided alongside a download available on a website? It is a common misconception that this is done for security purposes, but in reality, these hashes are provided so that a user can verify that the bits uploaded to the server are the same as the bits downloaded by the user. Imagine flashing a firmware image to a piece of network equipment that’s missing a few bits at the end! (Other mechanisms, such as CRC exist for this reason as well). Additionally, hashes are used heavily alongside asymmetric cryptography for creating digital signatures. A technique that is very popular is referred to as an “HMAC” (hash-based message authentication code). The practice is fairly straightforward: when sending a message, take the message and run it through a hashing function to create a digest. Encrypt the digest with your private key, and append it to the original message. The recipient, who has previously chosen to trust your public key simply computes their own hash of the message, decrypts the hash you provided at the end of the message using your public key, and compares it to the hash they computed from the message. As only the holder of the corresponding private key could have performed this feat, the message is authenticated!
What makes a hash suitable for cryptographic purposes?
As we now know, collisions are part of the course when it comes to hashing algorithms. What should not be possible are "predictable collisions". A collision attack occurs when a bad actor attempts to force a collision with a maliciously modified input which hashes to the same output. When the client goes to check the message’s integrity, it would believe the malicious payload to be the original message! Typically, these kinds of attacks involve changing the original message and playing with padding. Let’s say we have a message from one international super spy to another:
"Meet me at the London Eye at 4 PM"
The sha256 hash of this phrase is:
c36672122a8e8a8a36880e59ba8e6ca994f2771247ce88731d2486d7775c1233
Yet, take a look at the SHA-256 digest of the string below:
"Meet me at the London Eye at 4 PM "
The sha256 hash of this phrase is:
90100f64c2a8918ac581ce7f559289483bc543f37c78d77756ae7e85fd30dcf9
Confused? This version of the message contains a space at the end! Even though it is not semantically any different, it is a different combination of characters, so it results in a different hash. Imagine if an attacker changed the message to the following:
Meet me at the London Eye at 6 PM
With a cryptographically secure hashing algorithm, this would always result in a different hash. With a hashing algorithm vulnerable to predictable collision attacks, perhaps if an attacker included just the right combination of semantically irrelevant characters, they could make it look as if their modified message passed muster. Imagine if an attacker could do this in order to issue themselves SSL/TLS certificates that any browser in the world would trust!
Testing it with OpenSSL
Using OpenSSL, let’s create a keypair and use the private key to sign the hash:
"C:\program files\OpenSSL\bin\openssl.exe" genrsa -des3 -out private.pem 2048
Private.pem contains both the public and private key, let’s export just the public key:
openssl rsa -in private.pem -outform PEM -pubout -out public.pem
echo "Meet me at the London Eye at 4 PM" > sign.txt
"C:\program files\OpenSSL\bin\openssl.exe" dgst -sha256 -sign private.pem -out sign.txt.sha256 sign.txt
The contents of the file sign.txt.sha256 can be validated by anyone that trusts my public key using the following command:
openssl dgst -sha256 -verify <(openssl x509 -in "public.pem" -pubkey -noout) -signature sign.txt.sha256 sign.txt
Which hashing algorithms are secure in 2021?
Currently, the gold standard in the industry is any hash in the SHA-2 family of hashes of bit-length 256 or greater. This includes SHA-256, SHA-384, and SHA-512. These hashing algorithms are recommended via the NIST in FIPS PUB 180-2 (which has since been superseded by FIPS PUB 180-4 but retains the recommendation). Until fairly recently, SHA-1 was considered to be relatively secure, but within the past 5 years new ways of leveraging modern hardware to come up with predictable collisions have become a threat. MD5 and MD4 have been unsuitable for cryptographic purposes for over a decade, but are still perfectly fine for validating the bits of your router firmware match what was uploaded!
Of these recommended hashing functions, SHA-256 is by far the most commonly implemented in software. This is because with current computing power, SHA-256, SHA-384 and SHA-512 are equally secure (it is infeasible for an attacker to compute collisions). SHA-384 and SHA-512 are a means of future-proofing the algorithm, but in 2021 there is very little reason to compute hashes that take longer, consume more storage, and use more battery power.
Conclusion
Hashing algorithms are here to stay. The properties they provide are invaluable to cryptographers, and are ubiquitous anywhere cryptography is found. While there is no guarantee that SHA-256 will continue to be reasonably secure, other hashing algorithms do exist, including SHA-3 which was finalized in 2015 largely that if in the event SHA-2 is broken similarly to SHA-1 that a potential suitable replacement already exists. The next time you see SHA-2 referenced in the news, try to figure out why the system in question is employing hashing.