DEV Community

Cover image for Using Hash Codes as Unique Identifiers
Igvir
Igvir

Posted on • Originally published at blog.igvir.com

3

Using Hash Codes as Unique Identifiers

Introduction

In this article, we will cover a common use case for software developers: Using and generating Unique Identifiers.

Let´s start with a definition:
What is an identifier? The simplest explanation usually comes from Wikipedia:

An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique class of objects, where the "object" or class may be an idea, physical countable object (or class thereof), or physical noncountable substance (or class thereof). The abbreviation ID often refers to identity, identification (the process of identifying), or an identifier (that is, an instance of identification). An identifier may be a word, number, letter, symbol, or any combination of those.

Unique identifiers (UIDs), or identifiers, can be a string value or an integer that we use to address unique resources in a domain. Consumers (Systems or Users that need to refer to that resource) then use these identifiers to fetch the resource from a collection of resources. Without a unique identifier, the simple task of separating the resources and invoking those resources is almost impossible.

We use Identifiers to refer to database structural elements, like the names of tables, files in a collection, or elements in a set.
Another important concept here is Metadata. And again from Wikipedia:

Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself.
As we will see in this article, metadata and identifiers can be related for different purposes, including generating a Unique Identifier (UID).

Use Metadata to create Unique Identifiers

The metadata can be generated using algorithms, an example of this is a Hashing algorithm. Hashing consists of converting a general string of information into metadata that represent it. This is done to scramble the data so that it completely transforms the original value, making the hashed value utterly different from the original while also describing it in a certain way, to the point of identifying it.
Hans Peter Luhn invented Hashing. He was a German researcher and scientist who worked at IBM. During his professional life, he created over 80 patents in the field of Computer Science and Information Science.

Hashing is used for various functions, but here we are especially interested in its ability to verify the integrity of an object without having to recognize its content. Two files can be compared for equality easily through hashing. There is no need to open the two documents individually. Hashing compares them byte-by-byte and the computed hash value instantly tells if they are distinct.

Are Hash Codes Unique Identifiers?

A hash code is of a fixed length, so from a mathematical point of view it cannot be unique for all possible inputs. But all such hash functions are carefully designed to minimize the probability of a collision (two distinct files with the same hash value).

Scott Chacon and Ben Straub in "Pro Git" book talk about Hashing collision options using SHA-1 as UID:

If all 6.5 billion humans on Earth were programming, and every second, each one was producing code that was the equivalent of the entire Linux kernel history (6.5 million Git objects) and pushing it into one enormous Git repository, it would take roughly 2 years until that repository contained enough objects to have a 50% probability of a single SHA-1 object collision.

Therefore we can say that in most practical cases it is possible to use Hashing algorithms to generate Unique Identifiers.

Using Hashing to Create a Unique File Names

A simple use case where a Unique Identifier is needed is the naming process for an Object Storage. For this example, we assume that objects are immutable so if the object changes a new hashing Identifier will be associated with the new version. The idea is simple we will create a Hash code for every object and use it as a Unique Identifier of that object version.

Prerequisites

We use Java for our sample code but it could be easily ported to other programming languages. To run the file you will need to install a Java SDK. If you need it, you can see How to download and install JDK.

Hands-on with Code

The code could be separated into three steps:

  1. Select the algorithm
  2. Parse file(s)
  3. Generate the UID

Step 1: The Algorithm

There are three main options when you think about the Hashing algorithm you want to use: MD5, SHA-1, or SHA-256. A Hash code is fixed length, MD5 generates a 128 bits code, SHA-1 is 160 bits and SHA-256 is 256 bits. Longer numbers minimize the likelihood of a collision of generated codes.

Federal Information Processing Standards define Secure Hash Standard (SHS) FIPS PUB 180-4 names seven algorithms: SHA-1, SHA-224, SHA-256,
SHA-384, SHA-512, SHA-512/224, and SHA-512/256.

In 2017, researchers were able to produce two different documents with the same SHA-1 hash. It took them 6610 years of processor time to do it.

We use java.security.MessageDigest class and SHA-1 Algorithm in this example.

Step 2: Parse File(s)

As mentioned earlier, we use java.security.MessageDigest class and SHA-1 Algorithm in this example. Message digests are secure one-way hash functions that take arbitrary-sized data and output a fixed-length hash value.

The code creates a new file input stream for reading the file content. The data is processed using the update method call. Once all the data has been updated, the digest methods should be called to complete the hash computation.

See the code:

//New file input stream for reading the file content
FileInputStream fis = new FileInputStream(file);
//Create byte array to read data in chunks
byte[] byteArray = new byte[1024];
int bytesCount = 0;
//Read file data and update in message digest
while ((bytesCount = fis.read(byteArray)) != -1) {
digest.update(byteArray, 0, bytesCount);
};
//close the stream; We don't need it now.
fis.close();
//Get the hash's bytes
byte[] bytes = digest.digest();

Step 3: Generate the UID

The digest method returns an array of bytes in decimal format. We need to convert it to hexadecimal format before call toString() method to finally to get an UID based on file data.

//Get the hash's bytes
byte[] bytes = digest.digest();
//This bytes[] has bytes in decimal format;
//Convert it to hexadecimal format
StringBuilder sb = new StringBuilder();
for(int i=0; i< bytes.length ;i++)
{
sb.append(Integer.toString((bytes[i] & 0xff) + 0x100, 16).substring(1));
}
//return complete hash
return sb.toString();

If the file changes in any way the resultant hash code will be different. You can handle those differences according to the context as a new version or as an integrity check failure.

Complete Sample Code

The complete code can be found bellow:

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class FileCheckSum {
//Hashing Algorithms
public static final String MD5="MD5";
public static final String SHA1="SHA-1";
//Parameter options for cmd
//-f file list
//-a hashing algorithm
public static final String FILE="f";
public static final String ALGORITHM="a";
/**
* @param fileName
* @param algorithm
* @return
**/
public static String generateCheckSum(String fileName,String algorithm) throws NoSuchAlgorithmException, IOException{
// Create checksum for this file
File file = new File(fileName);
// Use MD5 or SAH1 algorithm
MessageDigest md5Digest = MessageDigest.getInstance(algorithm);
// Get the checksum
String checksum = getFileChecksum(md5Digest, file);
// see checksum
// System.out.println(checksum);
return checksum;
}
private static String getFileChecksum(MessageDigest digest, File file) throws IOException{
//New file input stream for reading the file content
FileInputStream fis = new FileInputStream(file);
//Create byte array to read data in chunks
byte[] byteArray = new byte[1024];
int bytesCount = 0;
//Read file data and update in message digest
while ((bytesCount = fis.read(byteArray)) != -1) {
digest.update(byteArray, 0, bytesCount);
};
//close the stream; We don't need it now.
fis.close();
//Get the hash's bytes
byte[] bytes = digest.digest();
//This bytes[] has bytes in decimal format;
//Convert it to hexadecimal format
StringBuilder sb = new StringBuilder();
for(int i=0; i< bytes.length ;i++)
{
sb.append(Integer.toString((bytes[i] & 0xff) + 0x100, 16).substring(1));
}
//return complete hash
return sb.toString();
}
/**
* @param args
*/
public static void main(String[] args) {
final Map<String, List<String>> params = new HashMap<>();
List<String> options = null;
for (int i = 0; i < args.length; i++) {
final String a = args[i];
if (a.charAt(0) == '-') {
if (a.length() < 2) {
System.err.println("Error at argument " + a);
return;
}
options = new ArrayList<>();
params.put(a.substring(1), options);
}
else if (options != null) {
options.add(a);
}
else {
System.err.println("Illegal parameter usage");
return;
}
}
try{
String algorithm=MD5;
long startTime=0;
long endTime=0;
String checkSum;
if(params.get(ALGORITHM) !=null){
algorithm=params.get(ALGORITHM).get(0);
if (!MD5.equals(algorithm) & !SHA1.equals(algorithm) ){
//use default SHA1
System.out.println("Invalid algorithm parameter. Using default: SHA-1 algorithm.");
algorithm=SHA1;
}
}
//add multiple files support using iterator
List<String> files=params.get(FILE);
if(files==null){
System.out.println("Use -f [filename]");
return;
}
for (String fname : files) {
startTime = System.nanoTime();
checkSum=FileCheckSum.generateCheckSum(fname,algorithm);
endTime = System.nanoTime();
System.out.println("["+((endTime-startTime)/1000000)+"] File: "+fname+" CheckSum: "+checkSum);
}
} catch (IOException e) {
System.err.println(e);
} catch (NoSuchAlgorithmException e){
System.err.println(e);
} catch (Exception e){
System.err.println(e);
}
}
}

Conclusions

We have made a tour of the concepts associated with the generation of Hash codes and their applications for the generation of Unique identifiers for objects in our application domain.
The main strength of Hashing algorithms is that they offer, at the same time, the possibility of generating an integrity verification code and an identifier reliable enough to be the unique identifier. Hashing libraries are widely available as standard in practically all programming languages, so their use is very feasible and quick to implement.

Although in the example we have used the case of generating filenames, many other uses of Hashing algorithms are well known. Just to name a few we can mention:

  • Database indexing
  • Password storage
  • Data compression
  • Search algorithms
  • Cryptography

Credits

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up

Top comments (0)

The Most Contextual AI Development Assistant

Pieces.app image

Our centralized storage agent works on-device, unifying various developer tools to proactively capture and enrich useful materials, streamline collaboration, and solve complex problems through a contextual understanding of your unique workflow.

👥 Ideal for solo developers, teams, and cross-company projects

Learn more

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay