Detailed explanation of core algorithm of device fingerprinting

Most of risks come from uncertainty of identity.

For example, the well-known phishing, bogus transaction, account theft, and fraud through registration and login are the direct consequences of the uncertainty of identity.

So, how do you make sure your identity is established and that you can't be easily stolen or copied?

Device fingerprinting provides a solution. Device fingerprinting refers to the unique identification of the device generated by the hardware, network, and environmental information of the user's Internet device, and the most important indicators of device fingerprint are uniqueness and stability.

This article starts with the stability and uniqueness of device fingerprinting.

It is not easy to ensure the stability and uniqueness of device fingerprinting

As a unique ID for identifying mobile phone or browser, it's required to collect data from user devices, but it is not easy for device fingerprinting.

First of all, the legal level of the user's personal information data protection authority increased.

In recent years, the protection provisions of personal information data have been continuously tightening the rights of data collection, and the collection rate of features with the unique identification function of the device is becoming less and less, which increases the difficulty of device fingerprint technology to a certain extent.

In the past, the methods for collecting the unique identification of device fingerprint include MAC address (Ethernet physical address), IMEI (mobile equipment identification number), and IDFA (advertising identifier). With the updage of system and emergence of new cheating methods and tools, parameters on the device can be tampered with, turning it into a new device easily, which compromises the effect of unique identification of the device. Therefore, the device fingerprinting requires more information to ensure the uniqueness and stability of the device fingerprint.

Taking iOS as example, IDFA collection on iOS 14 or later requires user authorization, the same applies for Serial and IMEI. This change has made the generation of stable and unique identity more complicated.

The collection rate of some feature statistics (total number of devices > 500,000), the actual availability rate of android-mac here will be lower, because of arbitrary nature of mac

Second, the amount of abnormal data increases.

Different mobile phone models and operating system versions will lead to different characteristics of the collected data. It's especially so for Android, as there are so many different brands of mobile phones run on Android, with different OS version and ROM, the data collected are inconsistent.

Consider IMEI, the following are actual abnormal data collected.
000000000000000
88888888888888
666666666666667
111111111111119
A000005EAAACCC
868686868686863
444646464643346
100000000150705
010000000053015
A0000060A60A0B
A0000070000AAB

Abnormal MAC address.
00:00:00:00:00:0a
02:00:00:00:00:00
00:00:00:00:00:01
00:00:00:00:00:10
6a:aa:6a:aa:6a:6a
00:02:00:00:00:00
00:00:00:80:00:00
10:00:00:00:00:12
f2:0f:f0:02:f0:22
32:12:31:23:32:32
66:00:44:40:06:66
c0:00:00:00:00:d0
00:00:00:00:00:12
04:00:00:50:54:04

These abnormal data may be because the device has been tampered with, or abnormal ROM, or special model device, it's also true for android_id, Seria, etc. If these abnormal data are ignored, the effect of fingerprints will definitely be affected. Therefore, it is necessary to have the support of big data to formulate a reasonable fingerprint algorithm.

In addition, the fingerprint calculation algorithm is not static. The release of a new version of operating system and a new model of mobile phone may lead to abnormal data collection at any time, which requires the device fingerprint to have the ability to continuously monitor abnormal data.

Last is the impact of illegal and semi-illegal behaviors.

In the field of risk control, device fingerprint is a basic tool to fight against illegal and semi-illegal behaviors, so there are inevitably ways to bypass the tracking and detection of fingerprints. A common way is to modify the device fingerprint by changing the device fingerprint manually.

In addition, the tampering of device information may also cause fingerprint data corruption. If the tampered data is associated with normal user's device, the data of normal user will be compromised, and it is difficult to remove such data from the system.

Accurate matching based on feature information to ensure device fingerprint stability

So this requires changes made to device fingerprinting, Ding Xiang chooses to do so through the cloud + terminal model, in addition to collection of more detailed information, the timeliness and security of equipment have been significantly enhanced, based on the previous experience and data precipitation of the industries. When new attack methods appear, how can fingerprinting technology respond and identify it faster.

How to ensure the stability and uniqueness of device fingerprints?

Let's consider the stability first.

The stability calculation logic is to construct a multi-dimensional device portrait based on the collected data. After reporting the data, the algorithm matches the closest device and uses the found device fingerprint, otherwise, new device fingerprint is generated.

The reason for emphasizing the stability of the device fingerprint is that many operations will cause the characteristic information of the device to change, such as app reinstallation, permission change, advertising identifier reset, machine restart, system upgrade, device parameter modification, and factory reset.

In order to calculate a stable unique identity of the device in these cases, it is of course impossible to rely on only one or two device parameters, and more comprehensive device information is needed to build a device profile.

To ensure the high stability of device fingerprints, it is necessary to collect multi-dimensional data, analyze the degree of discrimination, stability, and change cycle, and analyze the characteristics of the characteristics, and use the characteristics reasonably.

Taking CPU, storage, and memory as example, it has very good stability, but the discrimination is very poor. So such features are not suitable for use alone, it needs to be used together with other features.

It is necessary for us to further observe the period of feature stability (the size of the period in which the feature remains constant) and the difficulty of the stable change of the feature (the degree of difficulty of changing the feature), and decide how to use the feature based on the different characteristics of the feature, when selecting a feature, it should have an obvious advantage in at least some index.

In order to ensure the stability of the device fingerprint, it is necessary to construct multiple device portraits (short-term and long-term) for the device based on different characteristics. First, a batch of similar devices have to be located through the similarity search algorithm, and then accurately picked by the exact matching algorithm.

The process of basic algorithm (SimHash) is as follow:

Select the device information collected as feature pool;
Hash coding the features according to the degree of discrimination, stability, and difficulty of change of the collected information after corresponding processing. For example, splice multiple features with weak discrimination, and hash code to obtain coding vector C;
Set the weight W of the coding vector C according to features, (note: weight W can be obtained through big data learning, or set according to expert experience, weight W varies depending on the operating system version of device);
Multiply the coding vector C by the corresponding weight W to obtain the feature vector C';
Add all eigenvectors C' to get a eigenvector, and binarize (values greater than 0 become 1, and values less than 0 become 0) to obtain the final index D;
Calculate the similarity between the index D and the indexes of other devices in the server, (calculate the Hamming distance - in the information coding, the number of different bits encoded on the corresponding bits of the two legal codes is called code distance, also known as Hamming distance. The bit number with different values for the corresponding bits of the two code words is called Hamming distance of two code words. For example, the first, fourth, and fifth digit of 10101 and 00110 are different, so the Hamming distance is 3) return the device information whose similarity is within the threshold as candidate pool for further precise matching.

It should be noted that the stability of device fingerprint depends on whether sufficient and appropriate features are selected, whether the characteristics of each feature are properly understood, and whether the matching algorithm used conforms to the current situation of data acquisition.

Algorithm update + collision detection = the magic weapon to achieve uniqueness for device fingerprinting

Next, let's talk about uniqueness.

Uniqueness is a key indicator of device fingerprinting.

As basic service, device fingerprint is at the core of data services, such as user's commonly used devices, device-dependent policies in risk control, and unique advertising identifier. If there is a deviation in the fingerprint uniqueness calculation, or collision, it may cause serious misjudgment.

For example, after collision, the device may be considered to be in wrong area and associated with wrong account, resulting in erroneous punishment; if there is a large-scale collision, a large number of normal users will be blocked. Collision will affect the reliability of fingerprints, as a result, online services no longer rely on fingerprint-based judgments. Therefore, for good device fingerprint, the uniqueness must not have severe deviation.

The collision may be resulted from non-uniqueness of important equipment features that should be unique, or improper feature selection, lack of unique identifier for a combination of features, or bad matching algorithm.

Therefore, to ensure the uniqueness of the device fingerprint, it requires not only real-time dynamic update algorithm, but also collision detection.

First of all, as far as the algorithm update is concerned, it is not enough to rely on the application of the daily abnormal data detection, it is necessary to analyze the current data regularly in the offline warehouse, and the abnormal features can be found and extracted in time, and then fed back to the online algorithm optimization.