DEV Community

Cover image for Handling IP addresses
Ricardo Iván Vieitez Parra for Apeleg Limited

Posted on • Edited on • Originally published at apeleg.com

Handling IP addresses

Screenshot of Wireshark

IP addresses pose unique technical challenges because they are sent in the clear with every network request and because they may in some cases identify or help identify an individual, making them personal data. In the EU, this was affirmed already in case C-582/14 and, with the coming into force of the GDPR, this criterion has been applied to the transfer of IP addresses to the United States in the popular website analytics tool Google Analytics by data protection agencies in several countries already, such as Austria, the Netherlands and France, and others that may soon follow suit, like Norway. Similarly, in a recent case in Germany, a website operator was ordered to compensate a visitor for embedding Google Fonts and thus violating the visitor's privacy by sharing their IP address with Google.

Since IP addresses are sent along with every packet on the Internet and precisely because they identify the sender to some extent, they also have many practical uses. For instance, they are a valuable data point for detecting and investigating data breaches and other security incidents. They can also be used as input to automated moderation or filtering systems for purposes like blocking spam or online fraud and can also provide vital analytics for responding to traffic, like finding optimal route paths for transmitting content to our users.

Fortunately, regulations like the GDPR contemplate these use cases under categories like 'overriding legitimate interest', which means that not all use or processing of IP addresses is disallowed without explicit user consent. However, since IP addresses are personal information, it is important to ensure that they are only used in connection to essential purposes that don't require explicit consent or that consent is obtained prior to processing. The particular challenge of IP addresses compared to other personal information is that, by the nature of how the Internet works, they can be very easily relayed to third parties by things as simple as having the user download a file.

To avoid this situation, we are left with three options: not using external resources, obtaining the user's consent or self-hosting.

Not using (certain) external resources

While this addresses the concern of sharing the user's IP address without consent, it's not always a viable option, because those resources may very well provide value to our website and our users, for example, jQuery might what our website uses to provide an interactive page, or those resources might be fonts that are important to use for consistent branding. Nonetheless, the option of not using an external resource is a valid one to consider because, if feasible, it is generally the easiest alternative. This option is valid for resources that are not really used or which are easily replaceable.

Apart from not sharing IP addresses with third parties, the advantage of not using certain resources is that we get to improve the user's experience by having a small page that loads faster.

Obtaining the user's consent

While this is a valid option to consider, it might not always be practical or a preferred solution, because the user consent needs to come first, before any resource is loaded. Take the example of public CDNs, which host resources that may be essential for having our website work, or at least to provide a rich and high-quality user experience. If we seek to have the user's consent, depending on what we use those external resources for, in practice it means that we will deliver a degraded user experience until the consent is obtained. While this could be acceptable for certain sections or functionality of our page, it probably isn't for core functionality. More importantly, it can unnecessarily pressure the user into giving consent (removing the 'freely given' element of it) since doing otherwise implies having a poorer experience interacting with our site.

Self-hosting

Self-hosting, as the name implies, is hosting resources ourselves in our own infrastructure. While doing this can in some cases be an involved process, especially if we are implementing self-hosting on a site that currently loads several resources from third party origins, when done well it has the advantages of not leaking visitors' IP addresses to others, not requiring explicit consent and not resulting in limited functionality or interactivity.

Server logs

Listing of a typical access log

Self-hosing doesn't address all of the challenges that come with processing personal information, but it does put us back in control and from that position we can implement policies that protect our users' privacy. One of those challenges is that most web servers by default store incoming IP addresses in their various logs. The fact that IP addresses are stored is not necessarily a problem, as they can provide useful information for, for example, adequately responding to a security incident.

However, it's crucial to treat these entries with IP addresses as containing personal information and ensuring the right procedures and policies are in place, starting with the use to be given to the information. For example, while it is most certainly appropriate to keep logs for security purposes, using the IP entries in log files to also correlate sessions and derive behavioural analytics is likely to require the users' informed and actively given consent.

A second consideration when keeping log files with IP addresses is elaborating a data retention policy. The benefits of having logs of IP addresses for security purposes diminish over time, and need to be weighed against the disadvantages, namely that storing information opens us up to leaking that information (for example, by accident or due to server compromise) and that users might request their personal information be deleted, and handling these requests manually can be time-consuming and expensive. Hence, in many cases the optimal solution is retaining logs for a period long enough to respond to incidents but short enough to avoid the information turning into a liability.

Different jurisdictions

Different jurisdictions have different regulations and requirements about processing personal data, and what is permissible or even required in one place can be disallowed in another. Therefore, in as much as it is possible, the preferred solution is avoiding transferring personal information between jurisdictions, especially between those that don't have comparable levels of data protection. One example are data transfers between the EU and the US, which after the ECJ invalidated the Privacy Shield framework require additional considerations to ensure that all EU regulations are being followed.

If we host our server in the United States and receive visitors from the EU, safeguards need to be in place to ensure that all personal data being transferred, including IP addresses, are going to be handled appropriately and in accordance with the EU regulations. Since this can be burdensome, a solution for organisations using a CDN with many local points of presence might be to simply implement firewall policies at the edge and strip requests of all personal information, thus avoiding this requirement. For example, Cloudflare offers request transform rules that can remove information present in request headers, like IP addresses.

Analytics services

Analytics provide insightful information about how our web properties are used, which can in turn be valuable feedback for improving user experience and identifying areas for improvement. On the other hand, when implementing analytics, as with everything else, we must consider how our visitors' personal data will be processed and the basis for that processing. There exist a wide range of services, including the popular and ubiquitous Google Analytics, that enable publishers to collect and process various types of metrics.

However, using third party tools presents many of the same challenges as loading content from external sources do. Although Google Analytics has an opt-mechanism to avoid collecting users' IP addresses, as mentioned earlier, these measures have been found to be inadequate in many countries in the EU. The solutions are also in many ways the same as for loading external content.

Since analytics services are in most cases non-essential, the alternative of collecting users' consent could in this case be more viable than it is for loading essential libraries, and is implemented in many websites nowadays in the shape of consent prompts. It is still however not necessarily good for neither user experience nor user trust having to present these prompts and warn of the risks of cross-border data transfers.

When using analytics services provided by a third party, it can be worth considering to proxy connections to the service through our own infrastructure instead of having users connect directly to the third party analytics collection endpoint. This way, it is possible to scrub the data being sent and remove any personal information. In the case of Google Analytics, this means proxying requests to https://www.google-analytics.com/collect with our own domain.

Analytics services can also be self-hosted. While this is a more expensive solution from a management perspective, it is the solution that affords the greatest flexibility for implementing and enforcing privacy protections. There are many offerings in this space, including established projects like Matomo (formerly Piwik) and Open Web Analytics.

Top comments (0)