DEV Community

Edward Amor
Edward Amor

Posted on • Originally published at edwardamor.xyz on

High Level Overview of Quantile Quantile Plots

A part of any data analyst’s toolkit when working with one dimensional data, is the Quantile Quantile plot. Colloquially referred to as Q-Q plots, these visualizations are unique in that they’re mainly utilized when comparing samples and/or comparing distributions. Although they’re not intuitive, Q-Q plots are amazing tools, especially when assessing whether a sample fits a known distribution, like the Gaussian distribution.

Q-Q plots work simply by plotting the quantiles of one distribution (x-coordinate), typically a theoretical distribution, against the quantiles of another distribution (y-coordinate), typically an observed dataset. If the two quantiles being compared are related, then the resulting plot will show points lying approximately on the line y=x. There are some variations to the Q-Q plot though, and each one tells you something different about the data being compared. Q-Q plots are also loosely open to interpretation, and a good heuristic is if it generally lies close enough to the line y=x​ then you’re golden. Even data randomly drawn from the Gaussian distribution won’t lie exactly on the line ​y=x, so there is wiggle room.

This Q-Q plot shows the quantiles of 75 randomly drawn data points from the standard normal distribution, compared to the normal distribution. One would intuitively think that the points would lie perfectly on the line ​y=x, however this isn’t the case and explains why we say QQ plots are loosely open to interpretation.

One example of where Q-Q plots are definitely applied are in linear regression. In linear regression, there are assumptions that have to be met in order for the created model to be considered valid and not misleading. One of the assumptions is that the residuals of the model are normally distributed. To verify this assumption has not been violated, we typically use a Q-Q plot to quickly compare the distribution of residuals to that of the Gaussian distribution. If the residuals loosely fit the line ​y=x, then one can state that the assumption has not been violated.

This Q-Q plot was generated from fitting a multi-variate linear regression model, the residuals from the training data were then plotted against the normal standard distribution. One can see that this data doesn’t appear to be normal, due to the curvature of the points. This upward curvature actually denotes a positive skew in the residuals, meaning our model is over predicting even on our training set.

Just like any other graphical method for analyzing data, there are strengths and weaknesses to Q-Q plots. One has to know when best to use a Q-Q plot to receive the most benefit from it. In the case of Q-Q plots, they are immensely beneficial when comparing two distributions (theoretical or empirical), as they show how location (mean), scale (standard deviation), and skewness are similar or different in the two distributions. They’re also extremely beneficial when assessing the residuals of a regression model as shown previously.

The biggest weakness of Q-Q plots in my eyes is there exists an initial steep learning curve, but luckily the Internet offers a trove of information, and one of the most beneficial resources I found was a post on StackExchange. Beyond that, the other major issue with Q-Q plots is that there is some room for interpretation on whether your data lies close enough to the line y=x​. One person’s assessment will not always line up with another person’s, but after some practice they provide an immense benefit when quickly assessing data.

Top comments (0)