Tom Kochuyt
2 min readApr 26, 2021

--

Even though your are not affecting the actual data, you are affecting the visualization of it, which creates the possibility of obscuring the actual signal in the plot or amplifying some of the noise to a level it looks like a signal although it isn’t. In other words, don’t jitter just because you can and when you do jitter tread carefully.

Personally I find for this case the unjittered plot already quite informative about the relation between the variables, so I would not jitter. The ‘columns’ are not really a problem, they reflect the measurement precision (cm). Adding jittering actually hides that fact about the data while creating the perception the precision is higher than it actually is.

But I can imagine cases were jittering can be useful to make a plot more speaking. If you decide to add noise based on a normal distribution, don’t make an arbitrary choice for the standard deviation. Use the facts you have about the data to choose a value that minimizes risk of signal disturbance.

E.g. in this case you know that in the given dataset the precision of height is 1 cm. Given that, adding noise generated from a normal with mu = 0 and sigma = 2 means roughly 72% of you data points will be plotted at least 1 tick too low or too high on the x-axis. Roughly 32% will be plotted at least 2 ticks too high or too low, etc. In other words, your plot is not a faithful representation of your actual data.

So how to choose a value for sigma that minimizes the distortion/shift?

Well, we know that roughly 99,7% of a standard normal (mu = 0, sigma = 1) is between -3 and + 3 sigma. So we could simply choose ‘precision / 3' as sigma for our noise generating normal. This will plot most the datapoints of h not more than 1 tick below or above h.

This is fine for a particular height h, but does not take into account that datapoints for heights h-1 and h+1 are also jittered. Which might create artificial groups of plotpoints that are a mix of points belonging to h-1 and h on the left of h, and mix of points belonging to h and h+1 on the right of h. To account for this and minimize the risk of mix, you could choose ‘precision / 6' as sigma.

--

--

No responses yet