sPlot: a technique to reconstruct components of mixture
This post is devoted to the explanation of what an sPlot is. This well-known method was recently added to hep_ml library.
An sPlot is a way to reconstruct features of mixture components based on known properties of distributions. This method is frequently used in High Energy Physics.
Simple example of sPlot
First start from a simple (and not very useful in practice) example.
Assume we have two types of particles (say, electrons and positrons).
The distribution of some characteristic is different for them (let this be the px momentum projection).
Observed distributions
Picture above shows how this distribution should look like, but due to inaccuracies during classification we will observe a different picture, we’ll see later it is important.
Let’s assume that with a probability of 80% particle is classified correctly (and we are not using px
during classification).
When we look at the distribution of px
for particles which were classified as electrons or positrons, we see that distributions are distorted.
We lost the original shapes of distributions.
Applying sWeights
We can think of it in the following way: there are 2 bins. First bin contains 80% of electrons and 20% of positrons. And visa versa in the second bin.
To reconstruct the initial distribution, we can plot the histogram where each event from the first bin has weight 0.8, and each event from the second bin has weight -0.2. These numbers are called sWeights.
In other words, let’s say we had 8000 $e^{-}$ + 2000 $e^{+}$ in first bin and 8000 $e^{+}$ + 2000 $e^{-}$ ($ e^-, e^+$ are electron and positron). After summing with introduced sWeights:
Positrons with positive and negative weights compensated each other, and we got “pure electrons”.
At the moment we ignore the normalization of the sWeights (because it doesn’t play any role when we want to reconstruct the shape of distributions).
Compare
Let’s compare reconstructed distribution for electrons with original:
More complex case
In the case when we have only two ‘bins’ things are simple and straightforward.
But when there are more than two bins, the solution is not unique. There are many appropriate combinations of sWeights, which one to choose (like in example below with 3 bins)?
But things are more complex in practice. We have not bins, but continuous distributions (which can be treated as many bins).
Typically this is a distribution over mass. By fitting the mass we are able to split the mixture into two parts: signal channel and everything else.
Building sPlot over mass
Let’s show how this works. First we generate two fake distributions (signal and background) with 2 variables: mass and momentum.
Of course we don’t have labels which events are signal and which are background.
And we observe the mixture of two distributions:
We have no information about real labels
But we know a priori that the background is distributed as an exponential distribution and signal as a gaussian (more complex models can be met in practice, but idea is the same).
After fitting the mixture (let me skip this process), we get the following result:
Fitting doesn’t give us information about real labels
But it gives information about probabilities, which allows us to estimate the number of signal and background events within each bin.
We won’t use bins, but instead we’ll compute for each event probability that it is signal or background (this probability is computed from the mass. Make sure you see the connection with previous plot):
Applying sPlot
sPlot converts probabilities to sWeights, using the implementation from hep_ml
:
As you can see, there are also negative sWeights, which are needed to compensate the contributions of other class (remember that in first example we needed negative weights).
Using sWeights to reconstruct initial distribution
Let’s check that we achieved our goal and now we can reconstruct momentum distribution for signal and background using sWeights:
Important requirement of sPlot
Reconstructed variable (i.e. $p$) and splotted variable (i.e. mass) shall be statistically independent for each class.
Read the line above again. Reconstructed and splotted variable are correlated!
as a demonstration why this is important let’s use sweights to reconstruct the mass (obviously the mass is correlated with the mass):
Derivation of sWeights
Now, after we seen how this works, let’s derive the formula for sWeights.
The only information we have from fitting over the mass is $ p_s(x) $, $ p_b(x)$ which are probabilities of event $x$ to be signal and background.
Our main goal is to correctly reconstruct histogram. Let’s reconstruct the number of signal events in particular bin. Let’s introduce unknown $p_s$ and $p_b$ — probability that signal or background event will be in the named bin.
(Since mass and reconstructed variable are statistically independent for each class, $p_s$ and $p_b$ do not depend on mass.)
The mathematical expectation should be obviously equal to $p_s N_s$, where $N_s$ is total amount of signal events available from fitting.
Let’s also introduce random variable $1_{x \in bin}$, which is 1 iff event $x$ lies in selected bin.
The estimate for number of signal event in bin is equal to: where $sw_s(x)$ are sPlot weights and are subject to find.
First main property of sweights
Property 1. We expect estimate to be unbiased
Corollary Let’s understand what this means for sPlot weights.
$ p_s N_s = \mathbb{E} \, X = \sum_x w_s \; \mathbb{E} \, 1_{x \in bin} = \sum_x w_s \; (p_s p_s(x) + p_b p_b(x)) $
In the line above I used the assumption that variables are statistically independent for each class.
Since the previous equation should hold for all possible $p_s$ and $p_b$, we get two equalities:
$ p_s N_s = \sum_x sw_s(x) \; p_s p_s(x) $
$ 0 = \sum_x sw_s(x) \; p_b p_b(x) $
After reduction:
$ N_s = \sum_x sw_s(x) \; p_s(x) $
$ 0 = \sum_x sw_s(x) \; p_b(x) $
This way we can guarantee that mean input of background are 0 (expectation is zero, but observed number will not be zero due to statistical deviation), and the expected number of background
Under assumption of linearity:
assuming that splot weight can be computed as linear combination of conditional probabilities:
$ sw_s(x) = a_1 p_b(x) + a_2 p_s(x)$
We can easily reconstruct those numbers, first let’s rewrite our system:
$ \sum_x (a_1 p_b(x) + a_2 p_s(x)) \; p_s(x) = 0$
$ \sum_x (a_1 p_b(x) + a_2 p_s(x)) \; p_b(x) = N_{sig}$
$ a_1 V_{bb} + a_2 V_{bs} = 0$
$ a_1 V_{sb} + a_2 V_{ss} = N_{sig}$
Where $V_{ss} = \sum_x p_s(x) \; p_s(x) $, $V_{bs} = V_{sb} = \sum_x p_s(x) \; p_b(x)$, $V_{bb} = \sum_x p_b(x) \; p_b(x)$
Having solved this linear equation, we get needed coefficients (as those in the paper)
NB. There is little difference between $V$ matrix I use and $V$ matrix in the paper.
Minimization of variation
Previous part allows one to get the correct result. But there is still no explanation of reason for linearity.
Apart from having correct mean, we should also minimize variation of any reconstructed variable. Let’s try optimize it
A bit complex, isn’t it? Instead of optimizing such a complex expression (which is individual for each bin), let’s minimize it’s uniform upper estimate
so if we are going to minimize this upper estimate, we should solve the following optimization problem with constraints:
$\sum_x sw_s(x)^2 \to \min $
$\sum_x sw_s(x) \; p_b(x) = 0$
$\sum_x sw_s(x) \; p_s(x) = N_{sig}$
Let’s write lagrangian of optimization problem:
After taking derivative with respect to $ sw_s(x) $ we get the equality:
which holds for every $x$. Thus, after renaming for convenience $a_1 = - \lambda_1 / 2, $ $a_2 = - \lambda_2 / 2, $ we are getting needed linear dependency.
Statistical independence
The main assumption we used here is that distribution inside each bin is absolutely identical.
In other words, we stated that there is no correlation between the index of bin and the reconstructed variable. Remember that bin corresponds to some interval in mass, and finally we get:
reconstructed variable shall not be correlated with mass variables (or any other splotted variable)
Conclusion
- sPlot allows reconstruction of some variables.
- the only information used is probabilities taken from fit over variable. If fact, any probability estimates fit well.
- the source of probabilities should be statistically independent from reconstructed variable (for each class!).
- mixture may contain more than 2 classes (this is supported by
hep_ml.splot
as well)
Sources and code
The code for this post may be found at hep_ml
repository.
Links
A very close explanation was written by Michael Schmelling.
Thanks to Konstantin Schubert for correcting this post.