Data manipulation with numpy: tips and tricks, part 1
Data manipulation with numpy: tips and tricks, part 1¶
Some inobvious examples of what you can do with numpy are collected here.
Examples are mostly coming from area of machine learning, but will be useful if you're doing number crunching in python.
from __future__ import print_function # for python 2 & python 3 compatibility
%matplotlib inline
import numpy as np
1. numpy.argsort¶
Sorting values of one array according to the other¶
Say, we want to order the people according to their age and their heights.
ages = np.random.randint(low=30, high=60, size=10)
heights = np.random.randint(low=150, high=210, size=10)
print(ages)
print(heights)
sorter = np.argsort(ages)
print(ages[sorter])
print(heights[sorter])
once you computed permutation, you can apply it many times - this is fast.
Frequently to solve this problem people use sorted(zip(ages, heights)), which is much slower (10-20 times slower on large arrays).
Computing inverse of permutation¶
permutations in numpy are simply arrays:
permutation = np.random.permutation(10)
original = np.array(list('abcdefghij'))
print(permutation)
print(original)
print(original[permutation])
Inverse permutation is computed using numpy.argsort (again!)
inverse_permutation = np.argsort(permutation)
print(original[permutation][inverse_permutation])
This is true because of two facts:
indexing operation is associative (dramatically simple and interesting fact):
a[b][c] = a[b[c]]
provided that
a
,b
,c
are 1-dimensional arrays`permutation[inverse_permutation] is identical permutation:
permutation[inverse_permutation]
Even faster inverse permutation¶
As was said, numpy.argsort returns inverse permutation, but it takes $O(n \log(n))$ time, while computing inverse permutation should take $O(n)$.
This optimal way can be written in numpy.
print(np.argsort(permutation))
inverse_permutation = np.empty(len(permutation), dtype=np.int)
inverse_permutation[permutation] = np.arange(len(permutation))
print(inverse_permutation)
Computing order of elements in array¶
frequently it is important to compute order of each value in array.
In other words, for each element in array we want to find the number of elements smaller than given.
data = np.random.random(10)
print(data)
print(np.argsort(np.argsort(data)))
NB: there is scipy function which does the same, but it's more general and faster, so prefer using it:
from scipy.stats import rankdata
rankdata(data) - 1
IronTransform (flattener of distribution)¶
Sometimes you need to write monotonic tranformation, which turns one distribution into uniform.
This method is useful to compare distributions or to work with distributions with heavy tails or strange shape.
class IronTransform:
def fit(self, data, weights):
weights = weights / weights.sum()
sorter = np.argsort(data)
self.x = data[sorter]
self.y = np.cumsum(weights[sorter])
return self
def transform(self, data):
return np.interp(data, self.x, self.y)
Let's apply Iron transformation to data
sig_pred = np.random.normal(size=10000) + 1
bck_pred = np.random.normal(size=10000) - 1
from matplotlib import pyplot as plt
plt.figure()
plt.hist(sig_pred, bins=30, alpha=0.5)
plt.hist(bck_pred, bins=30, alpha=0.5)
plt.show()
Now we can see that IronTransform actually was able to turn signal distribution to uniform:
iron = IronTransform().fit(sig_pred, weights=np.ones(len(sig_pred)))
plt.figure()
plt.hist(iron.transform(sig_pred), bins=30, alpha=0.5)
plt.hist(iron.transform(bck_pred), bins=30, alpha=0.5)
plt.show()
How IronTransform works¶
To turn any distribution into uniform, you should use mapping: $x \to F(x)$, here $F(x)$ is cumulative distribution function.
In other words, for each point $x$ we should compute the part of distribution to the left.
This was done by summing all weights to this point using numpy.cumsum
(we also had to normalize weights).
To use the learned mapping, linear interpolation (numpy.interp
) was applied.
We can visualize this mapping:
plt.plot(iron.x, iron.y)
IronTransform and histogram equalization¶
The same method is applied in computer vision to 'improve' low-contrast images and called histogram equalization.
from PIL import Image
image = np.array(Image.open('AquaTermi_lowcontrast.jpg').convert('L'))
flat_image = image.flatten()
result = IronTransform().fit(flat_image, weights=np.ones(len(flat_image))).transform(image)
plt.figure(figsize=[14, 7])
plt.subplot(121), plt.imshow(image, cmap='gray'), plt.title('original')
plt.subplot(122), plt.imshow(result, cmap='gray'), plt.title('after histogram equalization')
plt.show()
Also worth mentioning: to fight very small / large numbers, use numpy.clip
. Simple and very handy:
np.clip([-100, -10, 0, 10, 100], a_min=-15, a_max=15)
x = np.arange(-5, 5)
print(x)
print(x.clip(0))
2. Broadcasting, numpy.newaxis¶
Broadcasting is very useful trick, which simplifies many operations.
Weighted covariation matrix¶
numpy has cov
function, but it doesn't support weights of samples. Let's write our own covariation matrix.
Let $X_{ij}$ be the original data. $i$ is indexing samples, $j$ is indexing features. $COV_{jk} = \sum_i X_{ij} w_i X_{ik}$
data = np.random.normal(size=[100, 5])
weights = np.random.random(100)
def covariation(data, weights):
weights = weights / weights.sum()
return data.T.dot(weights[:, np.newaxis] * data)
np.cov(data.T)
covariation(data, np.ones(len(data)))
covariation(data, weights)
alternative way to do this is via numpy.einsum
:
np.einsum('ij, ik, i -> jk', data, data, weights / weights.sum())
Pearson correlation¶
One more example of broadcasting: computation of Pearson coefficient.
np.corrcoef(data.T)
def my_corrcoef(data, weights):
data = data - np.average(data, axis=0, weights=weights)
shifted_cov = covariation(data, weights)
cov_diag = np.diag(shifted_cov)
return shifted_cov / np.sqrt(cov_diag[:, np.newaxis] * cov_diag[np.newaxis, :])
my_corrcoef(data, np.ones(len(data)))
Pairwise distances between vectors¶
we're given a set of vectors (say, 1000 vectors). The problem is to compute their pairwise distances.
X = np.random.normal(size=[1000, 100])
distances = ((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2).sum(axis=2) ** 0.5
also, since algebra is you friend, you can do it times faster: $$||x_i - x_j||^2 = ||x_i||^2 + ||x_j||^2 - 2 (x_i, x_j) $$ so dot products is the only thing you need.
products = X.dot(X.T)
distances2 = products.diagonal()[:, np.newaxis] + products.diagonal()[np.newaxis, :] - 2 * products
distances2 **= 0.5
... but keep in mind there is sklearn.metrics.pairwise which does it for you and has different options
from sklearn.metrics.pairwise import pairwise_distances
distances_sklearn = pairwise_distances(X)
np.allclose(distances, distances_sklearn), np.allclose(distances2, distances_sklearn)
3. numpy.unique, numpy.searchsorted¶
Converting categories to small integers¶
Despite categories are easier to read, operating with them is much faster after you represent them as small numbers
(for instance, in range [0, n_categories - 1])
First generate some categories:
from itertools import combinations
possible_categories = map(lambda x: x[0] + x[1], list(combinations('abcdefghijklmn', 2)))
categories = np.random.choice(possible_categories, size=10000)
print(categories)
Now we use numpy.unique to convert them to numbers (pay attention to return_inverse=True
)
unique_categories, new_categories = np.unique(categories, return_inverse=True)
print(unique_categories)
print(new_categories)
Also we can reconstruct old categories if needed:
print(categories)
print(unique_categories[new_categories])
Mapping new categories¶
When we need to map new categories using the same lookup table, we can use numpy.searchsorted
categories2 = np.random.choice(possible_categories, size=10000)
new_categories2 = np.searchsorted(unique_categories, categories2)
print(categories2)
print(unique_categories[new_categories2])
Checking values present in the lookup¶
Usually, we use set
to have many checks that each element exists in array. But numpy.unique is better option - it works dozens times faster.
np.in1d(['ab', 'ac', 'something new'], unique_categories)
Handling missing categories¶
When some new category appears in test data, this is always a trouble. Possible solutions:
- map randomly to one of present categories (this, i.e. is done by hashing trick)
- make a special category 'unknown' where all new values will be mapped. Then special value may be chosen for it. This approach is more general, but requires more efforts.
Below you can find the transformer for second approach:
class CategoryMapper:
def fit(self, categories):
self.lookup = np.unique(categories)
return self
def transform(self, categories):
"""Converts categories to numbers, 0 is reserved for new values (not present in fitted data)"""
return (np.searchsorted(self.lookup, categories) + 1) * np.in1d(categories, self.lookup)
CategoryMapper().fit(categories).transform([unique_categories[0], 'abc'])
4. numpy.percentile, numpy.bincount¶
Splitting data into bins¶
Splitting data into bins is frequently necessary. Simplest way done e.g. by numpy.histogram
is splitting region with even bins of same size.
This usually works poorly with distributions with long tails (or strange shape) and requires additional transformation to avoid empty or too large bins.
However, for most cases creating the bins with equal number of events gives more convenient and stable results.
Selecting the edges of bins can be done with numpy.percentile
.
x = np.random.exponential(size=20000)
_ = plt.hist(x, bins=40)
edges = np.percentile(x, np.linspace(0, 100, 20))
_ = plt.hist(x, bins=edges, normed=True)
the number of events within each bin is equal now.
Defining the index of bin¶
Use numpy.searchsorted
to compute the index of bin for each value in x
. (numpy.digitize
is the other option, it does the same).
numpy.searchsorted
assumes that first array is sorted and uses binary search, so it is effective even for large amount of bins.
np.searchsorted(edges, x)
Counter¶
Need to know number of occurences for each category? Use numpy.bincount
to compute the statistics:
new_categories
This will compute how many times each category present in the data:
np.bincount(new_categories)
the same done with collections.Counter
(which is much slower)
from collections import Counter
np.array(sorted(Counter(new_categories).items()))[:, 1]
Replace category with number of occurences¶
There are several tricks important to process the categorical features. Most known one is one-hot encoding, but let's speak about others.
One of tricks is to replace in each event category (which cannot be processed by most algorithms) with a number of times it presents in data.
This can be written in numpy one-liner:
np.bincount(new_categories)[new_categories]
Averaging predictions over category¶
Another important trick: replace category with average prediction for events from this category.
This is more tricky, since we need to add weights to numpy.bincount
:
predictions = np.random.normal(size=len(new_categories))
means_over_category = np.bincount(new_categories, weights=predictions) / np.bincount(new_categories)
means_over_category[new_categories]
Variation of predictions over category¶
The same trick can be applied to measure deviation after invloving some math: $$ \mathbb{V}x = \mathbb{E}x^2 - (\mathbb{E}x)^2 $$
means_of_squares_over_category = np.bincount(new_categories, weights=predictions**2) / np.bincount(new_categories)
var_over_category = means_of_squares_over_category - means_over_category ** 2
var_over_category[new_categories]
You can also assign weights to events and compute weighted mean / weighted variation using the same functions.
5. Numpy.ufunc.at¶
To proceed futher, we need to know that there is class of binary (and unary) functions: numpy.ufuncs
Examples: numpy.add
, numpy.subtract
, numpy.multiply
, numpy.maximum
, numpy.minimum
In the following example we cook a new column with maximal predictions over category using ufunc.at
max_over_category = np.zeros(new_categories.max() + 1) - np.infty
np.maximum.at(max_over_category, new_categories, predictions)
print(max_over_category[new_categories])
Important note: numpy.add.at
and numpy.bincount
are very similar, but latter is dozens of times faster. Always prefer it to numpy.add.at
%timeit np.add.at(max_over_category, new_categories, predictions)
%timeit np.bincount(new_categories, weights=predictions)
Multidimensional statistics: numpy.add.at and smart indexing¶
When need to compute number of times each combination met, there are two ways:
- convert couple of variables to new variable, e.g. by feature1 * (numpy.max(feature2) + 1) + feature2
- create multidimensional table
I prefer first, code below is demonstration of second approach and it's benefits.
first_category = np.random.randint(0, 100, len(new_categories))
second_category = np.random.randint(0, 100, len(new_categories))
counters = np.zeros([first_category.max() + 1, second_category.max() + 1])
np.add.at(counters, [first_category, second_category], 1)
Now some useful things we can do with colected statistic. First thing is create feature, for which we use fancy indexing:
counters[first_category, second_category]
We can also infer different statistics like:
# occurences of second category
counters.sum(axis=0)[second_category]
# maximal occurences of second category with same value of first category
counters.max(axis=0)[second_category]
# number of occurences of second category with fixed first category:
counters[42, second_category]
- you are not limited to
numpy.add
, remember there are other ufuncs. - Few space to keep whole matrix? Use
scipy.sparse
if there are too many zero elements. Alternatively, you can use matrix decompositions to keep approximation of matrix and compute result on-the-fly.
Computing ROC curve¶
As an important exercise, let's implement the computation of ROC curve. (interactive demo of ROC curve may be found here )
We will write the same as sklearn.metrics.roc_curve
(actually, simplified version of it).
ROC curve takes three arrays: labels (binary, 0 or 1), predictions (real-valued) and weights.
def roc_curve(labels, predictions, sample_weight):
sig_weights = sample_weight * (labels == 1)
bck_weights = sample_weight * (labels == 0)
thresholds, predictions = np.unique(predictions, return_inverse=True)
tpr = np.bincount(predictions, weights=sig_weights)[::-1].cumsum()
fpr = np.bincount(predictions, weights=bck_weights)[::-1].cumsum()
tpr /= tpr[-1]
fpr /= fpr[-1]
return fpr, tpr, thresholds[::-1]
np.random.seed(1)
size = 1000
labels = (np.random.random(size) > 0.5) * 1
predictions = np.random.normal(size=size) + labels
weights = np.random.exponential(size=size)
fpr1, tpr1, thr1 = roc_curve(labels, predictions, weights)
Let's check how result looks like:
plt.figure(figsize=[12, 5])
plt.subplot(121)
plt.hist(predictions[labels == 0], bins=20, alpha=0.5)
plt.hist(predictions[labels == 1], bins=20, alpha=0.5)
plt.subplot(122)
plt.plot(fpr1, tpr1)
plt.xlabel('FPR'), plt.ylabel('TPR')
# sometimes sklearn adds one more element to all arrays.
# I have no idea why this is needed and when, but in this case all answers turn to False
from sklearn.metrics import roc_curve as sklearn_roc_curve
fpr2, tpr2, thr2 = sklearn_roc_curve(labels, predictions, sample_weight=weights)
np.allclose(fpr1, fpr2), np.allclose(tpr1, tpr2), np.allclose(thr1, thr2)
Weighted Kolmogorov-Smirnov distance¶
ROC curve is an important object by itself and freqently used in ML to compare distribution, but the magic is that it contains all order-based information about original distributions.
Next we compute KS distance based on ROC curve:
$$ KS = \max_x \left| F_1(x) - F_2(x) \right| $$(Order-based information == information, which is preserved under monotonic transformations)
def ks_2samp(distribution1, distribution2, weights1, weights2):
labels = np.array([0] * len(distribution1) + [1] * len(distribution2))
predictions = np.concatenate([distribution1, distribution2])
weights = np.concatenate([weights1, weights2])
fpr, tpr, _ = roc_curve(labels, predictions, sample_weight=weights)
return np.max(np.abs(fpr - tpr))
data1 = np.random.normal(loc=0, size=size)
data2 = np.random.normal(loc=1, size=size)
weights1 = np.random.exponential(size=size)
weights2 = np.random.exponential(size=size)
Compare results with standard function:
from scipy.stats import ks_2samp as ks_2samp_scipy
print( ks_2samp(data1, data2, weights1=np.ones(size), weights2=np.ones(size)) )
print( ks_2samp_scipy(data1, data2)[0] )
Important moment that written function supports weights, while scipy version doesn't.
TODO: optimal F1, optimal accuracy based on ROC curve.
Conclusion¶
Many things can be written in numpy in efficient way, moreover resulting code will be short and (surpise!) more readable.
But this skill requires much practice.
See also¶
- Second part of post: numpy tips and tricks 2
- I used example from histogram equalization with numpy
- Numpy log-likelihood benchmark
This post was written in IPython. You can download the notebook from repository.