Data manipulation with numpy: tips and tricks, part 2
from __future__ import print_function
import numpy as np
Rolling window, strided tricks¶
When working with time series / images it is frequently needed to do some operations on windows.
Simplest case: taking mean for running window:
sequence = np.random.normal(size=10000) + np.arange(10000)
Very bad idea is to do this with pure python
def running_average_simple(seq, window=100):
result = np.zeros(len(seq) - window + 1)
for i in range(len(result)):
result[i] = np.mean(seq[i:i + window])
return result
running_average_simple(sequence)
A bit better is to use as_strided
from numpy.lib.stride_tricks import as_strided
def running_average_strides(seq, window=100):
stride = seq.strides[0]
sequence_strides = as_strided(seq, shape=[len(seq) - window + 1, window], strides=[stride, stride])
return sequence_strides.mean(axis=1)
running_average_strides(sequence)
From computation side, as_strided does nothing. No copies and no computations, it only gives new view to the data, which is two-dimensional this time.
numpy.cumsum¶
However the right way to compute mean over rolling window is using numpy.cumsum:
(this one is unbeatable in speed if n
is not small)
def running_average_cumsum(seq, window=100):
s = np.insert(np.cumsum(seq), 0, [0])
return (s[window :] - s[:-window]) * (1. / window)
running_average_cumsum(sequence)
See also for this purpose:
- scipy.signal.smooth
- pandas.rolling_mean and similar functions
Remark: numpy.cumsum
is equivalent to numpy.add.accumulate
,
but there are also:
numpy.maximum.accumulate
,numpy.minimum.accumulate
- running max and minnumpy.multiply.accumulate
, which is equivalent tonumpy.cumprod
Remark: for computing rolling mean, numpy.cumsum
is best, however for other window statistics like min/max/percentile, use strides trick.
Strides and training on sequences¶
ML algorithms in python are often taking numpy.arrays
. In many cases when working with sequences you need to pass some data many times as part of different chunks.
Example: you have exhange rates for a year, you want GBDT to predict next exchange rate based on the previous 10.
window = 10
rates = np.random.normal(size=1000)
# target in training
y = rates[window:]
Typically the solution used is:
X1 = np.zeros([len(rates) - window, window])
for day in range(len(X1)):
X1[day, :] = rates[day:day + window]
But strided tricks are better way, since they don't need additional space:
stride, = rates.strides
X2 = as_strided(rates, [len(rates) - window , window], strides=[stride, stride])
np.allclose(X1, X2)
And if I want to pass not each window, but with some gap (to speed up training usually), I use slicing, which again does no operations with data:
X_new = X2[::3]
y_new = y[::3]
Strides and images¶
The same tricks work with images. For instance, for each window of size [window_w, window_h]
we want to compute max, mean and variance. Let's compare naive and strided approach.
def compute_window_mean_and_var(image, window_w, window_h):
w, h = image.shape
w_new, h_new = w - window_w + 1, h - window_h + 1
means = np.zeros([w_new, h_new])
maximums = np.zeros([w_new, h_new])
variations = np.zeros([w_new, h_new])
for i in range(w_new):
for j in range(h_new):
window = image[i:i+window_w, j:j+window_h]
means[i, j] = np.mean(window)
maximums[i, j] = np.max(window)
variations[i, j] = np.var(window)
return means, maximums, variations
def compute_window_mean_and_var_strided(image, window_w, window_h):
w, h = image.shape
strided_image = np.lib.stride_tricks.as_strided(image,
shape=[w - window_w + 1, h - window_h + 1, window_w, window_h],
strides=image.strides + image.strides)
# important: trying to reshape image will create complete 4-dimensional compy
means = strided_image.mean(axis=(2,3))
mean_squares = (strided_image ** 2).mean(axis=(2, 3))
maximums = strided_image.max(axis=(2,3))
variations = mean_squares - means ** 2
return means, maximums, variations
image = np.random.random([500, 500])
%time a1, b1, c1 = compute_window_mean_and_var(image, 20, 20)
%time a2, b2, c2 = compute_window_mean_and_var_strided(image, 20, 20)
np.allclose(a1, a2), np.allclose(b1, b2), np.allclose(c1, c2)
Remark: numpy.cumsum
allows awesomely fast computations in this case as well (but not for the max). Find out how!
Oblivious decision trees¶
Oblivious decision tree is tree, which has the same splits in all nodes at particular depth.
ODT of depth $n$ consists of $n$ pairs (feature index, split) and $2^n$ real values, which correspond to predictions in each leaf.
def predict_odt(data, features, splits, leaf_values):
# first compute leaf for each sample in data
leaf_indices = np.zeros(len(data), dtype=int)
for feature, split in zip(features, splits):
leaf_indices *= 2
leaf_indices += data[:, feature] > split
predictions = leaf_values[leaf_indices]
return predictions
That's it. It was really simple, now let's create some fake data and fake ODT to measure speed.
data = np.random.normal(size=[100000, 11])
depth = 7
features = np.arange(depth)
splits = np.random.normal(size=depth).astype('float32')
leaf_values = np.random.random(2 ** depth).astype('float32')
%timeit predict_odt(data, features, splits, leaf_values)
This is good result, but can we do this faster? Let's apply bitwise operations instead of arithmetical.
def predict_odt_bit_operations(data, features, splits, leaf_values):
leaf_indices = np.zeros(len(data), dtype='uint8')
for feature, split in zip(features, splits):
leaf_indices <<= 1
leaf_indices |= data[:, feature] > split
predictions = leaf_values[leaf_indices]
return predictions
%timeit predict_odt_bit_operations(data, features, splits, leaf_values)
Ok, but what if we apply bitwise operations for 64-bit integers? This should require 8 times less operations, because currently there are 8-bit operations (not handled natively by 64bit procesors).
Result of computation is the same, but this requires the length of data to be divisible by 8.
This trick significantly speeds up shift <<
and bitwise or |
in the code, and the major part of time (>80%) is now spent on checking conditions.
def predict_odt_fast(data, features, splits, leaf_values):
leaf_indices = np.zeros(len(data), dtype='uint8').view('uint64')
for feature, split in zip(features, splits):
leaf_indices <<= 1
leaf_indices |= (data[:, feature] > split).view('uint64')
predictions = leaf_values[leaf_indices.view('uint8')]
return predictions
%timeit predict_odt_fast(data, features, splits, leaf_values)
Banks in RAM, strides and order of elements¶
Data in numpy
is stored as continuous chunk of data. This gives high flexibility (instant transpositions, reshaping, views and strided tricks), but sometimes we need to keep eye on strides (in particular, orders of dimensions).
In the following simple example:
data = np.random.random([100000, 64]).astype(order='C', dtype='float32')
%timeit predict_odt_fast(data, features, splits, leaf_values)
data = np.random.random([100000, 64]).astype(order='F', dtype='float32')
%timeit predict_odt_fast(data, features, splits, leaf_values)
How happened we have this huge difference in speed?¶
Operations with sequential data is faster due to CPU cache.
By default, numpy
uses C-order, thus elements from one column are far each from other,
while elements within row are placed in memory together.
C-order is fine when operating rows, while our ODT code uses column-wise operations.
The actual speed up may significantly differ on the structure of your CPU cache, but anyway is expected to be high.
Specially this is pressing when stride is some power of 2 due to banks of memory, which are used to get parallel access to memory. In C-order the distance between two closest elements in a column:
np.random.random([10000, 64]).astype(order='C', dtype='float32').strides[0]
stride is 256. We will use each time single memory bank, which is worst situation possible.
Conclusion: order of indexing matters. Mind the cache.
Computing optimal step for AdaLoss function¶
In GBDT there is important step after training a tree is finding optimal values in leaves.
AdaLoss (also known as ExpLoss) can be written as (here $y_i = \pm 1$):
$$ \sum_i \exp[-y_i d(x_i)]$$Contrary to other losses, for AdaLoss one can write exact expression of optimal value of a leaf:
$$ \text{leaf\_value} = \frac{1}{2} \log \left( \dfrac{w_{leaf,+}}{w_{leaf, -}} \right), $$where $w_{leaf,+}, w_{leaf, -}$ are weights of signal and background events in the leaf. Let's write this procedure in beautiful numpy
code:
def compute_optimal_leaf_values(terminal_regions, labels, predictions):
"""
terminal regions are integers, correspong to the number of leaf each event belongs to
labels are +1 and -1
predictions are real numbers - sum of predictions given by previous trees in boosting
"""
weights = np.exp(- labels * predictions)
w_plus = np.bincount(terminal_regions, weights * (labels == +1))
w_minus = np.bincount(terminal_regions, weights * (labels == -1))
return np.log(w_leafs / w_minus) / 2.
Logistic function¶
Logistic function is frequently used in machine learning to convert real-valued output to probability: $$ \sigma (x) = \dfrac{e^x}{e^x + 1} = \dfrac{1}{1 + e^{-x}} $$
However, it is a bad idea to use this definition directly, because:
def logistic(x):
return np.exp(x) / (1 + np.exp(x))
logistic(1000)
First expression fails. Second expression raises warning, but returns correct answer (but earlier numpy returned Nan in this case as well).
Instead use scipy.special.expit
, which is fast and reliable:
from scipy.special import expit
expit([-1000, 0, 1000])
Logistic loss function¶
being expressed via decision function, logistic loss for single event has the following expression:
$$ LogLoss = \log(1 + e^{-yd}) $$For simplicity, let's implement the function $$\log(1 + e^x)$$
def logloss(x):
return np.log(1 + np.exp(x))
logloss([1, 10, 100, 1000])
Whoops, something went wrong. Of course, there was an overflow after applying exponent. Resulting number didn't fit in float64.
Use numpy.logaddexp
to avoid these problems!
np.logaddexp(0, [1, 10, 100, 1000])
Remark: numpy.logaddexp.at
is a good replacement for numpy.maximum.at
if you want to keep 'soft' maximum.
Masked array¶
numpy
doesn't support jagged arrays. But this can be partially resolved by using masked arrays.
Imagine the following situation: there are $n$ houses, we want to select location to build new house. For each position we analyze the price of $k$ neighbouring houses to guess the best place.
houses_prices = np.random.exponential(size=10000)
n_neighs = 20
position_neighbours = np.random.randint(0, len(houses_prices), size=[100000, n_neighs])
computing average price of nearest houses is fairly simple:
houses_prices[position_neighbours].mean(axis=1)
same method is used to get total/min/max/std price of closest houses.
The problem is compute these things when part of the data shall be ignored. For instance, houses which are very expensive (top 5%) are probably expensive not due to their position. Can we ignore them when computing average?
Masked arrays come to the rescue:
# ignore houses with higher price
max_price = np.percentile(houses_prices, 95)
data = houses_prices[position_neighbours]
ignored = houses_prices[position_neighbours] > max_price
# pay attention that data[~ignored] will return one-dimensional array,
# that's why we need masked arrays
masked_prices = np.ma.MaskedArray(data=data, mask=ignored)
masked_prices.mean(axis=1)
np.array(masked_prices.mean(axis=1))
Numpy.ufunc.reduceat to work with multicategories.¶
Multicategory is type of feature: each sample in dataset can belong to several (possibly zero) groups.
For instance, samples are people, and we have informations about which languages they are programing. Programming languages are multicategory.
Assume we have two parallel arrays, corresponding pair denote programmer id and language id he/she knows.
salaries = np.random.random(2000)
programers = np.sort(np.random.randint(0, len(salaries), size=10000))
languages = np.random.randint(0, 50, size=len(programers))
print(programers)
print(languages)
Let's compute for each language average salary of its users:
lang_average_salaries = np.bincount(languages, salaries[programers]) / np.bincount(languages)
For fun, now we can compute the salary programmer can obtain for best-payed of his languages:
programmer_top_salary = np.zeros(programers.max() + 1)
np.maximum.at(programmer_top_salary, programers, lang_average_salaries[languages])
print(programmer_top_salary)
Get the id of best-payed language¶
We can guess the language based on price, but this can be inreliable in other cases.
Let's combine tricks studied earlier:
lang_sorter = np.argsort(lang_average_salaries)
# languages ranked by average salaries:
lang_salaries_ordered = np.argsort(lang_sorter)
programmer_top_language = np.zeros(programers.max() + 1, dtype=int)
np.maximum.at(programmer_top_language, programers, lang_salaries_ordered[languages])
# now we need to decode order of language back to original ID
programmer_top_language = lang_sorter[programmer_top_language]
programmer_top_language
Checking with previous result — taking average salaries for top languages.
NB: there were programmers with no languages - their 'top language' became the worst payed one. Maybe worth creating special pseudo-language to denote this situations.
lang_average_salaries[programmer_top_language]
numpy.ufunc.reduceat¶
Numpy.maximum.at is not fast at this moment, though very convenient.
Let's show faster, but more complicated way using:
sorter = np.argsort(programers)
ordered_programers = programers[sorter]
ordered_languages = languages[sorter]
limits_in_programmers = np.insert(np.bincount(programers).cumsum()[:-1], 0, [0])
programmer_top_salary = np.maximum.reduceat(lang_average_salaries[ordered_languages], limits_in_programmers)
Remark:
numpy.ufunc.reduceat
has inobvious behavior when the list of categories is empty. In this case the value for first category of next sample is returned.- significant degradation in speed happens when too few categories for samples. When one category per sample, speed is comparable to
numpy.ufunc.at
Sparse arrays and multicategories¶
There is an alternative to numpy.bincount
, specially handy when using 'weights'.
Minor modification of previous example: each programmer has the skill
value, describing how well he knows programming language (and plays a role of weight).
skills = np.random.random(len(languages))
Sparse matrices contain all this information together:
from scipy.sparse import coo_matrix
matrix = coo_matrix((skills, (programers, languages)))
Easy tricks with normalizations are strong side of sparse matrices.
Let's compute average salary with weights proportional to 'skill'
matrix_normed_for_language = matrix / (matrix.sum(axis=0) + 0.01)
lang_average_salaries = matrix_normed_for_language.T.dot(salaries)
matrix.dot(lang_average_salaries.T)
Sparse matrices are slower if used only once (compared to bincount
), because of matrix creation,
but faster when applied many times.
Fast computing of efficiencies for group¶
In uBoost
algorithm we need to compute many times local efficiencies:
for each event which part of its neighbors pass the given threshold. Sparse matrix is a good option in this case. Close things happen in other classifiers (uGB+kNN
, RankBoost
)
As an exercise, let's compute which part of programmers knowing language X
have salary greater then salary_threshold
(printing result only for first 5 languages):
for salary_threshold in np.linspace(0, 1, 10):
print(matrix_normed_for_language.T.dot(salaries > salary_threshold)[0, :5])
Fast residual¶
when having deal with words or other categorical variables, there are frequently troubles with applying one-hot encoding (don't want to keep the mapping or too many parameters for model).
Hashing trick is good alternative: each value is mapped by hash to small number.
Funny fact is that integer resudial is not very cheap operation, though shall be applied many times:
categories = np.random.randint(0, 1e10, size=1000000)
%timeit categories % 128
if you select the number of left features to be power of 2, you can do the same with bitmask:
%timeit categories & 127
how about rocket speed? (please don't apply in practice, just to show this is not the limit)
%%timeit
mask = np.array(127, dtype='int64')
for _ in range(8):
mask |= mask << 8
result = (categories.view('uint8')[::8].copy().view('uint64') & mask).view('uint8')
please keep such simple bit-tricks in mind if you're going to implement some algorithm.
# plain python approach
from collections import defaultdict
def compute_online_counter(categories_stream):
online_counter = np.zeros_like(categories_stream)
counters = defaultdict(int)
for i in range(len(categories_stream)):
online_counter[i] = counters[categories_stream[i]]
counters[categories_stream[i]] += 1
return online_counter
Simple example for online counter¶
each time we compute how many times each number (representing a category) presented before in the stream
compute_online_counter([1, 2, 1, 1, 3, 2, 5, 6, 3, 3, 3])
categories_stream = np.random.randint(0, 10000, size=1000000)
%timeit online_counter = compute_online_counter(categories_stream)
def compute_online_counter_numpy(categories_stream):
occurences = np.bincount(categories_stream)
before = occurences.cumsum() - occurences
positions = categories_stream.argsort(kind='mergesort').argsort(kind='mergesort')
return positions - before[categories_stream]
%timeit compute_online_counter_numpy(categories_stream)
this is the situation when numpy is really not of much help. Sorting applied twice is bad approach when no sorting shall be used.
pandas
suggests a good one-line alternative, but it is ..hmhm.. a bit slow.
import pandas
df = pandas.DataFrame({'cat': categories_stream, 'ones': np.ones(len(categories_stream))})
# truncating to first 10000, otherwise it will never finish
df = df[:10000]
%time result = df.groupby('cat')['ones'].transform(pandas.Series.cumsum)
Ok, then it is time to use some external tool. Let's do it with parakeet.
import parakeet
parakeet.config.backend = 'c' # one thread
@parakeet.jit
def compute_online_counter_parakeet(categories_stream):
counters = np.zeros(np.max(categories_stream) + 1)
online_counter = np.zeros_like(categories_stream)
for i in range(len(categories_stream)):
online_counter[i] = counters[categories_stream[i]]
counters[categories_stream[i]] += 1
return online_counter
compute_online_counter_parakeet(categories_stream)
%timeit compute_online_counter_parakeet(categories_stream)
However, keep in mind, that you'll probably not be able to use this code in windows OS. Use external tools only when they are really needed.
Conclusion¶
- numpy is very powerful if properly used.
- learning numpy is harder than a + b, but it gives benefit in shortness and readability
- there are very few cases when you really need to switch from python code to cython or c during research.
- many of things from above can be implemented in other vector environments: theano / matlab / R
Links¶
There are many more things to know about numpy. Here are the some helpful links you may also be interested in:
This post was written in IPython. You can download the notebook from repository.