Testing implementations of LibFM¶

By LibFM I mean an approach to solve classification and regression problems. This approach is frequently used in recommendation systems, because it generalizes the matrix decompositions.

LibFM proved to be quite useful to deal with highly categorical data (user id / movie id / movie language / site id / advertisement id / etc.).

Implementations tested

Original and widely known implementation was written by Steffen Rendle (and available on github).
- contains SGD, SGDA, ALS and MCMC optimizers
- command-line interface, does not have official python / R wrapper
- does not provide a way to save / load trained formula. Each time you want to predict something, you need to restart training process
- has some extensions (that almost nobody uses)
- supports linux, mac os
- has non-oficial pythonic wrapper

FastFM (github repo)
- claimed to be faster in the author's article
- has both command-line interface and convenient python wrapper, which almost follows scikit-learn conventions.
- supports SGD, ALS and MCMC optimizers
- supports save / load (for the except of MCMC)
- supports linux, mac os (though some issues with mac os)

pylibFM (github repo)
- uses SGDA
- pythonic library implemented with cython
- save / load operates normally
- supports any platform, provided cython operates normally
- slow and requires additional tuning, the number of iterations is reduced for pylibFM in tests

None of the libraries are pip-installable and all libraries need some manual setup. FastFM is the only to install itself normally into site-packages.

What is tested¶

ALS (alternating least squares) is very useful optimization technique for factorization models, however there is still one parameter one has to pass - namely, regularization. Quality of classification / regression is quite sensible to this parameter, so for fast tests data analyst prefers to leave the question of selecting regularization to machine learning.

MCMC is usually proposed as a solution: optimization algorithm should "find" the optimal regularization. MCMC uses however some priors (which don't influence the result that much).

So I am testing the quality libraries provide without additional tuning to check how bayesian inference and other heuristics work.

Logistic regression¶

Logistic regression is used as a stable baseline, because it is basic method to work with highly categorical data.

However, logistic regression, for instance, does not encounter the relation between user variables and movie variables (in the context of movie recommendations), so this approach is not able to provide any senseful recommendations.

In [1]:

import numpy
import pandas
import load_problems
import cPickle as pickle
from sklearn.metrics import roc_auc_score, mean_squared_error

In [2]:

from fastFM.mcmc import FMClassification, FMRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.datasets import dump_svmlight_file

Defining functions for benchmarking¶

In [3]:

LIBFM_PATH = '/moosefs/ipython_env/python_libfm/bin/libFM'
PYLIBFM_PATH = '/moosefs/ipython_env/python_pylibFM/'

import sys
if PYLIBFM_PATH not in sys.path:
    sys.path.insert(0, PYLIBFM_PATH)
import pylibfm


def fitpredict_logistic(trainX, trainY, testX, classification=True, **params):
    encoder = OneHotEncoder(handle_unknown='ignore').fit(trainX)
    trainX = encoder.transform(trainX)
    testX = encoder.transform(testX)
    if classification:
        clf = LogisticRegression(**params)
        clf.fit(trainX, trainY)
        return clf.predict_proba(testX)[:, 1]
    else:
        clf = Ridge(**params)
        clf.fit(trainX, trainY)
        return clf.predict(testX)

def fitpredict_fastfm(trainX, trainY, testX, classification=True, rank=8, n_iter=100):
    encoder = OneHotEncoder(handle_unknown='ignore').fit(trainX)
    trainX = encoder.transform(trainX)
    testX = encoder.transform(testX)
    if classification:
        clf = FMClassification(rank=rank, n_iter=n_iter)
        return clf.fit_predict_proba(trainX, trainY, testX)
    else:
        clf = FMRegression(rank=rank, n_iter=n_iter)
        return clf.fit_predict(trainX, trainY, testX)  

def fitpredict_libfm(trainX, trainY, testX, classification=True, rank=8, n_iter=100):
    encoder = OneHotEncoder(handle_unknown='ignore').fit(trainX)
    trainX = encoder.transform(trainX)
    testX = encoder.transform(testX)
    train_file = 'libfm_train.txt'
    test_file = 'libfm_test.txt'
    with open(train_file, 'w') as f:
        dump_svmlight_file(trainX, trainY, f=f)
    with open(test_file, 'w') as f:
        dump_svmlight_file(testX, numpy.zeros(testX.shape[0]), f=f)
    task = 'c' if classification else 'r'
    console_output = !$LIBFM_PATH -task $task -method mcmc -train $train_file -test $test_file -iter $n_iter -dim '1,1,$rank' -out output.libfm
    
    libfm_pred = pandas.read_csv('output.libfm', header=None).values.flatten()
    return libfm_pred

def fitpredict_pylibfm(trainX, trainY, testX, classification=True, rank=8, n_iter=10):
    encoder = OneHotEncoder(handle_unknown='ignore').fit(trainX)
    trainX = encoder.transform(trainX)
    testX = encoder.transform(testX)
    task = 'classification' if classification else 'regression'
    fm = pylibfm.FM(num_factors=rank, num_iter=n_iter, verbose=False, task=task)
    if classification:
        fm.fit(trainX, trainY)
    else:
        fm.fit(trainX, trainY * 1.)
    return fm.predict(testX)

Executing all of the tests takes much time¶

Below is simple mechanism, which preserves results between runs.

In [4]:

from collections import OrderedDict
import time

all_results = OrderedDict()
try:
    with open('./saved_results.pkl') as f:
        all_results = pickle.load(f)
except:
    pass

def test_on_dataset(trainX, testX, trainY, testY, task_name, classification=True, use_pylibfm=True):
    algorithms = OrderedDict()
    algorithms['logistic'] = fitpredict_logistic
    algorithms['libFM']    = fitpredict_libfm
    algorithms['fastFM']   = fitpredict_fastfm
    if use_pylibfm:
        algorithms['pylibfm']  = fitpredict_pylibfm
    
    results = pandas.DataFrame()
    for name, fit_predict in algorithms.items():
        start = time.time()
        predictions = fit_predict(trainX, trainY, testX, classification=classification)
        spent_time = time.time() - start
        results.ix[name, 'time'] = spent_time
        if classification:
            results.ix[name, 'ROC AUC'] = roc_auc_score(testY, predictions)
        else:
            results.ix[name, 'RMSE'] = numpy.mean((testY - predictions) ** 2) ** 0.5
            
    all_results[task_name] = results
    with open('saved_results.pkl', 'w') as f:
        pickle.dump(all_results, f)
        
    return results

Testing on movielens-100k dataset, only ids¶

MovieLens dataset is famous dataset in recommender systems. The task is to predict ratings for movies

In [5]:

trainX, testX, trainY, testY = load_problems.load_problem_movielens_100k(all_features=False)
trainX.head()

Out[5]:

	user	movie
98980	810	900
69824	803	754
9928	51	286
75599	734	180
95621	896	95

In [6]:

test_on_dataset(trainX, testX, trainY, testY, task_name='ml100k, ids', classification=False)

Out[6]:

	time	RMSE
logistic	0.059469	0.942771
libFM	8.970990	0.913520
fastFM	4.840041	0.915184
pylibfm	13.157164	0.944870

Testing on movielens-100k dataset, with additional information¶

In [7]:

trainX, testX, trainY, testY = load_problems.load_problem_movielens_100k(all_features=True)
trainX.head()

Out[7]:

	user	movie	age	gender	occupation	zip	released	Action	Adventure	...	Fantasy
98980	692	1310	33	0	7	615	68	0	0	...	0
69824	931	528	48	1	3	59	57	0	0	...	0
9928	216	553	12	1	13	110	67	1	1	...	0
75599	798	498	39	0	0	166	30	0	0	...	0
95621	910	547	27	0	20	397	68	0	0	...	1

5 rows × 26 columns

In [8]:

test_on_dataset(trainX, testX, trainY, testY, task_name='ml100k', classification=False)

Out[8]:

	time	RMSE
logistic	1.869114	0.942377
libFM	49.632649	0.896349
fastFM	53.611804	0.896543
pylibfm	55.756278	NaN

Testing on movielens-1m dataset, only ids¶

In [14]:

trainX, testX, trainY, testY = load_problems.load_problem_movielens_1m(all_features=False)
trainX.head()

load_problems.py:73: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  names=['user', 'movie', 'rating', 'timestamp'], header=None)

Out[14]:

	user	movie
610738	3703	3541
324752	1923	756
808217	4836	1288
133807	866	1106
431857	2630	2857

In [15]:

test_on_dataset(trainX, testX, trainY, testY, task_name='ml-1m,ids', classification=False)

Out[15]:

	time	RMSE
logistic	1.111601	0.910718
libFM	275.672684	0.861539
fastFM	307.400295	0.858305
pylibfm	132.618739	0.870263

Testing on movielens-1m dataset, with additional information¶

In [ ]:

trainX, testX, trainY, testY = load_problems.load_problem_movielens_1m(all_features=True)
trainX.head()

load_problems.py:78: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  names=['user', 'gender', 'age', 'occupation', 'zip'], header=None)
load_problems.py:79: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  movies = pandas.read_csv(folder + '/movies.dat', sep='::', names=['movie', 'title', 'genres'], header=None)

Out[ ]:

	user	movie	gender	age	occupation	zip	0	1	...	15
610738	5245	2240	0	1	0	2168	0	0	...	0
324752	4260	2213	0	3	6	346	1	1	...	1
808217	481	1205	1	2	14	1878	0	0	...	0
133807	973	1108	1	3	19	3140	1	1	...	0
431857	1344	2302	0	2	2	2336	0	0	...	1

5 rows × 26 columns

In [34]:

test_on_dataset(trainX, testX, trainY, testY, task_name='ml-1m', classification=False)

Out[34]:

	time	RMSE
logistic	23.983249	0.911024
libFM	779.900802	0.850382
fastFM	1170.468130	0.852738
pylibfm	564.922632	NaN

Test on flights dataset - 1m¶

Flights dataset is quite famous due to these benchmarks by szilard.

Based on defferent charateristics the goal is to predict whether the flight was delayed by 15 minutes or more.

In [17]:

trainX, testX, trainY, testY = load_problems.load_problem_flight(large=False, convert_to_ints=False)
trainX.head()

Out[17]:

	Month	DayofMonth	DayOfWeek	DepTime	UniqueCarrier	Origin	Dest	Distance
0	c-4	c-26	c-2	1828	XE	LEX	IAH	828
1	c-12	c-11	c-1	1212	UA	DEN	MCI	533
2	c-10	c-1	c-6	935	OH	HSV	CVG	325
3	c-11	c-26	c-6	930	OH	JFK	PNS	1028
4	c-12	c-6	c-2	1350	MQ	DFW	LBB	282

In [24]:

trainX, testX, trainY, testY = load_problems.load_problem_flight(large=False, convert_to_ints=True)
trainX.head()

Out[24]:

	Month	DayofMonth	DayOfWeek	DepTime	UniqueCarrier	Origin	Dest	Distance
0	7	19	2	8	21	157	133	6
1	4	3	1	4	18	79	172	4
2	2	1	6	2	15	129	72	2
3	3	19	6	2	15	147	220	7
4	4	28	2	5	13	80	155	2

In [23]:

test_on_dataset(trainX, testX, trainY, testY, task_name='flight1m', classification=True)

Out[23]:

	time	ROC AUC
logistic	26.408147	0.715099
libFM	441.534214	0.720484
fastFM	667.197644	0.718840
pylibfm	316.149734	0.711200

Test on flights dataset - 10m¶

pylibFM drops the kernel, so doesn't participate in comparison

In [ ]:

trainX, testX, trainY, testY = load_problems.load_problem_flight(large=True, convert_to_ints=True)
trainX.head()

Out[ ]:

	Month	DayofMonth	DayOfWeek	DepTime	UniqueCarrier	Origin	Dest	Distance
0	8	3	4	0	15	37	29	1
1	6	16	3	8	1	149	157	9
2	3	13	7	0	13	29	37	1
3	11	27	6	4	14	80	198	7
4	7	19	2	1	12	216	155	9

In [11]:

test_on_dataset(trainX, testX, trainY, testY, task_name='flight10m', classification=True, use_pylibfm=False)

Out[11]:

	time	ROC AUC
logistic	307.085924	0.715754
libFM	8500.038059	0.724258
fastFM	10718.261802	0.721615

Flights dataset with additional features¶

We simply add some 'quadratic' features

In [63]:

trainX, testX, trainY, testY = load_problems.load_problem_flight_extended(large=False)
trainX.head()

Out[63]:

	Month	DayofMonth	DayOfWeek	DepTime	UniqueCarrier	Origin	Dest	Distance	UniqueCarrier_Origin	UniqueCarrier_Dest	UniqueCarrier_DepTime	Origin_Dest	Origin_DepTime	Dest_DepTime
0	7	19	2	19	21	157	133	7	1628	1622	496	2606	2612	2275
1	4	3	1	13	18	79	172	5	1342	1369	417	1267	1294	2921
2	2	1	6	10	15	129	72	3	1072	1050	344	1985	2117	1195
3	3	19	6	10	15	147	220	8	1083	1111	344	2363	2410	3807
4	4	28	2	14	13	80	155	3	806	836	299	1379	1317	2665

In [64]:

test_on_dataset(trainX, testX, trainY, testY, task_name='flight1m, ext', classification=True)

Out[64]:

	time	ROC AUC
logistic	130.553217	0.739113
libFM	1284.598730	0.760524
fastFM	2398.107206	0.748448
pylibfm	514.768003	0.743312

Test on Avazu dataset (100k)¶

Avazu dataset comes from kaggle challenge, goal is to predict Click-Through Rate.

All the variables given are categorical, LibFM gave good results in this challenge.

In [9]:

trainX, testX, trainY, testY = load_problems.load_problem_ad(train_size=100000)
# taking max hash of 1000 for each category
trainX = trainX % 1000
testX = testX % 1000
trainX.head()

Out[9]:

	hour	C1	banner_pos	site_id	site_domain	site_category	app_id	app_domain	device_id	...	device_conn_type	C14	C15	C16	C17	C18	C19	C20	C21	day
13458308	1	2	0	582	339	2	884	254	800	...	0	281	3	2	56	0	2	0	22	24
11331426	10	2	0	582	339	2	884	254	800	...	1	283	3	2	56	0	2	0	22	23
1271792	6	2	0	879	903	24	884	254	800	...	0	84	3	2	18	3	14	57	6	21
7618971	13	2	0	762	841	24	884	254	800	...	0	88	3	2	220	0	2	132	42	22
16882663	3	2	1	158	893	24	884	254	800	...	0	376	3	2	88	2	4	13	8	25

5 rows × 23 columns

In [36]:

test_on_dataset(trainX, testX, trainY, testY, 
                task_name='avazu100k', classification=True, use_pylibfm=False)

Out[36]:

	time	ROC AUC
logistic	40.417285	0.717096
libFM	7778.470057	0.730173
fastFM	6601.202590	0.699962

Avazu 1m¶

In [35]:

trainX, testX, trainY, testY = load_problems.load_problem_ad(train_size=1000000)
# taking max hash of 1000 for each category
trainX = trainX % 1000
testX = testX % 1000
test_on_dataset(trainX, testX, trainY, testY, 
                task_name='avazu1m', classification=True, use_pylibfm=False)

Out[35]:

	time	ROC AUC
logistic	228.501853	0.740987
libFM	9109.968575	0.748516
fastFM	7962.800504	0.733875

Results¶

composing all results in one table. RMSE should be minimal, ROC AUC - maximal.

In [69]:

results_table = pandas.DataFrame()
tuples = []

for name in ['ml100k, ids', 'ml-1m,ids', 'ml100k', 'ml-1m', 'flight1m', 'flight1m, ext', 'flight10m', 'avazu100k', 'avazu1m']:
    df = all_results[name]
    results_table[name + ' (time)'] = df['time']
    metric_name = df.columns[-1]
    results_table[name + metric_name] = df[metric_name]
    tuples.append([name, 'time'])
    tuples.append([name, df.columns[-1]])
    
results_table = results_table.T
results_table.index = pandas.MultiIndex.from_tuples(tuples, names=['dataset', 'value'])
results_table.T

Out[69]:

dataset	ml100k, ids		ml-1m,ids		ml100k		ml-1m		flight1m		flight1m, ext		flight10m		avazu100k		avazu1m
value	time	RMSE	time	RMSE	time	RMSE	time	RMSE	time	ROC AUC	time	ROC AUC	time	ROC AUC	time	ROC AUC	time	ROC AUC
logistic	0.059469	0.942771	1.111601	0.910718	1.869114	0.942377	23.983249	0.911024	26.408147	0.715099	130.553217	0.739113	307.085924	0.715754	40.417285	0.717096	228.501853	0.740987
libFM	8.970990	0.913520	275.672684	0.861539	49.632649	0.896349	779.900802	0.850382	441.534214	0.720484	1284.598730	0.760524	8500.038059	0.724258	7778.470057	0.730173	9109.968575	0.748516
fastFM	4.840041	0.915184	307.400295	0.858305	53.611804	0.896543	1170.468130	0.852738	667.197644	0.718840	2398.107206	0.748448	10718.261802	0.721615	6601.202590	0.699962	7962.800504	0.733875
pylibfm	13.157164	0.944870	132.618739	0.870263	55.756278	NaN	564.922632	NaN	316.149734	0.711200	514.768003	0.743312	NaN	NaN	NaN	NaN	NaN	NaN

Conclusion¶

pylibfm is out of the game. It is slow, it crashes on large datasets, sometimes simply diverge and hardly can compete in quality.
Nothing new, adaptive methods require babysitting
FastFM and LibFM are quite stable and fast
but LibFM, being a bit faster, on average provides much better results.

As a sidenote, we saw on example with flight dataset that some feature engineering with providing quadratic features gives very significant boost in quality - even logisitic regression can work much better and faster than FMs on original features.

	Month	DayofMonth	DayOfWeek	DepTime	UniqueCarrier	Origin	Dest	Distance
0	7	19	2	8	21	157	133	6
1	4	3	1	4	18	79	172	4
2	2	1	6	2	15	129	72	2
3	3	19	6	2	15	147	220	7
4	4	28	2	5	13	80	155	2

	Month	DayofMonth	DayOfWeek	DepTime	UniqueCarrier	Origin	Dest	Distance
0	8	3	4	0	15	37	29	1
1	6	16	3	8	1	149	157	9
2	3	13	7	0	13	29	37	1
3	11	27	6	4	14	80	198	7
4	7	19	2	1	12	216	155	9

	Month	DayofMonth	DayOfWeek	DepTime	UniqueCarrier	Origin	Dest	Distance
0	7	19	2	8	21	157	133	6
1	4	3	1	4	18	79	172	4
2	2	1	6	2	15	129	72	2
3	3	19	6	2	15	147	220	7
4	4	28	2	5	13	80	155	2

	Month	DayofMonth	DayOfWeek	DepTime	UniqueCarrier	Origin	Dest	Distance
0	8	3	4	0	15	37	29	1
1	6	16	3	8	1	149	157	9
2	3	13	7	0	13	29	37	1
3	11	27	6	4	14	80	198	7
4	7	19	2	1	12	216	155	9