온라인 강의 자료모음 기업교육

파이썬 surprise 라이브러리의 SVD 모델

이해하기 쉽고, 장황하지 않은 자료를 기반으로 강의를 진행합니다.
AI · 풀스택 · 데이터 로드맵 Dave Lee 한 강사가 설계부터 강의까지 모두
사이트 바로가기

7. 파이썬 surprise 라이브러리의 SVD 모델

  • The prediction $\hat{r}_{ui}$ is set as:

$\hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u$

If user \(u\) is unknown, then the bias \(b_u\) and the factors \(p_u\) are assumed to be zero. The same applies for item \(i\) with \(b_i\) and \(q_i\).

  • 최소화는 SGD 사용 $$\begin{split}b_u \leftarrow b_u + \gamma (e_{ui} - \lambda b_u)\\ b_i \leftarrow b_i + \gamma (e_{ui} - \lambda b_i)\\ p_u \leftarrow p_u + \gamma (e_{ui} \cdot q_i - \lambda p_u)\\ q_i \leftarrow q_i + \gamma (e_{ui} \cdot p_u - \lambda q_i)\end{split}$$
  • 주요 파라미터
    • n_factors – The number of factors. Default is 100.
    • n_epochs – The number of iteration of the SGD procedure. Default is 20.
    • lr_all – The learning rate for all parameters. Default is 0.005

7.1. MovieLens 데이터를 기반으로 하는 실제 예제

MovieLens 데이터 로드

In [2]:
import os
import pandas as pd
from surprise import SVD
from surprise import Dataset
from surprise import dump


data = Dataset.load_builtin('ml-100k')

데이터 분석 기초를 체계적으로 익힐 수 있는 온라인 강의입니다

처음하는 파이썬 데이터 분석

pandas, plotly 시각화, 데이터 전처리 기본

In [3]:
df = pd.DataFrame(data.raw_ratings, columns=["user", "item", "rate", "id"])
In [4]:
df.head()
Out[4]:
user item rate id
0 196 242 3.0 881250949
1 186 302 3.0 891717742
2 22 377 1.0 878887116
3 244 51 2.0 880606923
4 166 346 1.0 886397596

트레이닝 + 모델 저장

In [5]:
import os

from surprise import SVD
from surprise import Dataset
from surprise import dump


data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()

algo = SVD()
algo.train(trainset)

# Compute predictions of the 'original' algorithm.
predictions = algo.test(trainset.build_testset())

# Dump algorithm and reload it.
file_name = os.path.expanduser('dump_file')
dump.dump(file_name, algo=algo)
_, loaded_algo = dump.load(file_name)

# We now ensure that the algo is still the same by checking the predictions.
predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
assert predictions == predictions_loaded_algo
print('Predictions are the same')
Predictions are the same

정확도 계산

데이터 분석/과학 전문가가 되기 위한 체계적인 로드맵입니다

가장 빠른 데이터 분석/과학 풀로드맵 (2025)

데이터 수집 → 분석 → 머신러닝/딥러닝 전과정

In [6]:
from surprise import Dataset
from surprise import SVD
from surprise import accuracy


data = Dataset.load_builtin('ml-100k')

algo = SVD()

trainset = data.build_full_trainset()
algo.train(trainset)

testset = trainset.build_testset()
predictions = algo.test(testset)

accuracy.rmse(predictions)
RMSE: 0.6763
Out[6]:
0.6763421071136434

모델 최적화 (파라미터 튜닝)

  • surprise의 GridSearch class
    • 주요 파라미터
      • algo_class (AlgoBase) – A class object of of the algorithm to evaluate.
      • param_grid (dict) – The dictionary has algo_class parameters as keys (string) and list of parameters as the desired values to try. All combinations will be evaluated with desired algorithm.
      • measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the accuracy module. Default is ['rmse', 'mae'].
      • verbose (int) – Level of verbosity. If 0, nothing is printed. If 1, accuracy measures for each parameters combination are printed, with combination values. If 2, folds accuracy values are also printed. Default is 1
In [7]:
import random

from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise import GridSearch


# Load the full dataset.
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)

# Select your best algo with grid search.
print('Grid Search...')
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005]}
grid_search = GridSearch(SVD, param_grid, measures=['RMSE'], verbose=1)
print(grid_search.best_params)
grid_search.evaluate(data)
Grid Search...
defaultdict(<class 'list'>, {})
Running grid search for the following parameter combinations:
{'n_epochs': 5, 'lr_all': 0.002}
{'n_epochs': 5, 'lr_all': 0.005}
{'n_epochs': 10, 'lr_all': 0.002}
{'n_epochs': 10, 'lr_all': 0.005}
Resulsts:
{'n_epochs': 5, 'lr_all': 0.002}
{'RMSE': 0.98969008899623423}
----------
{'n_epochs': 5, 'lr_all': 0.005}
{'RMSE': 0.96388170867799128}
----------
{'n_epochs': 10, 'lr_all': 0.002}
{'RMSE': 0.96890656349975923}
----------
{'n_epochs': 10, 'lr_all': 0.005}
{'RMSE': 0.95219313886797485}
----------

사용자별 영화 추천 예 - offline 방식으로 별도 테이블을 만들어서 저장하고, 해당 사용자 로그인시 웹에서 추천 시나리오

풀스택 개발자가 되기 위한 체계적인 로드맵입니다

가장 빠른 풀스택 개발 로드맵 (2025)

파이썬 → Flask → FastAPI → Flutter 전과정

In [ ]:
from collections import defaultdict

from surprise import SVD
from surprise import Dataset


def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.train(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

추천성능 평가

  1. RMSE (Root Mean Squared Error) : 평균 제곱근 오차 $$ \text{RMSE} = \sqrt{\frac{1}{|\hat{R}|} \sum_{\hat{r}_{ui} \in \hat{R}}(r_{ui} - \hat{r}_{ui})^2} $$
  2. MAE (Mean Absolute Error) : 평균 절대 오차 $$ \text{MAE} = \frac{1}{|\hat{R}|} \sum_{\hat{r}_{ui} \in \hat{R}}|r_{ui} - \hat{r}_{ui}| $$

알고리즘 평가

In [17]:
import surprise
from surprise import Dataset
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)

데이터 분석 기초를 체계적으로 익힐 수 있는 온라인 강의입니다

처음하는 파이썬 데이터 분석

pandas, plotly 시각화, 데이터 전처리 기본

In [23]:
sim_options = {'name': 'msd'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9911
MAE:  0.7833
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9833
MAE:  0.7771
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9894
MAE:  0.7819
------------
------------
Mean RMSE: 0.9879
Mean MAE : 0.7808
------------
------------
Out[23]:
CaseInsensitiveDefaultDict(list,
                           {'mae': [0.78332085007914287,
                             0.7771041490260826,
                             0.78185859406389302],
                            'rmse': [0.99109548719657858,
                             0.98332811659672703,
                             0.9893776110540401]})
In [24]:
sim_options = {'name': 'cosine'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0244
MAE:  0.8108
------------
Fold 2
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0187
MAE:  0.8061
------------
Fold 3
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0226
MAE:  0.8094
------------
------------
Mean RMSE: 1.0219
Mean MAE : 0.8088
------------
------------
Out[24]:
CaseInsensitiveDefaultDict(list,
                           {'mae': [0.81079510431791924,
                             0.80605631128339117,
                             0.80941768920884594],
                            'rmse': [1.0243634073960175,
                             1.0187482414191331,
                             1.0225720777877443]})
In [25]:
sim_options = {'name': 'pearson'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 1.0239
MAE:  0.8120
------------
Fold 2
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 1.0167
MAE:  0.8083
------------
Fold 3
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 1.0214
MAE:  0.8099
------------
------------
Mean RMSE: 1.0207
Mean MAE : 0.8101
------------
------------
Out[25]:
CaseInsensitiveDefaultDict(list,
                           {'mae': [0.81200632767328063,
                             0.80826931547507119,
                             0.80989597847789219],
                            'rmse': [1.0239246403571562,
                             1.0167128227144424,
                             1.0213664721488831]})
In [58]:
sim_options = {'name': 'pearson_baseline'}
algo = surprise.KNNBasic(sim_options=sim_options)
surprise.evaluate(algo, data)
Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.0049
MAE:  0.7968
------------
Fold 2
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.0103
MAE:  0.7985
------------
Fold 3
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.0110
MAE:  0.8002
------------
------------
Mean RMSE: 1.0087
Mean MAE : 0.7985
------------
------------
Out[58]:
CaseInsensitiveDefaultDict(list,
                           {'mae': [0.79675695245350209,
                             0.7984790061210334,
                             0.8002162404915526],
                            'rmse': [1.0048624637086425,
                             1.0102925000197331,
                             1.0110090350939811]})
In [26]:
algo = surprise.SVD()
surprise.evaluate(algo, data)
Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9498
MAE:  0.7498
------------
Fold 2
RMSE: 0.9458
MAE:  0.7463
------------
Fold 3
RMSE: 0.9446
MAE:  0.7453
------------
------------
Mean RMSE: 0.9467
Mean MAE : 0.7471
------------
------------
Out[26]:
CaseInsensitiveDefaultDict(list,
                           {'mae': [0.749794336106132,
                             0.7462662006765739,
                             0.74533450362095799],
                            'rmse': [0.94976543487583887,
                             0.94580670524240762,
                             0.94459061787140985]})