I'm trying to get through the MovieLens 1M Data Set, but am having an issue with the merge step. I checked all of my dataframes and confirmed that I successfully imported the data from the tables, even though the book/git don't have appropriately marked links. On page 28, you are supposed to merge ratings with users, then merge the result with movies. My code looks like this:
data = pd.merge(pd.merge(ratings, users), movies)
the output that results looks like this:
MergeError Traceback (most recent call last)
//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy) 37 right_on=right_on, left_index=left_index, 38 right_index=right_index, sort=sort, suffixes=suffixes, ---> 39 copy=copy) 40 return op.get_result() 41 if debug:
//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in init(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy) 181 (self.left_join_keys, 182 self.right_join_keys, --> 183 self.join_names) = self._get_merge_keys() 184 185 def get_result(self):
//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _get_merge_keys(self) 299 left_keys, right_keys 300 """ --> 301 self._validate_specification() 302 303 left_keys = []
//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self) 398 self.right.columns) 399 if len(common_cols) == 0: --> 400 raise MergeError('No common columns to perform merge on') 401 self.left_on = self.right_on = common_cols 402 elif self.on is not None:
MergeError: No common columns to perform merge on
Comment From: jreback
pls post version on numpy/pandas, and your code up to this point
Comment From: WLazar
import numpy as np
import pandas as pd
cd /Users/WL/Documents/Data_science/pydata-book/ch02
np.version -->
unames = ['user_id', 'gender', 'age', 'occupation', 'zip'] users = pd.read_table('movielens/users.dat', sep='::', header=None, names=unames) rnames = ['user-id', 'movie-id', 'rating', 'timestamp'] ratings = pd.read_table('movielens/ratings.dat', sep='::', header=None, names=rnames) mnames = ['movie-id', 'title', 'genres'] movies = pd.read_table('movielens/movies.dat', sep='::', header=None, names=mnames)
users.merge(ratings, on='user-id')
Comment From: jreback
Looks like a typo in the book; I think should 'user_id'
for the on field (and not user-id
)
and the names in the ratings should be the same
Their maybe someway to report typos if you are inclined.
In [23]: rnames = ['user_id', 'movie-id', 'rating', 'timestamp']
In [24]: ratings = pd.read_table('movielens/ratings.dat', sep='::', header=None,names=rnames)
In [25]: users.merge(ratings, on='user_id')
Out[25]:
user_id gender age occupation zip movie-id rating timestamp
0 1 F 1 10 48067 1193 5 978300760
1 1 F 1 10 48067 661 3 978302109
2 1 F 1 10 48067 914 3 978301968
3 1 F 1 10 48067 3408 4 978300275
4 1 F 1 10 48067 2355 5 978824291
5 1 F 1 10 48067 1197 3 978302268
6 1 F 1 10 48067 1287 5 978302039
7 1 F 1 10 48067 2804 5 978300719
8 1 F 1 10 48067 594 4 978302268
9 1 F 1 10 48067 919 4 978301368
10 1 F 1 10 48067 595 5 978824268
11 1 F 1 10 48067 938 4 978301752
12 1 F 1 10 48067 2398 4 978302281
13 1 F 1 10 48067 2918 4 978302124
14 1 F 1 10 48067 1035 5 978301753
15 1 F 1 10 48067 2791 4 978302188
16 1 F 1 10 48067 2687 3 978824268
17 1 F 1 10 48067 2018 4 978301777
18 1 F 1 10 48067 3105 5 978301713
19 1 F 1 10 48067 2797 4 978302039
20 1 F 1 10 48067 2321 3 978302205
21 1 F 1 10 48067 720 3 978300760
22 1 F 1 10 48067 1270 5 978300055
23 1 F 1 10 48067 527 5 978824195
24 1 F 1 10 48067 2340 3 978300103
25 1 F 1 10 48067 48 5 978824351
26 1 F 1 10 48067 1097 4 978301953
27 1 F 1 10 48067 1721 4 978300055
28 1 F 1 10 48067 1545 4 978824139
29 1 F 1 10 48067 745 3 978824268
30 1 F 1 10 48067 2294 4 978824291
31 1 F 1 10 48067 3186 4 978300019
32 1 F 1 10 48067 1566 4 978824330
33 1 F 1 10 48067 588 4 978824268
34 1 F 1 10 48067 1907 4 978824330
35 1 F 1 10 48067 783 4 978824291
36 1 F 1 10 48067 1836 5 978300172
37 1 F 1 10 48067 1022 5 978300055
38 1 F 1 10 48067 2762 4 978302091
39 1 F 1 10 48067 150 5 978301777
40 1 F 1 10 48067 1 5 978824268
41 1 F 1 10 48067 1961 5 978301590
42 1 F 1 10 48067 1962 4 978301753
43 1 F 1 10 48067 2692 4 978301570
44 1 F 1 10 48067 260 4 978300760
45 1 F 1 10 48067 1028 5 978301777
46 1 F 1 10 48067 1029 5 978302205
47 1 F 1 10 48067 1207 4 978300719
48 1 F 1 10 48067 2028 5 978301619
49 1 F 1 10 48067 531 4 978302149
50 1 F 1 10 48067 3114 4 978302174
51 1 F 1 10 48067 608 4 978301398
52 1 F 1 10 48067 1246 4 978302091
53 2 M 56 16 70072 1357 5 978298709
54 2 M 56 16 70072 3068 4 978299000
55 2 M 56 16 70072 1537 4 978299620
56 2 M 56 16 70072 647 3 978299351
57 2 M 56 16 70072 2194 4 978299297
58 2 M 56 16 70072 648 4 978299913
59 2 M 56 16 70072 2268 5 978299297
... ... ... ... ... ... ... ...
[1000209 rows x 8 columns]
Comment From: jreback
@Lingzy your example is not reproducible. if you are having an issue, please open a new issue with a copy-pastable example.