#Summary:A Programmer’s Guide to Data Mining
READ: 2014097 RATING: 8/10
How to find stuff?

Years ago, in that small town, our friends helped us find stuff.

We used experts to help us find stuff.

We also use the thing itself to help us find stuff.
These methods of finding relevant stuff— friends, experts, the thing itself—are still present today but we need some computational help to transform them into the 21st century where we have billions of choices.
Chapter 2: Getting Started with Recommendation Systems
 How a recommendation system works.
 How social filtering works
 How to find similar items
 Manhattan distance
 Euclidean distance
 Minkowski distance
 Pearson Correlation Coefficient
 Cosine similarity
Manhattan Distance
Euclidean Distance
Pearson Correlation Coefficient
Chapter 3: Implicit ratings and item based filtering
 Explicit ratings: Explicit ratings are when the user herself explicitly rates the item. One example of this is the thumbs up / thumbs down rating on sites such as Pandora and YouTube.

Implicit Ratings: For implicit ratings, we don’t ask users to give any ratings—we just observe their behavior. An example of this is keeping track of what a user clicks on in the online New York Times.

Problems with explicit ratings:

Problem 1: People are lazy and don’t rate items.

Problem 2: People may lie or give only partial information.

Problem 3: People don’t update their ratings.


Problems with Userbased filtering:

Scalability. As we have just discussed, the computation increases as the number of users increases. Userbased methods work fine for thousands of users, but scalability gets to be a problem when we have a million users.

Sparsity. Most recommendation systems have many users and many products but the average user rates a small fraction of the total products. For example, Amazon carries millions of books but the average user rates just a handful of books.

 Cosine Similarity
This formula is from a seminal article in collaborative filtering: “Itembased collaborative filtering recommendation algorithms”
Chapter 4: Classification based on item attributes

A classifier is a program that uses an object’s attributes to determine what group or class it belongs to.A classifier uses a set of objects that are already labeled with the class they belong to. It uses that set to classify new, unlabeled objects.
 Classifiers can be used in a wide range of applications:
 Twitter Sentiment Classification
 Automatic identification of people in photos.
 Classification for Targeted Political Ads
 Targeted Marketing

Normalizing data is critical when attributes have drastically different scales (for example, income and age). In order to get accurate distance measurements, we should rescale the attributes so they all have the same scale.
 Recall that one way to normalize an attribute on a scale between 0 and 1 is to find the minimum (min) and maximum (max) values of that attribute. The normalized value of a value is then value − min/max− min
Chapter 5: Further Explorations in Classification

We divided each of these datasets in turn into two subsets. One subset we used to construct the classifier. This data set is called the training set. The other set was used to evaluate the classifier. That data is called the test set. Training set and test set are common terms in data mining.

People in data mining never test with the data they used to train the system.

The accuracy based on a single test set may not reflect the true accuracy when our classifier is used with new data. A solution to this problem might be to repeat the process a number of times and average the results.

10fold Cross Validation: With this method we have one data set which we divide randomly into 10 parts. We use 9 of those parts for training and reserve one tenth for testing. We repeat this procedure 10 times each time reserving a different tenth for testing.
 Tenfold cross validation example:
 Step 1, we equally divide the data into 10 buckets:
 Step 2, we iterate through the following steps ten times:
 Step 3, we sum up the results.
 LeaveOneOut: In the machine learning literature, nfold cross validation (where n is the number of samples in our data set) is called leaveoneout. We already mentioned one benefit of leaveoneout— at every iteration we are using the largest possible amount of our data for training. The other benefit is that it is deterministic.