notebook This is my personal notebook ^_^

Notes for Machine Learning

Basic Knowledge

Three Core Components:

  • Representation: how to represent your problem (decision tree? neural network?)
  • Evaluation: how to evaluate your model (precision and recall? RMSE?)
  • Optimization: how to search among alternatives (greedy search? gradient descent?)

Underfitting: high bias; low variance Overfitting: low bias; high variance

Evaluation:

  • cross-validation: k-fold, leave one out et al.
  • Confusion matrix
  predicted + predicted -
True + a b
True - c d
  • Accuracy = (a+d)/(a+b+c+d)
  • Precision = a/(a+c)
  • Recall = a/(a+b)
  • Sensitivity = a/(a+b)
  • Specificity = d/(c+d)

Ensemble Methods:

  • Bootstrap:
    • Given a dataset of size N;
    • Draw N samples with replacement to create a new dataset;
    • Repeat ~1000 times and get ~1000 sample datasets;
    • Compute ~1000 sample statistics.
  • Bagging:
    • Draw N bootstrap samples;
    • Retrain the model on each sample;
    • Average the results (for regression, do averaging; for classification, do majority vote);
    • Works great for overfit models (decrease variance without changing bias; doesn’t help much with underfit/high bias model)
  • Boosting:
    • Instead of selecting data randomly with the bootstrap, favor the misclassified points
    • Initialize the weights
    • Repeat: resample with respect to weights; retrain the model; recompute weights

Both boosting and bagging are ensemble techniques – instead of learning a single classifier, several are trained and their predictions combined. While bagging uses an ensemble of independently trained classifiers, boosting is an iterative process that attempts to mitigate prediction errors of earlier models by predicting them with later models.

Random Forest Algorithm

  • Repeat k times:
    • Draw a bootstrap sample from the dataset
    • Train a decision tree
      • Until the tree is maximum size
        • Choose next leaf node
        • Select m attributes at random from the p available
        • Pick the best attribute/split as usual
    • Measure out-of-bag error
      • Evaluate against the samples that were not selected in the bootstrap
      • Provides measures of strength (inverse error rate), correlation between trees (which increases the forest error rate), and variable importance
  • Make a prediction by majority vote among the k trees

Optimization

  • Gradient Descent: at each step, use all data point
  • Stochastic Gradient Descent: at each step, pick one random data point
  • Minibatch Gradient Descent: at each step, pick a small subset of data point
  • Parallel Stochastic Gradient Descent: In each of k threads, pick a random data point; compute the gradient and update the weights; weights will be “mixed”

Notes for Regression from Cousera UW course.

Please see here