Explain the step by step implementation of ADA Boost.

Let’s say we have AdaBoost model with M trees that has to be trained on a dataset with N rows. We have use this model for a binary classification problem with target variable having values {1,-1}. Following are the steps that are taken by AdaBoost algorithm:

  1. Initialize weights for each observation with value of weight for any observation, say wi a, being 1/N. This step would be executed only for the first tree in the sequence. For second tree onwards the algorithm would itself assign weights, the process for which is described in step 5.
  2. Fit the tree, say G(x), on the data and make predictions.
  3. Compute the error using the following formula: where R(x)=1 , if condition is true i.e. G(x)  ≠ y, and 0 other wise. In other words numerator in above equation is only considered for wrong predictions.
  4. Compute the contribution factor of the given tree, denoted by α, in a manner that the tree with higher performance gets higher value. The Following is the formula for calculating the contribution:
  5. Compute the value of new weights in such a manner that the observations that were wrongly predicted by previous tree are highlighted i.e. given more weightage. Following is the formula for updating weights: where Q(x)=1 , if condition is true i.e. G(x)  ≠ y, and -1 other wise. In other words, wrong predictions are given higher values.
  6. Go back to step 2 and repeat the process for remaining trees
  7. Final predictions is made by aggregating the outputs of all the trees. Aggregation is done in a manner that preference is given to trees with higher performance. Following are the formulas for aggregating the output of of all trees in classification problems and regression problems respectively:

What are weak and strong classifiers?

A weak classifier is that algorithm which performs slightly better than random guess. The concept of Boosting came as an answer to the question – “Can a set of weak learners create a single strong learner?”.

Note that in practical applications an algorithm with 80% accuracy would considered weak compared to one with 90% accuracy and thus become a suitable candidate for Boosting.

How do boosting and bagging differ?

Following are differences between Bagging & boosting:

  1. Like Bagging, Boosting also employs bootstrapping but the resampling in done is a strategic manner to ensure that the consecutive Decision Tree gets more informative training data.
  2. There is a mathematical certainty that ensemble model built using Boosting would perform better than individual algorithms, using which it was built. No such thing can be said about bagging.
  3. In Boosting sequential learning happens, while in bagging all the individual models work independently.
  4. In Bagging equal weightage is given to output from all trees while Boosting algorithm gives gives preference to high performers.

What is Boosting?

Boosting is a ensemble approach that generally employs Decision Trees, arranged in a consecutive a fashion, to make prediction. The output from all the Decision Trees is considered while making final prediction but more weightage is given to the tree that has higher performance i.e. the output from high performing trees are boosted(amplified) while aggregating the outputs to make the final prediction.

Like Bagging, Boosting also employs bootstrapping but the resampling in done is a strategic manner to ensure that the consecutive Decision Tree gets more informative training data. By more informative I mean that training instances against whom wrong predictions were made are highlighted. In other words, while building the next Decision Tree learnings from previous tree are incorporated.

A distinguishing factor of Boosting algorithm is that it combines weak predictors into a strong one. A weak predictor is that algorithm which performs slightly better than random guess. The concept of Boosting came as an answer to the question “Can a set of weak learners create a single strong learner?”. Note that in practical applications an algorithm with 80% accuracy would considered weak compared to one with 90% accuracy and thus become a suitable candidate for Boosting.

Here we’ve focused on Decision trees, but other algorithms can also be ensembled using Boosting.

 

Why is stacking done?

The purpose of stacking is to ensure that patterns in each subset of the training data have been properly learned. Let’s say that one of the base model constantly makes wrong predictions, then the meta model which receives output from it and other models(who do correct predictions) will also learn this and give less weightage to outputs of the aforementioned faulty model. Thus the learning process will be much more accurate.

Explain the working behind Stacking.

Stacking, also called as Stacked Generalization, stacks models on top of each other i.e. output of a model at lower level is used as input for the model at upper level. The lower level consists of two or more models(base models) and the out put from these models are fed to models at higher level(metal model).

The purpose of stacking is to ensure that patterns in each subset of the training data have been properly learned. Let’s say that one of the base model constantly makes wrong predictions, then the meta model which receives output from it and other models(who do correct predictions) will also learn this and give less weightage to outputs of the aforementioned faulty model.

Let’s understand the steps involved in stacking using the following example: Say we have a simple stack generalization model with only two layer and the base layer constitutes of two models logistic regression and SVM classifier, with random Forest acting as a meta learner in the top layer. Following would be the steps involved in creation, training and validation of the aforementioned stacked generalization model|:

  1. The dataset is split in the ratio 50:50. With x1, y1 and x2,y2 acting as the independent and dependent variables of the two subsets respectively.
  2. Now we’ll further split the first subset into training and validation sets in the ration 80:20 respectively. Let’s call the independent and dependent variables of the training set as  as y1_train and x1_train respectively. Similarly,  the independent and dependent variables of the validation set will be referred as y1_test and x1_test respectively.
  3. Now the logistic regression and SVM models will be trained on the training data i.e. x1_train and y1_train.
  4. The models resulting from training in step 3 are validates using test set i.e. x1_test and y1_test. If their performance is satisfactory then we’ll proceed, else we’ll chose some other algorithm and repeat step 3.
  5. The validated models would make prediction on dependent variables of second set i.e. x2. Since we have two base models we’ll get two set of predictions, say y’21 and y’22. Where y’21 and y’22 are prediction on x2 data by the logistic regression and SVM classifier algorithms respectively.
  6. Now a new dataset would be created with y2 acting as dependent variable and y’21 & y’22 as  independent variables.
  7. The meta learner model, her random Forest, will make predictions on the dataset created in step 6.
  8. The resulting model from step 6 will be validated using validation data i.e. x1_test & y1_test

It is different from stacking in two aspects:

  1. Each subset is trained using a different algorithm
  2. Instead of aggregation a meta learner model is used for combining the outputs of different models.