What is Logistic Regression?

Logistic regression, though named regression, is a classification algorithm. It is most appropriate for binary classification problems. Here instead of predicting the actual value of the target variable, we predict the probability of the target variable falling in either class. Since the predicted value, i.e. probability, is a continuous term hence the term regression in its name is justified.

Like linear regression, in logistic regression too we try to find the slope and intercept of a line. But here instead of finding a line that is closes to the data points we find a line that divides the points belonging to once class from the other class in a manner that it covers most data points and is farthest from the data points of either class.

Enumerate the difference between classification and clustering problems

Often classification is confused with clustering but nothing could be further from the truth. Following are the differences between classification and clustering problems.

  • Supervised vs Unsupervised: Classification problems fall under supervised learning, while clustering falls under unsupervised learning.
  • Pre-defined number of Categories: In classification problems we have a predefined set of categories and they are unaffected by the choice of algorithm. On the other hand in clustering problems, the algorithm has to figure out the number of categories as well as what they are. The number of categories in clustering problems may change with the choice of algorithm.
  • Anomalies: It is entirely possible in clustering problems that a new data point might not belong to any of the existing categories but not in classification problem.

What is a classification problem?

Classification problems are one of the two sub groups of supervised learning, other being regression problems. Target variable in classification problems, unlike regression, is a categorical variable. In this we don’t try to predict the actual value of the target variable but try to assign it an appropriate category.

Classification algorithms try to draw a curve that separates data points in a manner that all similar points are stacked together. Often classification is confused with clustering but they are miles apart. Following are the differences between classification and clustering problems.

  • Supervised vs Unsupervised: Classification problems fall under supervised learning, while clustering falls under unsupervised learning.
  • Pre-defined number of Categories: In classification problems we have a predefined set of categories and they are unaffected by the choice of algorithm. On the other hand in clustering problems, the algorithm has to figure out the number of categories as well as what they are. The number of categories in clustering problems may change with the choice of algorithm.
  • Anomalies: It is entirely possible in clustering problems that a new data point might not belong to any of the existing categories but not in classification problem.

How to improve generalization performance?

There methods to improve generalization performance vary from equation based algorithms to neural networks based algorithms. In this post we limit ourselves to equation based algorithms i.e. conventional machine learning.

Following methods can be used to reduce.

  1. Train with more data : Easier said than done, but this method improves model generalization significantly. So if you could get arrange for more data economically viable manner, for ex purchasing data, you should go for it. You can also leverage similar dataset from public repositories. Research has showed that beyond a certain point adding more data to conventional machine learning models wont help with generalization.
  2. Feature Selection : Remove irrelevant variables from the training data. They don’t help in explaining the variance of dependent variable and introduce unnecessary information.
  3. Early Stopping : Limit the number of iteration. Model generalization and number of iterations have a parabolic relationship i.e. positively correlated to a certain point and beyond that are negatively correlated.
  4. Regularization : Regularization caps the coefficient of independent variables and thus checks the tendency of overfitting.
  5. Ensembling : This should be applied last to extract maximum benefit out of it. It can help with improving generalization when all other methods fail. Chose the ensembling technique as per the problem at hand. Bagging helps in case of complex models, while boosting helps with simpler models.