Logistic regression estimates the probability of the output variable based on the linear combination of one or more predictor variables by using the logit function. The nonlinear transformation of the logit function makes it useful for complex classification models.

Assumptions of logistic regression

Logistic regression makes fewer assumptions about the input than linear regression.

  • It does not need a linear relationship between the dependent and independent variables
  • The features are no longer assumed to be multivariate normal or homoscedastic .
  • The residuals are also not assumed to be normally distributed.
  • The features need not be interval or ratio scaled.

Logistic regression requires larger sample sizes for better estimates than compared to simple linear regression

Goodness of fit and statistical tests in logistic regression

Pseudo R2 and McFadden R2 can be used measure the variance explained by the model.

Another simple check would be to check whether the difference between the null deviance and residual deviance is significant. The difference between the residual and null deviances can be approximated as a chi squared distribution. We can compute an approximate p value for this difference.

If the p-value that we obtain is tiny, so we feel certain that our model produces predictions that are better than average guessing.

#output from glm
... 
    Null deviance: 4080.2  on 3023  degrees of freedom
Residual deviance: 1451.0  on 3023  degrees of freedom
AIC: 1651

> 1-pchisq(4080.2 - 1451.0 , df=(3023 - 3023))
[1] 0

The low p value means that the null hypothesis can be rejected and model does explains the output variance.

ROC Curve

A ROC (Receiver Operating Characteristic Curve) is used for summarizing classifier performance over a range of trade-offs between true positive (TP) and false positive (FP) error rates. ROC curve is a plot of sensitivity (the ability of the model to predict an event correctly) versus the specificity for the possible cut-off classification probability values. This means

The ROC of random guessing lies on the diagonal line. The ROC of a perfect diagnostic technique is a point at the upper left corner of the graph, where the TP proportion is 1.0 and the FP proportion is 0.

The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5.

ROCR package provide functions to plot the ROC curve

AUC

The Area Under the Curve (AUC) is a performance metric for a ROC curve. The higher the area under the curve the better prediction power the model has. c = 0.8 can be interpreted to mean that a randomly selected individual from the positive group has a test value larger than that for a randomly chosen individual from the negative group 80 percent of the time.

ROCR package provides function to calculate the model AUC

library(ROCR)
pred <- prediction(predicted_outcomes, original_outcome);
roc <- performance(pred, "tpr", "fpr");
plot (roc);

auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
auc

References

  • http://stats.stackexchange.com/questions/105501/understanding-roc-curve?rq=1
  • https://onlinecourses.science.psu.edu/stat504/node/163