The purpose of this assignment is to perform logistic regression, interpret the results, and analyze whether or not the information generated can be used to address a specific business problem.
For this assignment, you will use the “Adult Incomes” data set from the Topic Materials.
The marketing department is interested in creating advertising directed primarily at high-income individuals, and they it has come to you seeking very specific customer data. The director of marketing explains that individuals with large amounts of disposable income tend to purchase luxury items. Therefore, understanding what predictors are correlated with high income can be very useful for a marketing department because it can help it tailor messages to the high-earning cohort. For example, individuals that earn capital gains tend to be high-income earners, and advertisements for luxury items can be targeted toward them on realty or investment websites.
As a member of the analytics team, you have been asked to determine a list of predictors and their relative impact on the likelihood of an individual being a high-income earner. Individuals earning more than $50,000 annually are considered high income earners. In your summary, include discussion of how the marketing department can use your results to devise a smart advertising strategy.
Question 1: Partition the data to create a training data set (70%) and test data set (30%). With a cut-off of 0.5, run logistic regression with “Income” as the target and the following predictors: “Capital_Gain,” “Hours_Per_Week,” “Sex,” “Age,” and “Race.” Show the model summary and variables in the equation. Which probability is being modeled? Include the “Model Summary” and “Variables in the Equation” outputs when submitting the answer.
Question 2: Is “Race” a statistically significant predictor when modeling whether incomes are greater than $50,000 annually? Explain your answer. Use a 5% significance level.
Question 3: Rerun the model without “Race” while still using a cut-off of 0.5. Show the model summary and variables in the equation. Write the equation showing the probability as a function of the predictors. Interpret the meaning of the coefficients for “Age” and “Sex.” Include the “Model Summary” and “Variables in the Equation” outputs when submitting the answer.
Question 4: Given that approximately 26% of the individuals in the data have incomes greater than $50,000 annually, rerun the model in Question 3 with a cut-off of 0.26. Show the classification tables and percent correct for each predicted outcome (>50K and <=50K) for the training data and test data. Why is the percent that is correct usually lower when the test data are used? Include the “Training Classification Table” and “Test Classification Table” outputs when submitting the answer.
Question 5: Consider the following individual: Age=30, Sex=Female, Hours_Per_Week=40, Capital_Gain=$0. Based on the logistic model from Question 4, what is the probability of this individual earning more than $50,000 annually? What would be the predicted class for this individual? Explain your answer.
Question 6: Based upon your analysis, what are the predictors that can determine whether or not an individual would be considered a high-income earner? Discuss how the marketing department can use this information in formulating its advertising strategy? Present your findings in the form of a 250-word executive summary that includes relevant data, charts, and tables to validate the conclusions presented.
Submit the answers to Questions 1-5 and the executive summary as Word documents.