Household Spend Data Analysis

Scenario

You are the lead marketing data scientist for Bedding Bathing & Yonder (BBY). BBY is an American chain of domestic merchandise retail stores with an online presence. The chain primarily operates stores throughout the United States. BBY offers a membership and loyalty program designed to give customers easy access to great benefits and discounts through email, and physical direct mail coupons tailored to the household shopping habits. Loyalty customers have shared some basic information about their household and as a result the company’s marketing department has excellent data assets for these customers.

Senior management has asked you to model household revenue from a predefined household data set of these loyalty customers. Since pleasing these customers and maximizing their revenue potential is important to the company’s success, the marketing team has also purchased additional data sets to enrich BBY’s internal household data.

In addition to the training data the marketing department has a test set of households used primarily for marketing campaign evaluation. Once your predictive model is built, you will evaluate the model within the training set and upon the test data. Your business colleagues will expect model metrics for both data partitions.

Assuming the model is both accurate and consistent for both partitions, the marketing department has segmented a third set of prospects. The model needs to predict household spend among these prospects so that the campaign designers can tailor the ads according to the amounts. For example, a household that is expected to spend $20 may receive an offer of buy one item, get the second one half off to encourage higher spending. Similarly, a prospect household with a higher predicted spend, such as $250, may receive a coupon for 10% off to trigger the purchase behavior for a large single item.

Lastly, with careful EDA, you are also encouraged to share any data, or model insights that help shape the company’s understanding of its best customers.

You are asked to examine 15000 households’ attributes and their corresponding BBY spending habits. After examination, you will need to build a predictive model(s) on the training set. Next, evaluate the model(s) for any insights (important features) and model accuracy (RMSE) within the training set. Next, apply the model to a test set of households to ensure consistency. Once satisfied, predict household spend on the prospect’s household data. Lastly, create a PowerPoint for your data science peers with data information and insights, modeling methods employed, model results, and provide the predicted prospect values in a CSV. You are NOT expected to create marketing campaign suggestions based on the model(s)’ output because that is customarily created alongside other business departments.

Data

Source: The data has been synthesized using existing proprietary household data sets from various third parties obtained for a single US community. Some variables have been randomized, others have been anonymized, and further de-identification is performed on the geolocation attributes (lat/lon these have been completely manufactured). Thus, the data is likely not able to be reconstructed in its initial proprietary form but is still representative in some regards.

All data tables have a unique identifier, "tmpID" which can be used to join the data. If joined properly, the training data will have 80 variables described below. It may not be the case all variables are useful or even ethical to use in a model.

‍

FIPS CodeExplanation

*Source: https://www.policymap.com/2012/08/tips-on-fips-a-quick-guide-to-geographic-place-codes-part-iii/*

Presentation goals

Organization – Was the presentation well organized?

Delivery – Was the content delivered clearly and persuasively with the audience in mind?

Code Documentation – Was the data mined to support the conclusion?

Written Supplemental – Is the information clear and supported in narration and code? Did the information satisfy the case problem? Were external and trustworthy sources used?

Data Mining & Modeling Process – Overall, as a complete portfolio of work, is the topic interesting, organized, researched, supported and delivered effectively? Was CRISP-DM, SEMMA, or a similar workflow followed to organize the work?