In any production process in which one or more workers are engaged in a variety of tasks, the total time spent in production varies as a function of the size of the work pool and the level of output of the various activities. For example variables in a large metropolitan department store, the number of hours worked (HOURS) per day by the clerical staff may depend on the following variables:

MAIL: number of pieces of mail processed (open, sort, etc.)

CERT: number of money orders and gift certificates sold

ACC: number of window payments (customer charge accounts) transacted

CHANGE: number of change order transactions processed

CHECK: number of checks cashed

MISC: number of pieces of miscellaneous mail processed on an “as available” basis

TICKETS: number of tickets sold.

The data for 52 working days are stored in the data file clerical.txt, attached to this assignment. The data set contains all the variables listed above and the variable DAY: day of the week (Mon, Tue, Wed, Thu, Fri and Sat) in the following order:

DAY, HOURS, MAIL, CERT, ACC, CHANGE, CHECK, MISC, TICKETS.

Conduct a regression analysis of the data to model the number of hours (HOURS) worked per day by the clerical staff. Find the best regression model that explains the relationship among the variables. Do not include the variable DAY in your analysis!

a) Create scatterplots for HOURS vs each of the independent variables (MAIL, CERT, ACC, CHANGE, CHECK, MISC, TICKETS). What conclusions can you draw about the relationships between HOURS and the independent variables? (No need to include the scatterplots in your submission)

b) Build a boxplot to see if the daily number of hours (HOURS) varies by day of the week (DAY). Which days seem to be busier?

c) Fit the FULL regression model to predict HOURS using the following independent variables: {MAIL, CERT, ACC, CHANGE, CHECK, MISC, TICKETS}. (OPTION: You have the option to include DAY as an additional independent variable after creating appropriate dummy variables)

d) Use the goodness of fit test to check if any of the independent variables is associated to the response variable HOURS.

e)Does multi-collinearity seem to be a problem here? What is your evidence? Compute and analyze the VIF statistics.

f) Apply TWO model selection procedures to find the best model to predict HOURS. You can choose any two procedures (backward selection, forward selection, adj-R2, Cp, stepwise, etc…). Choose the model that provides the best fit for the data. (NOTE: Don’t worry about interaction terms, just fit a model with main effects)

g) Select a regression model based on the results in f). Write down the expression of the estimated model.

h) Draw a scatter plot of the studentized residuals against the predicted values. Does the plot show any striking pattern indicating problems in the regression analysis?

i) Analyze normal probability plot of residuals. Is there any evidence that the assumption of normality is not satisfied?

j) Are there any outliers or Influential Points? Compute appropriate statistics.

k) Use the fitted model in f) to predict the average number of hours worked by clerical staff on a day where 3200 pieces of mail are processed (MAIL), 120 certificates are sold (CERT), 600 payments are processed (ACC), 250 orders are changed (CHANGE), 500 checks are cashed (CHECK), 70 pieces of miscellaneous mail are processed (MISC), and 400 tickets are sold (TICKETS). Provide the confidence interval for your estimate.

l) Discuss the results of your analysis and explain what your model says about the relationships between HOURS and the other predictors. What are the most important predictors for HOURS?

Problem 2 Logistic Regression

Offshore oil drilling near an Alaskan estuary has led to increased air traffic – mostly large helicopters. Fist and Wildlife Service commissioned a study to investigate the impact these helicopters have on the flocks of Pacific Brant geese, which inhabit the estuary in fall before migrating. Two large helicopters were flown repeatedly over the estuary at different altitudes and lateral distances from the flock. The flight responses of the geese (recorded as “low (0)” or “high (1)”), Altitude (hundreds of meters), and lateral Distance (hundreds of meters) for each 464 helicopter overflights were recorded and are saved in the pacgeese.txt file.

a) Fit a logistic regression model to predict the probability of high flight response from the geese using altitude and lateral distances of helicopter flights. Write down the expression of the fitted model. (HINT: probability of interest is p = pr(flight response=HIGH)

b) Conduct a test to determine if flight response of the geese depends on the altitude of the helicopter. Test using alpha=0.01

c) Give a practical interpretation of the model parameters for Altitude and Distance. Explain how those values can be used to analyze the effect of altitude and distance on the odds of high flight response.