Simple Linear Regression and Predictive Modeling
The data for Assignment #10 is the Nutrition Study data.It is a 16 variable dataset with n=315 records that you have seen and worked with previously. The data was obtained from medical record information and observational self-report of adults. The dataset consists of categorical, continuous, and composite scores of different types. A data dictionary is not available for this dataset, but the qualities measured can easily be inferred from the variable and categorical names for most of the variables. As such, higher scores for the composite variables translate into having more of that quality. The QUETELET variable is essentially a body mass index. It can be googled for more detailed information. It is the ratio of BodyWeight (in lbs) divided by (Height (in inch))^2. Then the ratio is adjusted with an adjustment factor so that the numbers become meaningful. Specifically, QUETELET above 25 is considered overweight, while a QUETELET above 30 is considered obese. There is no other information available about this data.
1) Download the Nutrition Dataand read it into R-Studio. We will work with the entire data set for this assignment. 2) There are 11 variables that are clearly continuous variables. For this assignment, you should consider the Quetelet variable to be the dependent response variable (Y). All other continuous variables should be considered independent or explanatory variables. Make a scatter plot of each continuous variable (X) with Y. You should have 10 different scatterplots. Obtain Pearson Product Moment Correlations for each X variable with Y. You can do this in a table form or individually. It does not matter. Stil, combine the scatterplot with the correlation information and discuss the appropriateness of simple linear regression for each scatterplot. Which variable seems most predictive of Quetelet (Y)?
3) Often times, the explanatory variables are correlated amongst themselves. Obtain, a standard correlation matrix for all of the explanatory variables. Then, obtain a heat matrix of the correlations (see the correlation classroom for an example of this). Are there groups, or subsets, of explanatory variables that seem to clump together in that they are highly correlated amongst themselves?
4) Use the explanatory variable that is most highly correlated with Y, and fit a simple linear regression model. Call this Model 1. Report the prediction equation for Model1, interpret the coefficients, report the R-squared statistics as a measure of goodness of fit. Set up and report the results of the hypothesis test for the slope parameter (beta1). 5) Pick one of the remaining explanatory variables. Add that variable into the regression Model 1 from task 4). Re-fit the linear regression model (note, it is now a multiple regression model – why?). Call this Model 2. Report the prediction equation for Model 2, interpret the coefficients, report and interpret the R-squared statistic. How much has R-squared changed from Model 1 to Model 2? What is this change in R-squared uniquely attributable to? Does this change seem to have a practical meaning or value? Discuss. 6) For the remainder of the explanatory variables, add them into Model 2 one at a time so that the model becomes 1 variable larger at each step. Note the R-squared value and the change in R-squared between each subsequent model. Which explanatory variables seem to contribute alot (or a practical amount) to predicting Y and which explanatory variables contribute little or nothing?
7) Re-fit a multiple regression model using only those explanatory variables from task 6 that seem to contribute alot or a practical amount to predicting Y. Call this the Final Model. Report the prediction equation for the final model, interpret the coefficients, report the R-squared statistic. Does this model seem to be meaningful, in a larger medical scope of things, for predicting Quetelet? Remember, a regression model is also information about the relationships between variables – so it should have meaning and be part of the data’s story. Discuss. Is this modeling done? Or, is there something else you would want to do to model this data? Write up your synthesis description of what this data set seems to be saying (up to this point) and where we should go from here.
Category: R
-
“Predictive Modeling of Nutrition Study Data Using Simple Linear Regression” Title: “Uncovering the Story of Data: Exploring Relationships and Modeling Possibilities” The data set at hand presents a complex web of relationships between variables, each with its own unique story to tell. Through careful analysis and understanding of these
-
Exploring Risk Factors for Cardiovascular Disease in the Framingham Heart Study Data “Exploring the Relationship Between Categorical Variables and Cardiovascular Health Outcomes”
For this assignment, you will be using the Framingham Heart Study Data. The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease among a population subjects in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects. We will be using this original data. As you look over the Framingham Heart Study data and data dictionary to familiarize yourself with this data, you will notice that the study had a longitudinal design. This means that there were multiple observations on the same individuals at different points in time. You will notice variables with the same name, but with 1, 2 or 3 at the end of the name. These numbers indicate the data collection time points. For this assignment, we will only be using the primary variables and the variables at time point 1. Because of this, we can create an analysis file by retaining only the variables we want and removing the variables we do not need. This will make the data file easier to work with. To reduce the dataset to a more manageable size, open the Framingham Heart Study data in EXCEL. Remove all variables that have a name that ends in a ‘2’ or ‘3’. Variables like: sex2, sex3, age2, age3, etc should all be removed. In EXCEL, you can simply highlight those variables you do not want and delete. Next, remove all variables that start with “TIME…..” These are variables like: TIMEAP, TIMEMI, etc. etc. Save your reduced datafile to your computer using a different filename. Call this reduced dataset something like: FHS_assign7.xlxs.
Check on the records to see if there is missing values. Delete records with missing values. Re-save your dataset. Read your new analysis file into R. You are good to go
ASSIGNMENT TASKS
Part A – Mechanics (25 points)For this analysis, the variable “stroke” should be considered the response variable Y and the “diabetes1” variable should be considered the explanatory variable (X). Complete the following
:1) Construct a side by side bar graph to compare these two categorical variables. Describe what you see in this graph. Be sure to label the axes and give titles to the graph.
2) Construct a contingency table complete with Marginal row and column totals for these two variables, then answer the following
a) What is the conditional probability of having a stroke given diabetes is present at time 1? What is the conditional probability of having a stroke given diabetes is NOT present at time 1? b) What are the odds of having a stroke if diabetes is present at time 1? What are the odds of having a stroke of diabetes is not present at time 1?
c) Calculate the odds ratio of having a stroke when diabetes is present relative to when it is not. Interpret this result.d) Specify the null and alternative hypothesis, and then conduct a hypothesis test to see if diabetes is related to having a stroke. Interpret the results.
Part B – Open Ended Analysis (75 points)3) In professional practice, when you have an observational dataset like the Framingham Heart Study data, one typically is looking for risk factors. In other words, explanatory variables that are related to specific response variables of interest. For this last task, you will identify and work with only categorical explanatory variables. The response variables of interest are ANYCHD, STROKE, and DEATH. What categorical explanatory variables seem to indicate elevated risk of Coronary Heart Disease, Stroke or Death? Conduct an analysis. Report and interpret your results
.4) Which of the continuous explanatory variables do you think is most likely indicative of elevated risk of Coronary Heart Disease (ANYCHD), Stroke, or Death? Pick one such variable. Create a new variable that maps the continuous variable’s values into a categorical variable with at least 3 levels. Conduct contingency table analyses relating this newly created categorical variable to ANYCHD, STROKE and DEATH. These analyses should be done separately. In other words, you will have at least 3 separate contingency tables. Do NOT attempt to have multiple dimension contingency tables! Report on the results of your analysis and discuss the results.
5) Reflect on your experiences here. What are your recommendations for future analysis? Congratulations! You’ve completed the Assignment 8. Please save your R-code, because you can re-use or cannibalize this code in future assignments. Your write-up should address each task. -
“Predicting Ticket Sales Patterns: Analysis and Forecasting for 2014-2016”
I was given this answer, BUT I need the answers to the questions with visualizations not just the code written out.
Ticket sales patterns (15 points)
i. Create a model to predict ticket revenues or ticket revenue groups
for 2014 using the previous five years of data.
ii. Test your model on 2015 data. Comment.
iii. Make predictions for ticket purchases in 2016 (Like the Moneyball
example, the data of 2016 is missing. Assume that the coefficients
and the intercept values for the model created in point (ii) will be
the same for predicting 2016).
iv. Based on your model, who should be the top 10 ticket purchasers
for 2016? -
Title: Predicting Ticket Sales Patterns and Top Purchasers for 2016 Using Regression Analysis
Ticket sales patterns (15 points)
i. Create a model to predict ticket revenues or ticket revenue groups
for 2014 using the previous five years of data.
ii. Test your model on 2015 data. Comment.
iii. Make predictions for ticket purchases in 2016 (Like the Moneyball
example, the data of 2016 is missing. Assume that the coefficients
and the intercept values for the model created in point (ii) will be
the same for predicting 2016).
iv. Based on your model, who should be the top 10 ticket purchasers
for 2016?
I need the answers to the questions above, I need the top 10 in an excel file and the rest screenshots of the work done in R.