Due date: Feb. 18, 2015. Don't just give answers to these problems. Consider them as if they were job assignments given to you by your supervisor and add some brief discussion and interpretation of the results.

- Use the data contained in the file

http://www.utdallas.edu/~ammann/stat3355scripts/Smoking.txt

- Find the means and standard deviations for each variable.
- Which states are more than 2 sd's above the mean for cigarette consumption? for bladder cancer? for lung cancer?
- Which states are in the top 10% of cigarette consumption? of bladder
cancer? of lung cancer? (see documentation for
**R**function*quantile()*) - Plot cigarette consumption versus lung cancer and add an informative title.
- Repeat for bladder cancer.

- Use the data contained in the file

http://www.utdallas.edu/~ammann/stat3355scripts/Sleep.data

A description of this data is given in

http://www.utdallas.edu/~ammann/stat3355scripts/Sleep.txt

The`Species`column should be used as row names.- Construct histograms of each variable.
- The strong asymmetry for all variables except
`Sleep`indicates that a*log*transformation is appropriate for those variables. Construct a new data frame that contains`Sleep`, replaces*BodyWgt, BrainWgt, LifeSpan*by their log-transformed values, and then construct histograms of each variable in this new data frame. - Plot
`LifeSpan`vs`BrainWgt`with`LifeSpan`on the y-axis. Repeat using these variables after applying a log-transformation to both variables. Superimpose lines corresponding to the respective means of the variables for each plot. - What proportion of species are within 2 s.d.'s of mean
`LifeSpan`? What proportion are with 2 s.d.'s of mean`BrainWgt`? Answer these for the original variables and for the log-transformed variables. - Obtain the correlation between
*LifeSpan*and*BrainWgt*. Repeat for*Log(LifeSpan)*and*log(BrainWgt)*. Interpret these correlations. - Obtain the least squares regression line to predict
*LifeSpan*based on*BrainWgt*. Repeat to predict*log(LifeSpan)*based on*log(BrainWgt)*. Predict*LifeSpan*of Homo sapiens based on each of these regression lines. Which would you expect to have the best overall accuracy? Which prediction is closest to the actual*LifeSpan*of Homo sapiens?

- Use the data contained in the file

http://www.utdallas.edu/~ammann/stat3355scripts/HappyPlanet.csv

This data comes from the*Happy Planet Index*, http://www.happyplanetindex.org

Note that one of the countries is`Cote d'Ivoire`

which requires use of the`quote=`quote="\""

argument in`read.table()`.- Obtain the quartiles of LifeExpectancy.
- Construct a histogram of GDP. Obtain the mean and s.d. of GDP. How many countries are within 2 s.d.'s of the mean GDP?
- Since GDP is heavily skewed, construct a new variable called
**logGDP**which is the logarithm of GDP. Answer the previous two items for this variable. Are the quartiles of**logGDP**the same as the logarithm of the quartiles of GDP? What about the mean? - The SubRegion variable in this data set represents both region and sub-region. Region is the
first character and sub-region is the second character. If this data has been read into a data
frame named
*HappyPlanet*, then region can be extracted using the**substring()**function. We want these numeric codes to be treated as categories, not numbers, so we can convert the result of this operation to a factor.Region = substring(HappyPlanet[,"SubRegion"],1,1) Region = factor(Region)

Plot LifeExpectancy vs logGDP, use different colors for different regions, include an informative title, and include a legend that indicates which color corresponds to which region. - Find the correlation between
**LifeExpectancy, logGDP**and interpret. - Repeat the previous two items for
**HappyLifeYears**.

2015-04-20