
































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The results of various statistical tests performed on house price and living area data. The tests include one-sample t-tests, two-sample t-tests, and analysis of variance (anova). The data is presented for two different sets of variables: livarea and sprice. The tests aim to determine if the means of these variables are significantly different from a hypothesized value or if there is a significant difference between the means of two groups.
Typology: Lab Reports
1 / 40
This page cannot be seen from the preview
Don't miss anything!
Activity 1 1
Introduction................................................. 1 I. Analysis qualitative........................................... 2 II. Analysis Quantifier........................................... 4 III. Quantifier and Qualitative...................................... 8 IV. Hypothesis Testing........................................... 10 V. Prediction model............................................ 14 VI. Solving heteroskedasticity....................................... 17 Conclusion for the final model....................................... 24
Activity 2 24
Introduction................................................. 24 I. Clean dataset............................................... 25 II. Preprocess outliers........................................... 26 III. Descriptive Statistics......................................... 26 IV. Testing hypothesis........................................... 30 V. Prediction model............................................ 31 VI. Recommend different model...................................... 36 VII. Explain the problems......................................... 40 Conclusion................................................. 40
This dataset contains 1500 houses sold in Stockton, California, during 1996 -1998. The purpose of dataset in dataset is to examine how the sale price of houses in Stockton, California, are affected by house characteristics. There are 7 variables:
Variable Definition sprice Selling price of home, in dollars livarea Living area, in hundreds of square feet beds Number of beds baths Number of baths lgelot 1 if lot size is greater than 0.5 acres, 0 otherwise age Age of home at time of sale, in years
Variable Definition pool 1 if home has a pool, 0 otherwise
Data source: Dr. John Knight, Department of Finance, University of the Pacific.
a) pool:
category no pool pool
Comment: The number of houses without a pool is large (1402), while the number of houses with a pool is only a small part (98) in the collected data table. There is a significant difference between the number of houses with and without a pool. The proportion of the houses with pool in the dataset is smaller than 7%.
b) lgelot:
category <=0.
Comment: The proportion of houses with a size of larger than 0.5 acres in houses without pool is smaller than the proportion of houses with a size of larger than 0.5 acres in houses with pool.
a) sprice
Histogram of sprice
Comment : The chart has a positive skew. The selling price with the highest number falls around 100, dollars. The selling price falls mostly in the range of 50,000 dollars to 140,000 dollars. The average selling price is about 100,000 dollars.
Comment : The data has about 101 outliers, which is accounted for 6.7% in dataset. The data does not have large fluctuations. The highest price is around over 700,000 dollars and the lowest price is about 22, dollars. The price difference is about 63000 dollars.
b) age
Histogram of age
Comment : Houses that are about 18 years occupies the largest number. Houses are distributed from when they were built to about 50 years, with houses for the most from 10 to 22 years old. Houses that are above 60 years are rare and almost non-existent. The average age of the house is around 18
Comment : The number of outliers is not significant compared to the dataset. The data has fluctuations but not large. The largest area is close to 50 hundreds of square feet and the smallest is around 5 hundreds of square feet. The area difference is about 5 hundreds of square feet.
d) sprice and livarea
Comment : Based on the graph, the higher the area of the house, the higher the selling price of the house. The correlation rate is nearly 80%.
e) livarea and baths
Comment : Drawing like above, based on the graph, the more baths there are, the higher the area of the house. The correlation between baths and livarea is about 72%
f) sprice and beds
Comment : Continue with beds, based on the boxplot graph, the more bedrooms a house has, the higher the area of the house. On average, the area of a house with 2 bedrooms is smaller than that of a house with 3 bedrooms.There is a relatively correlation of about 58%.
a) livarea and lgelot
I.Qualitative
a) The proportion of the houses with pool in the dataset is not more than 7%.
p is the proportion of the houses with pool in the dataset.
H 0 : p = 7%
Ha : p < 7%
Because pvalue = 0_._ 2394 > 0_._ 05 → Accept H 0. Therefore, we cannot conclude that the proportion of the houses with pool in the dataset is not more than 7% at risk level α = 5%
b) The proportion of the houses with size larger than 0.5 acres is bigger than 6%.
p is the proportion of the houses with size larger than 0.5 acres
H 0 : p = 6%
Ha : p > 6%
Because pvalue = 0_._ 2934 > 0_._ 05 → Accept H 0. Therefore, we cannot conclude that the proportion of the houses with size larger than 0.5 acres is bigger than 6%. at risk level α = 5%
c) The proportion of houses with a size of larger than 0.5 acres in houses without pool is smaller than the proportion of houses with a size of larger than 0.5 acres in houses with pool. p 1 is the proportion of houses with a size of larger than 0.5 acres in houses without pool
p 2 is the proportion of houses with a size of larger than 0.5 acres in houses with pool
H 0 : p 1 = p 2
Ha : p 1 < p 2
Because pvalue < 0_._ 05 → Reject H 0. Therefore, we conclude that the proportion of houses with a size of larger than 0.5 acres in houses without pool is smaller than the proportion of houses with a size of larger than 0.5 acres in houses with pool at risk level α = 5%
II.Quantifier
a) The average selling price is about 100,000 dollars
H 0 : μ = 100000
Ha : μ ̸= 100000
Because pvalue < 0_._ 05 → Reject H 0. Therefore, we conclude the average selling price is not about 100, dollars at risk level α = 5%
Ha : μ 1 ̸= μ 2
Because pvalue = 0_._ 5675 > 0_._ 05 → Accept H 0. Therefore, we conclude the average age of houses with swimming pools is equal to the average age of houses without swimming pools at risk level α = 5%
e) The houses with a size >0.5 acres have a higher average selling price compared to houses with a size <=0.5 acres. μ 1 : the average selling price of houses with a size >0.5 acres
μ 2 : the average selling price of houses with a size <=0.5 acres
H 0 : μ 1 = μ 2
Ha : μ 1 > μ 2
Because pvalue < 2_._ 2 e −^16 < 0_._ 05 → Reject H 0. Therefore, we conclude the houses with a size >0.5 acres have a higher average selling price compared to houses with a size <=0.5 acres at risk level α = 5%
f) Houses with a pool have a higher average selling price compared to houses without a pool μ 1 : the average selling price of houses with a pool
μ 2 : the average selling price of houses without a pool
H 0 : μ 1 = μ 2
Ha : μ 1 > μ 2
Because pvalue = 2_._ 262 e −^08 < 0_._ 05 → Reject H 0. Therefore, we conclude the houses with a pool have a higher average selling price compared to houses without a pool at risk level α = 5%
a) Eliminate outlier from dataset
As above mention, sprice has 101 outliers acounted for 6.7% in the dataset. We will remove it.
After remove the outliers, we have some new outliers acounted for 1.4% which is not significant. Therefore, we can ignore it.
b) Split training and validate set
Because pvalue < 2_._ 2 e −^16 → Reject H 0. This is a highly significant result. The null hypothesis should be rejected at any reasonable significance level. Therefore at least one variable can be used to explain. Moreover, we can see that at risk level α = 5%, sprice , beds, baths have the meaning to explain livarea in this model. Among that pvalue of beds is smaller than 0.5 then Reject H 0 , we can use to explain livarea.
Moreover, R^2 = 74_._ 02% means that 74_._ 02% of the observed variation in livarea can be explained by the linear regression relationship based on sprice, beds, baths.
After that we have the model: Y ˆ = − 2_._ 797 e +00^ + 5_._ 724 e −^05 sprice + 1_._ 684 e +00 beds + 3_._ 284 e +00 baths. This model shows that when increasing 1 dollar then living area increases 5_._ 724 e −^05 hundreds of square feet; increasing 1 number of beds then living area increases 1_._ 684 e +00^ hundreds of square feet; increasing 1 number of baths then living area increases 3_._ 284 e +00^ hundreds of square feet.
d) Exam multicollinearity of model:
The VIF indexes of three variables sprice,beds, baths are not significant. Therefore we can ignore the multicollinearity of this model.
e) Predict validation data
After having prediction values, we can see that root mean square error is equal to 2.326186. There are still some predicted values that differ significantly from the actual values. However, the model can be considered acceptable for now.
f) Exam the independence of the model
Because pvalue = 0_._ 672 > 0_._ 05 → Accept H 0. Therefore, there is no correlation among the residuals at rist level α = 5%
g) Exam stability of model
Because pvalue = 0_._ 01943 < 0_._ 05 → reject H 0. Therefore the variance of the residuals is not constant.
When observing above plot, a fan or cone shape indicates the presence of heteroskedasticity. This is seen as a problem because linear regression assumes that the spread of residuals is constant across the plot. If there is an unequal scatter of residuals, the population used in the regression contains unequal variance, and therefore the analysis results may be invalid. As we can see, pvalue = 1_._ 885 e −^06 < 0_._ 05 → Reject H 0. Therefore the residuals do not adhere to normal distribution at risk level α = 5%.
a) Transforming the outcome variable
We will transform the livarea by using a log transformation.
As we can see when using studentized Breusch-Pagan test, pvalue = 0_._ 3555 > 0_._ 05 → Accept H 0. Therefore we can think that the variance of the residuals is constant at risk level α = 5%. However, when using Shapiro-Wilk normality test, pvalue = 0_._ 02724 < 0_._ 05 → Reject H 0. The residuals do not adhere to normal distribution at risk level α = 5%. However, the residuals adhere to normal distribution at risk level α = 2%. As we can see, the histogram is like having the normal distribution.
Compare to model 1, RMSE is approximately equal.
c) Conclude
All three models does not have residuals that adhere to normal distribution at risk level α = 5. Among of them, model 3 is the best model that satisfied no correlation among the residuals, the variance of the residuals is constant, residuals that adhere to normal distribution at risk level lower. The problem may arise due to the influence of outliers in the variables even though they have been processed, or simply because the initial assumption of using a linear regression model to handle this dataset is not appropriate.
d) Choose other models with AIC standard
The best model is livarea ˆ = B 0 + B 1 sprice + B 2 beds + B 3 baths + B 4 lgelot + B 5 age according to AIC standard. As shown by above plot, age and lgelot do not have correlations with livarea. Let’s exam we can drop it.
Ha : ∃ Bi ̸= 0
Because pvalue = 0_._ 000002041 < 0_._ 05 → Reject H 0. Therefore, we can drop age and lgelot variables for this case.
e) Exam AIC model