R studio assignment - Shrewd Writers

---
title: "Assignment 1" 
subtitle: "Case study: Avocadopocalypse"
output: html_document
---

![](avocado-Cover.jpg)

## 1. Load the dataset into R
In a [kaggle competition](https://www.kaggle.com/neuromusic/avocado-prices), this dataset was provided by the Hass avocado board to analyze the perception that avocado prices surged in 2017. R has devoted a package to a more recent version of this dataset which combines an [extensive description of variables](https://rdrr.io/cran/avocado/f/vignettes/a_intro.Rmd) and three datasets: 
- at the country level (`hass_usa`)
- at the region level (`hass_region`)
- at the city or subregion level (`hass`)
You can decide which dataset you will work with. I would advise for those who feel uncertain about their capabilities, to start with the more general dataset (at the usa level).


```{r}
getwd()# make sure you download the csv file in your working directory!
install.packages("avocado")
library(avocado)
data('hass_region')
head(hass_region)
summary(hass_region)

table(hass_region$region)
```


Your goal will be to predict prices in the last observed time period. You are free to select regions, time periods, and variables as you think is necessary to well predict the prices of the avocado's. How good your prediction is determined by the difference between the predicted value and the observed value of the last time period.

In order to inspect the dataset over time, we create three new variables, day, month and year.

```{r, warning=F}
for(i in seq(1, nrow(hass_region))){
  hass_region$year[i] <- unlist(strsplit(as.character(hass_region$week_ending[i]),"-"))[1]
  hass_region$month[i] <- unlist(strsplit(as.character(hass_region$week_ending[i]),"-"))[2]
  hass_region$day[i] <- unlist(strsplit(as.character(hass_region$week_ending[i]),"-"))[3]
}
```


I advise you to work from this notebook, so the results are well organized, and represented in such a way that others can make sense of it. For the assignment, it is important to show that you've learned how to code, but also how to interpret the plots, hypothesis tests and regression results. To get started, I added some examples below. Please make sure you add code and text below each question, and that the analyses combined become a report that is easy to read, and attractive to see.

## 2. Inspect variables
When you inspect the variables, pay special attention to measurement levels (how many categories, numeric values etc), distributions (use plots!), explore conditional distributions (plots separated for groups of observations), make selections on the data so you investigate how 'complex' the data is.

For the assignment, select those inspections that you felt were most insightful to you, and interpret these plots/summaries. As an example below I added my first inspection of the data:

```{r}
pairs(hass_region[,c('avg_price_nonorg','small_nonorg_bag','large_nonorg_bag','xlarge_nonorg_bag')])
boxplot(avg_price_nonorg~year, hass_region)
```

The boxplot shows how the price changed over the years. Keep in mind that there are regional differences that confound this comparisons over the years, as the price change might have differed per region. In the following we explore the price per region, and focus only on the larger regions in the US that are listed in the `hass_region` dataset. This is just to illustrate what you can do with the data. Ofcourse you can further explore `hass_county` if you want!

```{r}
# first delete levels so that you can focus on specific part of the data
hass_region$region <- as.character(hass_region$region)
unique(hass_region$region)
boxplot(avg_price_nonorg~region, hass_region)
boxplot(avg_price_org~region, hass_region)
```

As you can see the price of avocado's in the West is much more variable than those in SouthCentral (where the price of avocado is the lowest). This might be because of the volume that is consumed in those regions (see variable `Total.Volume`). In the Northeast the price of the avocado is the highest. 

Now let's inspect the time dimension using `tapply` and `month`.

```{r}
tapply(hass_region$avg_price_nonorg , hass_region$month, mean)
```

It seems that the average price of avocado's somewhat becomes higher in the months August till October. Indeed, https://www.wisegeek.com/when-is-avocado-season.htm states the following:

*The U.S. state of California is one of the worldâ€™s largest producers of avocados. Farmers there tend to grow the â€œHassâ€ variety, which matures quickly and produces a large, attractive fruit. The Hass is generally available beginning in late March and ending in early to mid-September, though of course a lot depends on weather patterns, storms, and basic growing conditions. Droughts and unusually cold weather tend to reduce crop availability, which can alter the start and end points of avocado season while also impacting price. Export demands may also place a strain on availability. Many California farms sell their goods to grocers around the world during the peak growing season.*

Apparently, on average, the price for avocado's is higher when it is in season in California, which suggests that the price of these Hass avocado's might be higher than the imported avocado's. You can find how many Hass avocado's are sold in the columns `X4046`, `X4225` and `X4770` which are the PLU codes for small, large and extra large Hass avocado's. 

## 3. Preliminary analyses
In preliminary analyses, the main goal is to get a feel for the data. How are variables construed, how do they relate with one another, what is the structure of the dataset? You might have noted during your inspection of the variables that this is a so-called long file, which means that all years are placed below one another. As you will see towards the end of this assignment, this makes time analysis easy (when you include trend variables). For the first analyses, you do not have to analyze the effect of time, but you do need to control for time. If not, the effect of time will confound the effect of the other variables.

The following questions help you to get to know the dataset better. If you think of other questions, feel free to add those analyses below.

#### a. How does price differ per region?
#### b. Does the volume of avocado's produced affect the price? 
#### c. Are organic more expensive than conventional?
#### d. How does type differently affect price, given the volume produced?

For the assignment, you need to answer these questions using plots, hypothesis tests, and regression models. Again, feel free to add any other analysis that you may have found useful to understand the data.

## 4.Compare regions
#### a. Do these effects differ per region? 
Look into the different regions here: https://cran.r-project.org/web/packages/avocado/vignettes/a_intro.html

```{r}
summary(lm(avg_price_nonorg ~year + region,hass_region))
```

## 5. Explore time
#### a. Do these effects differ per year?
#### b. Do the effects differ per month?
#### c. Did the prices became more expensive over time? (tip: make a trend variable)
#### d. To what extent was this a linear effect? NOTE: explain a bit what is needed here...

## 6. Predict

It would be nice to see how well your model to explain prices can predict prices. To do this, you need to coefficients of your model, and you need to extrapolate what the price would be given these coefficients. 
I present an example below. First I reorder the data, so that the first row contains the first time, and the last row the last time measured. Additionally, I create a trend variable that can be used to predict increase or decrease in sales across time.

```{r}
hass_region<-hass_region[order(hass_region$year, hass_region$month, hass_region$day),]

hass_region$trend <- NA
i<-1
for(date in unique(hass_region$week_ending)){
  hass_region$trend[hass_region$week_ending==date]<-i
  i<- i + 1
}
```

Using linear regression I can estimate the effect of trend and region on the price. As I will use the coefficients to predict the last time measured, I exclude this time from the analysis. The coefficients are later used to predict this time. As you can see below, I overpredict the price of both organic and conventional avocado's. 

```{r}
# observed
hass_region$date <- as.character(hass_region$week_ending)
hass_region[hass_region$date=="2018-08-05" ,c("avg_price_nonorg","avg_price_org","region","trend")]
```

```{r}
# predicted
# conventional
fit<-lm(avg_price_nonorg~trend + region,hass_region[hass_region$date!="2018-08-05",])
fit$coefficients[1] + fit$coefficients['trend']*84 + fit$coefficients['California']*1
# organic
fit<-lm(avg_price_org~trend + region,hass_region[hass_region$date!="2018-08-05",])
fit$coefficients[1] + fit$coefficients['trend']*84 + fit$coefficients['California']*1

# inspect two regions
glm_region <- glm(I(as.numeric(factor(region))-1) ~ small_nonorg_bag + avg_price_nonorg, hass_region[hass_region$region == "Northeast" | hass_region$region == "California",], family=binomial)
summary(glm_region)

# create data for decision boundary
summary(hass_region$small_nonorg_bag[hass_region$region == "Northeast" ])
plot_x <- c(1300000,1700000) # pick values based on x value in region of boundary decision
plot_y <- (-1 /coef(glm_region)[3]) * (coef(glm_region)[2] * plot_x + coef(glm_region)[1])
db.data <- data.frame(cbind(plot_x, plot_y))
colnames(db.data) <- c('x','y')
db.data

library(ggplot2)
ggplot(hass_region[hass_region$region == "Northeast" | hass_region$region == "California",], aes(x=small_nonorg_bag,y=avg_price_nonorg, color=region))+ 
  labs(x="small_org_bag", y="avg_price_org", title="Average organic avocado prices")+
  geom_point(size = 4, shape = 19, alpha=0.8)+
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))  +
  geom_line(data=db.data,aes(x=x, y=y), col = "black", size = 1)  # decision boundary

```

#### b. Explore how trend differs across regions.

```{r}
fit<-lm(avg_price_org~trend+ I(trend^2) + I(trend^3), hass_region)
ggplot(fit, aes(y=hass_region$avg_price_org, x=hass_region$trend,colour=hass_region$region))+
  geom_point()+
  geom_smooth(method='lm', formula=y~x+I(x^2)+I(x^3))+
  facet_wrap(~hass_region$region)
```


--------------------------------------------------------------------------------
## 7. Classification

You can use PCA or cluster analysis to classify the countries based on multiple variables. Select only one year to avoid correlated observations over time. You could start with using the variables you've chosen to study above, and include them in the PCA. The components that are obtained from this PCA can be used to create classes (i.e. groups of states/regions that have similar values). Interpret the components, and argue why you have chosen a specific classification. 


## The end  --- Good Luck!
![](avocadocopocalypse.jpg)
If you need answers to this assignment, WhatsApp/Text to +1 646 978 1313
or send us an email to admin@shrewdwriters.com and we will reply instantly. We provide original answers that are not plagiarized. Please, try our service. Thanks
Leave a Reply Cancel Reply