R studio - Exploratory Data Analysis - Assessment Answer

November 20, 2018
Author : Sara Lanning

Solution Code: 1GJB

Question: R studio

This assignment is related to ” R studio” and experts at My Assignment Services AU successfully delivered HD quality work within the given deadline.

R studio

Case Scenario/ Task

Bike sharing systems are new generation of traditional bike rentals where whole process

from membership, rental and return back has become automatic. Through these systems,

user is able to easily rent a bike from a particular position and return back at another

position. Currently, there are about over 500 bike-sharing programs around the world

which is composed of over 500 thousands bicycles. Today, there exists great interest in

these systems due to their important role in traffic, environmental and health issues.

Opposed to other transport services such as bus or subway, the duration of travel,

departure and arrival position is explicitly recorded in these systems. This feature turns

bike sharing system into a virtual sensor network that can be used for sensing mobility in

the city. It is expected that most of important events in the city could be detected via

monitoring these data.

In the finaly project, you will be analyzing the two-year historical log corresponding to

years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA. The data

set contains recourds of 17379 hourly counts of rentals. It was originally compiled by

Fanaee and Gama in “Event labeling combining ensemble detectors and background

knowledge” (2013).

You will download the data set bikeshares.csv. The data set contains

- instant: record index

- dteday : date

- season : season (1:springer, 2:summer, 3:fall, 4:winter)

- yr : year (0: 2011, 1:2012)

- mnth : month ( 1 to 12)

- hr : hour (0 to 23)

- holiday : weather day is holiday or not (extracted from https://dchr.dc.gov/page/holidayschedule)

- weekday : day of the week

- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.

- weathersit :

- 1: Clear, Few clouds, Partly cloudy, Partly cloudy

- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered

clouds

- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

- temp : Normalized temperature in Celsius. The values are divided to 41 (max)

- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)

- hum: Normalized humidity. The values are divided to 100 (max)

- windspeed: Normalized wind speed. The values are divided to 67 (max)

- casual: count of casual users

- registered: count of registered users

- cnt: count of total rental bikes including both casual and registered

This analysis is intentionally open ended. While you explore the data, recall the tools you

have learned in class.

These assignments are solved by our professional R studio at My Assignment Services AU and the solution are high quality of work as well as 100% plagiarism free. The assignment solution was delivered within 2-3 Days.

Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing style.

Solution: R studio

ABSTRACT

Bike-sharing systems allow people to rent a bicycle at one of many automatic rental stations scattered

in the city, use them for a short journey and return them at any other station in the city. The aim

of the study was to investigate factors that affect bike rentals in Capital Bikeshare. Exploratory Data

Analysis, Poisson/ Quasi-Poisson models were used to analyze the data. Data was explored by use of

frequency tables and graphical representation. To handle overdispersion in the dataset, Quasi-Poisson

and Negative Binomial models were fitted. Bike rentals are affected by weather conditions, time of the

day, month of the year, season etc.

Key words:Poisson Model, Quasi-Poisson model, Overdispersion

1 INTRODUCTION

Public bike sharing (PBS) systems are currently spreading across the globe and they have been gaining

increasing popularity in transportation plans as a strategy to multiply travel choices, promote the use

of active modes of transport, decrease dependence on automobile and especially reduce greenhouse gas

emission, (Contardo et al, 2012). Bike-sharing systems allow people to rent a bicycle at one of many

automatic rental stations scattered in the city, use them for a short journey and return them at any

other station in the city, (Raviv et al,2011).

Institute for transportation & development policy(ITDP) have stated that, more than 600 cities around

the globe have their own bike-share systems, and more programs are starting every year. The largest

systems are in China, in cities such as Hangzhou and Shanghai. In Paris, London, and Washington, D.C.,

highly successful systems have helped to promote cycling as a viable and valued transport option.

In this Study, we analyzed the two-year historical log corresponding to years 2011 and 2012 from Capital

Bikeshare system, Washington D.C., USA. Capital Bikeshare is the largest bike sharing program in

the United States. The aim of the study was to investigate factors that affect bike rentals in Capital

Bikeshare. An Exploratory Data Analysis(EDA) was done to get insight and a better understanding of

the dataset. This was done by use of frequency tables and graphical representations.

1.1 Data description

The data set contains records of 17379 hourly counts of rentals. It was originally compiled by Fanaee

and Gama in ”Event labeling combining ensemble detectors and background knowledge” (2013). The

variables used are described as follows;

. instant: record index

. dteday : date

. season : season (1:springer, 2:summer,

3:fall, 4:winter)

. yr : year (0: 2011, 1:2012)

. mnth : month ( 1 to 12)

. hr : hour (0 to 23)

. holiday : weather day is holiday

or not (extracted from

https://dchr.dc.gov/page/holidayschedule)

. weekday : day of the week

. workingday : if day is neither weekend

nor holiday is 1, otherwise is 0.

. hum: Normalized humidity. The values

are divided to 100 (max)

. windspeed: Normalized wind speed. The

values are divided to 67 (max)

. weathersit :

1: Clear, Few clouds, Partly cloudy, Partly

cloudy

2: Mist + Cloudy, Mist + Broken clouds,

Mist + Few clouds, Mist

3: Light Snow, Light Rain + Thunderstorm

+ Scattered clouds, Light Rain + Scattered

clouds

4: Heavy Rain + Ice Pallets + Thunderstorm

+ Mist, Snow + Fog

. temp : Normalized temperature in Celsius.

The values are divided to 41 (max)

. atemp: Normalized feeling temperature in Celsius.

The values are divided to 50 (max)

. casual: count of casual users

. registered: count of registered users

. cnt: count of total rental bikes including both

casual and registered

The response variable used was count of total rental bikes including both casual and registered(cnt).

3

1.2 METHODOLOGY

A Poisson distribution was assumed in this study since the response variable was a count data. The

Poisson distribution is used for counts of events that occur randomly over time or space, when outcomes

in disjoint periods or regions are independent.The mean and variance of Poisson distribution are the

same,(Agresti, 2012). The Poisson distribution has positive mean µ. It is more common to model the

log mean as it can take any real value. A Poisson loglinear model with explanatory variables is given

by

cnti ? Poisson(µi)

log(µi) = mnthik+seosonij+holidayi+hrir+workingdayi+humi+windspeedi+weathersitim+tempi+atempi

Where; k=1,..., 12 j=1,2,..., 4 r= 0,.., 23 i=1,2,..., 17379

For count data, Poisson assumption is often unrealistic because of overdispersion-the variance exceeds the

mean,(Agresti, 2012). One can use Quasi-Poisson or Negative binomial models to handle over-dispersion.

Quasi-Poisson model estimates a scale parameter as well, and also fixes the estimated standard error.

It uses quasi-likelihood estimation which assumes only mean-variance relationship rather than a specific

distribution of response variable. Negative binomial model contains an extra parameter ?, which is the

parameter of multiplicative random effect. It permits µ to depend on explanatory variables,(Agresti,

2012). All the models were fitted and compared using AIC.

R software version 3.1.1 was used to analyze the data.

2 RESULTS AND DISCUSSION

2.1 Exploratory Data Analysis

The mean and variance of response variable (cnt) were not the same. The mean and variance values were

189.4631 and 32901.46, respectively.This implies there is over-dispersion i.e there is greater variability in

count data. To observe how the response variable is distributed, an histogram of cnt was plotted. We

observe a very positive skewed distribution, with largest observed value equal to 977.

Response Variable

From Table 1, it can be observed that, during working days there were many bike rentals compared to

non-working days. There were many bike rental during summer(season 2) and fall(season 3) and fewer

during winter. From the table it can be seen that during, heavy Rain + Ice Pallets + Thunderstorm +

Mist, Snow + Fog(weather condition 4) only 3 bike rentals were made over the two years. This implies

that very few bike rental were made during this weather condition. There is also a decreasing trend of

bike rentals when weather conditions grow worse.

Data was also explored by use of scatter plots and box plots. Figure 2 below shows scatter plots of

number of bike rentals against atemp, temp and wind speed, respectively. A random sample of cnt was

selected in order to make these plots. It can be observed that many people tend to rent bikes when its

warmer and wind is calm.

Weather condition variables

It also implies that on overage, there were more bike rentals in 2012 as compared to 2011. Box plots in

Figure 3 shows different times when bikes were rented. It can be seen clearly that on average, in each

day there was a peak at around 8 a.m and in the evening at around 5 pm - 6 pm. On average in each

year, renting of bikes decrease from month of December- February(winter season) and increases from

March.

Time Variable

2.2 Poisson regression

Dummy variables were created for all categorical explanatory variables. With Poisson regression model,

all variables included in the model were significant. When all the other variables are held constant, the

average number of rented bikes in February(mnth 2) was about 1.089(exp(0.0855)) times or increased by

8% compared to January(mnth, reference variable). In addition, the average number of rented bikes at

8 a.m was about 6.8093(exp(1.9183)) times compared to 12 midnight.

The deviance 744445.1 and degrees of freedom 17378 of Poisson regression model suggested that the

model didnt fit the data well. This was in line with what we observed in EDA that there was overdispersion.

To handle over-dispersion, Quasi-Poisson and Negative binomial models were fitted. Table

2 gives parameter estimates of Poisson/Quasi-Poisson regression and Negative binomial model. The

standard errors for Quasi-poisson regression model are larger than those of Poisson regression models.

The dispersion parameter of Quasi-Poisson regression model was 43.4224 which implies that there is

indeed over dispersion in the data set and Poisson model underestimated the standard errors. The AIC

of Negative binomial model(value= 191 283) was smaller than that of Poisson regression model( value=

7

855 440). The deviance of Negative binomial model was also much smaller than that of Poisson regression.

This implies that Negative binomial improves Poisson model and deals with overdispersion quite well.

The estimated dispersion parameter for Negative Binomial was 3.1038 indicating that Negative binomial

is more appropriate than Poisson regression model for this data set.

Estimated coefficients table of Poisson/Quasipoisson regression and Negative binomial model

3 CONCLUSION and RECOMMENDATION

The study aimed to find whether there are factors that affect bike rental patterns in Capital Bikeshare

system. From EDA it was found that, season, year,month, hour, holiday, working day and weathers

conditions affects bike rental patterns.

There was over-dispersion in dataset. This was observed both in EDA and fitting Poisson regression

model. To handle Over-dispersion, Quasi- Poisson and Negative binomial models were fitted. It was

found that Negative Binomial did improve Poisson regression model. The standard errors in QuasiPoisson

model were larger than in Poisson regression model. This implies that that there was over

dispersion in the data set and Poisson model underestimated the standard errors. All variables were

scientifically significant, implying that the variables can be used to predict bike rental patterns.

In this study, it was found that there was heterogeneity among variables used. Other methods that

account for heterogeneity can be explored.This was not done due to limited time. For further studies,

interactions between explanatory variables can be included in the analyzes.

Find Solution for R studio by dropping us a mail at help@myassignmentservices.com.au along with the question’s URL. Get in Contact with our experts at My Assignment Services AU and get the solution as per your specification & University requirement.

RELATED SOLUTIONS

Order Now

Request Callback

Tap to ChatGet instant assignment help

Get 500 Words FREE