Introduction to R and Rstudio

2019-05-27 12:17:24

Announcement

Assignment

Evaluate whether rounded measurements are odd numbers or even number each.
Count the N of even number using the results above.
sumbit the number to the following.

Example

iris.tidy %>% 
  mutate(round.measurement = 
           round(measurement, digits = 0)%%2) %>% 
  dplyr::filter(round.measurement == 1) %>% 
  nrow()

## [1] 280

Density plot

We make it today.

Density plot?

A figure (graph) showing shape of density
Difference from histogram
- histogram: both continuous and discrete numbers
- density plot: merely for continuous numbers

ggplot2 package

A comprehensive package to draw figures such as graph and maps
Much more flexible and friendly with academic publication in comparison with base::plot() function

Procedure

Make plot area and fix type of graph
Arrange attributes, theme, and others
Assign the figure into a value
Save the figure

iris data

A world-famous data set
- The data set includes 5 attributes; Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
- We use the data set frequently for practices such as drawing figures, clustering, ANOVA, and others.

Make plot area and fix type of graph

Please write the code, save it (Ctrl+s), and run (select->Ctrl+Enter).

  iris.tidy %>% 
  ggplot2::ggplot(aes(x = measurement,
                      y = ..density..,
                      fill = Species)
  ) +
  geom_density(alpha = 0.5, 
               colour = "black", 
               position = "identity"
  )

Minimum requirement?

The fugure should be improved

The figure should be plotted by both attribute and species
Labels of axes should be corrected.
The legend should be replaced insude of plot area.
Background color and grid are not necessary.

Arrange arguments and theme

Please write the code, save it (Ctrl+s), and run (select->Ctrl+Enter).

iris.tidy.density <- 
  iris.tidy %>% ggplot2::ggplot(aes(x = measurement,
                                    y = ..density..,
                                    fill = Species
                                    )
                                ) +
  geom_density(alpha = 0.5, colour = "black", position = "identity"
  ) +
  facet_wrap( ~ attribute, scale = "free_y") +
  labs(x = "Length", y = "Density")

Arrange (Continued)

Please write the code, save it (Ctrl+s), and run (select->Ctrl+Enter).

iris.tidy.density <-   
iris.tidy.density +
  scale_fill_hue(name = "Species", 
                 labels = c("setosa" = "St.", 
                            "versicolor" = "Vc.",
                            "virginica" = "Ve."
                            )
  ) +
  theme_classic()

To save the figure,

Assign the figure into a value
Using ggplot2::ggsave() function, save it.
- In detail of the function, Check and refer to help message.
- ?ggplot::ggsave

Save the figure with .pdf format

Please write the code, save it (Ctrl+s), and run (select->Ctrl+Enter).

Open your project folder and find the saved file.
Adjust arguments as you like and compare with.

ggplot2::ggsave(filename =  "iris.density.pdf", 
                plot = iris.tidy.density,
                units = "mm", 
                dpi = 300
                )

Assignment

Using wine data set, make a density plot following constraints below.
1. Transform the data set into tidy.
2. Change filling color by Type.
3. Save it with .pdf format and submit.
To use the data set, install rattle.data package.

Sample

Descriptive statistics

Statistics?

A discipline / subject　統計学
A indicator expressing characteristics of distribution　統計量

Why is the statistics necessary?

To estimate a population’s distribution from sample distribution
Statistics: parameters demonstrating characteristics of distribution
- central tendency
- variance
- others

Mean (arithmetic mean; \(\bar{x}\))

Most frequently-used statistics indicating central tendency
- \(\mu\): population mean
Equivalent to centroid in Physics

\[ \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i} \] Where

n: sample size (N. of observation); i: index; \(x\): sample (observation / data)

\(\bar{x}\) as an estimator

sample mean \(\bar{x}\) is not always the same as population mean \(\mu\).
As the \(n\) become larger, the \(\bar{x}\) drifts.
We need to consider the following conditions.
- Variance of data
- \(n\)

Standard deviation (\(\sigma\))

A statistics indicating variance of observations / data.
- Square root of variance \(\sigma^{2}\) \[ \sigma = \frac{1}{n}\sum_{i=1}^{n}(x_{i}-\bar{x}) \]
- Mean is a point minimizing the deviation. ## Standard error (SE)
Depending on sample size (N), the sample mean (\(\bar{x}\)) drifts.

Drifting?

Blue: 100 observations from \(N(0, 1^{2})\)
Grey: 10,000 observations from \(N(0, 1^{2})\)

SE (Continued)

To adjust the influence of \(N\) toward mean, we need to consider \(SE\)
The statistics suggests: the mean become more precise as the \(n\) increases.

\[ SE = \frac{sd}{\sqrt{n}} \]

In R…

Mean: mean()
SD: sd()
Example: Compute the mean and SD of Sepal.Length of iris data

Is it appropriate?

Other statistics should be considered.
- min()
- max()
- median():robust against outliers
Stratification
Making a summary table is essential.

Compute the descriptive statistics

iris.tidy %>% 
  dplyr::group_by(Species, attribute) %>%
  dplyr::summarise(n = n(),
                   Min. = min(measurement),
                   Max. = max(measurement),
                   Mean = mean(measurement),
                   Median = median(measurement),
                   SD = sd(measurement),
                   SE = sd(measurement) / sqrt(n())
  )

Results

Species	attribute	n	Min.	Max.	Mean	Median	SD	SE
setosa	Petal.Length	50	1.0	1.9	1.462	1.50	0.1736640	0.0245598
setosa	Petal.Width	50	0.1	0.6	0.246	0.20	0.1053856	0.0149038
setosa	Sepal.Length	50	4.3	5.8	5.006	5.00	0.3524897	0.0498496
setosa	Sepal.Width	50	2.3	4.4	3.428	3.40	0.3790644	0.0536078
versicolor	Petal.Length	50	3.0	5.1	4.260	4.35	0.4699110	0.0664554
versicolor	Petal.Width	50	1.0	1.8	1.326	1.30	0.1977527	0.0279665
versicolor	Sepal.Length	50	4.9	7.0	5.936	5.90	0.5161711	0.0729976
versicolor	Sepal.Width	50	2.0	3.4	2.770	2.80	0.3137983	0.0443778
virginica	Petal.Length	50	4.5	6.9	5.552	5.55	0.5518947	0.0780497
virginica	Petal.Width	50	1.4	2.5	2.026	2.00	0.2746501	0.0388414
virginica	Sepal.Length	50	4.9	7.9	6.588	6.50	0.6358796	0.0899270
virginica	Sepal.Width	50	2.2	3.8	2.974	3.00	0.3224966	0.0456079

Assignment

Using the wine data set, make a summary table by type and attribute.

Announcement

Assignment

Example

Density plot

We make it today.

Density plot?

ggplot2 package

Procedure

iris data

Make plot area and fix type of graph

Please write the code, save it (Ctrl+s), and run (select->Ctrl+Enter).

Minimum requirement?

The fugure should be improved

Arrange arguments and theme

Please write the code, save it (Ctrl+s), and run (select->Ctrl+Enter).

Arrange (Continued)

Please write the code, save it (Ctrl+s), and run (select->Ctrl+Enter).

To save the figure,

Save the figure with .pdf format

Please write the code, save it (Ctrl+s), and run (select->Ctrl+Enter).

Assignment

Sample

Descriptive statistics

Statistics?

Why is the statistics necessary?

Mean (arithmetic mean; \(\bar{x}\))

\(\bar{x}\) as an estimator

Standard deviation (\(\sigma\))

Drifting?

SE (Continued)

In R…

Is it appropriate?

Compute the descriptive statistics

Results

Assignment

Enjoy!