2019-05-27 12:17:24

Announcement

Assignment

  1. Evaluate whether rounded measurements are odd numbers or even number each.
  2. Count the N of even number using the results above.
  3. sumbit the number to the following.

Example

iris.tidy %>% 
  mutate(round.measurement = 
           round(measurement, digits = 0)%%2) %>% 
  dplyr::filter(round.measurement == 1) %>% 
  nrow()
## [1] 280

Density plot

We make it today.

Density plot?

  • A figure (graph) showing shape of density
  • Difference from histogram
    • histogram: both continuous and discrete numbers
    • density plot: merely for continuous numbers

ggplot2 package

  • A comprehensive package to draw figures such as graph and maps
  • Much more flexible and friendly with academic publication in comparison with base::plot() function

Procedure

  1. Make plot area and fix type of graph
  2. Arrange attributes, theme, and others
  3. Assign the figure into a value
  4. Save the figure

iris data

  • A world-famous data set
    • The data set includes 5 attributes; Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
    • We use the data set frequently for practices such as drawing figures, clustering, ANOVA, and others.

Make plot area and fix type of graph

Please write the code, save it (Ctrl+s), and run (select->Ctrl+Enter).

  iris.tidy %>% 
  ggplot2::ggplot(aes(x = measurement,
                      y = ..density..,
                      fill = Species)
  ) +
  geom_density(alpha = 0.5, 
               colour = "black", 
               position = "identity"
  )

Minimum requirement?

The fugure should be improved

  • The figure should be plotted by both attribute and species
  • Labels of axes should be corrected.
  • The legend should be replaced insude of plot area.
  • Background color and grid are not necessary.

Arrange arguments and theme

Please write the code, save it (Ctrl+s), and run (select->Ctrl+Enter).

iris.tidy.density <- 
  iris.tidy %>% ggplot2::ggplot(aes(x = measurement,
                                    y = ..density..,
                                    fill = Species
                                    )
                                ) +
  geom_density(alpha = 0.5, colour = "black", position = "identity"
  ) +
  facet_wrap( ~ attribute, scale = "free_y") +
  labs(x = "Length", y = "Density")

Arrange (Continued)

Please write the code, save it (Ctrl+s), and run (select->Ctrl+Enter).

iris.tidy.density <-   
iris.tidy.density +
  scale_fill_hue(name = "Species", 
                 labels = c("setosa" = "St.", 
                            "versicolor" = "Vc.",
                            "virginica" = "Ve."
                            )
  ) +
  theme_classic()

To save the figure,

  • Assign the figure into a value
  • Using ggplot2::ggsave() function, save it.
    • In detail of the function, Check and refer to help message.
    • ?ggplot::ggsave

Save the figure with .pdf format

Please write the code, save it (Ctrl+s), and run (select->Ctrl+Enter).

  • Open your project folder and find the saved file.
  • Adjust arguments as you like and compare with.
ggplot2::ggsave(filename =  "iris.density.pdf", 
                plot = iris.tidy.density,
                units = "mm", 
                dpi = 300
                )

Assignment

  • Using wine data set, make a density plot following constraints below.
    1. Transform the data set into tidy.
    2. Change filling color by Type.
    3. Save it with .pdf format and submit.
  • To use the data set, install rattle.data package.

Sample

Descriptive statistics

Statistics?

  1. A discipline / subject 統計学
  2. A indicator expressing characteristics of distribution 統計量

Why is the statistics necessary?

  • To estimate a population’s distribution from sample distribution
  • Statistics: parameters demonstrating characteristics of distribution
    • central tendency
    • variance
    • others

Mean (arithmetic mean; \(\bar{x}\))

  • Most frequently-used statistics indicating central tendency
    • \(\mu\): population mean
  • Equivalent to centroid in Physics

\[ \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i} \] Where

n: sample size (N. of observation); i: index; \(x\): sample (observation / data)

\(\bar{x}\) as an estimator

  • sample mean \(\bar{x}\) is not always the same as population mean \(\mu\).
  • As the \(n\) become larger, the \(\bar{x}\) drifts.
  • We need to consider the following conditions.
    • Variance of data
    • \(n\)

Standard deviation (\(\sigma\))

  • A statistics indicating variance of observations / data.
    • Square root of variance \(\sigma^{2}\) \[ \sigma = \frac{1}{n}\sum_{i=1}^{n}(x_{i}-\bar{x}) \]
    • Mean is a point minimizing the deviation. ## Standard error (SE)
  • Depending on sample size (N), the sample mean (\(\bar{x}\)) drifts.

Drifting?

  • Blue: 100 observations from \(N(0, 1^{2})\)
  • Grey: 10,000 observations from \(N(0, 1^{2})\)

SE (Continued)

  • To adjust the influence of \(N\) toward mean, we need to consider \(SE\)
  • The statistics suggests: the mean become more precise as the \(n\) increases.

\[ SE = \frac{sd}{\sqrt{n}} \]

In R…

  • Mean: mean()
  • SD: sd()
  • Example: Compute the mean and SD of Sepal.Length of iris data

Is it appropriate?

  • Other statistics should be considered.
    • min()
    • max()
    • median():robust against outliers
  • Stratification
  • Making a summary table is essential.

Compute the descriptive statistics

iris.tidy %>% 
  dplyr::group_by(Species, attribute) %>%
  dplyr::summarise(n = n(),
                   Min. = min(measurement),
                   Max. = max(measurement),
                   Mean = mean(measurement),
                   Median = median(measurement),
                   SD = sd(measurement),
                   SE = sd(measurement) / sqrt(n())
  )

Results

Species attribute n Min. Max. Mean Median SD SE
setosa Petal.Length 50 1.0 1.9 1.462 1.50 0.1736640 0.0245598
setosa Petal.Width 50 0.1 0.6 0.246 0.20 0.1053856 0.0149038
setosa Sepal.Length 50 4.3 5.8 5.006 5.00 0.3524897 0.0498496
setosa Sepal.Width 50 2.3 4.4 3.428 3.40 0.3790644 0.0536078
versicolor Petal.Length 50 3.0 5.1 4.260 4.35 0.4699110 0.0664554
versicolor Petal.Width 50 1.0 1.8 1.326 1.30 0.1977527 0.0279665
versicolor Sepal.Length 50 4.9 7.0 5.936 5.90 0.5161711 0.0729976
versicolor Sepal.Width 50 2.0 3.4 2.770 2.80 0.3137983 0.0443778
virginica Petal.Length 50 4.5 6.9 5.552 5.55 0.5518947 0.0780497
virginica Petal.Width 50 1.4 2.5 2.026 2.00 0.2746501 0.0388414
virginica Sepal.Length 50 4.9 7.9 6.588 6.50 0.6358796 0.0899270
virginica Sepal.Width 50 2.2 3.8 2.974 3.00 0.3224966 0.0456079

Assignment

  • Using the wine data set, make a summary table by type and attribute.

Enjoy!