--- title: "GGPlot2 Introduction to Grammer of Graphics" author: "Arno Kimeswenger" output: html_document: code_folding: none # no hide/show butten in html document number_sections: true css: "styles.css" editor_options: markdown: wrap: 72 --- ```{r setup, include=FALSE} knitr::opts_chunk$set( warning = TRUE, message = TRUE, error = TRUE ) options(warn = 1) ``` See [ggplot2](https://ggplot2.tidyverse.org/), e.g. for Cheatsheet. Based on the insight that you can uniquely describe any plot as a combination of: - **data** - **mappings** - **geoms** - **stats** - **position** - **coordinate** - **scales** - **facets** - **themes** - **annotations and labels** We use **mtcars** dataset for demonstration. Code basis: ``` ggplot(data = ) + ( mapping = aes(), stat = , position = ) + + + + ``` # Preliminary Work Load tidyverse: ```{r} library("tidyverse") ``` # Data Data given as **tibble** (tidyverse) or **data.frame**. It can be tricky in general to get data in correct shape for plots. Often the command **pivot_longer()** helps. Here it is simple. Read csv file mtcars.csv. German notation is used, so use read_csv2. ```{r} mtcars.temp <- read_csv2("mtcars.csv") mtcars.temp |> head(3) ``` Get values of variable cylinder: ```{r} cyl.values <- mtcars.temp |> pull(cyl) |> unique() |> as.character() cyl.values ``` Change ordering (maybe it was already ok) and erase NA: ```{r} cyl.values <- cyl.values[c(1, 2, 3)] cyl.values ``` Same for variable gear: ```{r} gear.values <- mtcars.temp |> pull(gear) |> unique() |> as.character() gear.values ``` Change ordering (maybe it was already ok) and erase NA ```{r} gear.values <- gear.values[c(2, 1, 3)] gear.values ``` Load again mtcars.csv with this given information. ```{r} data.coltypes <- cols( name = col_character(), mpg = col_double(), cyl = col_factor(levels = cyl.values, ordered = TRUE), hp = col_double(), mass = col_double(), am = col_logical(), gear = col_factor(levels = gear.values, ordered = TRUE), ) mtcars <- read_csv2("mtcars.csv", col_types = data.coltypes) mtcars |> head(3) ``` Use glimpse for a quick look at data: ```{r} mtcars |> glimpse() ``` Levels of all factor variables: ```{r} mtcars |> select(where(is.factor)) |> summary() ``` Levels of all ordered variables: ```{r} mtcars |> select(where(is.ordered)) |> summary() ``` Full summary: ```{r} mtcars |> summary() ``` # MAPPING E.g. what is displayed on x-axis and y-axis in a scatter plot, but also we can define color, fill, shape, etc. depending on a variable (aes … aesthetics). Here we use mass for x-axis, mpg for y-axis and the factor variable cyl for color, but the plot is empty because we did not define the geometric object, e.g. scatter plot, bar plot, etc. (see section GEOM_FUNCTION). ```{r, fig.width=5} mtcars |> ggplot(mapping = aes(x = mass, y = mpg, color = cyl)) ``` # GEOM_FUNCTION To get a real plot we have to define the geometric information too. E.g. for scatter plot we use geom_point or for bar chart we use geom_bar. ```{r fig.width=5} mtcars |> ggplot(mapping = aes(x = mass, y = mpg, color = cyl)) + # we can use "+" to tell ggplot, that more information is comming geom_point() ``` We got the above warning because there are NA entries: ```{r} mtcars |> filter(if_any(everything(), is.na)) ``` We can manipulate data to get same plot without warning. But be careful here! Do not use drop_na for the whole data. Otherwise, Ferrari Dino is deleted too, which is not necessary! ```{r, fig.width=5, fig.height=3} mtcars |> select(mass, mpg, cyl) |> # first select relevant features drop_na() |> # then drop rows with NA-entries, the NA-entry for hp is not relevant here and Ferrari Dino will not be deleted because of select! ggplot(mapping = aes(x = mass, y = mpg, color = cyl)) + geom_point() ``` Same plots: ```{r} # mtcars |> # ggplot(aes(x = mass, y = mpg, color = cyl)) + # we do not need to write mapping=aes(...) # geom_point() # mtcars |> # ggplot() + # geom_point(aes(x = mass, y = mpg, color = cyl)) # we can use mapping/aes in each geom object ``` Later in section SCALING we will analyse the scaling part in more details. Till now we only have to know that: - On x-axis the variable mass is used and its values [1.513, 5.424] are scaled to the above plot dimensions width = 5inch. mtcars |> pull(mass) |> min() mtcars |> pull(mass) |> max() - On y-axis the variable mpg is used and its values [10.4, 33.9] are scaled to the above plot dimensions height = 3inch. mtcars |> pull(mpg) |> min() mtcars |> pull(mpg) |> max() - For coloring the variable cyl is used with the values "4", "6" and "8". There is an automatically scaling to the above 3 colors, which can be changed later in section SCALING. mtcars |> pull(cyl) Next we use a continuous variable as color. Resulting in a color gradient: ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass, y = mpg, color = hp)) + geom_point() ``` But of course we can also use a simple color for all points. Attention: color = "darkred" is outside of aesthetics. ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass, y = mpg)) + geom_point(color = "darkred") ``` Same with: ```{r, fig.width=5} # mtcars |> # ggplot() + # geom_point(aes(x = mass, y = mpg), color = "darkred") # here we see that color is defined outside aesthetics ``` Another geom object is e.g. geom_bar. Geom_bar uses the variable cyl and automatically computes the counts. ```{r, fig.width=5} mtcars |> ggplot(mapping = aes(x = cyl)) + geom_bar() ``` If you want to deliver the counts via a variable, you can use geom_col. ```{r, fig.width=5} mtcars |> select(cyl) |> group_by(cyl) |> summarise( count = n() ) |> ggplot(mapping = aes(x = cyl, y = count)) + geom_col() ``` # STAT In most cases we can compute some statistics for the data and plot this information. But ggplot can also compute statistical information directly, which is quite useful. E.g. for density plot we need the statistical information density: ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass)) + geom_line(stat = "density") ``` But there is the simpler workaround which is used more commonly: ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass)) + geom_density() ``` We can compute the statistical information (density) outside and plot it, but this is more complicated: ```{r, fig.width=5} density_function <- mtcars |> pull(mass) |> density(kernel = "gaussian", cut = 0) # define the range of the density by using [min, max] of variable mass densitty_data <- tibble( x = density_function$x, y = density_function$y ) densitty_data |> ggplot(aes(x = x, y = y)) + geom_line() ``` Often we want to use a statistical information for a variable in aesthetics. E.g. we want to use geom_bar with relative counts. But this is not what we want (all values are 100%): ```{r, fig.width=5} mtcars |> ggplot() + geom_bar(aes(x = cyl, y = after_stat(prop))) ``` Same problem in next plot, but here we see in more details what is happening: For each attribute of variable cyl we define value for comparison. We get cyl4: 9 of 9 is 1 cyl6: 5 of 5 is 1 cyl8: 12 of 12 is 1 NA: 2 of 2 is 1 ```{r, fig.width=5} mtcars |> ggplot() + geom_bar(aes(x = cyl, y = after_stat(prop), group = cyl)) ``` But we do not want to split the comparison, we only want one "group" with 9+5+12+2=28 observations. We get cyl4: 9 of 28 is 31.14% cyl6: 5 of 28 is 17.86% cyl8: 12 of 28 is 42.86% NA: 2 of 28 is 7.14% ```{r, fig.width=5} mtcars |> ggplot() + geom_bar(aes(x = cyl, y = after_stat(prop), group = 1)) ``` Again we can compute this statistical information (relative counts) manually and plot it: ```{r, fig.width=5} mtcars |> select(cyl) |> group_by(cyl) |> summarise( count = n() ) |> mutate( rel_count = count / sum(count) ) |> ggplot(aes(x = cyl, y = rel_count)) + geom_col() ``` To get percentages e.g. 40% instead of 0.4, see section SCALING. # POSITON E.g. show points in a boxplot. For the first time we use two geometric objects in one plot. The default for geom_point with x = 0 is not nice: ```{r, fig.width=5} mtcars |> ggplot(aes(y = mass)) + geom_boxplot() + geom_point(aes(x = 0)) ``` And here with jittered position: ```{r, fig.width=5} mtcars |> ggplot(aes(y = mass)) + geom_boxplot() + geom_point(aes(x = 0), position = position_jitter(width = 0.3)) ``` For a bar chart we get stacked bars here: ```{r, fig.width=5} mtcars |> ggplot(aes(x = cyl, fill = am)) + geom_bar() ``` If we want to get the bars side by side, we can use a position argument: ```{r, fig.width=5} mtcars |> ggplot(aes(x = cyl, fill = am)) + geom_bar(position = "dodge") ``` # COORDINATE_FUNCTION First we have a look at a scatter plot without changing coordinates: ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass, y = mpg)) + geom_point() ``` Logarithmic axis for y by changing the coordinate. We do not modify data here. The difference will be discussed in section SCALING. ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass, y = mpg)) + geom_point() + coord_transform(y = "log10") ``` Next, we use a density function and zoom out/in. But first without zoom: ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass)) + geom_density() ``` Next with zoom out. Density exists outside of [min, max], but nothing is plotted outside. We change the x-coordinate but not the data/function. mtcars |> pull(mass) |> min() [1] 1.513 mtcars |> pull(mass) |> max() [1] 5.424 ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass)) + geom_density() + coord_cartesian(xlim = c(0, 7)) ``` If we want to plot the function outside too, we have to scale the data/function, which can be done with scale_x_continuous or the short form xlim, see section SCALING. Next we zoom in: ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass)) + geom_density() + coord_cartesian(xlim = c(3, 4)) ``` # SCALING Of course there is always an implicit scaling used (x-axis: mass in [1.513, 5.424] to e.g. [0, 5inch] or color: cyl with cyl4 = violet, cyl6 = turquoise, cyl8 = yellow), but we can also use explicit scaling: ```{r, fig.width=5} mtcars |> ggplot(mapping = aes(x = am, y = mpg, color = cyl)) + geom_point() + scale_color_manual("cylinders", values = c("4" = "darkred", "6" = "darkblue", "8" = "darkgreen"), na.value = "black") + # scale color scale_y_continuous(breaks = seq(0, 40, by = 10), minor_breaks = seq(0, 40, by = 5)) + # scale y-axis (continuous) breaks scale_x_discrete(labels = c("FALSE" = "automatic", "TRUE" = "manual")) # scale x-axis (discrete) labels ``` We have already seen the following bar plot with relative values in section STAT. But now we scale to percentages, e.g. 0.2 to 20%. ```{r, fig.width=5} mtcars |> ggplot() + geom_bar(aes(x = cyl, y = after_stat(prop), group = 1)) + scale_y_continuous(labels = scales::percent) ``` We have already seen the following plot in section COORDINATE_FUNCTION. We only change the breaks on y-axis. Logarithmic axis for y: ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass, y = mpg)) + geom_point() + coord_transform(y = "log10") + scale_y_continuous(breaks = seq(10, 35, by = 5)) ``` We get the same with the following scaling command. So a logarithmic plot can be made by modifying the coordinate or by modifying the data. ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass, y = mpg)) + geom_point() + scale_y_log10(breaks = seq(10, 35, by = 5)) ``` This is equivalent to (except the values on y-axis, which have to be understood as 10^...) ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass, y = log10(mpg))) + geom_point() + scale_y_continuous(breaks = round(log10(seq(10, 35, by = 5)), 1)) ``` The practical difference between changing coordinate and scaling is: changing coordinate occurs after statistics (like e.g. geom_smooth). Scaling occurs before statistics (like e.g. geom_smooth). In section COORDINATE_FUNCTION we have also zoomed in/out by changing the coordinate. Now we will change the data by scaling and see the difference. With scale_x_continuous(limits = c(0, 7)) or the short version xlim(0, 7) we scale the data/function to [0, 7], therefore the function is plotted in the full interval. ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass)) + geom_density() + # density is plotted in [0, 7] scale_x_continuous(limits = c(0, 7)) ``` Compare with the above command which does not change the data/function, only the coordinates: ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass)) + geom_density() + coord_cartesian(xlim = c(0, 7)) ``` If we want to zoom in, we have already seen the command in section COORDINATE_FUNCTION: ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass)) + geom_density() + coord_cartesian(xlim = c(3, 4)) ``` If we use scaling instead, we get something weird: We get a warning which says that from our data/function some information (rows) is removed. This is because we scale data to [3, 4] and compute the density for points inside these interval. This is not a zoom of the original plot because we use only some of the observations! Do we really want that? ```{r, fig.width=5} mtcars |> ggplot(aes(x = mass)) + geom_density() + scale_x_continuous(limits = c(3, 4)) ``` This is the same but without warnings because we filter out the small/large values. ```{r, fig.width=5} mtcars |> filter(3 <= mass & mass <= 4) |> # here we see what is happening: values outside [3, 4] are not used in the further computations/plot ggplot(aes(x = mass)) + geom_density() + scale_x_continuous(limits = c(3, 4)) ``` # FACET_FUNCTION Multiple plots (matrix) where e.g. rows correspond to cyl and columns to am. ```{r, fig.width=8} mtcars |> ggplot(aes(x = mass, y = mpg)) + geom_point() + # facet_grid(cyl ~ am) # e.g. with labels "FALSE" and "TRUE" facet_grid(cyl ~ am, labeller = label_both) # here e.g. with labels "am: FALSE" and "am: TRUE" ``` Or only with cyl in rows: ```{r, fig.width=8} mtcars |> ggplot(aes(x = mass, y = mpg)) + geom_point() + facet_grid(. ~ cyl, labeller = label_both) ``` Or only with cyl in columns: ```{r, fig.width=8} mtcars |> ggplot(aes(x = mass, y = mpg)) + geom_point() + facet_grid(cyl ~ ., labeller = label_both) ``` Alternatively we can get a list of 6 plots (2 were empty) in one row: ```{r, fig.width=8} mtcars |> ggplot(aes(x = mass, y = mpg)) + geom_point() + facet_wrap(cyl ~ am, labeller = label_both, nrow = 1) ``` Or a list of 6 plots in two rows: ```{r, fig.width=8} mtcars |> ggplot(aes(x = mass, y = mpg)) + geom_point() + facet_wrap(cyl ~ am, labeller = label_both, nrow = 2) ``` # THEME Of course we can change the theme too (here we use black/white theme with larger text size and change the legend position): ```{r, fig.width=5} mtcars |> select(mass, mpg, cyl) |> # first select relevant features drop_na() |> # then drop rows with NA-entries, the NA-entry for hp is not relevant here and Ferrari Dino will not be deleted because of select! ggplot(mapping = aes(x = mass, y = mpg, color = cyl)) + geom_point() + theme_bw(base_size = 18) + theme( legend.position = "bottom" ) ``` # Annotations and labels Change annotations and labels to make plot clearer: ```{r, fig.width=5} my.car.x <- 2.05 # I manually define x coordinate, FALSE -> 1, TRUE -> 2, so we use 2 plus a little bit my.car.y <- mtcars |> filter(name == "Ferrari Dino") |> pull(mpg) my.car.text <- "my car" mtcars |> ggplot(mapping = aes(x = am, y = mpg, color = cyl)) + geom_point() + scale_color_manual("cylinders", values = c("4" = "darkred", "6" = "darkblue", "8" = "darkgreen"), na.value = "black") + scale_y_continuous(breaks = seq(0, 40, by = 10), minor_breaks = seq(0, 40, by = 5)) + scale_x_discrete(labels = c("FALSE" = "automatic", "TRUE" = "manual")) + labs(title = "mtcars", x = "transmission", y = "miles per gallon") + annotate("text", x = my.car.x, y = my.car.y, label = my.car.text, hjust = 0) ``` Or the above bar chart with scaling to percentages and new labels. ```{r, fig.width=5} mtcars |> ggplot() + geom_bar(aes(x = cyl, y = after_stat(prop), group = 1)) + scale_y_continuous(labels = scales::percent) + labs(x = "cylinders", y = "") ``` # Data (again) and pivot_longer Often we need the pivot_longer format to get the correct plot. E.g. ```{r} relig_income |> head(3) ``` We want to plot the large groups only. ```{r} relig_income_new <- relig_income |> pivot_longer( cols = !religion, names_to = "income", values_to = "count" ) relig_income_new |> head(10) ``` We want to plot the large groups only. ```{r} relig_used <- relig_income_new |> group_by(religion) |> summarise(total = sum(count)) |> filter(total > 1000) |> pull(religion) relig_used ``` And the corresponding bar chart. geom_bar is used when ggplot has to count. geom_col is used, when the counts are already given in a variable. ```{r, fig.width=10} relig_income_new |> filter(religion %in% relig_used) |> ggplot(aes(x = income, y = count, fill = religion)) + geom_col() + scale_fill_viridis_d() # viridis is a nice color palette, d stands for discrete ```