Ggplot2 Function Cheat Sheet and R Tutorial

Fedor KarmanovFedor Karmanov | 7 minute read | January 15, 2019
ggplot2 in r tutorial

In this article

The ggplot2 package, created by Hadley Wickham, provides a fast and efficient way to produce good-looking data visualizations that you can use to derive and communicate insights from your data sets. The package was designed to help you create all different types of data graphics in R, including histograms, scatter plots, bar charts, box plots, and density plots. This textbook has numerous examples of visualizations you can build in ggplot2.

The ggplot2 package offers a powerful graphics language for creating elegant and complex plots. Originally based on Leland Wilkinson’s The Grammar of Graphics, ggplot2 allows you to create graphs that represent both univariate and multivariate numerical and categorical data in a straightforward manner. Grouping can be represented by color, symbol, size, and transparency. The creation of trellis plots (i.e., conditioning), graphs that show relationships between different variables, is relatively simple.

In recent years, ggplot2’s popularity has grown exponentially. Due to its popularity, the functionalities built into this package have increased — which might be overwhelming for someone getting started with ggplot2. So I created this ggplot2 tutorial and cheatsheet to help you learn the basic functionalities of ggplot2.

This is a quick ggplot2 tutorial through the basics of ggplot2 — enough so that you can create beautiful visualizations in R.

You can use it as an extremely handy reference, or cheat-sheet, if you have just started your data science journey with ggplot2 in R, you can use it to help guide you to what you need to get done if you’re looking to create a specific data visualization in R.

ggplot2 Cheat Sheet of Essential Functions

ggplot2 Cheat Sheet

Here is a downloadable version as a PDF in case you want to have it handy with you as you navigate ggplot2 and data visualization in R.

Get To Know Other Data Science Students

Diana Xie

Diana Xie

Machine Learning Engineer at IQVIA

Read Story

Aaron Pujanandez

Aaron Pujanandez

Dir. Of Data Science And Analytics at Deep Labs

Read Story

Jonas Cuadrado

Jonas Cuadrado

Senior Data Scientist at Feedzai

Read Story

ggplot2 R Tutorial: Bar Charts

The Github repository containing all of the code in this ggplot2 in R tutorial can be found here.

When I get my hands on a new dataset, I often want to take a quick look at the shape of the data and at preliminary results before developing my research any further. While many tutorials offer easy ways of plotting data in one way or another, few tutorials lead you through the first steps of data exploration in R. This ggplot2 in R tutorial will help you make sense of large datasets and gives you a framework to do some exploratory graphing of your own.

Related19 Free Public Data Sets for Your Project

This ggplot2 in R tutorial assumes that you have already installed R, an IDE of your choice (I use RStudio), as well as the ggplot2 package. All these programs and packages are easy to access and free to install, so if you don’t have them already, you can use this guide to figure out how to get started. Jupyter with R is the most intuitive way to start with R if you don’t have anything installed. You can install ggplot2 and other libraries using the install.packages command in R

For the rest of the tutorial, I will be working on a sample dataset obtained from The Metropolitan Museum of Art in New York City. This dataset contains a set of metadata for all the artworks housed in the museum’s collection, and can be found on GitHub thanks to the Met Museum’s Open Access Initiative.  

First things first: make sure you have installed your libraries. Insert the following lines of code on the top. 

library(ggplot2)

library(dplyr)

library(reshape2)

You shouldn’t get any errors after running the code above if ggplot2 has been installed correctly.

Now, lets read in the Metropolitan dataset, which is a raw CSV file.

met.collection <- read.csv(file=”~/Documents/Springboard-Blog/Springboard-Blog-Tutorials/data/MetObjects.csv”)

Make sure you change the file path here to whatever it is on your computer! Here’s a quick guide to how to import CSVs into R. You may also have to work with git-lfs, Github’s large file system management system to get the CSV file we’re working with, as it exceeds 200mb in file size. Here’s a short tutorial on that. 

After R has ingested the table (it may take a while!), we can move to one of my favorite R functions: summary()!

summary(met.collection)

Summary is a great function because it looks at every column in your dataset and returns an insightful set of statistics about it. If the column is made of numeric values, it will return the average and standard deviation across the column’s values.

If your data is composed of strings (such as in our case), summary returns the count of unique strings within a column. The summary() function makes for a great first step for any exploratory data analysis using R.

 

../Library/Group%20Containers/Q79WDW8YH9.com.evernote.Evernote/Evernote/quick-note/fedorkarmanov___Evernote/quick-note-ul9Fmn/attachment--looHrz/screenshot.png
../Library/Group%20Containers/Q79WDW8YH9.com.evernote.Evernote/Evernote/quick-note/fedorkarmanov___Evernote/quick-note-ul9Fmn/attachment--bUuDdi/screenshot.png
../Library/Group%20Containers/Q79WDW8YH9.com.evernote.Evernote/Evernote/quick-note/fedorkarmanov___Evernote/quick-note-ul9Fmn/attachment--GN5QP8/screenshot.png

I decided to use the summary() function to narrow where I should explore the data — the dataset has 43 columns in total!

This analysis got me to three interesting columns: which countries artists are from (their nationality), which cities they are from, and a column that collected the number of artworks associated with a particular artist. While a lot of the top-scoring values are obvious –the Met Collection is an American museum after all–some of the more interesting values are found in other columns, such as “City.” Paris, for instance, is the top-scoring city for artworks across the whole collection, beating New York by a fairly wide margin, which suggests that Paris is a particularly great place to meet talented artists.

Exploratory graphs of three of these four categories could help us find trends in the dataset that are ripe for further exploration. Let’s start with a bar plot of artists’ nationalities found in the Met Collection.

[code lang=”r” toolbar=”true” title=”Bar Plot of Artists Nationalities”]nationality <-data.frame(table(met.collection$Artist.Nationality))
nationality <- nationality[order(nationality$Freq, rank(nationality$Freq), decreasing = TRUE), ]

df <- nationality[2:11, ]
ggplot(df, aes(x = Var1, y = Freq)) +
geom_bar(stat = “identity”, color = “black”, fill = “grey”) +
labs(title = “Frequency by Country\n”, x = “\nCountry”, y = “Frequency\n”) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
[/code]

The above code creates a frequency table of all elements found in the “Artist.Nationality” column in the dataframe, and then orders it in descending order. I then grab the top ten occurring terms and plot them as a bar graph, reversing the axis labels to make them readable.

The resulting graph, found below, indicates several things: 1) The Met Collection is primarily an American collection,with some affinity for French artists; 2) the Nationality labels need to be cleaned so that the results can be more easily read, especially duplicate labels. ggplot2 in r tutorial

Let’s see if we can add nuance to the nationality data above by looking at the most popular cities of origin for the Met Collection Archives:

ggplot2 in r tutorial

[code lang=”r” toolbar=”true” title=”Bar Plot of Artists Cities”]
city <- data.frame(table(met.collection$City))
city <- city[order(city$Freq,-rank(city$Freq), decreasing = TRUE), ]

df <- city[2:11, ]
ggplot(df, aes(x = Var1, y = Freq)) +
geom_bar(stat = “identity”, color = “black”, fill = “grey”) +
labs(title = “Frequency by City\n”, x = “\nCountry”, y = “Frequency\n”) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
[/code]

Wow! Paris really does a number on New York and London. Venice, usually the most disproportionate source of visual art in the world is lagging far behind the big culture capitals.

Finally, after all of this geographic analysis, it might be worth knowing what time-frame or period predominates the Met Collection.:

[code lang=”r” toolbar=”true” title=”Bar Plot of Art Timeframes”]
date <- data.frame(table(met.collection$Object.Date))
date <- date[order(date$Freq,-rank(date$Freq), decreasing = TRUE), ]

df <- date[3:11, ]
ggplot(df, aes(x = Var1, y = Freq)) +
geom_bar(stat = “identity”, color = “black”, fill = “grey”) +
labs(title = “Frequency by Date\n”, x = “\nCountry”, y = “Frequency\n”) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
[/code]

The code above produces the plot below. The Met is primarily composed of 19th and 18th century artworks, coming either from America or from Europe (most coming from France or Italy). There seems to be a passing interest in art from ancient Egypt or Greece, but not much else by way of non-classical European artworks.  

ggplot2 in r tutorial

Proper data visualization is essential in the field of data science. Through the use of R’s summary function and the ggplot2 library, we’ve started breaking down a large data set and looked for various insights in this ggplot2 in R tutorial. That work is never finished in a proper data analysis: we urge you to take this ggplot2 in R tutorial and use it to break down insights you’d like to see. Furthermore,ggplot2 is quickly becoming a popular data visualization package among data scientists and data analysts.

Since you’re here…
Thinking about a career in data science? Enroll in our Data Science Bootcamp, and we’ll get you hired in 6 months. If you’re just getting started, take a peek at our foundational Data Science Course, and don’t forget to peep our student reviews. The data’s on our side.

Fedor Karmanov

Fedor Karmanov