Tuesday, October 27, 2015

Mapping with ggplot: Create a nice choropleth map in R

I was working on making a map in R recently and after an extensive search online, I found a hundred different ways to do it and yet each way didn't work quite right for my data and what I wanted to do. Eventually, I found a way to make it work after pulling together code from [this blog](http://blog.revolutionanalytics.com/2009/11/choropleth-challenge-result.html ), [this blog](http://www.kevjohnson.org/making-maps-in-r/ ), and [this stack overflow page](http://stackoverflow.com/questions/11052544/convert-map-data-to-data-frame-using-fortify-ggplot2-for-spatial-objects-in-r ). Combining all of those sources, along with a good amount of trial and error, has led to this blog post. This post shows how to use ggplot to map a chloropleth map from shapefiles, as well as change legend attributes and scale, color, and titles, and finally export the map into an image file. ### Preparing the data We need a number of packages to make this work, as you can see below so make sure to install and load those. We load in shapefiles using the **readShapeSpatial()** function from the maptools library. A shapefile is a geospatial vector data storage format for storing map attributes like latitude and longitude. You can download lots of shape files for free! For example, for any country in the world you can download shape files [here](http://www.diva-gis.org/gdata). Or just google it. My project is to graph some data on India so I have downloaded a shapefile with state boundaries of India and I will now load that into R. library(ggplot2) library(maptools) library(rgeos) library(Cairo) library(ggmap) library(scales) library(RColorBrewer) set.seed(8000) ##input shapefiles setwd("/Users/Analysis/Shapefiles/") The class of the loaded shapefiles are of a spatial class, and we can see what is in it by checking its names. You can print out these components to see what they are. For example, let's print out the state names like this: print(states.shp$NAME_1) Now we'll get the data we want to plot on the map. Here I'll just make the data up, but you can import a csv of it or extract it from your model estimates. The important thing is that the id numbers are the same in the shape file as in your data you want to plot, because that will be necessary for them to merge properly (you could also merge by the name itself). Important Note: ID_1 is the ID that uniquely identifies that state in this shapefile so that is what I'm using in my prevalence data and that is what I use when I fortify in the next paragraph. You should look at your shapefiles and determine what ID uniquely determines the data and use that (for example, it may be ID_2). This ID must also be the SAME ID as in the mapping data (what is called here mydata). Please do this to check: If those are not exactly the same (and unique), you will encounter an error later on. Please change your id to match the states.shp ID you are using. ##create (or input) data to plot on map Now we need to merge the shapefile and the dataset together. First, we fortify() the shapefile (a function of ggplot) to get it into a dataframe. We need to include a region identifier so that ggplot keeps that id when it turns it into a dataframe; otherwise it will make it up and it will not be possible to merge properly. We can see that the class is now a common dataframe and each row is a longitude and latitude. Now we can merge the two dataframes together by the id variable, making sure to keep all the observations in the spatial dataframe. Importantly, we need to order the data by the "order" variable. You can see what happens if you don't do this and it's not pretty. Basic plot We're ready to plot the data using ggplot(), along with geom_polygon() and coord_map(). Note that in the aes statement, group=group does NOT change; "group" is a variable in the fortified dataframe so just leave it. You do need to change "prevalence" to whatever variable you want to plot. ggplot() + geom_polygon(data = final.plot, aes(x = long, y = lat, group = group, fill = prevalence), color = "black", size = 0.25) + coord_map() Troubleshooting: I have had many comments about an error occuring at this step, with usually this error: If you receive this error, that means that the shapefile did not merge properly with the mapping data. Go back to the Important Note above, and follow the instructions for making sure you have unique and matching IDs across your shapefile and mapping data. This is a great start, but there are a number of options that can make the map look much nicer. First, we can remove the background and gridlines by including **theme_nothing()** from the ggmap package, but indicating that we still want the legend. We can also change the title of the legend and add in a title for the whole map. Next, we can change the palette to "YlGn" which is Yellow to Green. You can find palette options as well as help on color in R [here](https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf), or type in display.brewer.all() to see a plot of them in R. We can also use **pretty_breaks()** from the scales package to get nice break points (the pretty algorithm creates a sequence of about n+1 equally spaced round values, that are 1, 2, or 5 times a power of 10). If you don't like the way pretty_breaks looks, you can change your scale with scale_fill_gradient() and set the limits and colors like this: scale_fill_gradient(name="My var", limits=c(0,100), low="white", high="red") I do this in the third map, below. More about using color gradients in R, [here](http://docs.ggplot2.org/current/scale_gradient.html). #nicer plot ggplot() + geom_polygon(data = final.plot, aes(x = long, y = lat, group = group, fill = prevalence), color = "black", size = 0.25) + coord_map()+ scale_fill_distiller(name="Percent", palette = "YlGn", breaks = pretty_breaks(n = 5))+ theme_nothing(legend = TRUE)+ labs(title="Prevalence of X in India") Other options: If we wanted to reverse the color to go from 0 being dark instead of light, we can just add in the option trans="reverse" into the scale_fill_distiller statement (see the plot below). If we were plotting a discrete variable instead of a continuous one, we would use scale_fill_manual instead of distiller, like this: scale_fill_manual(name="My discrete variable", values=c("red", "blue", "green", "yellow")) Additionally, we can put the names of the states on the map. This may or may not be a great idea depending on the size of the areas you're mapping, but if it's useful, you can do it easily using __geom_text()__. First we need to identify a longitude and latitude where you want to plot each state name. One way we can do this is using **aggregate()** to calculate the mean of the range of the longitude and latitude for each state. #aggregate data to get mean latitude and mean longitude for each state ggplot() + geom_polygon(data = final.plot, aes(x = long, y = lat, group = group, fill = prevalence), color = "black", size = 0.25) + coord_map() + scale_fill_gradient(name="Percent", limits=c(0,70), low="white", high="red")+ theme_nothing(legend = TRUE)+ labs(title="Prevalence of X in India")+ geom_text(data=cnames, aes(long, lat, label = NAME_1), size=3, fontface="bold") This doesn't look great on this particular map, but it may be useful in other settings. You could also use geom_text() to put other text on the map to identify certain features or highlight an area. Exporting map Finally, one way we can export the map is by using Cairo. For this to work on my mac, I had to install X11, found [here](http://xquartz.macosforge.org/landing/). We save the plot in an object and then use ggsave() to export the object as a png. First save the ggplot statement in an object, and then export with ggsave, indicating the width and height. You can also specify the dpi (default is 300). Another way to save to pdf is via the pdf function, also shown below. #save map ggsave(p2, file = "map1.png", width = 6, height = 4.5, type = "cairo-png") #another way to save pdf("mymap.pdf") print(p2) dev.off()

Tuesday, August 25, 2015

How to use lists in R

In the [last post](http://rforpublichealth.blogspot.com/2015/03/basics-of-lists.html), I went over the basics of lists, including constructing, manipulating, and converting lists to other classes. Knowing the basics, in this post, we'll use the **apply()** functions to see just how powerful working with lists can be. I've done two posts on apply for dataframes and matrics, [here](http://rforpublichealth.blogspot.com/2012/09/the-infamous-apply-function.html) and [here](http://rforpublichealth.blogspot.com/2013/10/loops-revisited-how-to-rethink-macros.html), so give those a read if you need a refresher. Intro to apply-based functions for lists There are a variety of apply functions that can be used depending on what you want to do. The table below shows the function, what it inputs, and what it outputs: For example, if we have a list and you want to produce a vector (of the same length), we use **sapply()**. If we have a vector and want to produce a list of the same length, we use **lapply()**. Let's try an example. The syntax of lapply is: lapply(INPUT, function(x) (Some function here)) where INPUT, as we see from the table above, must be a vector or a list, and function(x) is any kind of function that takes **each element of the INPUT** and applies the function to it. The function can be something that already exists in R, or it can be a new function that you've written up. For example, let's construct a list of 3 vectors like so: mylist<-list(x=c(1,5,7), y=c(4,2,6), z=c(0,3,4)) and now we can use lapply to find the mean of each element of the list (each of the vectors x, y, and z), and output to a new list: lapply(mylist, function(x) mean(x)) But let's say we wanted the result in a vector, not in a list, for whatever reason. Instead of doing the above and then converting the list into a vector (using unlist() or ldply() or whatever), we can do this directly using **sapply()** instead of **lapply()**. That's because as you can see in table, **sapply()** can take in a list as the input, and it will return a vector (or matrix). Let's try it: sapply(mylist, function(x) mean(x)) This is really great! Anytime you want to do the same thing over and over again, put all those things in a list and then use one of the apply functions. This reduces the need to run a loop, which can take a lot longer. Let's do another example where you write your own function this time: #write function to find the span of numbers in a vector and check if it's larger than 5 span.fun<-function(x) {(max(x)-min(x))>=5} #apply that function to the list sapply(mylist, span.fun) Creating a list using lapply You don't need to have a list already created to use lapply() - in fact, lapply can be used to _make_ a list. This is because the key about **lapply()** is that it *returns* a list of the same length as whatever you input. For example, let's initialize a list to have 2 empty matrices that are size 2x3. We'll use lapply(): our input is just a vector containing 1 and 2, and the function we specify uses the matrix() function to construct a 2x3 matrix of empty cells for each element of this vector, so it returns a list of two such matrices. If instead of empty matrices you wanted to fill these matrices with random numbers, you could do so too. Check out both possibilities below. #initialize list to to 2 empty matrices of 2 by 3 list2<-lapply(1:2, function(x) matrix(NA, nrow=2, ncol=3)) list2 #initialize list to 2 matrices with random numbers from standard normal distribution list2<-lapply(1:2, function(x) matrix(rnorm(6, 10, 1), nrow=2, ncol=3)) Again, we can use **lapply()** or **sapply()** on this newly created list to get the sum of each column of each matrix: #input list, output column sums of each matrix into a new list lapply(list2, colSums) #input list, output column sums into a **vector** (which binds them into a matrix) sapply(list2, colSums) #instead of binding, we can stack these column sums by using tranpose function t(): t(sapply(list2, colSums)) Practical uses of lists using lapply Finally, what are lists good for? Often, I find a lists are great when I want to store multi-dimensional objects into one object, for example group a bunch of data.frames into a list, or store all my model results into one list. Here's an example, where I run four linear models for four different outcomes. I want to store all my models into one object. There are two ways to do this: Use a for() loop and insert the results of each iteration into the list Use lapply! Faster and less code #create some data set.seed(2000) x=rbinom(1000,1,.6) mydata<-data.frame(trt=x,out1=x*3+rnorm(1000,0,3),out2=x*5+rnorm(1000,0,3),out3=rnorm(1000,5,3),out4=x*1+rnorm(1000,0,8)) head(mydata) Now I want to run each of the four outcomes on the trt variable using linear regression and save the results. I'll do this first as a loop, then using **lapply()**: #1. Use a loop #first, initialize the results list results<-vector("list", 4) #now use a loop for each outcome for(i in 1:4){ results[[i]]<-lm(mydata[,i+1]~trt, data = mydata) } #2.Use lapply in one statement! results<-lapply(2:5, function(x) lm(mydata[,x]~trt, data = mydata)) In the second case, we are taking the vector c(2,3,4,5) and for each component of this vector, we're running the model that we describe in the function. We can always name the components of the list as below, and I'll print out the first two elements: names(results)<-names(mydata)[2:5] print(results, max=2) Why is this a great way to store data? Well, we can __keep__ using the **apply()** functions, for example to put together all of the treatment effects for each outcome into one matrix: #extract coefficient and std error for each outcome and store in a matrix sapply(results, function(x) summary(x)$coefficients[2,1:2]) You can also easily use other functions like **stargazer()** (previous post on this function [here](http://rforpublichealth.blogspot.ie/2013/08/exporting-results-of-linear-regression_24.html)) to create a quick table of results like so (in latex code): require(stargazer) stargazer(results, column.labels=names(results),keep.stat=c("rsq","n"),dep.var.labels="") library(png) library(grid) img <- readPNG("~/Dropbox/Harvard Doctoral/Rforpublichealth/post30/sgtable2.png") grid.raster(img) Or easily create a graph of the model estimates and 95 confidence intervals: #extract coefficients from the list coefs<-as.data.frame(t(sapply(results, function(x) summary(x)$coefficients[2,1:2]))) #add outcome columnn and change name of SE column coefs$Outcome<-rownames(coefs) names(coefs)[2]<-"SE" #use ggplot to plot all the estimates require(ggplot2) ggplot(coefs, aes(Outcome,Estimate)) + geom_point(size=4) + theme(legend.position="none")+ labs(title="Treatment effect on outcomes", x="", y="Estimate and 95% CI")+ geom_errorbar(aes(ymin=Estimate-1.96*SE,ymax=Estimate+1.96*SE),width=0.1)+ geom_hline(yintercept = 0, color="red")+ coord_flip() I hope that was useful! There are many great ways to use lists and the **apply()** functions to make your programming more efficient and less prone to errors. For another resource on using the **apply()** functions with lists, definitely check out [this StackOverflow page](http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega).

Monday, March 9, 2015

Basics of Lists

Lists are a data type in R that are perhaps a bit daunting at first, but soon become amazingly useful. They are especially wonderful once you combine them with the powers of the apply() functions. This post will be part 1 of a two-part series on the uses of lists. In this post, we will discuss the basics - how to create lists, manipulate them, describe them, and convert them. In part 2, we'll see how using lapply() and sapply() on lists can really improve your R coding. Creating a list Let's start with some basics: what lists are and how to create them. A list is a data structure that can hold any number of any types of other data structures. If you have vector, a dataframe, and a character object, you can put all of those into one list object like so: create three different classes of objects add all three objects to one list using list() function print list list1 We can also turn an object into a list by using the as.list() function. Notice how every element of the vector becomes a different component of the list. coerce vector into a list Manipulating a list We can put names on the components of a list using the names() function. This is useful for extracting components. We could have also named the components when we created the list. See below: name the components of the list list1 could have named them when we created list Extract components from a list (many ways): the first is using the [[ ]] operator (notice two brackets, not just one). Note that we can use the single [ ] operator on a list, but it will return a list rather than the data structure that is the component of the list, which is normally not what we would want to do. See what I mean here: extract 3rd component using -> this returns a *string* list1[[3]] print a list containing the 3rd component -> this returns a *list* list1[3] It is also possible to extract components using the operator or the components name. Again, be careful about the [ ] vs [[ ]] operator in the second way. You need the [[ ]] to return the data structure of the component. extract 3rd component using extract 3rd component using [[ ]] and the name of the component Subsetting a list - use the single [ ] operator and c() to choose the components subset the first and third components list1[c(1,3)] We can also add a new component to the list or replace a component using the $ or [[ ]] operators. This time I'll add a linear model to the list (remember we can put anything into a list). add new component to existing list using add a new component to existing list using Finally, we can delete a component of a list by setting it equal to NULL. delete a component of existing list list1 Notice how the 5th component doesn't have a name because we didn't assign it one when we added it in. Now if we want to extract the dataframe we have in the list, and just look at it's first row, we would do list1[[2]][1,]. This code would take the second component in the list using the [[ ]] operator (which is the dataframe) and once it has the dataframe, it subsets the first row and all columns using only the [ ] operator since that is what is used for dataframes (or matrices). For help on subsetting matrices and dataframes, check out [this post](http://rforpublichealth.blogspot.com/2012/10/quick-and-easy-subsetting.html). extract first row of dataframe that is in a list list1[[2]][1,] Describing a list To describe a list, we may want to know the following: the class of the list (which is a list class!) and the class of the first component of the list. describe class of the whole list class(list1) describe the class of the first component of the list class(list1[[1]]) the number of components in the list - use the length function() find out how many components are in the list length(list1) a short summary of each component in the list - use str(). (I take out the model because the output is really long) take out the model from list and then show summary of what's in the list str(list1) Now we can combine these functions to append a component to the end of the list, by assigning it to the length of the list plus 1: Initializing a list To initialize a list to a certain number of null components, we use the vector function like this: initialize list to have 3 null components and print list2 Converting list into matrix or dataframe Finally, we can convert a list into a matrix, dataframe, or vector in a number of different ways. The first, most basic way is to use unlist(), which just turns the whole list into one long vector: convert to one long string - use unlist unlist(list1) But often we have matrices or dataframes as components of a list and we would like to combind them or stack them into one dataframe. The following shows the two good ways I've found to do this (from [this StackOverflow](http://stackoverflow.com/questions/4227223/r-list-to-data-frame) page) using ldply() from the plyr package or rbind(). Here, we first create a list of matrices and then convert the list of matrices into a dataframe. create list of matrices and print mat.list convert to data frame 1. use ldply require(plyr) ldply(mat.list, data.frame) 2. use rbind do.call(rbind.data.frame, mat.list) Get ready for part 2 next time, when we'll get to what we can use lists for and why we should use them at all.

Friday, December 26, 2014

Animations and GIFs using ggplot2

Tracing a regression line Diverging density plots Happy New Year plot

Happy New Year everyone! For the last post of the year, I thought I'd have a little fun with the new animation package in R. It's actually really easy to use. I recently had some fun with it when I presented my research at an electronic poster session, and had an animated movie embedded into the powerpoint. All of the GIFs above use ggplot and the animation packages. The main idea is to iterate the same plot over and over again, changing incrementally whatever it is that you want to move in the graph, and then save all those plots together into one GIF. Let's start with first plot that traces a regression line over the scatterplot of the points. We'll make up some data and fit a loess of two degrees (default). You could easily do this with any kind of regression. #make up some data tracedat<-data.frame(x=rnorm(1000,0,1)) tracedat$y<-abs(tracedat$x)*2+rnorm(1000,0,3) #predict a spline fit and add predicted values to the dataframe loess_fit <- loess(y ~ x, tracedat) tracedat$predict_y<-predict(loess_fit) Now let's make the completed scatterplot and loess line using ggplot. If you need help on how to plot a scatterplot in ggplot, see my post here: [ggplot2: Cheatsheet for Scatterplots](http://rforpublichealth.blogspot.com/2013/11/ggplot2-cheatsheet-for-scatterplots.html). It is possible to use **stat_smooth()** within **ggplot** to get the loess fit without predicting the values and using **geom_line()**, but the predicted values are going to make it easier to make the animation. #plot finished scatterplot with loess fit ggplot(tracedat, aes(x,y)) + geom_point() + geom_line(data=tracedat, aes(x,predict_y), color="red", size=1.3) + scale_x_continuous(limits=c(-3, 3)) + scale_y_continuous(limits=c(-10, 10)) Now what we need is for the loess fit to appear bit by bit, so to do this we'll cut off the dataframe for **geom_line** for only those x-values up to a certain cutoff x-value (by subsetting the dataframe called tracedat in the **geom_line** statement). Then we'll just keep moving that cutoff forward as we iterate over the range of all x-values. First, we will build a function that takes the cutoff value as an argument. Then we can pass whatever value of x we want and it will only graph the line up to that cutoff. Notice how the scatterplot itself, however, is for the full data. For more on how to write functions, see my post about them, [here](http://rforpublichealth.blogspot.com/2014/06/how-to-write-and-debug-r-function.html). #function to draw the scatterplot, but the curve fit only up to whatever index we set it at #try it out: draw curve up to cutoff x-value of -2 draw.curve(cutoff=-2) Almost done! Now we just need to iterate the **draw.curve()** function we just created for the full range of the values of x. So we'll use **lapply()** to iterate the **draw.curve()** function over the sequence of i=-3 to 3 (incrementing by .2) and we call the **draw.curve()** function for each value of i. Finally, we'll use the saveGIF() function from the animation package to stick all the images together successively into one GIF. The interval argument tells you how fast the GIF will move from one image to the next, and you can give it a name. If you don't want the GIF to loop back to the start again, you would add an argument "loop=FALSE" into the function call. #function to iterate over the full span of x-values trace.animate <- function() { lapply(seq(-3,3,.2), function(i) { draw.curve(i) }) } #save all iterations into one GIF saveGIF(trace.animate(), interval = .2, movie.name="trace.gif") ####2. Diverging density plots The same idea used above is used to make the diverging density plots (plot 2 above). Here we are showing the distribution of scores before some intervention and after the intervention. We need to create the code for a ggplot density plot, and turn it into a function that can take as an argument the variable we want to plot (that way we can use the same code for the "before" intervention plot and the "after" plot. Then to animate, we'll iterate between them. We'll make up some data and plot the "before" plot to start. Notice that we'll use **aes_string** rather than simply **aes** in the ggplot statement in order to be able to pass the data in as an argument when we turn this into a function. More on how to plot distributions using **ggplot** in my post [ggplot2: Cheatsheet for Visualizing Distributions](http://rforpublichealth.blogspot.ie/2014/02/ggplot2-cheatsheet-for-visualizing.html). It's important to set the scale for x and y axes so that when we iterate over the two plots, we have the same dimensions each time. The alpha argument in **geom_density** makes the colors more transparent. dist.data<-data.frame(base=rnorm(5000, 0, 3), follow=rnorm(5000, 0, 3), type=c(rep("Type 1",2500),rep("Type 2",2500))) dist.data$follow<-ifelse(dist.data$type=="Type 2", dist.data$follow+12, dist.data$follow) #plot one to make sure it's working. Use aes_string rather than aes p<-ggplot(dist.data, aes_string("base", fill="type")) + geom_density(alpha=0.5) + theme(legend.position="bottom") + scale_x_continuous(limits=c(-10, 20)) + scale_y_continuous(limits=c(0, 0.20)) + scale_fill_manual("", labels=c("Type 1", "Type 2"), values = c("orange","purple")) + labs(x="Score",y="Density", title="title") Again, we write two functions: one that draws a density plot based on the arguments passed to it (**plot.dens()**), and one that iterates over the two different plots (called **distdiverge.animate()**). In the **dist.diverge.animate()** function, we pass the plot.item (which is a character class that aes_string will understand as the name of the column in dist.data to plot), and the title, which is also a character class. #function that plots a density plot with arguments for the variable to plot and the title plot.dens<-function(plot.item, title.item){ p<-ggplot(dist.data, aes_string(plot.item, fill="type"))+ geom_density(alpha=0.5) + theme(legend.position="bottom") + scale_x_continuous(limits=c(-10, 20)) + scale_y_continuous(limits=c(0, 0.20)) + scale_fill_manual("", labels=c("Type 1", "Type 2"), values = c("orange","purple"))+ labs(x="Score",y="Density", title=title.item) print(p)} #try it out - plot it for the follow data with the title "After Intervention" plot.dens(plot.item="follow", title.item="After Intervention") #function that iterates over the two different plots distdiverge.animate <- function() { items<-c("base", "follow") titles<-c("Before Intervention","After Intervention") lapply(seq(1:2), function(i) { plot.dens(items[i], titles[i]) })} We'll make the interval slower so there's more time to view each plot before the GIF moves to the next one. #save in a GIF saveGIF(distdiverge.animate(), interval = .65, ,movie.name="dist.gif") ####3. Happy New Year plot Finally, to make the fun "Happy 2015" plot, we just make a black background plot, create new random data at every iteration for the snow, and then iterate over the letters of the sign. Let's start with the first plot for the letter "H". We create some data and the objects that will hold the letters of the sign that scrolls through, the colors (I use the colorspace package to pull some colors for me for this), and the x- and y-coordinates for where we want the letters to go. You can think about randomizing the coordinates too. Then we just plot a scatterplot and use annotate to add the letter to it. #create dataset happy2015<-data.frame(x=rnorm(500, 0, 1.5), y=rnorm(500, 0, 1.5), z=rnorm(500,0,1.5)) #create objects to hold the letters, colors, and x and y coordinates that we will scroll through sign<-c("H","A","P","P","Y","2","0","1","5","!!") colors <- rainbow_hcl(10, c=300) xcoord<-rep(c(-2, -1, 0, 1, 2),2) ycoord<-c(2, 1.7, 2.1, 1.5, 2, -.5, 0, -1, -.8, -.7) We set up the ggplot theme and test the first plot. #set up the theme in an object (get rid of axes, grids, and legend) theme.both<- theme(legend.position="none", panel.background = element_blank(), axis.ticks = element_blank(), axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), plot.background = element_rect(fill = "black"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) #plot the first letter (set index=1 to get the first element of color, letter, and coordinates) index<-1 ggplot(happy2015, aes(x, y, alpha = z, color=z)) + geom_point(alpha=0.2) + labs(title="", x="", y="") + theme.both + scale_colour_gradient(low = "white", high="lightblue")+ annotate("text", x=xcoord[index], y=ycoord[index], size=15, label=sign[index], color=colors[index]) Finally, we again go through the structure of two functions - one to draw a plot based on the "index" we give it as an argument, and one to iterate through all the letters using lapply(). Notice we put the dataframe statement in the first function - this will make the scatterplot different every time, rather than stay static, which is more festive (more or less like falling snow). Again, we save it in a GIF, with a slow interval in order to give time to read it. #set up function to create a new dataset, plot it, and annotate it by an index argument draw.a.plot<- function(index){ #make up a new dataframe happy2015<-data.frame(x=rnorm(500, 0, 1.5), y=rnorm(500, 0, 1.5), z=rnorm(500,0,1.5)) #plot according to the index passed g<-ggplot(happy2015, aes(x, y, alpha = z, color=z)) + geom_point(alpha=0.2) + labs(title="", x="", y="") + theme.both + scale_colour_gradient(low = "white", high="lightblue")+ annotate("text", x=xcoord[index], y=ycoord[index], size=15, label=sign[index], color=colors[index]) #print out the plot print(g)} #set up function to loop through the draw.a.plot() function loop.animate <- function() { lapply(1:length(sign), function(i) { draw.a.plot(i) })} #save the images into a GIF saveGIF(loop.animate(), interval = .5, movie.name="happy2015.gif")

Monday, October 20, 2014

Easy Clustered Standard Errors in R

Public health data can often be hierarchical in nature; for example, individuals are grouped in hospitals which are grouped in counties. When units are not independent, then regular OLS standard errors are biased. One way to correct for this is using clustered standard errors. This post will show you how you can easily put together a function to calculate clustered SEs and get everything else you need, including confidence intervals, F-tests, and linear hypothesis testing. Under standard OLS assumptions, with independent errors, $$V_{OLS} = \sigma^2(X'X)^{-1}$$ We can estimate $\sigma^2$ with $s^2$: $$s^2 = \frac{1}{N-K}\sum_{i=1}^N e_i^2$$ where N is the number of observations, K is the rank (number of variables in the regression), and $e_i$ are the residuals from the regression. But if the errors are not independent because the observations are clustered within groups, then confidence intervals obtained will not have $1-\alpha$ coverage probability. To fix this, we can apply a sandwich estimator, like this: $$V_{Cluster} = (X'X)^{-1} \sum_{j=1}^{n_c} (u_j'*u_j) (X'X)^{-1}$$ where $n_c$ is the total number of clusters and $u_j = \sum_{j_{cluster}}e_i*x_i$. $x_i$ is the row vector of predictors including the constant. Programs like Stata also use a degree of freedom adjustment (small sample size adjustment), like so: $$\frac{M}{M-1}*\frac{N-1}{N-K} * V_{Cluster}$$ where M is the number of clusters, N is the sample size, and K is the rank. In this example, we'll use the Crime dataset from the plm package. It includes yearly data on crime rates in counties across the United States, with some characteristics of those counties. Let's load in the libraries we need and the Crime data: library(plm) library(lmtest) data(Crime) We would like to see the effect of percentage males aged 15-24 (pctymle) on crime rate, adjusting for police per capita (polpc), region, and year. However, there are multiple observations from the same county, so we will cluster by county. In Stata the commands would look like this. reg crmrte pctymle polpc i.region year, cluster(county) In R, we can first run our basic ols model using lm() and save the results in an object called m1. Unfortunately, there's no 'cluster' option in the lm() function. But there are many ways to get the same result #basic linear model with standard variance estimate Crime$region<-factor(Crime$region) m1<-lm(crmrte~pctymle+polpc+region+year, data=Crime) Get the cluster-adjusted variance-covariance matrix First, I'll show how to write a function to obtain clustered standard errors. If you are unsure about how user-written functions work, please see my posts about them, [here](http://rforpublichealth.blogspot.com/2014/06/how-to-write-and-debug-r-function.html) (How to write and debug an R function) and [here](http://rforpublichealth.blogspot.com/2014/07/3-ways-that-functions-can-improve-your.html) (3 ways that functions can improve your R code). There are many sources to help us write a function to calculate clustered SEs. Check out these helpful links: Mahmood Arai's paper [found here](http://people.su.se/~ma/clustering.pdf) and DiffusePrioR's blogpost [found here](http://diffuseprior.wordpress.com/2012/06/15/standard-robust-and-clustered-standard-errors-computed-in-r/). I'll base my function on the first source. It uses functions from the sandwich and the lmtest packages so make sure to install those packages. However, instead of returning the coefficients and standard errors, I am going to modify Arai's function to return the variance-covariance matrix, so I can work with that later. The function will input the lm model object and the cluster vector. After that, I'll do it the super easy way with the new multiwayvcov package which has a cluster.vcov() function. Help on this package [found here](http://cran.r-project.org/web/packages/multiwayvcov/multiwayvcov.pdf). The function also needs the model and the cluster as inputs. Here are both ways: #write your own function to return variance covariance matrix under clustered SEs get_CL_vcov<-function(model, cluster){ require(sandwich, quietly = TRUE) require(lmtest, quietly = TRUE) #calculate degree of freedom adjustment M <- length(unique(cluster)) N <- length(cluster) K <- model$rank dfc <- (M/(M-1))*((N-1)/(N-K)) #calculate the uj's uj <- apply(estfun(model),2, function(x) tapply(x, cluster, sum) #use sandwich to get the var-covar matrix vcovCL <- dfc*sandwich(model, meat=crossprod(uj)/N) return(vcovCL)} #call our new function and save the var-cov matrix output in an object m1.vcovCL<-get_CL_vcov(m1, Crime$county) #equivalent way: use the cluster.vcov function to get variance-covariance matrix library(multiwayvcov) m1.vcovCL.2<-cluster.vcov(m1, Crime$county) Now, in order to obtain the coefficients and SEs, we can use the coeftest() function in the lmtest library, which allows us to input our own var-covar matrix. Let's compare our standard OLS SEs to the clustered SEs. #the regular OLS standard errors coeftest(m1) #the clustered standard errors by indicating the correct var-covar matrix coeftest(m1, m1.vcovCL) We can see that the SEs generally increased, due to the clustering. Now, let's obtain the F-statistic and the confidence intervals. Getting the F-statistic and confidence intervals for clustered SEs To obtain the F-statistic, we can use the waldtest() function from the lmtest library with test="F" indicated for the F-test. For the 95% CIs, we can write our own function that takes in the model and the variance-covariance matrix and produces the 95% CIs. You can modify this function to make it better and more versatile, but I'm going to keep it simple. #function to return confidence intervals get_confint<-function(model, vcovCL){ t<-qt(.975, model$df.residual) ct<-coeftest(model, vcovCL) est<-cbind(ct[,1], ct[,1]-t*ct[,2], ct[,1]+t*ct[,2]) colnames(est)<-c("Estimate","LowerCI","UpperCI") return(est)} Now, we can get the F-stat and the confidence intervals: #do a wald test to get F-statistic waldtest(m1, vcov = m1.vcovCL, test = "F") Note that now the F-statistic is calculated based on a Wald test (using the cluster-robustly esimtated var-covar matrix) rather than on sums of squares and degrees of freedom. The degrees of freedom listed here are for the model, but the var-covar matrix has been corrected for the fact that there are only 90 independent observations. If you want to save the F-statistic itself, save the waldtest function call in an object and extract: #save waldtest in an object w<-waldtest(m1, vcov = m1.vcovCL, test = "F") #check out what is saved in w names(w) #use $F to get the F-statistic w$F[2] For confidence intervals, we can use the function we wrote: #obtain the confidence interval using our function get_confint(m1, m1.vcovCL) As an aside, to get the R-squared value, you can extract that from the original model m1, since that won't change if the errors are clustered. Again, remember that the R-squared is calculated via sums of squares, which are technically no longer relevant because of the corrected variance-covariance matrix. But it can still be used as a measure of goodness-of-fit. #r-squared summary(m1)$r.squared One function to do everything Notice, that you could wrap all of these 3 components (F-test, coefficients/SEs, and CIs) in a function that saved them all in a list, for example like this: super.cluster.fun<-function(model, cluster){ require(multiwayvcov) require(lmtest) vcovCL<-cluster.vcov(model, cluster) coef<-coeftest(model, vcovCL) w<-waldtest(model, vcov = vcovCL, test = "F") ci<-get_confint(model, vcovCL) return(list(coef, w, ci))} super.cluster.fun(m1, Crime$county) Then you could extract each component with the [[]] operator. Check out [this post](http://rforpublichealth.blogspot.com/2014/06/how-to-write-and-debug-r-function.html)("Returning a list of objects") if you're unsure. Hypothesis testing when errors are clustered Now what if we wanted to test whether the west region coefficient was different from the central region? Again, we need to incorporate the right var-cov matrix into our calculation. Fortunately the car package has a __linearHypothesis()__ function that allows for specification of a var-covar matrix. The inputs are the model, the var-cov matrix, and the coefficients you want to test. Check out the help file of the function to see the wide range of tests you can do. library(car) linearHypothesis(m1, vcov=m1.vcovCL, "regionwest = regioncentral") Troubleshooting and error messages One reason to opt for the cluster.vcov() function from the multiwayvcov package is that it can handle missing values without any problems. When doing the variance-covariance matrix using the user-written function get_CL_vcov above, an error message can often come up: #error messages get_CL_vcov(m1, Crime$couunty) There are two common reasons for this. One is just that you spelled the name of the cluster variable incorrectly (as above). Make sure to check that. The second is that you have __missing values__ in your outcome or explanatory variables. In this case, the length of the cluster will be different from the length of the outcome or covariates and tapply() will not work. One possible solutions is to remove the missing values by subsetting the cluster to include only those values where the outcome is not missing. Another option is to run na.omit() on the entire dataset to remove all missing vaues. Here's an example: #force in missing value in outcome Crime2<-Crime Crime2$crmrte[1]<-NA #rerun model m2<-lm(crmrte~pctymle+polpc+region+year, data=Crime2) #this produces errors v1<-get_CL_vcov(m2, Crime2$county) #but we can remove the observations in county for which crmrte is missing v2<-get_CL_vcov(m2, Crime2$county[!is.na(Crime2$crmrte)]) #or can use na.omit Crime3<-na.omit(Crime2) m3<-lm(crmrte~pctymle+polpc+region+year, data=Crime3) v3<-get_CL_vcov(m3, Crime3$county) However, if you're running a number of regressions with different covariates, each with a different missing pattern, it may be annoying to create multiple datasets and run na.omit() on them to deal with this. To avoid this, you can use the cluster.vcov() function, which handles missing values within its own function code, so you don't have to. Using plm Finally, you can also use the plm() and vcovHC() functions from the plm package. You still need to do your own small sample size correction though. #use plm function to define formula, dataset, that it's a pooling model, and the cluster variable p1 <- plm(crmrte~pctymle+polpc+region+year, Crime, model='pooling', index=c('county')) #calculate small sample size adjustment G <- length(unique(Crime$county)) N <- length(Crime$county) dfa <- (G/(G - 1)) * (N - 1)/p1$df.residual #use coeftest and the vcovHC functions, specifying HC0 type and identifying cluster as 'group' coeftest(p1, vcov=function(x) dfa*vcovHC(x, cluster="group", type="HC0")) A website that goes further into this function [is here](http://landroni.wordpress.com/2012/06/02/fama-macbeth-and-cluster-robust-by-firm-and-time-standard-errors-in-r/). Update: A reader pointed out to me that another package that can do clustering is the rms package, so definitely [check that out](http://cran.r-project.org/web/packages/rms/rms.pdf) as well.

Monday, July 7, 2014

3 ways that functions can improve your R code

My [previous blog post](http://rforpublichealth.blogspot.com/2014/06/how-to-write-and-debug-r-function.html) went over the basics of the syntax and debugging of user-written functions. In this post, I'll show you examples of useful functions that you can write to make your life easier when using R. Here is the data we'll be using for this post: set.seed(10) bpdata<-data.frame(bp=rnorm(1000,140,20), age=rnorm(1000,50,3), sex=rbinom(1000,1,.5), race=as.factor(c(rep(1,500),rep(2,500))), out=rbinom(1000,1,.8)) bpdata[c(100,200,400),2]<-NA bpdata[c(300),1]<-400 Functions to help you avoid repeating yourself One of the main reasons to use functions is that you can re-use them over and over again, without repeating the same code again and again. Copying and pasting code that you change just a little bit every time leads to errors. With a function, if you decide to change the way you do something, you just change the one function and it will now automatically update your code everywhere you use that function. As an example, in a previous blog post I wrote about [calculating robust standard errors and exporting](http://rforpublichealth.blogspot.com/2013/08/exporting-results-of-linear-regression_24.html) them. The calculation went like this: library(sandwich) cov.fit1 <- vcovHC(fit1, type = "HC") rob.std.err <- sqrt(diag(cov.fit1)) Well, if you are going to be using robust standard errors more than once in your analysis, it makes more sense to make a function. That way you don't need to copy and paste this code and change the object fit1 everytime. To do this, I just take the exact same code, but I change the fit1 to a more general modelfit that is passed in as an argument to a function I call **robust.SE()**. Also, instead of loading the library sandwich, I can just require it for that function (note that you still have to have the package installed for R to load it using require). Then you can pass in the model fit as the argument and return the robust standard errors. Like this: #function to calculate robust standard errors for a model robust.SE<-function(modelfit){ require(sandwich, quietly=TRUE) cov.fit1 <- vcovHC(modelfit, type = "HC") rob.std.err <- sqrt(diag(cov.fit1)) return(rob.std.err)} #get robust SEs for model 1 model1<-lm(out~age + sex, data=bpdata) robust.SE(model1) #get robust SEs for model2 model2<-lm(bp~age+sex, data=bpdata) robust.SE(model2) Instead of copying and pasting the code for robust standard errors twice, I just wrap it up in a function that allows me to evaluate it for any model fit I want to throw at it. Functions to customize your output Another reason to use functions is to make it easier and faster to do the things you always need to do, in the exact way you want. For example, it's important when you start with a new dataset to look at it descriptively and look at whether there are strange values or missing values. There are many ways to do this. You can of course do **summary(data)** which is a good start: #use the basic summary built-in function summary(bpdata) But this doesn't give me all the information I want and it's not in the format I want. If I had 20 variables, which is not out of the ordinary, it would be hard to read this and it takes up too much space in a report. I could also use the describe() function in the psych package, but it doesn't deal well with factor variables. I want to create a table with the variables down the rows and some specific stats going across the columns, namely mean, standard deviation, min, max, and the number missing. I want to make sure to treat factors as dummies instead of numeric variables. Here I create a new function that takes a dataset and produces this kind of table for all the variables within the dataset. I call the function **summarize.vars**. The only argument I pass through is a dataset. I use the package dummies to create dummy variables of all the factors so that I have only numeric variables in my data (it ignores all character variables if I had any). Then I use the **apply()** function to do my summarizing. Check out my previous post on how apply works, [here](http://rforpublichealth.blogspot.com/2012/09/the-infamous-apply-function.html). #function to summarize the variables in the data summarize.vars<-function(data){ #use dummies package to turn all factors into dummies require(dummies, quietly=TRUE) dat.d<-dummy.data.frame(data, dummy.class="factor") #use apply to calculate statistics for each variable mat<-t(apply(dat.d, 2, function(x) c(length(x), round(mean(x, na.rm=TRUE),2), round(sd(x, na.rm=TRUE),2), round(min(x, na.rm=TRUE),2), round(max(x, na.rm=TRUE),2), length(x)-length(x[!is.na(x)])))) #assign column names and rownames to output table colnames(mat)<-c("N","Mean","SD","Min","Max","Num Missing") rownames(mat)<-colnames(dat.d) return(mat)} summarize.vars(bpdata) I can also use this to summarize only part of my data, for example by subsetting my data like so: summarize.vars(bpdata[bpdata$race==1,]) You can think about making this function better. One way would be to have the N for the factors count within the level, since that is more useful information. You can customize it in whatever way makes sense to get the kind of information that is most useful. Let's do another example of a function to summarize information. This time we'll put together a function that runs a linear model, succinctly summarizes the results, and produces some diagnostic plots at the same time. This is useful if you run linear models often. Remember that you can pass anything through a function, including a formula. In this function, I pass in a formula for the model and the dataframe and I don't return any values; instead, I print directly from inside the function. You could change this so that it returned the output and you could store it in an object outside the function. #linear model function lm.model.diagnostics<-function(formula, dataset){ #run model and print specific output model1<-lm(formula=formula, data=dataset) stats<-round(c(summary(model1)$fstatistic[c(1,3)], summary(model1)$sigma, summary(model1)$r.squared, summary(model1)$adj.r.squared),3) names(stats)<-c("F","DF", "Sigma","Rsq","AdjRsq") l1<-list(round(summary(model1)$coefficients,3), stats) names(l1)<-c("Coefficients","Stats") print(l1) #run specific diagnostic tests par(mfrow=c(1,3)) hist(model1$residuals, main="Histogram of residuals", xlab="") plot(model1, 1) plot(model1, 2)} #run function for model of blood pressure on age and sex lm.model.diagnostics(bp~age+sex, bpdata) So we can see that there is a funny outlier in observation 300. That observation has a blood pressure of 400, which we think is an incorrect value. We can see what happens when we take out that outlier: #take out the outlier lm.model.diagnostics(bp~age+sex, bpdata[c(-300),]) Note that if your formula is long and you use it more than once, you can avoid copying and pasting it as well by just saving it in an object like this: form1<-bp~age+sex lm.model.diagnostics(form1, bpdata) Functions to aid in your analysis Another useful way to use functions is for improving your analysis. For example, R summary output of a model fit doesn't provide confidence intervals. It's useful to have a function to calculate the confidence intervals and put them in a nice table. As a bonus, you can make the function versatile so that it can provide nice output of logit or poisson models, by exponentiating the results to get odds ratios or incidence rate ratios, respectively. Notice how in this function I set up the defaults for the parameters that I pass in. Unless I indicate otherwise, exponent=FALSE, alpha=.05, and digits=2. That way the function runs even without specifying those parameters. If I want to change them, I can do so the way I do in the second example. #function to get confidence intervals for glm output, can get exponentiated output for logit or poisson glmCI <- function(glmfit, exponent=FALSE, alpha=0.05, digits=2){ #get SE from model fit se <- sqrt(diag(summary(glmfit)$cov.scaled)) #calculuate CI for linear case mat <- cbind(coef(glmfit), coef(glmfit) - qnorm(1-alpha/2)*se, coef(glmfit) + qnorm(1-alpha/2)*se) colnames(mat) <- c("Beta", "LowerCI", "UpperCI") #if exponent=TRUE, exponeniate the coefficients and CIs if(exponent == TRUE) { mat <- exp(mat) if(summary(glmfit)$family$link=="logit") colnames(mat)[1] <- "OR" if(summary(glmfit)$family$link=="log") colnames(mat)[1] <- "IRR"} #return a rounded matrix of results return(round(mat, digits=digits))} #1. use glm with identity link on continuous response data (default family is gaussian) g.glm<-glm(bp~age+sex, data=bpdata) glmCI(g.glm) #2. use glm with logit link on binary response data b.glm<-glm(out~age+sex+bp, family=binomial(link="logit"), data=bpdata) glmCI(b.glm, exponent=TRUE, digits=3) There are tons of other possibilities for functions, but hopefully this post has convinced you that functions can improve your coding by reducing repetition, increasing customizability, and improving your analysis and reports.

Sunday, June 8, 2014

How to write and debug an R function

I've been asked on a few occasions what is the deal with R user-written functions. First of all, how does the syntax work? And second of all, why would you ever want to do this? In Stata, we don't write functions; we execute built-in commands like **browse** or **gen** or **logit**. You can write your own programs that create new commands (like ado files) but it's less common for users to do so. In R, there are built-in functions like **summary()** or **glm()** or **median()**, but you can also write your own functions. You can write a quick, one-line function or long elaborate functions. I use functions all the time to make my code cleaner and less repetitive. In this post I'll go over the basics of how to write functions. In the next post, I'll explain what kinds of functions I have used commonly in public health research that have improved my data cleaning and analyses. Basic syntax of a function A function needs to have a name, probably at least one argument (although it doesn't have to), and a body of code that does something. At the end it usually should (although doesn't have to) return an object out of the function. The important idea behind functions is that objects that are created within the function are local to the environment of the function - they don't exist outside of the function. But you can "return" the value of the object from the function, meaning pass the value of it into the global environment. I'll go over this in more detail. Functions need to have curly braces around the statements, like so: name.of.function <- function(argument1, argument2){ statements return(something) } The argument can be any type of object (like a scalar, a matrix, a dataframe, a vector, a logical, etc), and it's not necessary to define what it is in any way. As a very simple example, we can write a function that squares an incoming argument. The function below takes the argument x and multiplies it by itself. It saves this value into the object called square, and then it returns the value of the object square. Writing and calling a function square.it<-function(x){ square<-x*x return(square) } I can now call the function by passing in a scalar or a vector or matrix as its argument, because all of those objects will square nicely. But it won't work if I input a character as its argument because although it will pass "hi" into the function, R can't multiply "hi". #square a number square.it(5) #square a vector square.it(c(1,4,2)) #square a character (not going to happen) square.it("hi") I can also pass in an object that I already have saved. For example, here I have a matrix called matrix1, so I pass that into the **square.it() function**. R takes this matrix1 into the function as x. That is, in the local function environment it is now called x, where it is squared, and returned. matrix1<-cbind(c(3,10),c(4,5)) square.it(matrix1) Local vs global environment Now, it's not necessarily the case that you must use **return()** at the end of your function. The reason you return an object is if you've saved the value of your statements into an object inside the function - in this case, the objects in the function are in a local environment and won't appear in your global environment. See how it works in the following two examples: fun1<-function(x){ 3*x-1 } fun1(5) fun2<-function(x){ y <- 3*x-1 } fun2(5) In the first function, I just evaluate the statement 3*x-1 without saving it anywhere inside the function. So when I run fun1(5), the result comes popping out of the function. However, when I call fun2(5), nothing happens. That's because the object y that I saved my result into doesn't exist outside the function and I haven't used __return(y)__ to pass the value of y outside the function. When I try to print y, it doesn't exist because it was created in the local environment of the function. print(y) I can return the *value* of y using the **return(y)** at the end of the function fun2, but I can't return the object itself; it's stuck inside the function. Getting more complex Obviously writing a whole function to square something when you could just use the ^ operator is silly. But you can do much more complicated things in functions, once you get the hang of them. Calling other functions and passing multiple arguments First, you can pass multiple arguments into a function and you can call other functions within your function. Here's an example. I'm passing in 3 arguments which I want to be a matrix, a vector, and a scalar. In the function body, I first call my previous function **square.it()** and use it to square the scalar. Then I multiply the matrix by the vector. Then I multiply those two results together and return the final object. my.fun <- function(X.matrix, y.vec, z.scalar){ #use my previous function square.it() to square the scalar and save result sq.scalar<-square.it(z.scalar) #multiply the matrix by the vector using %*% operator mult<-X.matrix%*%y.vec #multiply the two resulting objects together to get a final object final<-mult*sq.scalar #return the result return(final) } When you have multiple arguments in a function that you call, R will just evaluate them in order of how you've written the function (the first argument will correspond to X.matrix, the second y.vec, and so on), but for clarity I would name the arguments in the function call. In this example below, I already have two saved objects, my.mat and my.vec that I pass through as the X.matrix and y.vec arguments, and then I just assign the z.scalar argument the number 9. #save a matrix and a vector object my.mat<-cbind(c(1,3,4),c(5,4,3)) my.vec<-c(4,3) #pass my.mat and my.vec into the my.fun function my.fun(X.matrix=my.mat, y.vec=my.vec, z.scalar=9) #this is the same as my.fun(my.mat, my.vec, 9) Returning a list of objects Also, if you need to return multiple objects from a function, you can use **list()** to list them together. An example of this is my [blog post on sample size functions.](http://rforpublichealth.blogspot.com/2013/06/sample-size-calculations-equivalent-to.html) For example: another.fun<-function(sq.matrix, vector){ #transpose matrix and square the vector step1<-t(sq.matrix) step2<-vector*vector #save both results in a list and return final<-list(step1, step2) return(final) } #call the function and save result in object called outcome outcome<-another.fun(sq.matrix=cbind(c(1,2),c(3,4)), vector=c(2,3)) #print the outcome list print(outcome) Now to separate those objects for use in your further code, you can extract them using the [[]] operator: ###extract first in list outcome[[1]] ##extract second in list outcome[[2]] Tricks for troubleshooting and debugging When you execute multiple statements in a function, sometimes things go wrong. What's nice about functions is that R evalutes every statement until it reaches an error. So in the last function, the dimensions of the objects really matter. You can't multiply matrices of incomptabile dimensions. Like this: my.fun(X.matrix=my.mat, y.vec=c(2,3,6,4,1), z.scalar=9) Using the Debug() function When you have an error, one thing you can do is use R's built-in debugger **debug()** to find at what point the error occurs. You indicate which function you want to debug, then run your statement calling the function, and R shows you at what point the function stops because of errors: debug(my.fun) my.fun(X.matrix=my.mat, y.vec=c(2,3,6,4,1), z.scalar=9) We see that the first line calling the **square.it()** function was fine, but then an error occurred in the line defining mult. This debugging is useful especially if you had many more statements in your function that multiplied matrices and you weren't sure which one was causing the issues. So now we know the problem is that X.matrix and y.vec won't multiply. But we still need to know why they won't multiply. More on debugging [can be found here.](http://www.stats.uwo.ca/faculty/murdoch/software/debuggingR/debug.shtml) Printing out what's happening (sanity checks) At this point, a good way to troubleshoot this is to print out the dimensions or lengths of the objects or even the objects themselves that are going into the statement causing errors. The great part about functions is that they evaluate all the way until there's an error. So you can see what is happening inside your function before the error. If the object is too long, you can **print(head(object))**. This helps to see if you're doing what you think you're doing. Note that you have to use the function **print()** to actually print out anything from within a function. my.fun <- function(X.matrix, y.vec, z.scalar){ print("xmatrix") print(X.matrix) print("yvec") print(y.vec) print("Dimensions") print(dim(X.matrix)) print(length(y.vec)) #use my previous function square.it() to square the scalar and save result sq.scalar<-square.it(z.scalar) print(paste("sq.scalar=", sq.scalar)) #multiply the matrix by the vector using %*% operator mult<-X.matrix%*%y.vec #multiply the two resulting objects together to get a final object final<-mult*sq.scalar #return the result return(final) } my.fun(X.matrix=my.mat, y.vec=c(2,3,6,4,1), z.scalar=9) Now we can see the actual dimensions of the objects and fix them accordingly. This example is really simple, but you can imagine that if you've written a long function that uses many arguments, you could easily lose track of them and not be sure where the issue in your function was. You can also throw in these statements along the way as sanity checks to make sure that things are proceeding as you think they should, even if there isn't any error. Using the stop() and stopifnot() functions to write your own error messages One other trick you can use is writing your own error messages using the **stop()** and **stopifnot()** functions. In this example, if I know I need dimensions to be the right size, I can check them and print out a message that says they are incorrect. That way I know what the issue is immediately. Here's an example: my.second.fun<-function(matrix, vector){ if(dim(matrix)[2]!=length(vector)){ stop("Can't multiply matrix%*%vector because the dimensions are wrong") } product<-matrix%*%vector return(product) } #function works when dimensions are right my.second.fun(my.mat, c(6,5)) #function call triggered error my.second.fun(my.mat, c(6,5,7)) You can do these kinds of error messages for yourself as checks so you know exactly what triggered the error. You can think about putting in a check for if the value of an object is 0 if you are dividing by it as another example. Good function writing practices Based on my experience, there are a few good practices that I would recommend keeping in mind when writing function. 1. Keep your functions short. Remember you can use them to call other functions! * If things start to get very long, you can probably split up your function into more manageable chunks that call other functions. This makes your code cleaner and easily testable. * It also makes your code easy to update. You only have to change one function and every other function that uses that function will also be automatically updated. 2. Put in comments on what are the inputs to the function, what the function does, and what is the output. 3. Check for errors along the way. * Try out your function with simple examples to make sure it's working properly * Use debugging and error messages, as well as sanity checks as you build your function. The next post will go over examples of useful functions that you can use in your day to day R coding.