A few days ago I asked this question on stack overflow. The answer was awesome, but left me with some confusion as to what exactly was happening. And thus the idea for this post was born.
Summarizing data across a dimension happens a lot. And while it mostly happens with simple functions like mean, median, standard deviations, min, max, etc…, occassionally you want to summarize using a more complex operation. Enter Dplyr in all it’s glory.
Simple summarizing of data with
dplyr looks like this:
group_by() groups your data by week and
sd() to each group.
%>% is a pipe operator that applies the output of a previous function to the next function. You can have lots of fun reading about it and dplyr in this nice post.
So, that’s great and all. But you can extend this to apply to any function, not just
But what if you want to vary
b in the example above based on some other data? In effect, what if you want to wrap summarize in another function to allow for different argument values? This is not a theoretical question. I needed to do this for a shiny app I’m working on. In it we want to display “Growing Degree Days”, which are a measure of how much of the heat in a day is available to crops. It’s calculated like this:
GDD can be calculated using the defaults in the function above, but it can also be tuned to a specific crop, for which different values may apply. Apples, for example, would be:
We want to wrap
calcGDD() in a summarize statement, but make it flexible enough to specify the variables to group and summarize by. For a primary function like
mean that’s relatively easy, and looks like this:
Notice that “group_by” is now “group_by_” and “summarize” is now “summarize_”. The added “_” to many (all?) dplyr functions allows for what is called non-standard evaluation (NSE). Basically, evaluation is held until the value for a variable can be substituted, rather than evaluated directly. So here “group_var” is first evaluated to its value “day”, which is then passed to the group_by function. Wickham has a great vignette you can access in R by `vignette(“nse”), or you can read about it in Advanced R.
With the dynamic function, if you had multiple data sets for which you wanted to calculate the mean over a group, instead of grouping and summarizing each one, you could just make a series of calls to
dynamicSummarize(). A custom function works the same way:
###The Hard Part
Suppose we sometimes want to replace some of the default values to
calcGDD(). Suppose our crop data, for some crops includes values for all the defaults, but for others may only have one or two. We need to pass a flexible list of additional variables to our
Unfortunately, (as so often happens) the first thing I did didn’t work:
Of course…, because
... is not evaluated correctly. And none of these lines work to replace it:
Welcome to the wonderful world of… something. Anyway, as you can imagine I spent quite a while trying to figure out how to get this working, which eventually led to my swearing off programming, deciding to be a hobo, and finally posting the question I linked to at the beginning of this post to stack exchange.
The solution was less convoluted than I had imagined, and ends up using
But why does it work? Well,
call returns an unevaluated function call, in which a function itself is not evaluated, but all of the variables passed to it are. For example:
A simple test function reveals more:
So by evaluating
temp_var as a name and quoting it, the call in
calcGDD(temp, <list of optional variables>). These then get evaluated by
lazy_ in the summarize call. To annotate each line:
If you’re here looking for help, I hope you found it. I know it definitely helped me to write this.