Filling in empty values in a time series in R

In this previous post, I had shown how to use Python to fill in zeroes in a time series. I thought I’d use this post to show how to do the same in R.

By way of review, I track my bike rides, runs and walks in an Android app called KeepTrack and I wanted to graph a YTD cumulative total. One problem I encountered is that if I don’t ride on a day, I don’t enter zero for the day; rather I just do no entry for that day. So if I skipped the 19th and 22rd of July, a sorted data snippet might look like:

07-18-2014,30
07-20-2014,20
07-21-2014,23
07-23-2014,20

When I want a cumulative mileage graph, I want to show the mileage as it grows for ALL days, not just the days I have ridden. That is, I want the above snippet to look like

07-18-2014,30
07-19-2014,0
07-20-2014,20
07-21-2014,23
07-22-2014,0
07-23-2014,20

As in Python, what I wanted to accomplish in R would be essentially a left outer join in SQL. I wrote a function to do this that I will show in its entirety at the end of the post. But first I’ll go through step by step.

Let’s say I’ve read in the existing dates into a dataframe called exer.df (recall this has date “holes” in it as the first listing of dates above). Now, we’ll form another dataframe which contains all dates for the entire year (note that we are assuming we are doing a whole year here; in the Python post, the year wasn’t over yet)

  seqdays<-seq(as.Date("2014/1/1"), as.Date("2014/12/31"), "days")
  seqdays.df<-as.data.frame(seqdays)
  names(seqdays.df)<-c('exerdt')

Now we can use the R merge command to join this list of all dates to our partial list of dates (with actual mileage) in exer.df. Note that the second command assigns a 0 to any merged row which did not have a row in exer.df. This is essentially our outer join in SQL terms.

mrg.exer.df <- merge(seqdays.df,exer.df,all.x=TRUE, by="exerdt")
mrg.exer.df[is.na(mrg.exer.df)]<-0

Our final step is then to perform a cumulative sum operation so we have a running sum throughout the year

mrg.exer.df$cummiles<-cumsum(mrg.exer.df$miles)

Here is the entire function in R. The function takes an xtype, which is an exercise type, basically to distinguish among file names. xyr does the same for exercise year (so I expect a file like walk_2014.csv).

function(xtype,xyr){
    library(ggplot2)
    library(Hmisc)
    fname=paste(paste('trk',xyr,xtype,sep='_'),'csv',sep='.')
    print(fname)
    exer.df<-read.csv(fname,header=FALSE)
    exer.df$V1<-as.Date(exer.df$V1,format="%m/%d/%Y")
    seqdays<-seq(as.Date("2014/1/1"), as.Date("2014/12/31"), "days")
    seqdays.df<-as.data.frame(seqdays)
    names(seqdays.df)<-c('exerdt')
    names(exer.df)<-c('exerdt','milesfact','miles')
    mrg.exer.df
}

plot_2014_walk

0 Comments on “Filling in empty values in a time series in R