My problem deals in particular with time-series data about server performance, but the solution is sure to be applicable to many types of data sets. Pardon me if the answer is well-known; I don't know the right vocabulary to effectively search for these things.
Suppose I have two sets of time-series observations (let's call them A and B) containing thousands to millions of observations each. For example,
A: [ (1367678860, 30), (1367678870, 40), (1367678880, 33), ... ] B: [ (1367678863, 0.18), (1367678873, 0.11), (1367678884, 0.12), ... ]
The data in
B are, in general, collected by different processes with different sampling rates. Furthermore, any number of data points could be straight-up missing from either series.
I want to parameterize this data by time with a one-to-one correspondence. That is, I need an algorithm that will find a sequence of ordered pairs
(a_i, b_i) whose corresponding times are close to each other. For example, the data above would be converted to:
[ (30, 0.18), (40, 0.11), (33, 0.12), ... ]
As I said, it needs to work on data sets that don't necessarily share a sampling rate, and for which arbitrary points may be missing. Points that don't match up nicely (whatever that means) with the other time series should be discarded.
I've been banging my head against this for a while, trying to adapt the least-squares linear regression method to the problem, but so far I haven't come up with anything good. Stats nerds: what do you suggest?
cross correlation? I don't know what language you're working in, but if it's R, try
ccf(x, y, na.action=na.exclude). Make sure that the missing time points have
NA s in there (instead of just making the series shorter by deleting missing or bad points, keep it the same length and use the NA).
ccf will show you the correlation between the two variables at different lags of x relative to y (where they give you x[t+lag] relative to y[t]). Look for the lag where the acf is highest, e.g.
#Simple case: lagged series, some missing obs, but mostly overlapping obs x <- rnorm(1000) y <- c(rep(NA,3), head(x, -3)*1.3) ccf(x,y, na.action=na.exclude) #More complicated: some missing obs, y sampled at half frequency x <- rnorm(1000) y <- c(rep(NA,3), head(x, -3)*1.3) m <- sample(1:1000, size=200) x[m] <- NA y[m] <- NA ccf(x,y, na.action=na.exclude) #More complicated, more revealing: it gets the lag wrong when fewer obs x <- rnorm(1000) y <- c(rep(NA,3), head(x, -3)*1.3) m <- sample(1:1000, size=500) x[m] <- NA y[m] <- NA ccf(x,y, na.action=na.exclude) #Most complicated: mild missingness, different sampling rates (but rates are multiples) x <- rnorm(1000) y <- c(rep(NA,3), head(x, -3)*10) mx <- (1:1000)%%1 == 0 & rbinom(n=1000, size=1, prob=0.01)==0 #test for multiple of 1 [basline sampling rate] & randomly introduce missingness my <- (1:1000)%%2 == 0 & rbinom(n=1000, size=1, prob=0.01)==0 #test for multiple of 2 [every other sample] & randomly introduce missingness #Note that for every other time y is observed, x is also observed, until you account for missingness, and then sometimes that won't hold true. x[!mx] <- NA y[!my] <- NA ccf(x,y, na.action=na.exclude) #this doesn't do so well, I think it's because neither time series is autocorrelated. Maybe if I used a more realistic time series it'd work, but I think that given glenb's criticism, I should stick to the conservative case #Terribly hopeless: there are no overlapping observations x <- rnorm(1000) y <- c(rep(NA,3), head(x, -3)*1.3) m <- (1:1000)%%2 == 0 x[m] <- NA y[!m] <- NA ccf(x,y, na.action=na.exclude)
Maybe I misunderstood your question, but I think you want Dynamic time warping (DTW).
DTW allows you to match similar or equivalent segments in time series that have different sampling rate. Therefore, DTW is the preferred alternative to good old euclidean distance.
The following image shows you the intuition behind how DTW works. You can see it is extremely flexible and fortunately and there are really fast implementations available.