Finding a correspondence between time-series elements

by danslimmon   Last Updated February 12, 2019 10:19 AM

My problem deals in particular with time-series data about server performance, but the solution is sure to be applicable to many types of data sets. Pardon me if the answer is well-known; I don't know the right vocabulary to effectively search for these things.

Suppose I have two sets of time-series observations (let's call them A and B) containing thousands to millions of observations each. For example,

A: [ (1367678860, 30), (1367678870, 40), (1367678880, 33), ... ]
B: [ (1367678863, 0.18), (1367678873, 0.11), (1367678884, 0.12), ... ]

The data in A and B are, in general, collected by different processes with different sampling rates. Furthermore, any number of data points could be straight-up missing from either series.

I want to parameterize this data by time with a one-to-one correspondence. That is, I need an algorithm that will find a sequence of ordered pairs (a_i, b_i) whose corresponding times are close to each other. For example, the data above would be converted to:

[ (30, 0.18), (40, 0.11), (33, 0.12), ... ]

As I said, it needs to work on data sets that don't necessarily share a sampling rate, and for which arbitrary points may be missing. Points that don't match up nicely (whatever that means) with the other time series should be discarded.

I've been banging my head against this for a while, trying to adapt the least-squares linear regression method to the problem, but so far I haven't come up with anything good. Stats nerds: what do you suggest?



Answers 2


cross correlation? I don't know what language you're working in, but if it's R, try ccf(x, y, na.action=na.exclude). Make sure that the missing time points have NA s in there (instead of just making the series shorter by deleting missing or bad points, keep it the same length and use the NA).

ccf will show you the correlation between the two variables at different lags of x relative to y (where they give you x[t+lag] relative to y[t]). Look for the lag where the acf is highest, e.g.

EDIT:

#Simple case: lagged series, some missing obs, but mostly overlapping obs
x <- rnorm(1000)
y <- c(rep(NA,3), head(x, -3)*1.3)
ccf(x,y, na.action=na.exclude)

#More complicated: some missing obs, y sampled at half frequency
x <- rnorm(1000)
y <- c(rep(NA,3), head(x, -3)*1.3)
m <- sample(1:1000, size=200)
x[m] <- NA
y[m] <- NA
ccf(x,y, na.action=na.exclude)

#More complicated, more revealing: it gets the lag wrong when fewer obs
x <- rnorm(1000)
y <- c(rep(NA,3), head(x, -3)*1.3)
m <- sample(1:1000, size=500)
x[m] <- NA
y[m] <- NA
ccf(x,y, na.action=na.exclude)


#Most complicated: mild missingness, different sampling rates (but rates are multiples)
x <- rnorm(1000)
y <- c(rep(NA,3), head(x, -3)*10)
mx <- (1:1000)%%1 == 0 & rbinom(n=1000, size=1, prob=0.01)==0 #test for multiple of 1 [basline sampling rate] & randomly introduce missingness
my <- (1:1000)%%2 == 0 & rbinom(n=1000, size=1, prob=0.01)==0 #test for multiple of 2 [every other sample] & randomly introduce missingness
#Note that for every other time y is observed, x is also observed, until you account for missingness, and then sometimes that won't hold true.
x[!mx] <- NA
y[!my] <- NA
ccf(x,y, na.action=na.exclude) #this doesn't do so well, I think it's because neither time series is autocorrelated. Maybe if I used a more realistic time series it'd work, but I think that given glenb's criticism, I should stick to the conservative case


#Terribly hopeless: there are no overlapping observations
x <- rnorm(1000)
y <- c(rep(NA,3), head(x, -3)*1.3)
m <- (1:1000)%%2 == 0
x[m] <- NA
y[!m] <- NA
ccf(x,y, na.action=na.exclude)
rbatt
rbatt
May 05, 2013 12:04 PM

Maybe I misunderstood your question, but I think you want Dynamic time warping (DTW).

DTW allows you to match similar or equivalent segments in time series that have different sampling rate. Therefore, DTW is the preferred alternative to good old euclidean distance.

The following image shows you the intuition behind how DTW works. You can see it is extremely flexible and fortunately and there are really fast implementations available.

enter image description here

Robert Smith
Robert Smith
October 03, 2013 03:55 AM

Related Questions


Ways to stabilize OLS betas

Updated August 12, 2015 19:08 PM



GARCH vs ARCH models - which is more parsimonious?

Updated August 31, 2017 18:19 PM

How can I write an asymmetric-BEKK(1,1,1) model

Updated March 11, 2019 12:19 PM