Removal of correlated variables and substitution of variables in scikit-learn pipeline

by eizanendoso   Last Updated October 17, 2018 13:19 PM

I'm working on a dataset where I have generated features by performing transformations on count-based features. I'm trying to solve a problem where I include:

  • the removal of correlated variables ('the correlation function')
  • substitution of variables with easier to interpret transformations (binary, ratios, log) of the same variable e.g. replace a count-based feature with a binary transformation of that feature ('the substitution function')

in a scikit-learn pipeline in order to find the best parameters (in a subsequent GridSearch), as I perform these operations after a Lasso for feature selection.

I've looked at the code here, here, here and here. I would like to include the correlation and substitution functions as part of my pipeline using a FunctionTransformer.

It's easy enough for the correlation function (return all columns that are not correlated or only one of a group of correlated variables), but am struggling with the substitution function. I'm not sure to get hold of the names of the features selected by the Lasso, in order to perform the substitution. My code currently looks like this:

Pipeline:

pipeline = Pipeline([
        ('standard', StandardScaler()),
        ('lasso', SelectFromModel(Lasso(fit_intercept=False, normalize=False, random_state = state), prefit=False)),
        ('substition', FunctionTransformer(substitute())),
        ('correlation removal', FunctionTransformer(correlation()))
        ('ridge', RidgeClassifier(fit_intercept=False, random_state = state))
        ])

Substitution Function (needs refactoring):

def substitute(info, X):
    lst = [ 'reg', 'bool', 'ratio', 'log' ]
    new_df = []
    del_df = []

    correlations = X.corr()

    for item in info:

        reg = 'reg' + item.replace(word, '')
        b = 'bool' + item.replace(word, '')
        r = 'ratio' + item.replace(word, '')
        l = 'log' + item.replace(word, '')

        if item.startswith('bool'):
            pass
        if item.startswith('reg'):
            if round(correlations[item][b], 3) == 1 and item != reg:
                del_df.append(item)
                new_df.append(b)
        if item.startswith('log'):
            if round(correlations[item][b], 3) == 1 and item != reg:
                del_df.append(item)
                new_df.append(b)
            elif round(correlations[item][reg], 3) == 1 and item != reg:
                del_df.append(item)
                new_df.append(b)
        if item.startswith('ratio'):
            if round(correlations[item][b], 3) == 1 and item != reg:
                del_df.append(item)
                new_df.append(b)
            elif round(correlations[item][reg], 3) == 1 and item != reg:
                del_df.append(item)
                new_df.append(b)
            elif round(correlations[item][log], 3) == 1 and item != reg:
                del_df.append(item)
                new_df.append(b)

    new_X = X.copy()[list(info)]

    for variable in del_df:
        new_X.drop([variable], axis=1, inplace=True)

    for variable in new_df:
        if variable not in data_X.columns:
            new_X.join(data_X[new_df])

    return new_X

I think I would need to find a way to pass the names of the new features into the function.

and for removing correlated variables (as in the question linked above):

def correlation(dataset):
    threshold=0.99
    col_corr = set() # Set of all the names of deleted columns
    corr_matrix = pd.DataFrame(dataset).corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if corr_matrix.iloc[i, j] >= threshold:
                colname = corr_matrix.columns[i] # getting the name of column
                print(colname)
                col_corr.add(colname)
    return dataset[dataset.columns.difference(col_corr)]

Thank you very much in advance!



Related Questions