pandas groupby apply is really slow

by user5406764   Last Updated November 05, 2017 15:26 PM

When I call df.groupby([...]).apply(lambda x: ...) the performance is horrible. Is there a faster / more direct way to do this simple query?

To demonstrate my point, here is some code to set up the DataFrame:

import pandas as pd

df = pd.DataFrame(data=
    {'ticker': ['AAPL','AAPL','AAPL','IBM','IBM','IBM'],
       'side': ['B','B','S','S','S','B'],
       'size': [100, 200, 300, 400, 100, 200],
      'price': [10.12, 10.13, 10.14, 20.3, 20.2, 20.1]})


    price   side     size   ticker
0   10.12   B        100    AAPL
1   10.13   B        200    AAPL
2   10.14   S        300    AAPL
3   20.30   S        400    IBM
4   20.20   S        100    IBM
5   20.10   B        200    IBM

Now here is the part that is extremely slow that I need to speed up:

%timeit avgpx = df.groupby(['ticker','side']) \
.apply(lambda group: (group['size'] * group['price']).sum() / group['size'].sum())

3.23 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This produces the correct result but as you can see above, takes super long (3.23ms doesn't seem like much but this is only 6 rows... When I use this on a real dataset it takes forever).

ticker  side
AAPL    B       10.126667
        S       10.140000
IBM     B       20.100000
        S       20.280000
dtype: float64


Answers 1


You can save some time by precomputing the product and getting rid of the apply.

df['scaled_size'] = df['size'] * df['price']
g = df.groupby(['ticker', 'side'])

g['scaled_size'].sum() / g['size'].sum()
100 loops, best of 3: 2.58 ms per loop
cᴏʟᴅsᴘᴇᴇᴅ
cᴏʟᴅsᴘᴇᴇᴅ
November 05, 2017 15:24 PM

Related Questions


Do not map item to any output using apply()

Updated July 30, 2018 21:26 PM


Add daily_volume so far to dataframe pandas

Updated June 18, 2020 21:26 PM

Duplicate values pandas

Updated September 18, 2019 12:26 PM

Chaining groupby and apply pandas

Updated May 22, 2018 02:26 AM