Pandas Efficient Groupby Season For Every Year
I have a multi-year time series an want the bounds between which 95% of my data lie. I want to look at this by season of the year ('DJF', 'MAM', 'JJA', 'SON'). I've tried following
Solution 1:
In case it helps, I would suggest replacing the following list comprehension and dict lookup that you identified as slow:
month_to_season_dct = {
1: 'DJF', 2: 'DJF',
3: 'MAM', 4: 'MAM', 5: 'MAM',
6: 'JJA', 7: 'JJA', 8: 'JJA',
9: 'SON', 10: 'SON', 11: 'SON',
12: 'DJF'
}
grp_ary = [month_to_season_dct.get(t_stamp.month) for t_stamp in df.index]
with the following, which uses a numpy array as a lookup table.
month_to_season_lu = np.array([
None,
'DJF', 'DJF',
'MAM', 'MAM', 'MAM',
'JJA', 'JJA', 'JJA',
'SON', 'SON', 'SON',
'DJF'
])
grp_ary = month_to_season_lu[df.index.month]
Here's a timeit comparison of the two approaches on ~3 years of minutely data:
In [16]: timeit [month_to_season_dct.get(t_stamp.month) for t_stamp in df.index]
1 loops, best of 3: 12.3 s per loop
In [17]: timeit month_to_season_lu[df.index.month]
1 loops, best of 3: 549 ms per loop
Solution 2:
The fastest so far is a combination of creating a low-frequency timeseries with which to do the season lookup and @Garrett's method of using a numpy.array
index lookup rather than a dict
.
season_lookup = np.array([
None,
'DJF', 'DJF',
'MAM', 'MAM', 'MAM',
'JJA', 'JJA', 'JJA',
'SON', 'SON', 'SON',
'DJF'])
SEASON_HALO = pd.datetools.relativedelta(months=4)
start_with_halo = df.index.min() - SEASON_HALO
end_with_halo = df.index.max() + SEASON_HALO
seasonal_idx = pd.DatetimeIndex(start=start_with_halo, end=end_with_halo, freq='QS-DEC')
seasonal_ts = pd.DataFrame(index=seasonal_idx)
seasonal_ts[SEAS] = season_lookup[seasonal_ts.index.month]
seasonal_minutely_ts = seasonal_ts.resample(df.index.freq, fill_method='ffill')
df_via_resample = df.join(seasonal_minutely_ts)
gp_up_sample = df_via_resample.groupby(SEAS)
gp_up_sample.quantile(FRAC_2_TAIL)
with 10 years of minute data, on my machine: this is about:
- 2% faster than low frequency
dict
lookup then up-sample - 7% faster than the normal frequency
np.array
lookup - >400% improvement on my original method
YMMV
Solution 3:
This can be helpful
data = pd.read_excel(DATAPATH)
data["Date"] = pd.to_datetime(data["Date"])
def MonthToSeason(x):
global season
if x == 6 or x == 7 or x == 8 or x == 9:
season = "Monsoon"
elif x == 10 or x == 11:
season = "Post-monsoon"
elif x == 12 or x == 1 or x == 2:
season = "Winter"
elif x == 3 or x == 4 or x == 5:
season = "Summer"
else:
season = np.nan
return season
data['Season'] = data['Date'].dt.month.apply(lambda x : MonthToSeason(x))
GroupedData = data.groupby(data["Season"]).agg(['count','min','mean','max','std'])
Post a Comment for "Pandas Efficient Groupby Season For Every Year"