Pandas Generating Dataframe Based On Columns Being Present
Solution 1:
Pandas DataFrame
s support most of the dict
methods, include get
(which allows you to substitute a value for an empty key). So you can do the statistics you want on all columns, then get the values for the column you want, substituting an empty Series
for missing columns, then drop NaN
columns (I use Bad Value
to demonstrate what happens to missing columns):
eser = pd.Series()
count = p.count().max(axis=1)
all_max = p.max()
score_max = all_max.get('Score', eser)
wins_max = all_max.get('Wins', eser)
wins_mean = p.mean().get('Wins', eser)
type_nunique = p.agg(lambda x: x.nunique()).get('Type', eser)
summary = pd.DataFrame({'Total Games': count,
'Average Wins': wins_mean,
'Greatest Wins': wins_max,
'Unique games': type_nunique,
'Max Score': score_max})
summary.dropna(axis=1, how='all', inplace=True)
Or a one-liner (which involves calculating the max
on all columns twice, which shouldn't be a problem for a small number of values):
summary = pd.DataFrame({'Total Games': p.count().max(axis=1),
'Average Wins': p.mean().get('Wins', pd.Series()),
'Greatest Wins': p.max().get('Wins', pd.Series()),
'Unique games': p.agg(lambda x: x.nunique()).get('Type', pd.Series()),
'Max Score': p.max().get('Score', pd.Series())}).dropna(axis=1, how='all')
The result of either approach:
Average Wins Greatest Wins Total Games Unique games
Greg 11 30 4 2
Sam 15 20 2 2
Steve 11 20 4 2
Without the dropna
:
Average Wins Greatest Wins Max Score Total Games Unique games
Greg 1130NaN42
Sam 1520NaN22
Steve 1120NaN42
If performance is an issue, the above will be slow because they require computing multiple statistics on all columns, which means statistics are being computed and then thrown away. A faster, but uglier, approach is similar to your approach of using a loop over a dict
.
The problem with your implementation is that the dict
items are not lazily evaluated, they are evaluated when the dict
is created, which means it still tries to access the non-existent columns.
The approach below gets both the item and applies the function only when the column is found (with special handling for the count
case, since any existing column will work):
sumdict = {'Total Games': (None, 'count'),
'Average Wins': ('Wins', 'mean'),
'Greatest Wins': ('Wins', 'max'),
'Unique games': ('Type', 'nunique'),
'Max Score': ('Score', 'max')}
summary = []
for key, (column, op) in sumdict.items():
if column isNone:
res = p.agg(op).max(axis=1)
elif column notin df:
continueelse:
res = p[column].agg(lambda x: getattr(x, op)())
summary.append(pd.DataFrame({key: res}))
summary = pd.concat(summary, axis=1)
It gives the same result as my above approaches, although with a different column order.
Post a Comment for "Pandas Generating Dataframe Based On Columns Being Present"