Skip to content Skip to sidebar Skip to footer

Pandas Generating Dataframe Based On Columns Being Present

I have a dataframe import pandas as pd df=pd.DataFrame({'Players': [ 'Sam', 'Greg', 'Steve', 'Sam', 'Greg', 'Steve', 'Greg', 'Steve', 'Greg', 'Steve'],

Solution 1:

Pandas DataFrames support most of the dict methods, include get (which allows you to substitute a value for an empty key). So you can do the statistics you want on all columns, then get the values for the column you want, substituting an empty Series for missing columns, then drop NaN columns (I use Bad Value to demonstrate what happens to missing columns):

eser = pd.Series()
count = p.count().max(axis=1)
all_max = p.max()
score_max = all_max.get('Score', eser)
wins_max = all_max.get('Wins', eser)
wins_mean = p.mean().get('Wins', eser)
type_nunique = p.agg(lambda x: x.nunique()).get('Type', eser)

summary = pd.DataFrame({'Total Games': count,
                        'Average Wins': wins_mean,
                        'Greatest Wins': wins_max,
                        'Unique games': type_nunique,
                        'Max Score': score_max})
summary.dropna(axis=1, how='all', inplace=True)

Or a one-liner (which involves calculating the max on all columns twice, which shouldn't be a problem for a small number of values):

summary = pd.DataFrame({'Total Games': p.count().max(axis=1),
                        'Average Wins': p.mean().get('Wins', pd.Series()),
                        'Greatest Wins': p.max().get('Wins', pd.Series()),
                        'Unique games': p.agg(lambda x: x.nunique()).get('Type', pd.Series()),
                        'Max Score': p.max().get('Score', pd.Series())}).dropna(axis=1, how='all')

The result of either approach:

       Average Wins  Greatest Wins  Total Games  Unique games
Greg             11             30            4             2
Sam              15             20            2             2
Steve            11             20            4             2

Without the dropna:

       Average Wins  Greatest Wins  Max Score  Total Games  Unique games
Greg             1130NaN42
Sam              1520NaN22
Steve            1120NaN42

If performance is an issue, the above will be slow because they require computing multiple statistics on all columns, which means statistics are being computed and then thrown away. A faster, but uglier, approach is similar to your approach of using a loop over a dict.

The problem with your implementation is that the dict items are not lazily evaluated, they are evaluated when the dict is created, which means it still tries to access the non-existent columns.

The approach below gets both the item and applies the function only when the column is found (with special handling for the count case, since any existing column will work):

sumdict = {'Total Games': (None, 'count'),
           'Average Wins': ('Wins', 'mean'),
           'Greatest Wins': ('Wins', 'max'),
           'Unique games': ('Type', 'nunique'),
           'Max Score': ('Score', 'max')}

summary = []
for key, (column, op) in sumdict.items():
    if column isNone:
        res = p.agg(op).max(axis=1)
    elif column notin df:
        continueelse: 
        res = p[column].agg(lambda x: getattr(x, op)())
    summary.append(pd.DataFrame({key: res}))
summary = pd.concat(summary, axis=1)

It gives the same result as my above approaches, although with a different column order.

Post a Comment for "Pandas Generating Dataframe Based On Columns Being Present"