Skip to content Skip to sidebar Skip to footer

Pandas: Drop Quasi-duplicates By Column Values

I have a list that, let's say, looks like this (which I'm putting into a DF): [ ['john', '1', '1', '2016'], ['john', '1', '10', '2016'], ['sally', '3', '5', '2016'], ['sally', '4',

Solution 1:

You can sort the data frame by year, month, day and then take the first row from each name:

df.sort_values(by = ['year', 'month', 'day']).groupby('name').first()

#   month   day year# name          # john  1   1   2016#sally  3   5   2016

Data:

df = pd.DataFrame([['john', '1', '1', '2016'],
                   ['john', '1', '10', '2016'],
                   ['sally', '3', '5', '2016'],
                   ['sally', '4', '1', '2016']], 
                   columns = ['name', 'month', 'day', 'year'])

Solution 2:

Option 1 use pd.to_datetime to parse ['year', 'month', 'day'] columns. groupby('name') then take first

df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df.sort_values(['name', 'date']).groupby('name').first()

enter image description here

Option 2 Same pd.to_datetime usage. groupby('name') take idxmin to locate smallest date.

df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df.ix[df.groupby('name').date.idxmin()]

enter image description here

Post a Comment for "Pandas: Drop Quasi-duplicates By Column Values"