Pandas: Drop Quasi-duplicates By Column Values
I have a list that, let's say, looks like this (which I'm putting into a DF): [ ['john', '1', '1', '2016'], ['john', '1', '10', '2016'], ['sally', '3', '5', '2016'], ['sally', '4',
Solution 1:
You can sort the data frame by year, month, day and then take the first row from each name:
df.sort_values(by = ['year', 'month', 'day']).groupby('name').first()
# month day year# name # john 1 1 2016#sally 3 5 2016Data:
df = pd.DataFrame([['john', '1', '1', '2016'],
['john', '1', '10', '2016'],
['sally', '3', '5', '2016'],
['sally', '4', '1', '2016']],
columns = ['name', 'month', 'day', 'year'])
Solution 2:
Option 1
use pd.to_datetime to parse ['year', 'month', 'day'] columns.
groupby('name') then take first
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df.sort_values(['name', 'date']).groupby('name').first()
Option 2
Same pd.to_datetime usage.
groupby('name') take idxmin to locate smallest date.
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df.ix[df.groupby('name').date.idxmin()]


Post a Comment for "Pandas: Drop Quasi-duplicates By Column Values"