Skip to content Skip to sidebar Skip to footer

Group By And Concatenate Dataframe

I have df with frame, m_label, and details so all of them can be duplicated, in same frame may be different labels with different details, but you need to know m_label+details have

Solution 1:

First I find it useful to move the start/end info to a new column, which is done by merging together the rows that have start/end on one side and the ones that don’t on the other:

>>> detail_type = df['details'].isin({'start', 'end'})
>>> df = pd.merge(df[~detail_type], df[detail_type].rename(columns={'details': 'detail_type'}))
>>> df
    frame   m_label details detail_type
0       0      BBPS       3       start
1       0  Findings      DV       start
2       0  Findings      DV         end
3       0  Findings      DV       start
4       0  Findings      DV         end
5      31   Actions      IR       start
6      99      BBPS       2         end
7      99  Findings      PL       start
8      99  Findings      PL         end
9      99  Findings      PL       start
10     99  Findings      PL         end
11    101    Action      IR         end

Now we can replace the 2 columns by their concatenated text:

>>> df = df.drop(columns=['m_label', 'details']).join(df['m_label'].str.cat(df['details'], sep='.'))
>>> df.drop_duplicates()
    frame detail_type      m_label
0       0       start       BBPS.3
1       0       start  Findings.DV
2       0         end  Findings.DV
5      31       start   Actions.IR
6      99         end       BBPS.2
7      99       start  Findings.PL
8      99         end  Findings.PL
11    101         end    Action.IR

You could even pivot to have a start and an end column:

>>> df.drop_duplicates().pivot(columns='detail_type', index='m_label', values='frame')
detail_type    end  start
m_label                  
Action.IR    101.0    NaN
Actions.IR     NaN   31.0
BBPS.2        99.0    NaN
BBPS.3         NaN    0.0
Findings.DV    0.0    0.0
Findings.PL   99.0   99.0

But for that to be efficient you’ll first need to define rules that uniquely name your labels, e.g. BBPS regardless of details 2 and 3, Action / Actions always spelled the same way, etc.


Solution 2:

I don't think groupby would help, as the order inside the group also matter.

Try this (since you didn't post the df in a copiable way, I can't test it myself):

df = df.assign(new_label=None).sort_values(['frame', 'm_label'])
df.loc[~df['details'].isin(['start', 'end']), 'new_label'] = df['m_label'] + '.' + df['details']
df.loc[(df['frame'] == df['frame'].shift(-1).fillna('')) & (df['m_label'] == df['m_label'].shift(-1).fillna('')) & df['details'].shift(-1).isin(['start', 'end']), 'details'] = df['details'].shift(-1).fillna('')
df = df.loc[pd.notna(df['new_label']) & df['details'].isin(['start', 'end']), ['frame', 'new_label', 'details']]

Post a Comment for "Group By And Concatenate Dataframe"