Skip to content Skip to sidebar Skip to footer

Expand Pandas Dataframe Based On Range In A Column

I have a pandas dataframe like this: Name SICs Agric 0100-0199 Agric 0910-0919 Agric 2048-2048 Food 2000-2009 Food 2010-2019 Soda 2097-2097 The SICs column gives a ran

Solution 1:

Quick and dirty but I think this gets you to what you need:

from io import StringIO
import pandas as pd

players=StringIO(u"""Name,SICs
Agric,0100-0199
Agric,0210-0211
Food,2048-2048
Soda,1198-1200""")

df = pd.DataFrame.from_csv(players, sep=",", parse_dates=False).reset_index()


df2 = pd.DataFrame(columns=('Name', 'SIC'))

count = 0for idx,r in df.iterrows():
    data = r['SICs'].split("-")
    for i inrange(int(data[0]), int(data[1])+1):
                   df2.loc[count] = (r['Name'], i)
                   count += 1

Solution 2:

The neatest way I found (building on from Andy Hayden's answer):

# Extract date min and maxdf = df.set_index("Name")
df = df['SICs'].str.extract("(\d+)-(\d+)")
df.columns = ['min', 'max']
df = df.astype('int')

# Enumerate dates into wide tableenumerated_dates = [np.arange(row['min'], row['max']+1) for _, row in df.iterrows()]
df = pd.DataFrame.from_records(data=enumerated_dates, index=df.index)

# Convert from wide to long tabledf = df.stack().reset_index(1, drop=True)

It is however slow due to the for loop. A vectorised solution would be amazing but I cant find one.

Solution 3:

You can use str.extract to get strings from a regular expression:

In [11]:dfOut[11]:NameSICs0Agri0100-01991Agri0910-09192Food2000-2009

First take out the name as that's the thing we want to keep:

In [12]:df1=df.set_index("Name")In [13]:df1Out[13]:SICsNameAgri0100-0199Agri0910-0919Food2000-2009In [14]:df1['SICs'].str.extract("(\d+)-(\d+)")Out[14]:01NameAgri0100  0199Agri0910  0919Food2000  2009

Then flatten this with stack (which adds a MultiIndex):

In [15]:df1['SICs'].str.extract("(\d+)-(\d+)").stack()Out[15]:NameAgri00100101990091010919Food0200012009dtype:object

If you must you can remove the 0-1 level of the MultiIndex:

In [16]:df1['SICs'].str.extract("(\d+)-(\d+)").stack().reset_index(1,drop=True)Out[16]:NameAgri0100Agri0199Agri0910Agri0919Food2000Food2009dtype:object

Post a Comment for "Expand Pandas Dataframe Based On Range In A Column"