Skip to content Skip to sidebar Skip to footer

Fill NaN Values From Another DataFrame (with Different Shape)

I'm looking for a faster approach to improve the performance of my solution for the following problem: a certain DataFrame has two columns with a few NaN values in them. The challe

Solution 1:

you can use Index to speed up the lookup, use combine_first() to fill NaN:

cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
merged_df[cols] = merged_df[cols].combine_first(
    date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))

print(merged_df[cols])

the result:

 day_of_week  holiday_flg
0     Tuesday          0.0
1   Wednesday          0.0
2    Thursday          0.0
3    Saturday          1.0

Solution 2:

This is one solution. It should be efficient as there is no explicit merge or apply.

merged_df['visit_date'] = pd.to_datetime(merged_df['visit_date']) 
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date']) 

s = date_info_df.set_index('calendar_date')['day_of_week']
t = date_info_df.set_index('day_of_week')['holiday_flg']

merged_df['day_of_week'] = merged_df['day_of_week'].fillna(merged_df['visit_date'].map(s))
merged_df['holiday_flg'] = merged_df['holiday_flg'].fillna(merged_df['day_of_week'].map(t))

Result

  air_store_id area_name day_of_week genre_name  holiday_flg hpg_store_id  \
0       air_a1     Tokyo     Tuesday   Japanese          0.0       hpg_h1   
1       air_a2       NaN   Wednesday        NaN          0.0          NaN   
2       air_a3       NaN    Thursday        NaN          0.0          NaN   
3       air_a4       NaN    Saturday        NaN          1.0          NaN   

   latitude  longitude     reserve_datetime  reserve_visitors visit_date  \
0    1234.0     5678.0  2017-04-22 11:00:00              25.0 2017-05-23   
1       NaN        NaN                  NaN              35.0 2017-05-24   
2       NaN        NaN                  NaN              45.0 2017-05-25   
3       NaN        NaN                  NaN               NaN 2017-05-27   

        visit_datetime  
0  2017-05-23 12:00:00  
1                  NaN  
2                  NaN  
3                  NaN  

Explanation

  • s is a pd.Series mapping calendar_date to day_of_week from date_info_df.
  • Use pd.Series.map, which takes pd.Series as an input, to update missing values, where possible.

Solution 3:

Edit: one can also use merge to solve the problem. 10 times faster than the old approach. (Need to make sure "visit_date" and "calendar_date" are of the same format.)

# don't need to `set_index` for date_info_df but select columns needed.
merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], 
                left_on="visit_date", 
                right_on="calendar_date", 
                how="left") # outer should also work

The desired result will be at "day_of_week_y" and "holiday_flg_y" column right now. In this approach and the map approach, we don't use the old "day_of_week" and "holiday_flg" at all. We just need to map the results from data_info_df to merged_df.

merge can also do the job because data_info_df's data entries are unique. (No duplicates will be created.)


You can also try using pandas.Series.map. What it does is

Map values of Series using input correspondence (which can be a dict, Series, or function)

# set "calendar_date" as the index such that 
# mapping["day_of_week"] and mapping["holiday_flg"] will be two series
# with date_info_df["calendar_date"] as their index.
mapping = date_info_df.set_index("calendar_date")

# this line is optional (depending on the layout of data.)
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)

# do replacement here.
merged_df["day_of_week"] = merged_df.visit_date.map(mapping["day_of_week"])
merged_df["holiday_flg"] = merged_df.visit_date.map(mapping["holiday_flg"])

Note merged_df.visit_date originally was of string type. Thus, we use

merged_df.visit_date = pd.to_datetime(merged_df.visit_date)

to make it datetime.

Timings date_info_df dataset and merged_df provided by karlphillip.

date_info_df = pd.read_csv("full_date_info_data.csv")
merged_df = pd.read_csv("full_data.csv")   
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
date_info_df.calendar_date = pd.to_datetime(date_info_df.calendar_date)
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)

# merge method I proprose on the top.
%timeit merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], left_on="visit_date", right_on="calendar_date", how="left")
511 ms ± 34.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# HYRY's method without assigning it back
%timeit merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
772 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method with assigning it back
%timeit merged_df[cols] = merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))    
258 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

One can see that HYRY's method runs 3 times faster if assigning the result back to the merged_df. This is why I thought HARY's method was faster than mine at first glance. I suspect that is because of the nature of combine_first. I guess that the speed of HARY's method will depend on how sparse it is in merged_df. Thus, while assigning the results back, the columns become full; therefore, while rerunning it, it is faster.

The performances of the merge and combine_first methods are nearly equivalent. Perhaps there can be circumstances that one is faster than another. It should be left to each user to do some tests on their datasets.

Another thing to note between the two methods is that the merge method assumed every date in merged_df is contained in data_info_df. If there are some dates that are contained in merged_df but not data_info_df, it should return NaN. And NaN can override some part of merged_df that originally contains values! This is when combine_first method should be preferred. See the discussion by MaxU in Pandas replace, multi column criteria


Post a Comment for "Fill NaN Values From Another DataFrame (with Different Shape)"