Fill NaN Values From Another DataFrame (with Different Shape)
Solution 1:
you can use Index
to speed up the lookup, use combine_first()
to fill NaN:
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
merged_df[cols] = merged_df[cols].combine_first(
date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
print(merged_df[cols])
the result:
day_of_week holiday_flg
0 Tuesday 0.0
1 Wednesday 0.0
2 Thursday 0.0
3 Saturday 1.0
Solution 2:
This is one solution. It should be efficient as there is no explicit merge
or apply
.
merged_df['visit_date'] = pd.to_datetime(merged_df['visit_date'])
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date'])
s = date_info_df.set_index('calendar_date')['day_of_week']
t = date_info_df.set_index('day_of_week')['holiday_flg']
merged_df['day_of_week'] = merged_df['day_of_week'].fillna(merged_df['visit_date'].map(s))
merged_df['holiday_flg'] = merged_df['holiday_flg'].fillna(merged_df['day_of_week'].map(t))
Result
air_store_id area_name day_of_week genre_name holiday_flg hpg_store_id \
0 air_a1 Tokyo Tuesday Japanese 0.0 hpg_h1
1 air_a2 NaN Wednesday NaN 0.0 NaN
2 air_a3 NaN Thursday NaN 0.0 NaN
3 air_a4 NaN Saturday NaN 1.0 NaN
latitude longitude reserve_datetime reserve_visitors visit_date \
0 1234.0 5678.0 2017-04-22 11:00:00 25.0 2017-05-23
1 NaN NaN NaN 35.0 2017-05-24
2 NaN NaN NaN 45.0 2017-05-25
3 NaN NaN NaN NaN 2017-05-27
visit_datetime
0 2017-05-23 12:00:00
1 NaN
2 NaN
3 NaN
Explanation
s
is apd.Series
mapping calendar_date to day_of_week fromdate_info_df
.- Use
pd.Series.map
, which takespd.Series
as an input, to update missing values, where possible.
Solution 3:
Edit: one can also use merge
to solve the problem. 10 times faster than the old approach. (Need to make sure "visit_date"
and "calendar_date"
are of the same format.)
# don't need to `set_index` for date_info_df but select columns needed.
merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]],
left_on="visit_date",
right_on="calendar_date",
how="left") # outer should also work
The desired result will be at "day_of_week_y"
and "holiday_flg_y"
column right now. In this approach and the map
approach, we don't use the old "day_of_week"
and "holiday_flg"
at all. We just need to map the results from data_info_df
to merged_df
.
merge
can also do the job because data_info_df
's data entries are unique. (No duplicates will be created.)
You can also try using pandas.Series.map
. What it does is
Map values of Series using input correspondence (which can be a dict, Series, or function)
# set "calendar_date" as the index such that
# mapping["day_of_week"] and mapping["holiday_flg"] will be two series
# with date_info_df["calendar_date"] as their index.
mapping = date_info_df.set_index("calendar_date")
# this line is optional (depending on the layout of data.)
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
# do replacement here.
merged_df["day_of_week"] = merged_df.visit_date.map(mapping["day_of_week"])
merged_df["holiday_flg"] = merged_df.visit_date.map(mapping["holiday_flg"])
Note merged_df.visit_date
originally was of string type. Thus, we use
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
to make it datetime.
Timings date_info_df dataset and merged_df provided by karlphillip.
date_info_df = pd.read_csv("full_date_info_data.csv")
merged_df = pd.read_csv("full_data.csv")
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
date_info_df.calendar_date = pd.to_datetime(date_info_df.calendar_date)
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
# merge method I proprose on the top.
%timeit merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], left_on="visit_date", right_on="calendar_date", how="left")
511 ms ± 34.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method without assigning it back
%timeit merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
772 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method with assigning it back
%timeit merged_df[cols] = merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
258 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One can see that HYRY's method runs 3 times faster if assigning the result back to the merged_df
. This is why I thought HARY's method was faster than mine at first glance. I suspect that is because of the nature of combine_first
. I guess that the speed of HARY's method will depend on how sparse it is in merged_df
. Thus, while assigning the results back, the columns become full; therefore, while rerunning it, it is faster.
The performances of the merge
and combine_first
methods are nearly equivalent. Perhaps there can be circumstances that one is faster than another. It should be left to each user to do some tests on their datasets.
Another thing to note between the two methods is that the merge
method assumed every date in merged_df
is contained in data_info_df
. If there are some dates that are contained in merged_df
but not data_info_df
, it should return NaN
. And NaN
can override some part of merged_df
that originally contains values! This is when combine_first
method should be preferred. See the discussion by MaxU in Pandas replace, multi column criteria
Post a Comment for "Fill NaN Values From Another DataFrame (with Different Shape)"