Skip to content Skip to sidebar Skip to footer

Pandas Convert String Columns To Datetime, Allowing Missing But Not Invalid

I have a pandas data frame with multiple columns of strings representing dates, with empty strings representing missing dates. For example import numpy as np import pandas as pd

Solution 1:

pandas doesn't have an option that exactly replicates what you want, here's one way to do it, which should be relatively efficient.

In [4]: dfBad
Out[4]: 
   custId   eventDate registerDate
0106/10/199206/08/20021208/24/201220/08/20122304/24/201504/20/2015344510/14/200910/10/2009In [7]: cols
Out[7]: ['eventDate', 'registerDate']

In [9]: dts = dfBad[cols].apply(lambda x: pd.to_datetime(x, errors='coerce', format='%m/%d/%Y'))

In [10]: dts
Out[10]: 
   eventDate registerDate
01992-06-102002-06-0812012-08-24          NaT
22015-04-242015-04-203        NaT          NaT
42009-10-142009-10-10In [11]: mask = pd.isnull(dts) & (dfBad[cols] !='')

In [12]: mask
Out[12]: 
  eventDate registerDate
0FalseFalse1FalseTrue2FalseFalse3FalseFalse4FalseFalseIn [13]: mask.any()
Out[13]: 
eventDate       False
registerDate     True
dtype: bool

In [14]: is_bad = mask.any()

In [23]: if is_bad.any():
    ...:     raise ValueError("bad dates in col(s) {0}".format(is_bad[is_bad].index.tolist()))
    ...: else:
    ...:     df[cols] = dts
    ...:     
---------------------------------------------------------------------------
ValueError                                Traceback (most recent calllast)
<ipython-input-23-579c06ce3c77>in<module>()
      1 if is_bad.any():
----> 2     raise ValueError("bad dates in col(s) {0}".format(is_bad[is_bad].index.tolist()))3else:
      4     df[cols] = dts
      5 

ValueError: bad dates in col(s) ['registerDate']

Solution 2:

Just to take the accepted answer a little further, I replaced the columns of all valid or missing strings with their parsed datetimes, and then raised an error for the remaining unparsed columns:

dtCols = ['eventDate', 'registerDate']
dts = dfBad[dtCols].apply(lambda x: pd.to_datetime(x, errors='coerce', format='%m/%d/%Y'))

mask = pd.isnull(dts) & (dfBad[dtCols] != '')
colHasError = mask.any()

invalidCols = colHasError[colHasError].index.tolist() 
validCols = list(set(dtCols) - set(invalidCols))

dfBad[validCols] = dts[validCols] # replace the completely valid/empty string cols with datesif colHasError.any():
    raise ValueError("bad dates in col(s) {0}".format(invalidCols))
# raises:  ValueError: bad dates in col(s) ['registerDate']print(dfBad) # eventDate got converted, registerDate didn't

The accepted answer contains the main insight, though, which is to go ahead and coerce errors to NaT and then distinguish the non-empty but invalid strings from the empty ones with the mask.

Post a Comment for "Pandas Convert String Columns To Datetime, Allowing Missing But Not Invalid"