5

I want to fill NaN using a mean of values from the same hour-minutes of the previous days. To simplify, this is an example of my df.

timstamp         data
22/04/2016 09:00 1
22/04/2016 09:05 2
...
23/04/2016 09:00 3
23/04/2016 09:05 4
...
24/04/2016 09:00 5
24/04/2016 09:05 6
...
25/04/2016 09:00 7
25/04/2016 09:05 8
...
25/04/2016 10:00 NaN
25/04/2016 10:05 NaN

The real data contains many days with continuous 5-minute intervals.

df = df.groupby(df.index.minute).fillna(df.data.rolling(3).mean()) attempted to do rolling mean from the previous hour-minute on the past days but it did not work.

An alternative method of df = df.groupby(df.index.minute).ffill() take values from the previous two lines (i.e. 7 and 8) which comes from the same minutes of the previous hour on the same day.

However, I want the following result:

timstamp         data
22/04/2016 09:00 1
22/04/2016 09:05 2
...
23/04/2016 09:00 3
23/04/2016 09:05 4
...
24/04/2016 09:00 5
24/04/2016 09:05 6
...
25/04/2016 09:00 7
25/04/2016 09:05 8
25/04/2016 10:00 3
25/04/2016 10:05 4

where the value 3 (second last line) is the mean of the values from the same hour-minute of the previous days (mean of 1, 3, and 5), and 4 (last line) is the mean of 2, 4, and 6. Given the size of my df, I want to take a mean from tens of previous days.

Edit
I am getting closer. Using the following code, the mean of the data is calculated by similar hours and minutes as I wanted:

df.set_index('timstamp', inplace=True)
df=df.groupby([df.index.hour, df.index.minute]).mean()
df.index.names = ["hour", "minute"]

But, it uses the entire data to get the hour-minute mean. What I wanted is to use the same hour-minutes from the previous days only, where I can set the number of past days in the calculation. Then, the resulted mean value shall be used to fill the NaN.

1

Let's try this:

# time sample every 5 mins
idx = pd.date_range('2018-01-01', '2018-01-31', freq='300s')
np.random.seed(2019)

# create toy data
df = pd.DataFrame({'idx':idx,
                   'data':np.random.uniform(0,5, len(idx))})
df.loc[np.random.uniform(0,1,len(idx)) > 0.95, 'data'] = None

# means by the hour, can also use median
means = df.resample('H', on='idx').data.mean()

# get the timestamp on the hour
df['hour'] = df['idx'] - pd.to_timedelta(df.idx.dt.minute, unit='m')

# get the hour stamp of previous day
df['hour'] -= pd.to_timedelta(1, unit='d')

# update NaN
# df.loc[df.data.isna(), 'data'] = means[nan_hour]

# the original mapping raised a ValueError due to duplicates in nan_hour
df.loc[df.data.isna(), 'data'] = df.loc[df.data.isna(), 'hour'].\   
                                    replace({'hour': means})
  • here, the last line returns ValueError: cannot reindex from a duplicate axis – k.ko3n Apr 12 at 8:47
  • @k.koen Edited to fix error. – Quang Hoang Apr 12 at 12:56

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.