Phyton Filling in date gaps in MultiIndex Pandas Dataframe

I would like to modify a pandas MultiIndex DataFrame such that each index group includes Dates between a specified range. I would like each group to fill in missing dates 2013-06-11 to 2013-12-31 with the value 0 (or NaN).

Group A, Group B, Date,           Value
loc_a    group_a  2013-06-11      22
                  2013-07-02      35
                  2013-07-09      14
                  2013-07-30       9
                  2013-08-06       4
                  2013-09-03      40
                  2013-10-01      18
         group_b  2013-07-09       4
                  2013-08-06       2
                  2013-09-03       5
         group_c  2013-07-09       1
                  2013-09-03       2
loc_b    group_a  2013-10-01       3

I've seen a few discussions of reindexing, but that is for a simple (non-grouped) time-series data.

Is there an easy way to do this?


Following are some attempts I've made at accomplishing this. For example: Once I've unstacked by ['A', 'B'], I can then reindex.

df = pd.DataFrame({'A': ['loc_a'] * 12 + ['loc_b'],
                'B': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2 + ['group_a'],
                'Date': ["2013-06-11",
                        "2013-07-02",
                        "2013-07-09",
                        "2013-07-30",
                        "2013-08-06",
                        "2013-09-03",
                        "2013-10-01",
                        "2013-07-09",
                        "2013-08-06",
                        "2013-09-03",
                        "2013-07-09",
                        "2013-09-03",
                        "2013-10-01"],
                 'Value': [22, 35, 14,  9,  4, 40, 18, 4, 2, 5, 1, 2, 3]})

df.Date = df['Date'].apply(lambda x: pd.to_datetime(x).date())
df = df.set_index(['A', 'B', 'Date'])

dt_start = dt.datetime(2013,6,1)
all_dates = [(dt_start + dt.timedelta(days=x)).date() for x in range(0,60)]

df2 = df.unstack(['A', 'B'])
df3 = df2.reindex(index=all_dates).fillna(0)
df4 = df3.stack(['A', 'B'])

## df4 is about where I want to get, now I'm trying to get it back in the form of df...

df5 = df4.reset_index()
df6 = df5.rename(columns={'level_0' : 'Date'})
df7 = df6.groupby(['A', 'B', 'Date'])['Value'].sum()

The last few lines make me a little sad. I was hoping that at df6 I could simply set_index back to ['A', 'B', 'Date'], but that did not group the values as they are grouped in the initial df DataFrame.

Any thoughts on how I can reindex the unstacked DataFrame, restack, and have the DataFrame in the same format as the original?

Answer:1

Your question wasn't clear about exactly which dates you were missing; I'm just assuming that you want to fill NaN for any date for which you do have an observation elsewhere. My solution will have to be amended if this assumption is faulty.

Side note: it may be nice to include a line to create the DataFrame

In [55]: df = pd.DataFrame({'A': ['loc_a'] * 12 + ['loc_b'],
   ....:                    'B': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2 + ['group_a'],
   ....:                    'Date': ["2013-06-11",
   ....:                            "2013-07-02",
   ....:                            "2013-07-09",
   ....:                            "2013-07-30",
   ....:                            "2013-08-06",
   ....:                            "2013-09-03",
   ....:                            "2013-10-01",
   ....:                            "2013-07-09",
   ....:                            "2013-08-06",
   ....:                            "2013-09-03",
   ....:                            "2013-07-09",
   ....:                            "2013-09-03",
   ....:                            "2013-10-01"],
   ....:                     'Value': [22, 35, 14,  9,  4, 40, 18, 4, 2, 5, 1, 2, 3]})

In [56]: 

In [56]: df.Date = pd.to_datetime(df.Date)

In [57]: df = df.set_index(['A', 'B', 'Date'])

In [58]: 

In [58]: print(df)
                          Value
A     B       Date             
loc_a group_a 2013-06-11     22
              2013-07-02     35
              2013-07-09     14
              2013-07-30      9
              2013-08-06      4
              2013-09-03     40
              2013-10-01     18
      group_b 2013-07-09      4
              2013-08-06      2
              2013-09-03      5
      group_c 2013-07-09      1
              2013-09-03      2
loc_b group_a 2013-10-01      3

To get the unobserved values filled, we'll use the unstack and stack methods. Unstacking will create the NaNs we're interested in, and then we'll stack them up to work with.

In [71]: df.unstack(['A', 'B'])
Out[71]: 
              Value                           
A             loc_a                      loc_b
B           group_a  group_b  group_c  group_a
Date                                          
2013-06-11       22      NaN      NaN      NaN
2013-07-02       35      NaN      NaN      NaN
2013-07-09       14        4        1      NaN
2013-07-30        9      NaN      NaN      NaN
2013-08-06        4        2      NaN      NaN
2013-09-03       40        5        2      NaN
2013-10-01       18      NaN      NaN        3


In [59]: df.unstack(['A', 'B']).fillna(0).stack(['A', 'B'])
Out[59]: 
                          Value
Date       A     B             
2013-06-11 loc_a group_a     22
                 group_b      0
                 group_c      0
           loc_b group_a      0
2013-07-02 loc_a group_a     35
                 group_b      0
                 group_c      0
           loc_b group_a      0
2013-07-09 loc_a group_a     14
                 group_b      4
                 group_c      1
           loc_b group_a      0
2013-07-30 loc_a group_a      9
                 group_b      0
                 group_c      0
           loc_b group_a      0
2013-08-06 loc_a group_a      4
                 group_b      2
                 group_c      0
           loc_b group_a      0
2013-09-03 loc_a group_a     40
                 group_b      5
                 group_c      2
           loc_b group_a      0
2013-10-01 loc_a group_a     18
                 group_b      0
                 group_c      0
           loc_b group_a      3

Reorder the index levels as necessary.

I had to slip that fillna(0) in the middle there so that the NaNs weren't dropped. stack does have a dropna argument. I would think that setting that to false would keep the all NaN rows around. A bug maybe?

Answer:2



I am trying to remove some data plotted as a scatter plot on matplotlib in python. I plot some scatter data and some 'plot' line data To remove the 'plot' line data I use : del self.plot1.lines[0] ...

I am trying to remove some data plotted as a scatter plot on matplotlib in python. I plot some scatter data and some 'plot' line data To remove the 'plot' line data I use : del self.plot1.lines[0] ...

I want to start learning more about using SWIG and other methods to interface Python and C++. To get started, I wanted to compile this simple program mentioned in another post: #include <Python.h&...

I want to start learning more about using SWIG and other methods to interface Python and C++. To get started, I wanted to compile this simple program mentioned in another post: #include <Python.h&...