Week 6 - More about pandas

[1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('bmh')

We’ve looked at Series, DataFrame, how to view, select and index data, and how to make plots with pandas. Now we will look at time series data, which pandas is simply brilliant for working with.

‘Time series data’ refers to any form of data that is represented or indexed with ordered timestamps. Examples include stock prices, electrical activity in the brain, and temperature throughout the day. As you might have guessed, this involves dates, times, and differences between them.

Timestamp and Timedelta

In pandas, dates and times are represented with pd.Timestamp, which is a replacement for Python’s native datetime.datetime object. The two are interchangeable in many respects, but a key difference is that pd.Timestamp uses NumPy’s datetime64 and timedelta64 data types and incorporates a much wider range of functionality and powerful features for creating and manipulating time series.



Let’s create timestamps to mark two memorable points in history. The First World War ended in 1918 at the eleventh hour of the eleventh day of the eleventh month, and the Second World War ended on the 08 May at one-minute past 11 in the evening.

[2]:
end_of_ww1 = pd.Timestamp('1918-11-11 11:00:00')  # Armistice Day
end_of_ww2 = pd.Timestamp('1945-05-08 23:01:00')  # VE Day

Timestamp data have many useful methods. An especially useful method is .strftime(), which creates a str representation of a timestamp using various formatting codes. Here are some examples.

[3]:
print("The First World War ended", end_of_ww1.strftime('%d %B %Y'), "at", end_of_ww1.strftime('%I:%M %p'))
print("The Second World War ended", end_of_ww2.strftime('%d %B %Y'), "at", end_of_ww2.strftime('%I:%M %p'))
The First World War ended 11 November 1918 at 11:00 AM
The Second World War ended 08 May 1945 at 11:01 PM

Here, the % symbols and subsequent characters tell the strftime() method how to represent the timestamp. You may wish to experiment with other formatting codes to format the timestamps in different ways. Copy the above code into the cell below and try to format the dates using abbreviated month names and 24-hour time.

If you are a history buff, you may already know the day of the week when each of The Great Wars ended. If not, you can find out using the pd.Timestamp.day_name() method.

[4]:
print('The First World War ended on a', end_of_ww1.day_name())
print('The Second World War ended on a', end_of_ww2.day_name())
The First World War ended on a Monday
The Second World War ended on a Tuesday

If we have two timestamps, we can calculate the difference between them, which is known as a Timedelta. Let’s do this to find out how much time elapsed between the end of the First and Second World Wars.

[5]:
tdelta = end_of_ww2 - end_of_ww1  # Subtract the end of WW1 from the end of WW2
print('Time between the end of the First and Second World Wars:')
print(f'{tdelta.days} days and {tdelta.seconds} seconds')
Time between the end of the First and Second World Wars:
9675 days and 43260 seconds

A Timedelta behaves like a number in many respects and can be operated on with most mathematical operators. But there are restrictions. For example, one can divide or multiply a Timedelta by any number, but addition and subtraction can only take place with other Timestamp or Timedelta objects. If you think about it, this makes perfect sense.

[6]:
print('Time between end of WW1 and WW2: ', tdelta)
print('Multiplied by 3:                 ', tdelta * 3)
print('Divided by 2:                    ', tdelta / 2)
print('Added to itself:                 ', tdelta + tdelta)
print('Minus twice itself:              ', tdelta - (tdelta * 2))
Time between end of WW1 and WW2:  9675 days 12:01:00
Multiplied by 3:                  29026 days 12:03:00
Divided by 2:                     4837 days 18:00:30
Added to itself:                  19351 days 00:02:00
Minus twice itself:               -9676 days +11:59:00

If you try to perform an operation that is not supported, such as raising a Timedelta to some power or multiplying two Timedelta’s together, you will get an error.

[7]:
tdelta ** 2  # Can not raise a Timedelta to a power...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/c9/7yddvl1n2ss863cgfngj0wpm0000gp/T/ipykernel_41875/977053077.py in <module>
----> 1 tdelta ** 2  # Can not raise a Timedelta to a power...

TypeError: unsupported operand type(s) for ** or pow(): 'Timedelta' and 'int'

In case you are wondering, leap years and other time-related idiosyncrasies are dealt with implicitly.

[8]:
leap_year = pd.Timestamp('27 Feb 2020')  # 2020 was a leap year
print((leap_year + pd.Timedelta('2D')).strftime('%d %B %Y'))  # Add 2 days
29 February 2020
[9]:
not_a_leap_year = pd.Timestamp('27 Feb 2021')  # 2021 wasn't a leap year
print((not_a_leap_year + pd.Timedelta('2D')).strftime('%d %B %Y'))  # Add 2 days
01 March 2021

Working with dates and times

Previously we have encountered functions such as range() and np.linspace(), which produce sequences of evenly-spaced numeric data. In pandas, we can do the same with dates using the pd.date_range(...) function. Let’s use it to create a timestamp for every day of this year.

[13]:
ts = pd.date_range(start='2022-01-01', end='2022-12-31', freq='D', inclusive='both')
ts
[13]:
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
               '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
               '2022-01-09', '2022-01-10',
               ...
               '2022-12-22', '2022-12-23', '2022-12-24', '2022-12-25',
               '2022-12-26', '2022-12-27', '2022-12-28', '2022-12-29',
               '2022-12-30', '2022-12-31'],
              dtype='datetime64[ns]', length=365, freq='D')

The result is a DatetimeIndex, which may be used as the index for a Series or DataFrame. Let’s work some magic on these timestamps to get a list of all future dates for this year in long-format.

[14]:
# Get all future dates for this year and format them nicely
ts[ts > pd.Timestamp.now()].strftime('%A %d %B %Y').tolist()
[14]:
['Wednesday 21 December 2022',
 'Thursday 22 December 2022',
 'Friday 23 December 2022',
 'Saturday 24 December 2022',
 'Sunday 25 December 2022',
 'Monday 26 December 2022',
 'Tuesday 27 December 2022',
 'Wednesday 28 December 2022',
 'Thursday 29 December 2022',
 'Friday 30 December 2022',
 'Saturday 31 December 2022']

As you can see, a DatetimeIndex is a powerful and flexible tool for representing dates and times. When it really starts to shine is when it is used as the index for time series data, so let’s invent some.

Imagine there is a shop that opened at the turn of the millennium which sells apples, bananas and cakes. At the time of opening, these items were sold at the respective prices of £0.45, £0.28, and £0.62 per item, but there’s been some inflation over time. Each day, a random(ish) number of each item is sold. Cakes are most popular, then bananas, then apples.

[122]:
# Initial cost in year 2000
cost_of_apple = 0.45
cost_of_banana = 0.28
cost_of_cake = 0.62

# Create a range of dates from 2000 to 2022 with 1-day frequency.
# This means we will have a row for every day.
dates = pd.date_range('2000', '2022', freq='1D')
number_of_days = len(dates)

# Create the DataFrame
df = pd.DataFrame(
    {
        'date': dates,
        'apples_sold': np.random.randint(0, 200, number_of_days),
        'bananas_sold': np.random.randint(0, 350, number_of_days),
        'cakes_sold': np.random.randint(0, 700, number_of_days),
        'cost_of_apple' : cost_of_apple,
        'cost_of_banana': cost_of_banana,
        'cost_of_cake': cost_of_cake
    },
)

# Adjust costs for inflation
cost_cols = ['cost_of_apple', 'cost_of_banana', 'cost_of_cake']
inflation = np.exp(np.linspace(0, 1, number_of_days))
df[cost_cols] = df[cost_cols].mul(inflation, axis=0)

# New columns for profit
df['apples_profit'] = df['apples_sold'] * df['cost_of_apple']
df['bananas_profit'] = df['bananas_sold'] * df['cost_of_banana']
df['cakes_profit'] = df['cakes_sold'] * df['cost_of_cake']
df['total_profit'] = df.apples_profit + df.bananas_profit + df.cakes_profit
df = df.round(2)
df
[122]:
date apples_sold bananas_sold cakes_sold cost_of_apple cost_of_banana cost_of_cake apples_profit bananas_profit cakes_profit total_profit
0 2000-01-01 199 347 539 0.45 0.28 0.62 89.55 97.16 334.18 520.89
1 2000-01-02 179 19 693 0.45 0.28 0.62 80.56 5.32 429.71 515.59
2 2000-01-03 178 215 447 0.45 0.28 0.62 80.12 60.21 277.21 417.54
3 2000-01-04 22 165 578 0.45 0.28 0.62 9.90 46.22 358.49 414.61
4 2000-01-05 23 1 294 0.45 0.28 0.62 10.36 0.28 182.37 193.01
... ... ... ... ... ... ... ... ... ... ... ...
8032 2021-12-28 157 6 411 1.22 0.76 1.68 191.95 4.56 692.33 888.84
8033 2021-12-29 1 255 106 1.22 0.76 1.68 1.22 194.01 178.58 373.81
8034 2021-12-30 71 326 545 1.22 0.76 1.68 86.83 248.06 918.28 1253.17
8035 2021-12-31 106 36 617 1.22 0.76 1.69 129.65 27.40 1039.72 1196.76
8036 2022-01-01 167 169 40 1.22 0.76 1.69 204.28 128.63 67.41 400.32

8037 rows × 11 columns

Now, if we look at the data types of the DataFrame, we can see that the date column has type datetime64[ns].

[123]:
df.dtypes
[123]:
date              datetime64[ns]
apples_sold                int64
bananas_sold               int64
cakes_sold                 int64
cost_of_apple            float64
cost_of_banana           float64
cost_of_cake             float64
apples_profit            float64
bananas_profit           float64
cakes_profit             float64
total_profit             float64
dtype: object

When working with the world population and titanic datasets, we used the str accessor method to interface with additional string methods (e.g., df['name'].str.contains('Smith')). For datetime-like data, there is a dt accessor method, which opens the door to datetime functionality. Below, we use the dt accessor method to create a separate column containing the day of the week, the year, and the fiscal year.

[124]:
df['Day'] = df['date'].dt.day_name()  # Get day of week
df['Year'] = df['date'].dt.year  # Get current year
df['Fiscal Year'] = df['date'].dt.to_period('Q-APR').dt.qyear  # Get fiscal year
df.tail()
[124]:
date apples_sold bananas_sold cakes_sold cost_of_apple cost_of_banana cost_of_cake apples_profit bananas_profit cakes_profit total_profit Day Year Fiscal Year
8032 2021-12-28 157 6 411 1.22 0.76 1.68 191.95 4.56 692.33 888.84 Tuesday 2021 2022
8033 2021-12-29 1 255 106 1.22 0.76 1.68 1.22 194.01 178.58 373.81 Wednesday 2021 2022
8034 2021-12-30 71 326 545 1.22 0.76 1.68 86.83 248.06 918.28 1253.17 Thursday 2021 2022
8035 2021-12-31 106 36 617 1.22 0.76 1.69 129.65 27.40 1039.72 1196.76 Friday 2021 2022
8036 2022-01-01 167 169 40 1.22 0.76 1.69 204.28 128.63 67.41 400.32 Saturday 2022 2022

We now have three new and useful columns in our DataFrame. At the moment, the original date column is just a column of data, but what we really need is for it to be our index.

[125]:
# Set the date column as the index
df = df.set_index('date')
df
[125]:
apples_sold bananas_sold cakes_sold cost_of_apple cost_of_banana cost_of_cake apples_profit bananas_profit cakes_profit total_profit Day Year Fiscal Year
date
2000-01-01 199 347 539 0.45 0.28 0.62 89.55 97.16 334.18 520.89 Saturday 2000 2000
2000-01-02 179 19 693 0.45 0.28 0.62 80.56 5.32 429.71 515.59 Sunday 2000 2000
2000-01-03 178 215 447 0.45 0.28 0.62 80.12 60.21 277.21 417.54 Monday 2000 2000
2000-01-04 22 165 578 0.45 0.28 0.62 9.90 46.22 358.49 414.61 Tuesday 2000 2000
2000-01-05 23 1 294 0.45 0.28 0.62 10.36 0.28 182.37 193.01 Wednesday 2000 2000
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2021-12-28 157 6 411 1.22 0.76 1.68 191.95 4.56 692.33 888.84 Tuesday 2021 2022
2021-12-29 1 255 106 1.22 0.76 1.68 1.22 194.01 178.58 373.81 Wednesday 2021 2022
2021-12-30 71 326 545 1.22 0.76 1.68 86.83 248.06 918.28 1253.17 Thursday 2021 2022
2021-12-31 106 36 617 1.22 0.76 1.69 129.65 27.40 1039.72 1196.76 Friday 2021 2022
2022-01-01 167 169 40 1.22 0.76 1.69 204.28 128.63 67.41 400.32 Saturday 2022 2022

8037 rows × 13 columns

Now we can index the DataFrame using any valid date/time string, of which there are many. For example, if we wanted to look at the information for a particular day, say, 23 April 2012, we could use any of the following:

  • '2012-04-23'

  • 'April 23, 2012'

  • '23 Apr 2012'

  • '23 April 2012'

  • '2012/04/23'

[134]:
df.loc['23 April 2012']  # St Georges day, 2012
[134]:
apples_sold           54
bananas_sold         250
cakes_sold           605
cost_of_apple       0.79
cost_of_banana      0.49
cost_of_cake        1.08
apples_profit      42.52
bananas_profit    122.48
cakes_profit      656.34
total_profit      821.34
Day               Monday
Year                2012
Fiscal Year         2012
Name: 2012-04-23 00:00:00, dtype: object

Try experimenting with different indexing strings!

[ ]:

If you want the data for an entire year, just pass in the year.

[135]:
df.loc['2012']  # Get all data from 2012
[135]:
apples_sold bananas_sold cakes_sold cost_of_apple cost_of_banana cost_of_cake apples_profit bananas_profit cakes_profit total_profit Day Year Fiscal Year
date
2012-01-01 68 166 389 0.78 0.48 1.07 52.80 80.19 416.12 549.10 Sunday 2012 2012
2012-01-02 77 238 240 0.78 0.48 1.07 59.79 114.99 256.76 431.54 Monday 2012 2012
2012-01-03 37 319 98 0.78 0.48 1.07 28.73 154.15 104.86 287.74 Tuesday 2012 2012
2012-01-04 133 173 8 0.78 0.48 1.07 103.30 83.61 8.56 195.47 Wednesday 2012 2012
2012-01-05 128 316 339 0.78 0.48 1.07 99.43 152.73 362.81 614.97 Thursday 2012 2012
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2012-12-27 158 144 552 0.81 0.51 1.12 128.31 72.76 617.61 818.68 Thursday 2012 2013
2012-12-28 167 75 72 0.81 0.51 1.12 135.63 37.90 80.57 254.10 Friday 2012 2013
2012-12-29 41 1 11 0.81 0.51 1.12 33.30 0.51 12.31 46.12 Saturday 2012 2013
2012-12-30 152 277 680 0.81 0.51 1.12 123.48 140.02 761.11 1024.61 Sunday 2012 2013
2012-12-31 78 255 45 0.81 0.51 1.12 63.37 128.91 50.37 242.66 Monday 2012 2013

366 rows × 13 columns

Looks like 2012 was a leap year, because there are 366 days. Let’s check this by getting the data for February of that year. To do this, we just need to add the month.

[136]:
df.loc['2012-02']  # February 2012
[136]:
apples_sold bananas_sold cakes_sold cost_of_apple cost_of_banana cost_of_cake apples_profit bananas_profit cakes_profit total_profit Day Year Fiscal Year
date
2012-02-01 198 309 540 0.78 0.48 1.07 154.32 149.85 579.87 884.05 Wednesday 2012 2012
2012-02-02 130 140 48 0.78 0.49 1.07 101.33 67.90 51.55 220.79 Thursday 2012 2012
2012-02-03 83 164 500 0.78 0.49 1.07 64.71 79.55 537.05 681.31 Friday 2012 2012
2012-02-04 34 221 292 0.78 0.49 1.07 26.51 107.22 313.68 447.40 Saturday 2012 2012
2012-02-05 130 18 180 0.78 0.49 1.07 101.37 8.73 193.39 303.49 Sunday 2012 2012
2012-02-06 113 126 361 0.78 0.49 1.07 88.13 61.14 387.90 537.17 Monday 2012 2012
2012-02-07 77 337 92 0.78 0.49 1.07 60.06 163.55 98.87 322.48 Tuesday 2012 2012
2012-02-08 156 343 576 0.78 0.49 1.07 121.69 166.49 619.07 907.25 Wednesday 2012 2012
2012-02-09 101 194 116 0.78 0.49 1.07 78.80 94.18 124.69 297.66 Thursday 2012 2012
2012-02-10 65 234 179 0.78 0.49 1.08 50.72 113.61 192.43 356.76 Friday 2012 2012
2012-02-11 60 248 446 0.78 0.49 1.08 46.82 120.42 479.53 646.77 Saturday 2012 2012
2012-02-12 167 31 444 0.78 0.49 1.08 130.34 15.05 477.44 622.83 Sunday 2012 2012
2012-02-13 131 64 32 0.78 0.49 1.08 102.25 31.08 34.41 167.75 Monday 2012 2012
2012-02-14 144 232 244 0.78 0.49 1.08 112.42 112.69 262.44 487.55 Tuesday 2012 2012
2012-02-15 174 232 603 0.78 0.49 1.08 135.85 112.71 648.66 897.21 Wednesday 2012 2012
2012-02-16 100 25 503 0.78 0.49 1.08 78.09 12.15 541.15 631.38 Thursday 2012 2012
2012-02-17 4 133 649 0.78 0.49 1.08 3.12 64.63 698.31 766.06 Friday 2012 2012
2012-02-18 95 71 357 0.78 0.49 1.08 74.20 34.51 384.17 492.88 Saturday 2012 2012
2012-02-19 134 224 518 0.78 0.49 1.08 104.67 108.87 557.50 771.05 Sunday 2012 2012
2012-02-20 10 337 170 0.78 0.49 1.08 7.81 163.82 182.99 354.62 Monday 2012 2012
2012-02-21 85 63 388 0.78 0.49 1.08 66.41 30.63 417.69 514.73 Tuesday 2012 2012
2012-02-22 190 284 312 0.78 0.49 1.08 148.47 138.09 335.92 622.48 Wednesday 2012 2012
2012-02-23 190 18 562 0.78 0.49 1.08 148.49 8.75 605.15 762.40 Thursday 2012 2012
2012-02-24 96 246 331 0.78 0.49 1.08 75.04 119.64 356.46 551.14 Friday 2012 2012
2012-02-25 4 151 333 0.78 0.49 1.08 3.13 73.45 358.66 435.23 Saturday 2012 2012
2012-02-26 118 176 388 0.78 0.49 1.08 92.26 85.62 417.95 595.82 Sunday 2012 2012
2012-02-27 111 337 277 0.78 0.49 1.08 86.79 163.96 298.42 549.17 Monday 2012 2012
2012-02-28 60 257 163 0.78 0.49 1.08 46.92 125.05 175.63 347.60 Tuesday 2012 2012
2012-02-29 197 309 288 0.78 0.49 1.08 154.08 150.38 310.35 614.80 Wednesday 2012 2012

We can also ‘slice’ a DataFrame with a DatetimeIndex using date strings. Let’s grab the data for the last four days of August 2014.

[137]:
df.loc['2014-08-28':'2014-08-31']  # Last 4 days of August 2014
[137]:
apples_sold bananas_sold cakes_sold cost_of_apple cost_of_banana cost_of_cake apples_profit bananas_profit cakes_profit total_profit Day Year Fiscal Year
date
2014-08-28 125 63 266 0.88 0.55 1.21 109.50 34.34 321.05 464.89 Thursday 2014 2015
2014-08-29 77 135 450 0.88 0.55 1.21 67.46 73.59 543.19 684.25 Friday 2014 2015
2014-08-30 8 122 435 0.88 0.55 1.21 7.01 66.52 525.15 598.68 Saturday 2014 2015
2014-08-31 124 8 467 0.88 0.55 1.21 108.67 4.36 563.85 676.88 Sunday 2014 2015

To round things off, let’s group the data by quarters of the fiscal year (which starts in April) and plot mean profit over time.

[141]:
# A new column for quarters of the fiscal year
df['Fiscal Year'] = df.index.to_period('Q-APR')
df.head()
[141]:
apples_sold bananas_sold cakes_sold cost_of_apple cost_of_banana cost_of_cake apples_profit bananas_profit cakes_profit total_profit Day Year Fiscal Year
date
2000-01-01 199 347 539 0.45 0.28 0.62 89.55 97.16 334.18 520.89 Saturday 2000 2000Q3
2000-01-02 179 19 693 0.45 0.28 0.62 80.56 5.32 429.71 515.59 Sunday 2000 2000Q3
2000-01-03 178 215 447 0.45 0.28 0.62 80.12 60.21 277.21 417.54 Monday 2000 2000Q3
2000-01-04 22 165 578 0.45 0.28 0.62 9.90 46.22 358.49 414.61 Tuesday 2000 2000Q3
2000-01-05 23 1 294 0.45 0.28 0.62 10.36 0.28 182.37 193.01 Wednesday 2000 2000Q3
[142]:
# Get the profit columns
profit_cols = df.columns[df.columns.str.endswith('profit')]
(
    df.groupby('Fiscal Year')[profit_cols]
    .mean()
    .plot(ylabel='Profit (£)', figsize=(12, 4), rot=45)
);
_images/06_more_pandas_44_0.png

If you can master tricks like this in pandas, the world of time series data is at your fingertips.