Welcome to part 2 of the data analysis with Python and Pandas tutorials, where we're learning about the prices of Avocados at the moment. Soon, we'll find a new dataset, but let's learn a few more things with this one. Where we left off, we were graphing the price from Albany over time, but it was quite messy. Here's a recap:

import pandas as pd

df = pd.read_csv("datasets/avocado.csv")

albany_df = df[df['region']=="Albany"]
albany_df.set_index("Date", inplace=True)

albany_df["AveragePrice"].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x11fd925f8>

So dates are funky types of data, since they are strings, but also have order, at least to us. When it comes to dates, we have to help computers out a bit. Luckily for us, Pandas comes built in with ways to handle for dates. First, we need to convert the date column to datetime objects:

df = pd.read_csv("datasets/avocado.csv")

df['Date'] = pd.to_datetime(df['Date'])

albany_df = df[df['region']=="Albany"]
albany_df.set_index("Date", inplace=True)

albany_df["AveragePrice"].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x11fa86828>

Alright, the formatting looks better in terms of axis, but that graph is pretty wild! Could we settle it down a bit? We could smooth the data with a rolling average.

To do this, let's make a new column, and apply some smoothing:

albany_df["AveragePrice"].rolling(25).mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1223cc278>

Hmm, so what happened? Pandas understands that a date is a date, and to sort the X axis, but I am now wondering if the dataframe itself is sorted. If it's not, that would seriously screw up our moving average calculations. This data may be indexed by date, but is it sorted? Let's see.

albany_df.sort_index(inplace=True)

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

What's this warning above? Should we be worried? Basically, all it's telling us is that we're doing operations on a copy of a slice of a dataframe, and to watch out because we might not be modifying what we were hoping to modify (like the main df). In this case, we're not trying to work with the main dataframe, so I think this warning is just plain annoying, but whatever. It's just a warning, not an error.

albany_df["AveragePrice"].rolling(25).mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1223ccf98>

And there we have it! A more useful summary of avocado prices for Albany over the years.

Visualizations are cool, but what if we want to save our new, smoother, data like above? We can give it a new column in our dataframe:

albany_df["price25ma"] = albany_df["AveragePrice"].rolling(25).mean()

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

albany_df.head()

	Unnamed: 0	AveragePrice	Total Volume	4046	4225	4770	Total Bags	Small Bags	Large Bags	XLarge Bags	type	year	region	price25ma
Date
2015-01-04	51	1.22	40873.28	2819.50	28287.42	49.90	9716.46	9186.93	529.53	0.0	conventional	2015	Albany	NaN
2015-01-04	51	1.79	1373.95	57.42	153.88	0.00	1162.65	1162.65	0.00	0.0	organic	2015	Albany	NaN
2015-01-11	50	1.24	41195.08	1002.85	31640.34	127.12	8424.77	8036.04	388.73	0.0	conventional	2015	Albany	NaN
2015-01-11	50	1.77	1182.56	39.00	305.12	0.00	838.44	838.44	0.00	0.0	organic	2015	Albany	NaN
2015-01-18	49	1.17	44511.28	914.14	31540.32	135.77	11921.05

Graphing/visualization - Data Analysis with Python 3 and Pandas