Welcome to part 2 of the data analysis with Python and Pandas tutorials, where we're learning about the prices of Avocados at the moment. Soon, we'll find a new dataset, but let's learn a few more things with this one. Where we left off, we were graphing the price from Albany over time, but it was quite messy. Here's a recap:
import pandas as pd
df = pd.read_csv("datasets/avocado.csv")
albany_df = df[df['region']=="Albany"]
albany_df.set_index("Date", inplace=True)
albany_df["AveragePrice"].plot()
So dates are funky types of data, since they are strings, but also have order, at least to us. When it comes to dates, we have to help computers out a bit. Luckily for us, Pandas comes built in with ways to handle for dates. First, we need to convert the date column to datetime objects:
df = pd.read_csv("datasets/avocado.csv")
df['Date'] = pd.to_datetime(df['Date'])
albany_df = df[df['region']=="Albany"]
albany_df.set_index("Date", inplace=True)
albany_df["AveragePrice"].plot()
Alright, the formatting looks better in terms of axis, but that graph is pretty wild! Could we settle it down a bit? We could smooth the data with a rolling average.
To do this, let's make a new column, and apply some smoothing:
albany_df["AveragePrice"].rolling(25).mean().plot()
Hmm, so what happened? Pandas understands that a date is a date, and to sort the X axis, but I am now wondering if the dataframe itself is sorted. If it's not, that would seriously screw up our moving average calculations. This data may be indexed by date, but is it sorted? Let's see.
albany_df.sort_index(inplace=True)
What's this warning above? Should we be worried? Basically, all it's telling us is that we're doing operations on a copy of a slice of a dataframe, and to watch out because we might not be modifying what we were hoping to modify (like the main df). In this case, we're not trying to work with the main dataframe, so I think this warning is just plain annoying, but whatever. It's just a warning, not an error.
albany_df["AveragePrice"].rolling(25).mean().plot()
And there we have it! A more useful summary of avocado prices for Albany over the years.
Visualizations are cool, but what if we want to save our new, smoother, data like above? We can give it a new column in our dataframe:
albany_df["price25ma"] = albany_df["AveragePrice"].rolling(25).mean()
albany_df.head()