数学建模时间序列分析_时间序列分析建模验证
数学建模时间序列分析
时间序列预测 (Time Series Forecasting)
背景 (Background)
This article is the fourth in the series on the time-series data. We started by discussing various along with data preparation techniques followed by building a robust framework. And finally, in our previous article, we discussed a wide range of forecasting techniques that must be explored before moving to machine learning algorithms.
本⽂是有关时间序列数据的系列⽂章中的第四篇。 我们⾸先讨论各种以及数据准备技术,然后建⽴⼀个强⼤的框架。 最后,在我们的前⼀篇⽂章中,我们讨论了⼴泛的预测技术,在转向机器学习算法之前必须对其进⾏探索。
Now, in the current article, we are going to apply all these learnings to a real-life dataset. We will work through a time series forecasting project from end-to-end, from importing the dataset, analyzing and transforming the time series to training the model, and making predictions on new data. The steps of th
is project that we will work through are as follows:
现在,在当前⽂章中,我们将所有这些学习应⽤于实际数据集。 我们将从头到尾完成⼀个时间序列预测项⽬,从导⼊数据集,分析和转换时间序列到训练模型,以及对新数据进⾏预测。 我们将完成的该项⽬的步骤如下:
1. Problem Description
问题描述
2. Data Preparation and Analysis
数据准备与分析
3. Set up an Evaluation Framework
建⽴评估框架
4. Stationary Check: Augmented Dickey-Fuller test
固定检查:增强的Dickey-Fuller测试
5. ARIMA Models
ARIMA模型
6. Residual Analysis
残差分析
7. Bias corrected Model
偏差校正模型
8. Model Validation
模型验证
问题描述 (Problem Description)
The problem is to predict the number of monthly airline passengers. We will use the Airline Passengers dataset for this exercise. This dataset describes the total number of airline passengers over time. The units are a count of the number of airline passengers in thousands. There are 144 mo
nthly observations from 1949 to 1960. Below is a sample of the first few rows of the dataset.
问题是要预测每⽉的航空公司乘客数量。 我们将使⽤航空公司乘客数据集进⾏此练习。 该数据集描述了⼀段时间内航空公司乘客的总数。单位是数千名航空公司乘客的总数。 从1949年到1960年,每⽉进⾏144次观测。下⾯是数据集前⼏⾏的样本。
Sample dataset
样本数据集
You can download this dataset from .
您可以从下载该数据集。
此项⽬的Python库 (Python Libraries for this Project)
We need the following libraries to work on this project. These names are self-explanatory but don’t worry if you are not getting any of them. As we go along you will understand the usage of these libraries.
我们需要以下库来进⾏此项⽬。 这些名称是不⾔⾃明的,但是不⽤担⼼这些名称。 随着我们的前进,您将了解这些库的⽤法。
import numpy
from pandas import read_csv
ics import mean_squared_error
from math import sqrt
from math import log
from math import exp
from scipy.stats import boxcox
from pandas import DataFrame
from pandas import Grouper
from pandas import Series
from pandas import concat
from pandas.plotting import lag_plot
from matplotlib import pyplot
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.arima_model import ARIMAResults
from statsmodels.tsa.seasonal import seasonal_decompose
aphics.tsaplots import plot_acf
aphics.tsaplots import plot_pacf
fplots import qqplot
数据准备与分析 (Data Preparation and Analysis)
We will use the read_csv() function to load the time series data as a series object, a one-dimensional array with a time label for each row. It is always good to take a peek at the data to confirm that data has been loaded correctly.
我们将使⽤read_csv()函数将时间序列数据加载为序列对象,即⼀维数组,每⾏带有时间标签。 偷看数据以确认已正确加载数据始终是⼀件好事。
series = read_csv('airline-passengers.csv', header=0, index_col=0, parse_dates=True, squeeze=True)
print(series.head())
Image for post
让我们通过查看汇总统计数据开始数据分析,我们将快速了解数据分布。 (Let’s begin the data analysis by looking into the summary statistics, we will get a quick idea of the data distribution.)
print(series.describe())
We can see the number of observations matches our expectations, the mean is about 280 which we can consider our level in this series. Other statistics like standard deviation and percentiles suggest a large spread of the data.
我们可以看到观察次数与我们的期望相符,平均数约为280,我们可以将其视为本系列的⽔平。 其他统计数据(例如标准差和百分位数)表明数据分布⼴泛。
下⼀步,我们将可视化折线图上的值,该⼯具可以为问题提供很多见解。 (As a next step, we will visualize the values on a line plot, this tool can provide a lot of insights into the problem.)
series.plot()
pyplot.show()
Here, the line plot suggests that there is an increasing trend of airline passengers over time. We can also observe a systematic seasonality to the travel pattern for each year and the seasonal signal appears to be growing over time, which suggests a multiplicative relationship.
在此,线图表明,随着时间的推移,航空公司的乘客数量呈增长趋势。 我们还可以观察到每年出⾏⽅式的系统季节性,并且季节性信号似乎随着时间的推移⽽增长,这表明存在乘法关系。
This insight gives us a hint that data may not be stationary and we can explore differencing with one or two levels to make it stationary before modeling.
这种见解给我们⼀个暗⽰,即数据可能不是固定的,我们可以在建模之前探索⼀个或两个级别的差异以使其稳定。
我们可以通过年度线图来确认我们的假设。 (We can confirm our assumption by yearly line plots.)
For the following plot, created year-wise separate groups of data and plotted a line plot for each year from 1949 to 1957. You can create this plot for any number of years.
对于下⾯的图,创建了逐年的数据组,并绘制了从1949年到1957年的每⼀年的线图。您可以创建任意年的图。
pyplot.figure()
i = 1
n_groups = len(groups)
validation框架for name, group in groups:
pyplot.subplot((n_groups*100) + 10 + i)
i += 1
pyplot.plot(group)
pyplot.show()
We can observe that seasonality is a yearly cycle by looking at line plots of the dataset by year. We can see a dip at each year-end and rise from July to August. This pattern exists across the years which again suggests us to adopt season based modeling.
通过按年份查看数据集的线图,我们可以观察到季节性是⼀个年度周期。 我们可以看到每年年底都有下降,从7⽉到8⽉上升。 多年来⼀直存在这种模式,这再次表明我们采⽤基于季节的建模。
让我们探索观察的密度,以进⼀步了解我们的数据结构。 (Let’s explore the density of observations for further insight into our data structure.)
pyplot.figure(1)
pyplot.subplot(211)
series.hist()
pyplot.subplot(212)
series.plot(kind='kde')
pyplot.show()
Image for post
We can observe that the distribution is not Gaussian, and this insight encourages us to explore some log or power transforms of the data before modeling.
我们可以观察到分布不是⾼斯分布,这种见解⿎励我们在建模之前探索数据的⼀些对数或幂变换。
让我们按年份分析每⽉数据,并了解每年观测值的分布范围。 (Let’s analyze monthly data by year and get an idea of the spread of observations for each year.)
We will perform this analysis through a box and whisker plot.
我们将通过箱形图和晶须图进⾏此分析。
for name, group in groups:
ar] = group.values
years.boxplot()
pyplot.show()
The spread of the data (blue boxes) suggests a growth trend over the years which also suggests our assumption of non-stationarity of the data.
数据的散布(蓝⾊框)表明多年来的增长趋势,这也表明我们假设数据是⾮平稳的。
分解时间序列可以更清楚地了解其组成部分-⽔平,趋势,季节性和噪声。 (Decompose the time series for more clarity on its components — Level, Trend, Seasonality, and Noise.)
Based on our analysis till now, we have an intuition that out time series is multiplicative. So, we can decompose the series assuming a multiplicative model.
根据到⽬前为⽌的分析,我们可以直观地看出时间序列是可乘的。 因此,我们可以假设乘法模型来分解序列。
result = seasonal_decompose(series, model='multiplicative')
result.plot()
pyplot.show()
We can see that the trend and seasonality information extracted from the series validate our earlier findings that series has a growing trend and yearly seasonality. The residuals are also interesting, showing periods of high variability in the early and later years of the series.
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论