In Chapter 1 we discussed briefly the types of data that are generally availablefor empirical analysis, namely, time series, cross section, and panel.
In time series data we observe the values of one or more variables over a period of time (e.g., GDP for several quarters or years). In cross-section data,values of one or more variables are collected for several sample units, or entities,at the same point in time (e.g., crime rates for 50 states in the UnitedStates for a given year).
In panel data the same cross-sectional unit (say afamily or a firm or a state) is surveyed over time. In short, panel data have space as well as time dimensions.
There are other names for panel data, such as pooled data (pooling of time series and cross-sectional observations), combination of time series and cross-section data, micropanel data, longitudinal data (a study overtime of a variable or group of subjects), event history analysis (e.g., studyingthe movement over time of subjects through successive states or conditions),cohort analysis (e.g., following the career path of 1965 graduates of a businessschool). Although there are subtle variations, all these names essentially connote movement over time of cross-sectional units.We will therefore use theterm panel data in a generic sense to include one or more of these terms. And we will call regression models based on such data panel data regression models.
We will analyze two kinds of data sets in this chapter. An independently pooled cross
section is obtained by sampling randomly from a large population at different points in time (usually, but not necessarily, different years). For instance, in each year, we can draw a random sample on hourly wages, education, experience, and so on, from the population of working people in the United States. Or, in every other year, we draw a random sample on the selling price, square footage, number of bathrooms, and so on, of houses sold in a particular metropolitan area. From a statistical standpoint, these datasets have an important feature: they consist of independently sampled observations. This was also a key aspect in our analysis of cross-sectional data: among other things, it rules out correlation in the error terms for different observations.
An independently pooled cross section differs from a single random sample in that sampling from the population at different points in time likely leads to observations that are not identically distributed. For example, distributions of wages and education have changed over time in most countries. As we will see, this is easy to deal with in practice by allowing t
he intercept in a multiple regression model, and in some cases the slopes, to change over time. We cover such models in Section 13.1. In Section 13.2, we discuss how pooling cross sections over time can be used to evaluate policy changes.
A panel data set, while having both a cross-sectional and a time series dimension, differs in some important respects from an independently pooled cross section. To collect panel data—sometimes called longitudinal data—we follow (or attempt to follow) the same individuals, families, firms, cities, states, or whatever, across time. For example, a panel data set on individual wages, hours, education, and other factors is collected by randomly selecting people from a population at a given point in time. Then, these same people are reinterviewed at several subsequent points in time. This gives us data on wages, hours, education, and so on, for the same group of people in different years.
认为面板数据(panel data)有别于混合时间序列横截面数据(pooled time-series, cross-section data)。长时间,短截面叫混合时间序列截面数据;宽截面,短时间叫做面板数据。
其中:数据存放方式分为非堆积数据(unstacked data)和堆积数据(stacked data)。非堆积数据:给定截面成员、变量的观测值放在一起,但同其他变量、其他截面成员的数据分开。高铁梅,第二版,334页:
堆积数据又分两种:按截面成员堆积或按日期堆积:truncated normal distribution
Generally speaking, we distinguish between the two by noting that pooled time-series, cross-section data refer to data with relatively few cross-sections, where variables are
held in cross-section specific individual series, while panel data correspond to data with large numbers of cross-sections, with variables held in single series in stacked form.
If each cross-sectional unit has the same number of time series observations, then such a panel (data) is called a balanced panel. In the present example we have a balanced panel, as each company in the sample has 20 observations.
If the number of observations differs among panel members, we call such a panel an unbalanced panel. In this chapter we will largely be concerned with a balanced panel.
If we have the same T time periods for each of N cross-sectional units, we say that the data set is a balanced panel. 430页
Some panel data sets, especially on individuals or firms, have missing years for at least some cross-sectional units in the sample. In this case, we call the data set an unbalanced panel. 448页
