python描述性统计_使⽤Python的描述性统计
python 描述性统计
描述性统计 (Descriptive Statistics)
After data collection, most Psychology researchers use different ways to summarise the data. In this tutorial we will learn how to do descriptive statistics in Python. Python, being a programming language, enables us many ways to carry out descriptive statistics. Pandas makes data manipulation and summary statistics quite similar to how you would do it in R. I believe that the dataframe in R is very intuitive to use and pandas offers a DataFrame method similar to Rs. Also, many Psychology researchers may have experience of R.
收集数据后,⼤多数⼼理学研究⼈员使⽤不同的⽅式来汇总数据。 在本教程中,我们将学习如何在Python中进⾏描述性统计 。 Python是⼀种编程语⾔,它使我们可以采⽤多种⽅式来进⾏描述性统计。 Pandas使数据操作和汇总统计信息与R中的操作⾮常相似。我相信R中的数据框的使⽤⾮常直观,Pandas提供了类似于Rs的DataFrame⽅法。 同样,许多⼼理学研究⼈员可能有R的经验。
Thus, in this tutorial you will learn how to do descriptive statistics using Pandas, but also using NumPy,
and SciPy. We start with using Pandas for obtaining summary statistics and some variance measures. After that we continue with the central tenancy measures (e.g., mean and median) using Pandas and NumPy. The harmonic, geometric, and trimmed mean cannot be calculated using Pandas or NumPy so we use SciPy. Towards the end we learn how get some measures of variability (e.g., variance using pandas).
因此,在本教程中,您将学习如何使⽤Pandas以及NumPy和SciPy进⾏描述性统计。 我们⾸先使⽤熊猫获取摘要统计信息和⼀些⽅差度量。 之后,我们继续使⽤Pandas和NumPy进⾏中央租赁措施(例如,均值和中位数)。 谐波,⼏何和修剪均值⽆法使⽤Pandas或NumPy计算,因此我们使⽤SciPy。 最后,我们学习如何获得⼀些可变性的度量(例如,使⽤熊猫的变异)。
import numpy as np
from pandas import DataFrame as df
from scipy.stats import trim_mean, kurtosis
from scipy.stats.mstats import mode, gmean, hmean
import numpy as np
from pandas import DataFrame as df
from scipy.stats import trim_mean, kurtosis
from scipy.stats.mstats import mode, gmean, hmean
模拟响应时间数据 (Simulate response time data)
Many times in experimental psychology response time is the dependent variable. I to simulate an experiment in which the dependent variable is response time to some arbitrary targets. The simulated data will, further, have two independent variables (IV, “iv1” have 2 levels and “iv2” have 3 levels). The data are simulated as the same time as a dataframe is created and the first descriptive statistics is obtained using the method describe.
在实验⼼理学中,响应时间很多时候都是因变量。 我模拟⼀个实验,其中因变量是对某些任意⽬标的响应时间。 此外,模拟数据将具有两个⾃变量(IV,“ iv1”具有2个级别,“ iv2”具有3个级别)。 在创建数据框的同时对数据进⾏仿真,并使⽤描述的⽅法获得第⼀个描述性统计信息。
使⽤熊猫进⾏描述性统计 (Descriptive statistics using Pandas)
data.describe()
data.describe()
Pandas will output summary statistics by using this method. Output is a table, as you can see below.
熊猫将使⽤此⽅法输出摘要统计信息。 输出是⼀个表,如下所⽰。
Output table of data.describe()
data.describe()的输出表
Typically, a researcher is interested in the descriptive statistics of the IVs. Therefore, I group the data
by these. Using describe on the grouped date aggregated data for each level in each IV. As can be seen from the output it is somewhat hard to read. Note, the method unstack is used to get the mean, standard deviation (std), etc as columns and it becomes somewhat easier to read.
通常,研究⼈员会对IV的描述性统计感兴趣。 因此,我将这些数据分组。 使⽤分组⽇期上的describe描述每个IV中每个级别的汇总数据。从输出中可以看出,它有点难以阅读。 请注意,unstack⽅法⽤于获取均值,标准差(std)等作为列,并且变得更易于阅读。
Output from describe on the grouped data
来⾃分组数据描述的输出
中央倾向 (Central tendancy)
Often we want to know something about the “average” or “middle” of our data. Using Pandas and NumPy the two most commonly used measures of central tenancy can be obtained; the mean and the median. The mode and trimmed mean can also be obtained using Pandas but I will use methods from SciPy.
通常,我们想了解⼀些有关数据“平均”或“中间”的信息。 使⽤Pandas和NumPy,可以获得两种最常⽤的中央租房措施。 均值和中位数。 模式和修剪后的均值也可以使⽤Pandas获得,但我将使⽤SciPy的⽅法。
意思 (Mean)
There are at least two ways of doing this using our grouped data. First, Pandas have the method mean;
使⽤我们的分组数据⾄少有两种⽅法可以做到这⼀点。 ⾸先,熊猫具有⽅法的含义;
grouped_data['rt'].mean().reset_index()
grouped_data['rt'].mean().reset_index()
But the method aggregate in combination with NumPys mean can also be used;
但是也可以使⽤与NumPys平均值结合的⽅法。
Both methods will give the same output but the aggregate method have some advantages that I will explain later.
两种⽅法将提供相同的输出,但是聚合⽅法具有⼀些优点,我将在后⾯解释。
Output of mean and aggregate using NumPy – Mean
使⽤NumPy输出均值和合计–均值
⼏何与谐波均值 (Geometric & Harmonic mean)
Sometimes the geometric or harmonic mean can be of interested. These two descriptives can be obtained using the method apply with the methods gmean and hmean (from SciPy) as arguments. That is, there is no method in Pandas or NumPy that enables us to calculate geometric and harmonic means.
有时,⼏何或调和均值可能令⼈感兴趣。 可以使⽤gmean和hmean(来⾃SciPy)⽅法作为参数的⽅法获得这两个描述。 也就是
说,Pandas或NumPy中没有任何⽅法可以使我们计算⼏何和调和平均值。
⼏何 (Geometric)
grouped_data['rt'].apply(gmean, axis=None).reset_index()
grouped_data['rt'].apply(gmean, axis=None).reset_index()
谐波 (Harmonic)
均值修整 (Trimmed mean)
Trimmed means are, at times, used. Pandas or NumPy seems not to have methods for obtaining the trimmed mean. However, we can use the method trim_mean from SciPy . By using apply to our grouped data we can use the function
(‘trim_mean’) with an argument that will make 10 % av the largest and smallest values to be removed.
有时会使⽤修饰后的⽅法。 Pandas或NumPy似乎没有获得修整平均值的⽅法。 但是,我们可以使⽤SciPy中的trim_mean⽅法。 通过应⽤应⽤于分组数据,我们可以将函数('trim_mean')与参数⼀起使⽤,该参数将使10%av成为要删除的最⼤值和最⼩值。
trimmed_mean = grouped_data['rt'].apply(trim_mean, .1)
set_index()
trimmed_mean = grouped_data['rt'].apply(trim_mean, .1)
set_index()
Output from the mean values above (trimmed, harmonic, and geometric means):
从上述平均值(修整,谐波和⼏何均值)输出:
Trimmed Mean
均值Harmonic Mean
谐波均值
Geometric Mean
⼏何平均数
中位数 (Median)
As with the mean there are also at least two ways of obtaining the median;与平均值⼀样,⾄少还有两种获取中位数的⽅法;
grouped_data['rt'].dian).reset_index()
grouped_data['rt'].dian).reset_index()
Output of aggregate using Numpy – Median.
使⽤Numpy –中位数的合计输出。
模式 (Mode)
There is a method (i.e., ) for getting the mode for a DataFrame object. However, it cannot be used on the grouped data so I will use mode from SciPy:
有⼀种⽅法(即 )⽤于获取DataFrame对象的模式。 但是,它不能⽤于分组数据,因此我将使⽤SciPy的模式:
Most of the time I probably would want to see all measures of central tendency at the same time. Luc
kily, aggregate enables us to use many NumPy and SciPy methods. In the example below the standard deviation (std), mean, harmonic mean, geometric mean, and trimmed mean are all in the same output. Note that we will have to add the trimmed means afterwards.
⼤多数时候,我可能希望同时查看所有集中趋势指标。 幸运的是,聚合使我们能够使⽤许多NumPy和SciPy⽅法。 在下⾯的⽰例中,标准偏差(std),均值,谐波均值,⼏何均值和微调均值都在同⼀输出中。 请注意,我们将必须在之后添加调整后的均值。
descr = grouped_data['rt'].aggregate([np.median, np.std, np.mean]).reset_index()
descr['trimmed_mean'] = pd.Series(trimmed_mean.values, index=descr.index)
descr
descr = grouped_data['rt'].aggregate([np.median, np.std, np.mean]).reset_index()
descr['trimmed_mean'] = pd.Series(trimmed_mean.values, index=descr.index)
descr
Output of aggregate using some of the methods.
使⽤某些⽅法输出合计。
变异性度量 (Measures of variability)
Central tendency (e.g., the mean & median) is not the only type of summary statistic that we want to calculate. Doing data analysis we also want a measure of the variability of the data.
集中趋势(例如,均值和中位数)不是我们要计算的唯⼀统计摘要类型。 在进⾏数据分析时,我们还希望度量数据的可变性。
标准偏差 (Standard deviation)
四分位间距 (Inter quartile range)
Note that here the use unstack() also get the quantiles as columns and the output is easier to read.
请注意,这⾥使⽤unstack()还将分位数作为列,并且输出更易于阅读。
grouped_data['rt'].quantile([.25, .5, .75]).unstack()
grouped_data['rt'].quantile([.25, .5, .75]).unstack()能运行python的软件
IQR
IQR
⽅差 (Variance)
Variance
⽅差
That is all. Now you know how to obtain some of the most common descriptive statistics using Python. Pandas, NumPy, and SciPy really makes these calculation almost as easy as doing it in graphical statistical software such as SPSS. One great advantage of the methods apply and aggregate is that we can input other methods or functions to obtain other types of descriptives.
就这些。 现在,您知道如何使⽤Python获得⼀些最常见的描述性统计信息。 Pandas,NumPy和SciP
y实际上使这些计算⼏乎与在诸如SPSS之类的图形统计软件中进⾏计算⼀样容易。 应⽤和聚合⽅法的⼀⼤优势是我们可以输⼊其他⽅法或函数来获取其他类型的描述。
翻译⾃:
python 描述性统计
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论