Python直⽅图绘图:NumPy,Matplotlib,Pandas和Seaborn In this tutorial, you’ll be equipped to make production-quality, presentation-ready Python histogram plots with a range of choices and features.
在本教程中,您将具备制作具有各种选择和功能的⽣产质量,可⽤于演⽰的Python直⽅图的能⼒。
If you have introductory to intermediate knowledge in Python and statistics, you can use this article as a one-stop shop for building and plotting histograms in Python using libraries from its scientific stack, including NumPy, Matplotlib, Pandas, and Seaborn.
如果您具有Python和统计⽅⾯的中级⼊门知识,可以将本⽂⽤作使⽤Python的科学堆栈中的库(包括NumPy,Matplotlib,Pandas和Seaborn)在Python中构建和绘制直⽅图的⼀站式商店。
A histogram is a great tool for quickly assessing a that is intuitively understood by almost any audience. Python offers a handful of different options for building and plotting histograms. Most people know a histogram by its graphical representation, which is similar to a bar graph:
直⽅图是快速评估⼏乎所有受众都能直观理解的的绝佳⼯具。 Python提供了许多不同的选项来构建和绘制直⽅图。 ⼤多数⼈都通过直⽅图的图形表⽰来了解直⽅图,这类似于条形图:
This article will guide you through creating plots like the one above as well as more complex ones. Her
e’s what you’ll cover:
本⽂将指导您创建上⾯的图以及更复杂的图。 这是您要介绍的内容:
Building histograms in pure Python, without use of third party libraries
Constructing histograms with NumPy to summarize the underlying data
Plotting the resulting histogram with Matplotlib, Pandas, and Seaborn
使⽤纯Python构建直⽅图,⽆需使⽤第三⽅库
使⽤NumPy构造直⽅图以汇总基础数据
使⽤Matplotlib,Pandas和Seaborn绘制结果直⽅图
Free Bonus: Short on time? that summarizes the techniques explained in this tutorial.
免费奖⾦:时间短吗? ,其中总结了本教程中介绍的技术。
纯Python中的直⽅图 (Histograms in Pure Python)
When you are preparing to plot a histogram, it is simplest to not think in terms of bins but rather to report how many times each value appears (a frequency table). A Python is well-suited for this task:
当您准备绘制直⽅图时,最简单的⽅法是不以垃圾桶的⽅式思考,⽽是报告每个值出现的次数(频率表)。 Python ⾮常适合此任务:
>>> >>>  # Need not be sorted, necessarily
# Need not be sorted, necessarily
>>> >>>  a a = = (( 00 , , 11 , , 11 , , 11 , , 22 , , 33 , , 77 , , 77 , , 2323 )
)
>>> >>>  def def count_elementscount_elements (( seqseq ) ) -> -> dictdict :
:
...    ...    """Tally elements from `seq`."""
"""Tally elements from `seq`."""
...    ...    hist hist = = {}
{}
...    ...    for for i i in in seqseq :
:
...        ...        histhist [[ ii ] ] = = histhist .. getget (( ii , , 00 ) ) + + 1
1
...    ...    return return hist
hist
>>> >>>  counted counted = = count_elementscount_elements (( aa )
)
>>> >>>  counted
counted
{0: 1, 1: 3, 2: 1, 3: 1, 7: 2, 23: 1}
{0: 1, 1: 3, 2: 1, 3: 1, 7: 2, 23: 1}
count_elements() returns a dictionary with unique elements from the sequence as keys and their frequencies (counts) as values. Within the loop over seq, hist[i] = (i, 0) + 1 says, “for each element of the sequence, increment its
corresponding value in hist by 1.”
count_elements()返回⼀个字典,其中序列中的唯⼀元素作为键,⽽其频率(计数)作为值。 在seq循环中, hist[i] = (i, 0) +
1说:“对于序列中的每个元素,将其在hist的对应值增加1。”
In fact, this is precisely what is done by the collections.Counter class from Python’s standard library, which a Python dictionary and overrides its .update() method:
事实上,这恰恰是由做collections.Counter从Python的标准库,它的类 Python字典,并覆盖其.update()⽅法:
You can confirm that your handmade function does virtually the same thing as collections.Counter by testing for equality between the two:
您可以通过测试两者之间的相等性来确认您的⼿⼯功能与collections.Counter实际上具有相同的功能:
>>> >>>  recountedrecounted .. itemsitems () () == == countedcounted .. itemsitems ()
()
True
True
Technical Detail: The mapping from count_elements() above defaults to a more highly optimized if it is available. Within the Python function count_elements(), one micro-optimization you could make is to declare get = before the for-loop. This would bind a method to a variable for faster calls within the loop.
技术细节 :如果可⽤,从以上count_elements()的映射默认为更⾼优化的 。 在Python函数count_elements() ,可以进⾏的⼀种微优化是在for循环之前声明get = 。 这会将⽅法绑定到变量,以便在循环内更快地进⾏调⽤。
It can be helpful to build simplified functions from scratch as a first step to understanding more complex ones. Let’s further reinvent the wheel a bit with an ASCII histogram that takes advantage of Python’s :
从头开始构建简化的功能对于理解更复杂的功能是有帮助的。 让我们进⼀步利⽤ASCII直⽅图来重塑轮⼦,该直⽅图利⽤Python的 :
This function creates a sorted frequency plot where counts are represented as tallies of plus (+) symbols. Calling sorted() on a dictionary returns a sorted list of its keys, and then you access the corresponding value for each with counted[k]. To see this in action, you can create a slightly larger dataset with Python’s random module:
此函数创建⼀个排序的频率图,其中计数表⽰为加号( + )符号。 在字典上调⽤sorted()返回其键的排序列表,然后使⽤counted[k]访问每个键的对应值。 要查看实际效果,您可以使⽤Python的random模块创建稍微更⼤的数据集:
>>> >>>  # No NumPy ... yet
# No NumPy ... yet
>>> >>>  import import random
random
>>> >>>  randomrandom .. seedseed (( 11 )
)
>>> >>>  vals vals = = [[ 11 , , 33 , , 44 , , 66 , , 88 , , 99 , , 1010 ]
]
>>> >>>  # Each number in `vals` will occur between 5 and 15 times.
# Each number in `vals` will occur between 5 and 15 times.
>>> >>>  freq freq = = (( randomrandom .. randintrandint (( 55 , , 1515 ) ) for for _ _ in in valsvals )
)
>>> >>>  data data = = []
[]
>>> >>>  for for ff , , v v in in zipzip (( freqfreq , , valsvals ):
):
...    ...    datadata .. extendextend ([([ vv ] ] * * ff )
)
>>> >>>  ascii_histogramascii_histogram (( datadata )
)
1 +++++++
1 +++++++
3 ++++++++++++++
3 ++++++++++++++
4 ++++++
4 ++++++
6 +++++++++
6 +++++++++
8 ++++++
8 ++++++
9 ++++++++++++
9 ++++++++++++
10 ++++++++++++
10 ++++++++++++
Here, you’re simulating plucking from vals with frequencies given by freq (a ). The resulting sample data repeats each value from vals a certain number of times between 5 and 15.
在这⾥,您正在模拟频率为freq ( )给定freq vals采摘。 所得样本数据在5到15之间重复⼀定次数重复vals的每个值。
Note: is use to seed, or initialize, the underlying pseudorandom number generator () used by random. It may sound like an oxymoron, but this is a way of making random data reproducible and deterministic. That is, if you copy the code here as is, you should get exactly the same histogram because the first call to random.randint() after seeding the generator will produce identical “random” data using the .
注意 : 是使⽤于种⼦,或初始化,底层伪随机数发⽣器( 使⽤) random 。 听起来像是⽭盾的话,但这是使随机数据可重现和确定性的⼀种⽅法。 也就是说,如果您按原样复制代码,则应该获得完全相同的直⽅图,因为在播种⽣成器之后对random.randint()的⾸次调⽤将使⽤产⽣相同的“随机”数据。
从基础开始:NumPy中的直⽅图计算 (Building Up From the Base: Histogram Calculations in NumPy)
Thus far, you have been working with what could best be called “frequency tables.” But mathematical
ly, a histogram is a mapping of bins (intervals) to frequencies. More technically, it can be used to approximate the probability density function () of the underlying variable.
到⽬前为⽌,您⼀直在使⽤最好的“频率表”进⾏⼯作。 但是在数学上,直⽅图是bin(间隔)到频率的映射。 从技术上讲,它可以⽤于近似基础变量的概率密度函数( )。
Moving on from the “frequency table” above, a true histogram first “bins” the range of values and then counts the number of values that fall into each bin. This is what histogram() function does, and it is the basis for other functions you’ll see here later in Python libraries such as Matplotlib and Pandas.
从上⾯的“频率表”继续,真实的直⽅图⾸先“组合”值的范围,然后计算落⼊每个组合中的值的数量。 这就是 histogram()函数的作⽤,它是稍后在Python库(如Matplotlib和Pandas)中将看到的其他函数的基础。
random python
Consider a sample of floats drawn from the . This distribution has fatter tails than a normal distribution and has two descriptive parameters (location and scale):
考虑⼀个从提取的浮⼦样本。 该分布的尾部⽐正态分布更胖,并且具有两个描述性参数(位置和⽐例):
In this case, you’re working with a continuous distribution, and it wouldn’t be very helpful to tally each float independently, down to the umpteenth decimal place. Instead, you can bin or “bucket” the data and count the
observations that fall into each bin. The histogram is the resulting count of values within each bin:
在这种情况下,您正在使⽤连续分布,并且将每个浮动分别计算到⼩数点后第位并不会很有帮助。 取⽽代之的是,您可以对数据进⾏分类或“存储”,并计算落⼊每个分类中的观察值。 直⽅图是每个bin中的值的最终计数:
>>> >>>  histhist , , bin_edges bin_edges = = npnp .. histogramhistogram (( dd )
)
>>> >>>  hist
hist
array([ 1,  0,  3,  4,  4, 10, 13,  9,  2,  4])
array([ 1,  0,  3,  4,  4, 10, 13,  9,  2,  4])
>>> >>>  bin_edges
bin_edges
array([ 3.217,  5.199,  7.181,  9.163, 11.145, 13.127, 15.109, 17.091,
array([ 3.217,  5.199,  7.181,  9.163, 11.145, 13.127, 15.109, 17.091,
19.073, 21.055, 23.037])
19.073, 21.055, 23.037])
This result may not be immediately intuitive. by default uses 10 equally sized bins and returns a tuple of the frequency counts and corresponding bin edges. They are edges in the sense that there will be one more bin edge than there are members of the histogram:
此结果可能不是⽴即直观的。 默认情况下, 使⽤10个⼤⼩相等的bin,并返回频率计数和相应bin边缘的元组。 从某种意义上说,它们是边缘,即条形图的边缘⽐直⽅图的成员多:
Technical Detail: All but the last (rightmost) bin is half-open. That is, all bins but the last are [inclusive, exclusive), and the final bin is [inclusive, inclusive].
技术细节 :除最后⼀个(最右边)的垃圾箱外,其他所有垃圾箱都是半开的。 也就是说,除最后⼀个垃圾箱外,其他垃圾箱均为[包含(包括)],最后⼀个垃圾箱为[包含(包括)。
A very condensed breakdown of how the bins are constructed looks like this:
关于如何构造垃圾箱的简明分解如下:
>>> >>>  # The leftmost and rightmost bin edges
# The leftmost and rightmost bin edges
>>> >>>  first_edgefirst_edge , , last_edge last_edge = = aa .. minmin (), (), aa .. maxmax ()
()
>>> >>>  n_equal_bins n_equal_bins = = 10  10  # NumPy's default
# NumPy's default
>>> >>>  bin_edges bin_edges = = npnp .. linspacelinspace (( startstart == first_edgefirst_edge , , stopstop == last_edgelast_edge ,
,
...                        ...                        numnum == n_equal_bins n_equal_bins + + 11 , , endpointendpoint == TrueTrue )
)
...
...
>>> >>>  bin_edges
bin_edges
array([ 0. ,  2.3,  4.6,  6.9,  9.2, 11.5, 13.8, 16.1, 18.4, 20.7, 23. ])
array([ 0. ,  2.3,  4.6,  6.9,  9.2, 11.5, 13.8, 16.1, 18.4, 20.7, 23. ])
The case above makes a lot of sense: 10 equally spaced bins over a peak-to-peak range of 23 means intervals of width 2.3.
上⾯的情况很有道理:在峰峰值范围为23的10个等间隔的条带表⽰宽度为2.3的间隔。
From there, the function delegates to either or . bincount() itself can be used to effectively construct the “frequency table”that you started off with here, with the distinction that values with zero occurrences are included:
从那⾥,该函数委托给或 。 bincount()本⾝可⽤于有效地构建您从此处开始的“频率表”,区别在于包括了出现次数为零的值:
Note: hist here is really using bins of width 1.0 rather than “discrete” counts. Hence, this only works for counting integers, not floats such as [3.9, 4.1, 4.15].
注意 :这⾥的hist实际上使⽤的是宽度为1.0的垃圾箱,⽽不是“离散”计数。 因此,这仅适⽤于计数整数,不适⽤于[3.9, 4.1, 4.15]浮点数。
使⽤Matplotlib和Pandas可视化直⽅图 (Visualizing Histograms with Matplotlib and Pandas)
Now that you’ve seen how to build a histogram in Python from the ground up, let’s see how other Python packages can do the job for you. provides the functionality to visualize Python histograms out of the box with a versatile wrapper around NumPy’s histogram():
既然您已经了解了如何从头开始构建Python直⽅图,那么让我们看看其他Python软件包如何为您完成这项⼯作。 通过围绕NumPy
的histogram()的通⽤包装器提供了开箱即⽤的可视化Python直⽅图的功能:

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。