华盛顿⼤学公开课 Okay, so I want to spend a little time on
the term, "Big Data" and I'm not too
concerned with any sort of technical
definition of it, of the term.
Because it probably doesn't exist but I
want to arm you with some of the language
that people use when they describe Big
Data.
So you know, you can speak intelligently
about it when, when asked.
Okay?
So, the, probably the main thing to
recognize is this notion of the three V's
of Big Data which are volume, velocity,
and variety.
And we talked a little bit about this in
a previous segment.
So just to repeat, you know, volume is
the size of the data.
And you measure in bytes or number of
rows or number of objects or what have
you, sort of the vertical dimension of
the data.
What I'll say here is the latency of the
data processing relative to the demand
for interactivity, and that's maybe a
mouthful.
But, what I mean by that is you know, how
fast is it coming based on how fast it
needs to be consumed.
And so, there's a lot of applications for
which interactive response time are
increasingly important, if not directly
important.
Okay?
And so, when this becomes the bottle
neck, when this becomes the challenge,
then this will also start to become
pretty relevant.
And the one that I think is really pretty
interesting to me and is near and dear to
my heart and with my research is notion
of variety.
And so here the problem is, you know, an
increasing number of different data
sources are being applied for any
particular task.
So you need to pull out, you know, ASI
files as well as download data from the
web as well as pull data out of some
database.
As well as you use some of those sequel
system, and so on.
And the integration of all this data
sources, is, a, pretty significant
problem, and can end up occupying, a lot
of your time.
So I made this point, a couple of
segments ago, about researches who spend
near 90% of their time, quote, handling data.
This is where a lot of that time is
going.
There's a notion of variety.
Okay.
So all three of these are relevant in performing sort of data science tasks. Alright, let me give you another notion
and I'm going to go back to use science examples.
And you've seen some of these before, but
if you sort of make a plot of number of
bytes on the Y-axis versus number of data sources.
Maybe columns of data in single table or columns of data across multiple tables or
a number of distinct data sources on the
X axis.
You can sort of, map out different fields
of study or different problems and sort
of, see where they lie.
And so typically Astronomy has been the challenge by the sheer volume of data.
So that's right about here we're high on
the Y axis but you know, but the number
of actual sources in astronomy is not too high.
There's telescope there's these spectral imagers and then there's the simulation
of the, of the, of the galaxy and so
that's relatively few.
In say the ocean sciences
and certainly
in life sciences although I only show you know, one example here.
The variety is really more of a challenge, the actual shear scale is not
as high as the, you know, hundreds
of[UNKNOWN] bytes that could be generated by these, these telescopic projector like
a large[UNKNOWN] telescope.
but the number of different types of instruments you can, use to acquire data
is large and ever-growing, right?
So you have these glider systems that
will go out for months at a time and kind
of porpoise through the water.
you have autonomous underwater vehicles that are more for short term missions.
you know there, there's oceanographic cruises where they deploy these conductivity temperature and depth instrumensts that can take profiles of
the water, right.
So this is you know, at a fixed XY and a varying Z and a variying T at a varying depth at a varying time while the gliders
are sort of varying in all four dimensions.
You have these simulations that are
probably one of the largest sources of information, right.
So these can be[UNKNOWN] scales, sort of. At the order of the entire northern hemisphere, or whole eastern pacific, or there could be models, of a particular bay.
Or inlet or estuary connected to a river
or connected to the open ocean or a much smaller scale thing.
So, there's a lot of diversity there.
And I say stations to mean these sort of fixed stations where there's a particular sense that are deployed to one location and just measuring across time.
ADCP is a Acoustic Doppler something profilor, where they're still using sound waves to record the time that the sound waves take to bounce off a particular matter in the ocean.
sort of和kind of
And they can therefore measure velocity. And so this gives you an entire profile
of the velocities in the ocean.
And you can mount these on the sea floor pointing upwards, or you can mount them on the bottom of a boat pointing downwards, and so on.
And then there's satellite images that measure sort of sea color and wave breaking as well.
Okay.
So fine.
So just a little more on the term Big Data.
a, a quote.
The notion that Mike Franklin at the University of Berkeley uses, which I like
is that you know, "Big Data is really relative rigtht, it's any data that's expensive to manage and hard to extract value from".
So it's not so much about a particular
cut off, you know, what makes it big, is
it petabyte scale is big versus terabyte scale is small or gigabyte scale is very small since it fits in memory on your machine.
You know, not necessarily.
It depends on what you're trying to do
with it and it depends on what sort of resources and infrastructure you have to bring to bear on the problem.
And so in some sense, difficult data is perhaps what big data really means.
It's not so much really big, it's about
being challenging, okay.
This is really important to remember,
that big is relative.
So let me give you a little bit of the
hist
ory of the term big data.
There's the earliest notion I could find was from Erik Larson in 1989 where he says, you know, the keepers of, from Harper's magazine that eventually went into a book.
The keepers of Big Data say they do it for consumer's benefit but data have a way of being used for purposes other than originally intended.
So his point was not really about technology at all.
It was just a notion that data is being collected for one purpose and being reused for another.
Which is a theme that I mentioned in the very first segment in this course and
we'll come back to over and over again. And so I think he had it right that sense so his real point was t
hat, you know about consumer private data starting to be commoditized.
Which was absolutely, true and, and fairly prescient at the time since it's become a big issue now.
But, you know, and it's been especially impressive that, you know, given that this predates the rise of the internet.
and already sort of foreshadows very topical issues in big data, this ethics
and privacy and sensitivity and so forth that we'll talk a little bit about.
but this isn't quite what we mean by big data nowadays typically because it didn't have that technology aspect to it.
It didn't talk about the challenge of actually managing this, these data sets alright.
So another point of reference is that more reasonable reports from these consulting firms get credit for this
notion of 3D's ,its got really the
original source.
This was a report from governor is 2001 written by a guy named Doug Laney. And so we talked about volume, velocity and variety which we've said but let me just give you a chance to look at these quotes.
You know and so in volume it's he's really talking about, sort of business to business.
If you think about 2001, this is around
the dot com boom.
And so everyone was trying to figure out what this new era of technology was going to get, what the internet was
really going to give to them beyond. Just, sort of putting up a webpage and serving it out to your customers. What, how are you going to be able to interact with your supply chain or your your vendors and so on.
Okay, and so that's what it means by this notion of e-channels.
But you know, up to ten x, the quantity
of data about an individual transaction may be collected.
You know, absolutely true that this data exhausts, this point we've made a couple of times, is giving rise to a larger
scale of data being collected.
You know, when velocity well has increased the point of interaction speed, right.
So this is that need for interactivity.
then but didn't used to be so required, but as the velocity of all business and
all transactions sort of increased.
So do the constraints on the infrastructure used to process it.
And so on variety, I like this one a lot. Through 2003, 2004, right, so he's been sort of fairly conservative about how far out he wanted to predict.
No greater barrier to effective data management will exist and the variety of incompatible data formats, non-aligned data structures and inconsistent data semantics.
so this is great, this is, this is, you know, you could have said this for the through 2015 and been arguably correct, this problem is not gone away.
So, another point in the history of this term Big Data there was a series of talks, a lot of work by John Mashey, who was formerly the chief scientist at the SGI, who would talk about Big Data being the next wave of infrastress.
And so what he meant by infrastress was what's really going to drive the technology forward.
Where we're going to feel the pain.
And his point was that the IO interfaces, was where it was tough.
So in particular disk, disk capacities were growing incredibly fast and still
are and the latencies are not keeping pace, right.
So you can go down to, a local store and buy 3-tier BI drive for probably $200.
But the rate at which you can pull data off that is essentially the same as it
has been for many, many year.
And so now it takes you hours to actually read every byte of data on that disk that you're, that you stored.
And so, this is a problem because the actual analysis you can do of all the data we can, you know we can keep it. And that's really cheap, but cannot do anything with it because the, the pipe is so small.
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论