华盛顿大学公开课--688IT编程网

华盛顿⼤学公开课 Okay, so I want to spend a little time on

the term, "Big Data" and I'm not too

concerned with any sort of technical

definition of it, of the term.

Because it probably doesn't exist but I

want to arm you with some of the language

that people use when they describe Big

Data.

So you know, you can speak intelligently

about it when, when asked.

Okay?

So, the, probably the main thing to

recognize is this notion of the three V's

of Big Data which are volume, velocity,

and variety.

And we talked a little bit about this in

a previous segment.

So just to repeat, you know, volume is

the size of the data.

And you measure in bytes or number of

rows or number of objects or what have

you, sort of the vertical dimension of

the data.

What I'll say here is the latency of the

data processing relative to the demand

for interactivity, and that's maybe a

mouthful.

But, what I mean by that is you know, how

fast is it coming based on how fast it

needs to be consumed.

And so, there's a lot of applications for

which interactive response time are

increasingly important, if not directly

important.

Okay?

And so, when this becomes the bottle

neck, when this becomes the challenge,

then this will also start to become

pretty relevant.

And the one that I think is really pretty

interesting to me and is near and dear to

my heart and with my research is notion

of variety.

And so here the problem is, you know, an

increasing number of different data

sources are being applied for any

particular task.

So you need to pull out, you know, ASI

files as well as download data from the

web as well as pull data out of some

database.

As well as you use some of those sequel

system, and so on.

And the integration of all this data

sources, is, a, pretty significant

problem, and can end up occupying, a lot

of your time.

So I made this point, a couple of

segments ago, about researches who spend

near 90% of their time, quote, handling data.

This is where a lot of that time is

going.

There's a notion of variety.

Okay.

So all three of these are relevant in performing sort of data science tasks. Alright, let me give you another notion

and I'm going to go back to use science examples.

And you've seen some of these before, but

if you sort of make a plot of number of

bytes on the Y-axis versus number of data sources.

Maybe columns of data in single table or columns of data across multiple tables or

a number of distinct data sources on the

X axis.

You can sort of, map out different fields

of study or different problems and sort

of, see where they lie.

And so typically Astronomy has been the challenge by the sheer volume of data.

So that's right about here we're high on

the Y axis but you know, but the number

of actual sources in astronomy is not too high.

There's telescope there's these spectral imagers and then there's the simulation

of the, of the, of the galaxy and so

that's relatively few.

In say the ocean sciences

and certainly

in life sciences although I only show you know, one example here.

The variety is really more of a challenge, the actual shear scale is not

as high as the, you know, hundreds

of[UNKNOWN] bytes that could be generated by these, these telescopic projector like

a large[UNKNOWN] telescope.

but the number of different types of instruments you can, use to acquire data

is large and ever-growing, right?

So you have these glider systems that

will go out for months at a time and kind

of porpoise through the water.

you have autonomous underwater vehicles that are more for short term missions.

you know there, there's oceanographic cruises where they deploy these conductivity temperature and depth instrumensts that can take profiles of

the water, right.

So this is you know, at a fixed XY and a varying Z and a variying T at a varying depth at a varying time while the gliders

are sort of varying in all four dimensions.

You have these simulations that are

probably one of the largest sources of information, right.

So these can be[UNKNOWN] scales, sort of. At the order of the entire northern hemisphere, or whole eastern pacific, or there could be models, of a particular bay.

Or inlet or estuary connected to a river

or connected to the open ocean or a much smaller scale thing.

So, there's a lot of diversity there.

And I say stations to mean these sort of fixed stations where there's a particular sense that are deployed to one location and just measuring across time.

ADCP is a Acoustic Doppler something profilor, where they're still using sound waves to record the time that the sound waves take to bounce off a particular matter in the ocean.

sort of和kind of

And they can therefore measure velocity. And so this gives you an entire profile

of the velocities in the ocean.

And you can mount these on the sea floor pointing upwards, or you can mount them on the bottom of a boat pointing downwards, and so on.

And then there's satellite images that measure sort of sea color and wave breaking as well.

Okay.

So fine.

So just a little more on the term Big Data.

a, a quote.

The notion that Mike Franklin at the University of Berkeley uses, which I like

is that you know, "Big Data is really relative rigtht, it's any data that's expensive to manage and hard to extract value from".

So it's not so much about a particular

cut off, you know, what makes it big, is

it petabyte scale is big versus terabyte scale is small or gigabyte scale is very small since it fits in memory on your machine.

You know, not necessarily.

It depends on what you're trying to do

with it and it depends on what sort of resources and infrastructure you have to bring to bear on the problem.

And so in some sense, difficult data is perhaps what big data really means.

It's not so much really big, it's about

being challenging, okay.

This is really important to remember,

that big is relative.

So let me give you a little bit of the

hist

ory of the term big data.

There's the earliest notion I could find was from Erik Larson in 1989 where he says, you know, the keepers of, from Harper's magazine that eventually went into a book.

The keepers of Big Data say they do it for consumer's benefit but data have a way of being used for purposes other than originally intended.

So his point was not really about technology at all.

It was just a notion that data is being collected for one purpose and being reused for another.

Which is a theme that I mentioned in the very first segment in this course and

we'll come back to over and over again. And so I think he had it right that sense so his real point was t

hat, you know about consumer private data starting to be commoditized.

Which was absolutely, true and, and fairly prescient at the time since it's become a big issue now.

But, you know, and it's been especially impressive that, you know, given that this predates the rise of the internet.

and already sort of foreshadows very topical issues in big data, this ethics

and privacy and sensitivity and so forth that we'll talk a little bit about.

but this isn't quite what we mean by big data nowadays typically because it didn't have that technology aspect to it.

It didn't talk about the challenge of actually managing this, these data sets alright.

So another point of reference is that more reasonable reports from these consulting firms get credit for this

notion of 3D's ,its got really the

original source.

This was a report from governor is 2001 written by a guy named Doug Laney. And so we talked about volume, velocity and variety which we've said but let me just give you a chance to look at these quotes.

You know and so in volume it's he's really talking about, sort of business to business.

If you think about 2001, this is around

the dot com boom.

And so everyone was trying to figure out what this new era of technology was going to get, what the internet was

really going to give to them beyond. Just, sort of putting up a webpage and serving it out to your customers. What, how are you going to be able to interact with your supply chain or your your vendors and so on.

Okay, and so that's what it means by this notion of e-channels.

But you know, up to ten x, the quantity

of data about an individual transaction may be collected.

You know, absolutely true that this data exhausts, this point we've made a couple of times, is giving rise to a larger

scale of data being collected.

You know, when velocity well has increased the point of interaction speed, right.

So this is that need for interactivity.

then but didn't used to be so required, but as the velocity of all business and

all transactions sort of increased.

So do the constraints on the infrastructure used to process it.

And so on variety, I like this one a lot. Through 2003, 2004, right, so he's been sort of fairly conservative about how far out he wanted to predict.

No greater barrier to effective data management will exist and the variety of incompatible data formats, non-aligned data structures and inconsistent data semantics.

so this is great, this is, this is, you know, you could have said this for the through 2015 and been arguably correct, this problem is not gone away.

So, another point in the history of this term Big Data there was a series of talks, a lot of work by John Mashey, who was formerly the chief scientist at the SGI, who would talk about Big Data being the next wave of infrastress.

And so what he meant by infrastress was what's really going to drive the technology forward.

Where we're going to feel the pain.

And his point was that the IO interfaces, was where it was tough.

So in particular disk, disk capacities were growing incredibly fast and still

are and the latencies are not keeping pace, right.

So you can go down to, a local store and buy 3-tier BI drive for probably $200.

But the rate at which you can pull data off that is essentially the same as it

has been for many, many year.

And so now it takes you hours to actually read every byte of data on that disk that you're, that you stored.

And so, this is a problem because the actual analysis you can do of all the data we can, you know we can keep it. And that's really cheap, but cannot do anything with it because the, the pipe is so small.

688IT编程网

华盛顿大学公开课

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

688IT编程网

华盛顿大学公开课

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时 正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

java正则表达式选择题

非零金额正则表达式

半小时正则表达式