首页 > 代码库 > welcome

welcome

what is this school about?

1.it is about applications of computer science,tools and technologies and staticstics to scientific analysis.

2.it is not about computer science proper and high performance computing.

transformation and synergy(协同)

1.all science in the 21st century is becoming cyber-science(计算机科学,网络科学)(aka e-science),and with this change comes the need for a new scientific methodology.

2.the challenges we are tackling

(1)management of large, complex, distributed data sets

(2)effective exploration of such data -> new knowledge

3.a great synergy of the computationally enabled science, and the science-driven technology.

the evoling paths to knowledge

1.the first paradigm(范例): experiment/measurement

2.the second paradigm: analytical theory

3.the third paradigm: numerical simulation(数值模拟)

4.the fourth paradigm: data-driven science

challenges

1.from data poverty to data glut

2.from data sets to data streams

3.from static to dynamic, evolving data

4.from anytime to real-time analysis and discovery

5.from centralized to distributed resources

6.from ownership of data to ownership of expertise 

a modern scientific discovery process

1.data gathering -> data farming -> data mining-> data understanding -> new konwledge

(1)data gathering: e.g., from sensor networks, telescopes...

(2)data farming:

storage/archiving, indexing, searchability, data fusion, interoperability ->database technologies

(3)data mining(or knowledge discovery databases:KDD)

pattern or correlation search, clustering analysis, classification, outlier/anomaly searches, hyperdimensional visualization -> key technical challenges

(4)data understanding -> key methodological challenges

(5)new knowledge

conclusion:

since the information volume grows exponentially, and imformation complexity is also increasing greaty, we need to create a new scientific methodology fot the computational science in the 21st century.

and  the goal of the class is to help you start learning about the modern tools of scientific data analysis.

 

原文:

Hello and welcome to the JPL-Caltech Virtual Summer Schoolon Big Data Analytics.

My name is George Djorgovski, I‘m a Professor of Astronomy(天文学) and Director of the Center For Data Driven Discovery at Caltech.

(summer school里学习的是什么)

And let me introduce the summer school for you.

So first, what is it about?

It‘s about applications of advanced computational, and statistical tools for data analysis. Whether it‘s big or not doesn‘t really matter. But it‘s becoming increasingly important for scientists or engineers or in, indeed anybody dealing with vast amounts of data.

To master such tools. What we can cover in this school, is only a tiny subset of all of the skills, that the modern data scientist should have.

And we started with a subset that some of our lecturers could provide but in the future we hope to grow it and add additional material.

So, what this means is that these lectures will really just be a start of learning on any given subject to tell you roughly what‘s, what it is about. Show you how to use it.

But you, to get the full benefit you really need to explore further and we‘ll provide links, and other resources for you to do so.

And I should point out that this is not about computer science, perse, even though we use tools derived from computer science, and it‘s certainly not about high performance computing.

As such, it‘s really about analizing data and extracting knowledge from data.

(介绍更广阔的背景)

So let me put this in a somewhat broader context of what‘s happening.

I think everybody now knows that, everything is being completely transformed by information computation technology. And certainly science is no exception to that.Now that brings interesting new possibilities, but, also new challenges.

And those, many of those are universal and common to all the different fields which

is how to actually dealwith vast amounts of data?

How to store it, how to addressit how to check it and so on?But more importantly, how to explore it,how to discover knowledge in it?

(发展历史)

And tools and methods developed for this really form new parts of the scientific

methodology, adding to the tools we‘ve been developing over centuries so far.

It is also really an excellent synergy between the main sciences like astronomy,

or physics, or biology, or geological sciences. And information and computation technology.

Where as the domain science is used toolsthat come from computer science, say for

statistics, out challenges brought in to improve them further and to develop new and better tools which then may be found, find other applications.

There is, a concept of the Forth Paradigm, introduced by the great computer scientist Jim Gray. Which is, if you think how we learn about the world, how we understand it,

it started, science started experimentally with the likes of Galileo and so on. What very quickly followed by analytical approach, say by Newton and others. And we still use experimental, analytical methods and will forever.

But then in the mid 20th century, something new came about. That‘s computers. And we can call it third paradigm.

Where we use machines to simulate,what physical processes do in nature.Not because we are lazy to write formula, but, because there is no other way in which we can do it.

 And then over the last 20 years or so, we saw different kind of computing arise. Computing that‘s notabout number crunching.

In traditional sense, it‘s really about accessing and understanding data.

It‘s a different kind of computing, has different demands that optimize in a different way and that‘s in fact what most of the scientists do. So we, some of us still run large numerical simulations, but, their output is also a huge data set.

And to understand yourtheoretical output you also need,to understand how to do

data driven computing. 

(以我的专业为例说明与big data的关系)

So in my own field, Astronomy, is a good example of this.

Astronomy‘s been completely transformed by the modern digital computation of technology. And most of our data come in form of large Digital Sky surveys, which typically are now 10‘s or 100‘s of terabytes each, if not more, petabyte now.

And we‘re already talking about exabyte and beyond. We have at least 10, I would say probably few 10‘s of petabytes stored.

In good quality archives, in site circa mid 2014 and we generate tons, tons of terabytes per day, many tons of terabytes per day. The interesting point is that the data volume, doubles every year and a half.

Following Mars law for exact same reason the technology. Gives us the date that follows Moore‘s law(摩尔定律) and that‘s a rather stunning if you think about it,

which means that in the next year and a half from whenever, we will generate as much data as in all of the past history.

Now it‘s not about data.It‘s about discovering stuff, andto put these large numbers in context.

Human genome(基因组) itself can be coded in less than one Gigabyte, and

Terabyte is about two million books in just pure text. Humans can maybe process information roughly at one Terabyte per year, more if it‘s images, less if it‘s text.

But, you can see that that implies that we‘re now getting into regime [INAUDIBLE]

simply cannot fol, even follow the data, let alone do something with it, in a simple fashion.

And the progress continues.

Astronomers are now building large new surveying facilities, like large synoptic survey telescope(综观巡天望远镜). Which will generate about 30 terabytes per night, roughly speaking will do one stand digital sky survey every week.And then are building even more ambitious machine to square kilometer array. Where the raw data generation, of the instrument will be about exabyte per second. And they‘ll be reduced to maybe

a few beta bytes per year.

(目前的发展趋势,我们面临的挑战,数据越来越多,内容越来越多,分布越来越多)

So there‘s some general trends, that allsciences are following worth looking at.

First and most obvious isthe Exponential Growth of Data Volumes指数增长的数据量,

which is why people talk about big data.

But, much more interesting are, is the growth of data complexity.

The informational content of the data has also been increasing, and that‘s where things get real interesting.

So we‘ve moved from data poverty and starvation to exponential over-abundance, and we‘re also moving from fixed datasets that you obtain once and that‘s it. To constantly arriving new data streams, from different kinds of sensors, whether a telescopes or environmental sensors. It doesn‘t matter. It means that we have tounderstand the data in real time.And that poses a whole new set of challenges for data analysis.

Also data are verydistributed geographically, even in the given disciplines there will be many different data centers, many contributing groups or labs. And intrinsically it has to be that way.

So, we have to have ways by which theyare connected in a very efficient wayin order to put the data together for further scientific analysis.

(重要的不是data而是你在data中能发现什么)

Since there is so much data now, the value of having data is not very big. In the past, data recurrency of the realm, who had access to the data can do science. Now the data are over abundant most of it is free. And the real value is having the expertise, to extract knowledge from this data and this is what the school is all about.

So simple way to represent modernscientific process is say, you gather the data whether it‘s from gene sequencing machines or satellites or telescopes, doesn‘t matter, and then you do what I call data farming, data have to be organized and indexed and made easily accessible and findable. And ways to which they can be combined, and we know how to do this very well but it still takes some skills.Then comes the interesting part, discovering regularities(发现规律) in the data themselves.

That‘s what science is really about,finding patterns in nature andtrying to understand them.And that could be correlations(有联系), or clusters(群集), oroutliers(异常值), anomalies(异常现象) and so on.

And a lot of interesting tools from statistics and machine learning can be used for

that purpose. At, at the end, of course, comes the real role of scientists which is

understanding what that means, and therefore the new knowledge.This is grossly oversimplified of course. And they‘re feedback loops at every step of this lag.

But, I think you get the idea.

The data science is not about data.It‘s about what you find in the data. And so it‘s interesting to think, how is this different? From what we had in the past. The first and

obvious difference is that, for the first time in history, we‘ll never be able to see all of our data. Which means we need to have technology that‘s reliable that will kind of look over, after the data without human intervention and enable us to find pieces we want.

But then, perhaps even more interesting is that informational content. Of the data is so high that there are meaningful constructs in the data that humans cannot easily comprehend unaided. And we‘re moving ever more towards a collaborative human-computer discovery. So all this I think adds to our. Toolkit scientific methodology in 21st century. It‘s not all about science, because every field of human endeavour, medicine, security, finance everything depends on good handling and understanding of large data sets.

(说完了)

So with that I‘ll end, and my colleagues from JPL. Will also provide their take of, what are we doing here and why?

 

welcome