Learniverse

Optimierungsgrundlagen

00:00

The following content is provided under a Creative Commons license.

00:05

Your support will help MIT OpenCourseWare continue to offer high-quality educational

00:10

resources for free.

00:12

To make a donation or to view additional materials from hundreds of MIT courses, visit

00:17

MIT OpenCourseWare at ocw.mit.edu.

00:23

Well, OK, I'm happy to be back.

00:26

And I'm really happy about the project proposals that are coming in.

00:32

This is, like, OK, this is a really a good part of the course.

00:37

And so keep them coming.

00:40

And I'm happy to give whatever feedback I can on those proposals

00:45

and do make a start.

00:49

They're really good.

00:50

And if summer completed before the end of the semester,

00:55

and we can produce a chance to offer you a chance to report on them,

01:01

that's good to, so well done with those proposals.

01:08

So today, I'm jumping to Part 6.

01:13

So Part 6 and Part 7 are optimization,

01:16

which is the fundamental algorithm that goes into deep learning.

01:20

So we've got to start with optimization.

01:22

Everybody has to get that picture.

01:26

And then Part 7 will be the structure of CNN's,

01:32

convolutional neural nets and all the kinds of applications.

01:38

And so can we start with optimization?

01:41

So first, could I get the basic facts about three terms of a Taylor series?

01:49

So that's the typical.

01:51

It's seldom that we would go up to third derivatives in optimization.

01:57

So that's the most useful approximation to a function.

02:02

Everybody recognizes it.

02:04

So I'm thinking of F as just one function and X as just one variable.

02:12

But now I really want to go to more variables.

02:18

So what do I have to change if F is a function of more variables?

02:26

So now I'm thinking of X as well.

02:31

No, let me see.

02:38

Yeah, I want n variables here.

02:43

X is X1 up to Xn.

02:49

So what?

02:50

I just to get the word straight.

02:53

So we can begin on optimization.

02:56

So what will be the similar step?

02:59

So the function F at X, remember X is n variables.

03:06

Okay, now what do I have?

03:08

Delta X.

03:09

So what's the point about Delta X now?

03:12

It's a vector.

03:14

Delta X1 to Delta Xn.

03:17

And what about the derivative of F?

03:20

It's a vector 2.

03:22

The derivative of F with respect X1.

03:25

The derivative of F with respect X2 and so on.

03:28

So what do I have?

03:30

What do I have to change about that?

03:32

I know those guys are vectors.

03:35

So it's their dot product.

03:37

So it's Delta X transpose.

03:40

That vector times this DFDX.

03:46

So what's the, so this now I'm replacing this by all the derivatives.

03:53

The, it's the gradient.

03:58

So the gradient of F at X is the derivatives, let's see.

04:09

It's essentially get the notation straight here.

04:14

Yeah.

04:15

So it'll be the partial derivatives of the function F.

04:19

So grad F is the partial derivatives of F with respect to X1 down

04:28

to partial derivative of respect to Xn.

04:32

Okay, good.

04:34

That's the linear term.

04:35

And now what's the quadratic term?

04:37

One half.

04:39

Now Delta X is into the scalar anymore.

04:42

It's a vector.

04:43

So I'm going to have Delta X transpose.

04:47

And Delta X.

04:50

And what goes in between is the second derivatives.

04:54

But I've got a function of n variables.

04:58

And so what does this, so now I have a matrix of second derivatives.

05:03

Right?

05:04

And I'll call it H.

05:06

This is the matrix of second derivatives.

05:10

H, J, K is the second derivative of F with respect to X, J, and X, K.

05:20

And what's the name for this guy?

05:23

The Hessian, Hessian matrix.

05:30

How the Hessians got into this picture?

05:32

I don't know.

05:33

The only Hessians I know are the ones who fought in the revolutionary war for somebody.

05:39

Who, which side were they on?

05:41

I think maybe the wrong side, the French were on our side.

05:46

And anyway, Hessian matrix.

05:49

And what are the facts about that matrix?

05:52

Well, the first fact is that it's the key fact is it's symmetric.

05:57

Yeah.

06:00

Okay.

06:02

And again, it's an approximation.

06:05

And so, and everybody recognizes that if we have a function, if n is very large,

06:12

and we have a function of many variables, then we had n derivatives to compute here,

06:19

and about half n squared derivatives using the half comes from the symmetry.

06:24

But the key point is the n squared variables, the derivatives to compute there.

06:30

So, computing the gradient is feasible if n is small or moderately large.

06:39

Actually, by using automatic differentiation, the key idea of back propagation, back prop.

06:47

You can do, you can speed up the computation of derivatives quite amazingly,

06:57

but still for the size of deep learning problems, that's how to reach.

07:04

Okay.

07:05

So, that's the picture, and then I will want to solve, use this to solve equations.

07:15

Let me, there's a parallel picture for a vector F.

07:24

So, now, this is a vector function.

07:27

This is f1 of x up to fn of x, and x is x1 to xn.

07:42

So, I have n functions of n variables, n functions of n variables.

07:48

Well, that's exactly what I have in the gradient.

07:53

That's the state of these two as parallel.

07:57

The parallel being f is corresponds to the gradient of f, n functions of n variables.

08:11

Okay.

08:13

Okay.

08:14

Now, maybe what I'm after here is to solve f equals zero.

08:19

I'm going to think about the f at x plus delta x.

08:25

So, it starts with f of x, and then we have the correction times the matrix of first derivatives.

08:36

And what's the name for that matrix of first derivatives?

08:46

Well, if I'm just given n functions, and what am I after here?

08:56

I'm looking for the Jacobian.

09:04

So, here we'll go the Jacobian.

09:07

This is the Jacobian named after Jacobi.

09:12

Jacobian matrix.

09:15

And what are its entries?

09:19

Jay, the jk entry, is the derivative of the j function with respect to the case variable.

09:28

Okay.

09:31

And I'm stopping at first order there.

09:34

Okay.

09:35

So, these are the sort of facts of calculus.

09:38

Fact of 1802, you could say.

09:41

Multi-variable calculus.

09:43

That's the point.

09:44

Notice that we're doing just like the first half of 1802, just differential calculus.

09:51

Derivatives.

09:52

Taylor series.

09:53

We're not doing multiple nicals.

09:55

That's not part of our world here.

09:57

Okay.

09:58

So, that's the background.

10:00

Now, I want to look at optimization.

10:04

Okay.

10:05

Okay.

10:06

So, over here, I want to optimize.

10:12

Well, over here, let me try to minimize f of x.

10:24

And I'll be in the vector case here.

10:27

And over here, I want to solve f equals zero.

10:38

And of course, that means f of 1 equals zero all the way along to f n equals zero.

10:48

Here, I have n equations n n unknowns.

10:55

Let me start with that one.

11:02

And I'll start with Newton's method.

11:04

Newton's method to solve these n equations in n unknowns.

11:09

Okay.

11:10

So, Newton's method.

11:17

Which is often not presented in 1802.

11:24

That's a crime.

11:25

Because that's the big application of the gradients in Jacobians.

11:33

Okay.

11:34

So, I'm trying to solve n equations in n unknowns.

11:37

And so, I want f at x plus delta x to be zero.

11:42

Right?

11:43

So, I want f of x plus delta x to be zero.

11:46

So, f of x plus delta x is I'm putting in a zero.

11:51

I'm just copying that equation.

11:53

Is f at the where I am.

11:57

Let me use k for the case iteration.

12:02

So, I'm at a point x k.

12:04

I want to get to a point x k plus 1.

12:07

And so, f of x plus j at that point times delta x,

12:17

which is x k plus 1 minus x k.

12:22

Good.

12:23

That's Newton's method.

12:25

Of course, e zero isn't quite true.

12:29

Well, zero will be true for the,

12:32

I'm constructing x k plus 1 here.

12:35

I'm constructing x k plus 1.

12:38

Okay.

12:39

So, let me just rewrite that night,

12:41

and we've got Newton's method.

12:43

So, we're looking for this change.

12:47

x k plus 1 minus x k.

12:53

I'll put it on this side as plus x k.

12:58

So, that's this.

13:00

Now, I have to invert that and put it on the other side of the equation.

13:04

So, that'll go with a minus.

13:06

This guy will be inverted.

13:09

And f at x k.

13:15

So, that's Newton's method.

13:17

Natural.

13:19

Let me just repeat that.

13:21

You see where the x k plus 1 minus x k is sitting.

13:25

Right?

13:27

And I moved f x k to the other side with a minus sign.

13:32

And then I multiplied through by j inverse.

13:36

So, I got that.

13:37

So, that's Newton's method.

13:39

For a system of equations.

13:44

But over there, I'm going to write down Newton's method

13:47

for minimizing a function.

13:49

This is such basic stuff that we have to begin here.

13:53

Let me even begin with a,

13:57

extremely straightforward example of Newton's method here.

14:02

Suppose my function,

14:05

I suppose I've only got one function actually.

14:10

Suppose I only have one function.

14:13

So, I'm, so suppose my function is x squared minus 9.

14:21

And I want to solve f of x equals 0.

14:25

I want to find this square root of 9.

14:29

Okay.

14:30

So, what is Newton's method for it?

14:32

We just should see.

14:34

My point is just to see how Newton's method is written.

14:37

And then rewrite it a little bit so that we see that convergence.

14:43

Okay.

14:44

So, of course, the Jacobian is 2x.

14:48

So, Newton's method says that xk plus 1.

14:52

I'm just going to copy that Newton's method minus 1 over 2xk.

14:59

Right?

15:01

That's the derivative.

15:03

Times f of xk, which is xk squared minus 9.

15:09

Okay.

15:13

We followed the formula where this determines xk plus 1.

15:17

And let's simplify it.

15:20

So, here I have xk minus that looks like a half of xk.

15:26

So, I think I have a half of xk.

15:30

And then this times this is 9.

15:35

I have of 1 over xk.

15:40

Is that right?

15:43

Half of xk from this stuff.

15:46

And plus 9.

15:48

I have of 1 over xk.

15:50

Okay.

15:53

Can I just check that I know the answer is 3.

15:59

Can I be sure that I get the right answer 3.

16:03

So, if xk was exactly 3, then of course, I expect xk plus 1 to stay at 3.

16:11

So, do what does that happen?

16:13

So, half of 3 and 9 half of 1.

16:17

Third, what's that?

16:20

Half of 3 and 9 half of 1.

16:24

Third.

16:26

That's 3 halves and 3 halves.

16:29

That's 6 halves and that's 3.

16:32

Okay.

16:35

So, we've checked that the method is consistent, which just means we kept the algebra straight.

16:42

But then the really important point about Newton's method is to discover how fast it converges.

16:50

So, now let me do xk plus 1 minus 3.

16:55

So, now I'm looking at the error, which is I hope approaching 0.

17:02

Is it approaching 0?

17:04

How quickly is it approaching 0?

17:06

These are the fundamental questions of optimization.

17:09

So, I'm going to subtract 3 from both sides.

17:13

Somehow.

17:14

Okay.

17:15

From here.

17:16

I guess I'm going to subtract 3.

17:19

Can I just, so I was just checking that it was correct.

17:24

Okay.

17:25

Now, so xk plus 1 minus 3.

17:29

I'm going to subtract 3 from both sides.

17:32

I'm going to subtract 3 there.

17:34

And then I hope that that box is what goes down here, right?

17:40

Subtracted 3 from both sides.

17:42

So, I'm looking at the, I'm hoping it now, hoping things go to 0.

17:48

Okay.

17:49

So, what do I have there?

17:52

Let me factor out the 1 over xk.

17:59

So, what do I have then left?

18:02

1 over xk.

18:03

So, there's a 9 halves from there.

18:08

1 over xk.

18:09

So, I really have a half of xk squared, because I've divided by an xk.

18:15

And this minus 3, I better put minus 3xk, because I'm dividing by the xk.

18:23

I claim that that's, now I've got it.

18:31

And let's see.

18:32

Let me take out the 2 to forget these 2s and make that a 6.

18:41

So, I have 1 over 2 xk times 9 plus xk squared minus 6.

18:47

Anything good about that?

18:51

We hope so.

18:56

We hope that that is something attractive.

19:00

So, this is again.

19:01

This is the error at step k plus 1.

19:05

And it's 1 over 2 xk times this thing in brackets 9 plus xk squared minus 6xk.

19:15

And we recognize that as xk minus 3 squared.

19:26

xk squared minus 6 of them plus 9, that's xk minus 3 squared.

19:34

Okay, that was the goal, of course.

19:40

That's the goal that shows why Newton's method is fantastic.

19:44

If you can execute it, if you can start near enough, notice that.

19:49

So, how do I describe this great equation?

19:54

It says that the error is squared at every step.

19:58

Square at every step.

20:01

So, if I'm converging to a limit, it will satisfy the, it will be 3.

20:11

Or I guess minus 3, is that possible?

20:15

Yeah, minus 3 is another solution here.

20:18

So, we've got 2 solutions.

20:20

The Newton's method could converge to 3.

20:26

Am I right? It could converge to minus 3.

20:29

So, I could, I'd have a similar equation sort of centered at minus 3.

20:35

Or what does it always do one of those?

20:40

It could blow up.

20:43

So, those are sort of regions of attraction.

20:48

They're all the starting points that approach 3.

20:51

And the whole point of that equation is with quadratic convergence.

20:57

The error being squared at every step.

21:00

It zooms in on 3.

21:02

Then there's a whole, all these starting points that would go to minus 3.

21:07

And then there's starting points that would blow up.

21:10

So, and those, maybe for this very simple problem,

21:14

that the picture is not too difficult to sort out those 3 regions.

21:19

But if we had, and this is allowing for a vector.

21:25

Two equations, or in equations, then we're in n variables.

21:32

And as a really, you get beautiful pictures.

21:35

You get some of the type of pictures that gave rise to these books on fractals.

21:42

The picture books on fractals.

21:44

For these basins of attraction, does the starting point lead you to one of the solutions,

21:52

or does it lead you to infinity?

21:54

Here, that would be interesting to just draw it for this.

21:58

Okay, but the essential point is the quadratic convergence.

22:03

If it's close enough, you see that it has to be close.

22:07

If xK is, if x0 is pretty near 3, then this is about 1, 6 of that.

22:15

And there'll be a good region of attraction in this case.

22:21

Okay, so that's a Newton's method for equations.

22:29

And now I want to do Newton's method.

22:32

I just want to convert all those words over to Newton's method for optimization.

22:38

So remember, the, these words were solving F equals 0.

22:45

This board is minimizing capital F, and what's the connection between them?

22:50

Well, of course, this corresponds to solving the gradient equals 0.

23:04

At a minimum, if I'm minimizing, I'm finding a point where all the first derivatives are 0.

23:11

So that'll be the match between these.

23:15

That this rad F in this picture is the small F in that picture.

23:22

Okay.

23:26

Now, I guess here I have, and this is sort of the heart of our applications to deep learning.

23:35

We have very complicated loss functions to minimize.

23:39

The functions of thousands are hundreds of thousands of variables.

23:43

Okay.

23:45

So that means that we would like to use Newton's method but often we can't.

23:50

So let me, so I need to put down here two methods.

23:54

One that doesn't involve those high second derivatives and one a Newton's that does.

24:02

So, so first I'll write down a method that does not.

24:08

So, so method 1.

24:13

And this will be steepest descent.

24:25

And what is that?

24:27

That says that xk plus 1, the new x, is the old x minus steepest descent means that I move into

24:37

the steepest direction, which is the direction of the gradient of F.

24:44

I move some distance and I better have freedom to decide what that distance should be.

24:51

So this is a step size S or in the language of deep learning.

25:01

It's often called the learning rate.

25:04

So obviously learning rate.

25:11

Okay.

25:18

So is that so, and it's natural to choose Sk.

25:23

We're going along, do you see where what this right hand side looks like?

25:28

I'm at a point in end dimensions.

25:31

We're all, we're in end dimensions here.

25:34

We have functions of end variables.

25:38

There's a vector.

25:39

There's a direction to move down the steepest slope of the graph.

25:46

And here's a distance to move.

25:49

And we will stop.

25:52

We'll have to get off this step.

25:58

If we stay on it, it'll swing, it'll take us off to infinity.

26:06

You would like to choose Sk so that you minimize capital F.

26:13

You take the point on this line.

26:17

So this is a line in Rn.

26:22

And for all the points on that line in that direction, F has some value.

26:35

And what you expect is that initially, because you chose it sensibly,

26:42

the value of the graph will drop.

26:45

But then at a certain point it will turn back on you and increase.

26:51

So that would be the natural stopping point.

26:54

That would call that an exact line search.

26:58

So exact line search would be exact line search is the best sense.

27:12

Of course that would take time to compute.

27:17

You're probably in deep learning.

27:20

That's time you can't afford.

27:22

So you fix the learning rate S.

27:26

Maybe you choose 0.01 to be pretty safe.

27:30

Okay.

27:31

So that's method one steepest descent.

27:33

Now, method two will be Newton's method.

27:44

So we have xk plus one minus something times delta F.

27:59

And now I'm going to do the right thing.

28:02

I'm going to live right here.

28:05

And the right thing is the Hessian.

28:07

The second derivative.

28:09

This was cheap.

28:11

We just took the direction and went along it.

28:14

Now we're getting really the right direction by using the second derivative.

28:20

So that's h inverse.

28:24

Okay.

28:25

And what I've done is to set that to 0.

28:33

So that would be.

28:38

You see that's Newton's method.

28:40

It's totally parallel to this guy.

28:43

Actually, I'm really happy to have these two on the board parallel to each other

28:50

because you have to kind of keep straight.

28:53

Are you solving equations or are you minimizing functions?

28:57

And you're using different letters in the two problems.

29:01

But now you see how they match.

29:03

The Jacobian of.

29:07

So again, the matches at think of F as the gradient of F.

29:13

That's the way you should think of it.

29:16

So the Jacobian for of the gradient is the Hessian.

29:24

The Jacobian of the gradient is the Hessian.

29:27

And that makes sense because we take the first derivative of the first derivative

29:32

of the second derivative.

29:33

Only we're doing matrix y.

29:35

So that's.

29:36

So the Jacobian of the gradient.

29:39

We're doing a vector matrix sentence instead of a scalar sentence.

29:43

The Jacobian of the gradient is a Hessian.

29:48

Yeah.

29:49

Right.

29:50

Okay.

29:51

So that's what I wanted to start with.

29:53

Just to get those basic facts down.

29:56

And so the basic facts were the three term Taylor series.

30:03

And then the basic algorithms followed naturally from it by setting F at the new point to zero.

30:13

If that's what you were solving.

30:15

Or by assuming you had the middle.

30:17

Right.

30:18

Good.

30:19

Good.

30:20

Okay.

30:21

Now what?

30:22

Now we have to think about solving these problems.

30:25

We have them studying.

30:28

Do they converge?

30:29

What rate do they converge?

30:31

Well that's the rate of convergence is like what I.

30:35

Why I took this separated off this example.

30:39

So the convergence rate for Newton's method will be quadratic.

30:44

The error gets squared.

30:47

And of course that means super fast convergence.

30:50

If you start close enough.

30:53

The rate of convergence for a steepest descent is of course not.

30:57

You're not squaring errors here because you're just taking some number instead of the inverse of the correct matrix.

31:05

So you can't expect super speed.

31:12

So linear rate of convergence would be right.

31:17

You expect that the error.

31:19

You would like to know that the error is multiplied by at every step by some constant below one.

31:26

That would be a linear rate compared to being multiplied by being squared at every step.

31:35

Okay.

31:36

And so this will be our basic.

31:40

This is the basic formula that we've built on.

31:45

For large scale.

31:48

For really large scale problems.

31:51

And there are methods.

31:53

Of course people are going to come up with methods that.

31:57

There's sort of a cheap Newton's method.

32:00

11burg Markwart.

32:03

And it's in the notes at the end of this section.

32:07

At the end of the six point four that we'll get to.

32:11

So 11burg Markwart is sort of cheap man's Newton's method.

32:16

It does not compute the Hessian.

32:19

But it says, okay, I, from the gradient I can see a sort of one term in the Hessian.

32:27

So it grabs that term.

32:29

But it's not fully second order.

32:34

Okay.

32:36

So now we have to think about problems.

32:41

And I guess the whole, the message here is.

32:44

And our starting point has to be convexity.

32:49

Convexity is the key word for this.

32:52

For these, for these problems.

32:55

For the function that we want to minimize.

32:58

If that's a convex function.

33:00

Well first of all, the convex function is likely to have one minimum.

33:07

And the picture that's in our mind of steepest descent is picture of a ball.

33:16

A ball is the graph of a convex function.

33:19

So I'm turning to convexity now.

33:22

I'll leave that board there because that's pretty crucial.

33:28

And speak about the idea of convexity.

33:32

Convex function.

33:34

Convex set.

33:36

So let's call the function f of x.

33:43

And the typical convex that will be, I'll call it k.

33:48

Okay.

33:50

So we just want to remember what does that word convex mean?

33:55

And how do you know if you have a convex function or a convex set?

33:59

Okay, let me start with convex set.

34:02

So because here's my general problem.

34:06

Convex minimization.

34:16

Which you hope to have and in many applications you do have.

34:20

So you minimize a convex function for points in a convex set.

34:31

So that's like the ideal situation.

34:34

That's the ideal situation.

34:36

Okay, get something on your side.

34:39

Something powerful convexity.

34:41

The function is convex.

34:43

And so let me draw a convex function.

34:46

The graph.

34:47

Okay, so I'll draw a convex function.

34:50

So a ball.

34:54

So that's a graph of f of x.

34:57

And then so here are the x's.

35:00

Let me put x1 and x2 in the base.

35:06

And the graph of f of x1 and x2 appear.

35:16

Okay, actually I'm over there.

35:18

I should be calling this function f, I think.

35:25

Is that right?

35:29

Yeah, little f would be the gradient of this guy.

35:33

Yeah, I think so.

35:38

Okay.

35:41

So now I'm minimizing over x and over certain x is not all x's.

35:49

I might be minimizing for example.

35:55

K might be the set where a x equals b.

36:02

K might be in that case a subspace.

36:07

Or a shifted subspace when I said subspace.

36:11

But then 1806 is reminding me in my mind that I only have a subspace when b is zero.

36:17

But you know the word for a subspace that sort of moved over.

36:23

So I'll just put that word down.

36:30

A bunch of words to learn for this topic.

36:35

But they're worth learning.

36:37

Okay.

36:38

So it's like a plane but not necessarily through the origin.

36:43

If b is zero, it doesn't go through.

36:45

If b is not zero, it doesn't go through the origin.

36:47

Okay.

36:48

Anyway, or I have some other convex.

36:50

Let me just put this convex at k in the base.

36:57

And did I make it convex?

36:59

I think pretty likely I did.

37:04

So now what's, well, the convex sets the constraint.

37:11

So this is the constraint.

37:17

Is that x must be x is in the set k.

37:23

Okay.

37:24

And I drew it as a convex blob.

37:28

Here was an example where it's a flat, not a blob, but a flat plane.

37:36

But let me come back to what this convex mean.

37:41

What's a convex set?

37:43

We have to do that.

37:46

Should have done that before.

37:55

In the notes, I had the fun of figuring out if I took a triangle.

38:02

Is that a convex set?

38:04

Let's just be sure.

38:09

So what's a convex set?

38:11

That is a convex set because if I take any two points in the set,

38:16

and draw the line between them, it stays in the set.

38:21

So that's convexity.

38:23

Any line from x1 to x2 stays in the set.

38:37

Okay, good.

38:39

So here's my little exercise to myself.

38:43

What if I took the union of two triangles?

38:46

All I wanted to get you to do is to sort of think visualize convex

38:52

and not convex possibilities.

38:54

Suppose I have one triangle even if it was obtuse.

39:01

That's still a convex, right?

39:04

No problem.

39:06

But now, what if I put those two triangles together?

39:10

Take their union.

39:11

Well, if I take them sitting with a big gap between, like I've lost.

39:17

I've never had a chance.

39:20

Because if I took one of my, if it was the union of these two,

39:24

well, you know what I'm going to say.

39:26

If I took that point, and that point, of course, it goes outside and stupid.

39:31

What about, but what if the triangles,

39:36

what if that triangle that lower triangle kind of overlaps the upper triangle?

39:43

Is that a convex set?

39:46

You're everybody's right, saying no.

39:49

How do I say that the union of those two triangles is not a convex set?

39:55

Guys, tell me where to pick two points.

39:59

Where the line goes out.

40:01

Well, I take one from that corner and one from that corner and the line between them went outside.

40:08

So union is usually not convex.

40:21

Well, if I think of the union of two sets, my mind actually automatically goes to the other corresponding possibility,

40:30

which is the intersection of the two sets.

40:36

So if I take the intersection of the two sets.

40:43

Now, what's the deal with that?

40:46

What was the, when I had two triangles, two separated triangles,

40:51

what can we say about the intersection of those two triangles?

40:56

It's empty.

40:58

So should we regard the empty set as a convex set?

41:02

Yes.

41:04

Isn't it?

41:05

Yes.

41:06

It's vacuous.

41:08

So it hasn't got any problems, right?

41:11

Okay.

41:12

But now is the intersection is always convex.

41:18

I'm assuming the two sets that we start with are.

41:22

Now, that's an important fact.

41:25

That the intersection of convex sets, let's just draw a picture that shows an example.

41:34

So what's the intersection, just this part, and it's convex?

41:39

Okay.

41:40

Can you give me a little proof that the intersection is convex?

41:48

So I take two points in the intersection.

41:51

Let me start the proof.

41:54

To test if something's convex, how do you test it?

41:57

You take two points in the set in the intersection.

42:01

And you want to show that the line between them is in the intersection.

42:06

Okay.

42:07

Why is that?

42:08

So take two points.

42:10

Take x1 in the intersection of two sets here.

42:16

And that's the symbol for intersection.

42:19

And we've got another point in the intersection.

42:22

And now we want to look at the line between them.

42:28

The line from x1 to x2, what's the deal with that one?

42:38

Is that fully in x1 in k1?

42:43

Why is it fully in k1?

42:45

I took two points in the intersection.

42:49

I'm looking at the line between them.

42:52

And I'm asking is it in the first set k1?

42:55

And the answer is yes.

42:57

Because those points were in k1 and k1's convex.

43:02

And is that line between them in k2?

43:05

Yes, same reason.

43:08

The two endpoints were in k2.

43:11

So the line between them is in k2.

43:14

So the intersection of convex sets is always convex.

43:18

The intersection of convex sets is convex.

43:24

Good.

43:26

So you'll see in the notes these possibilities with two triangles.

43:31

Sometimes you can take the union but not very often.

43:36

Okay.

43:37

Now what's the next thing I have to do?

43:41

Convex functions.

43:44

We got convex sets.

43:45

What are convex functions?

43:47

And we're good.

43:48

Right.

43:49

Because this is our prototype of a problem.

43:53

And I now would know what it means to be for that F to be.

43:57

I'm sorry.

43:58

I now know what it means for the set k to be convex set.

44:03

But now I have to look at the other often more important part of the problem.

44:08

What's the function I'm minimizing?

44:10

And I'm looking for functions with this kind of a picture.

44:15

Okay.

44:16

The coolest way is to connect the definition of a convex function

44:24

to the definition of a convex set.

44:28

This is really the nicest way.

44:31

It's a little quick.

44:33

It just switches by you.

44:35

Tell me.

44:36

Do you see a convex set in that picture?

44:41

Do you see a convex set in that picture?

44:43

That's a picture of a graph of a convex function.

44:46

It's a picture of a bowl.

44:48

Is the points on that surface?

44:50

Is that a convex set?

44:52

No.

44:53

Certainly not.

44:55

No.

44:56

But where is a convex set to be found here in that picture?

45:03

Yes.

45:04

Yes.

45:07

Yes.

45:08

The points on and above the bowl.

45:11

The inside the bowl we could say.

45:14

These points.

45:16

So convex function, yes.

45:21

A function's convex when the points on and above the graph.

45:31

Our convex set.

45:47

You could say, okay, mathematicians are just being lazy.

45:52

Having got one definition straight for a convex set.

45:55

Now, they're just using that to give a easy definition of a convex function.

46:01

Actually, it's quite useful for functions that could maybe equal infinity, sort of generalized function.

46:09

But it's not the quickest way to tell if the function is convex.

46:15

It's not our usual test for convex functions.

46:19

So now I want to give such a test.

46:25

Now I definition of convex function.

46:33

Of a smooth convex.

46:36

Yeah.

46:37

Yeah.

46:38

The, this fact, don't, I don't.

46:41

I don't.

46:42

I don't.

46:43

I don't.

46:44

I don't.

46:46

I don't.

46:47

I don't.

46:50

I don't.

46:51

I don't.

46:52

I don't.

46:54

I don't know if it's graph that the really official French name for the set above the graph is the epic graph.

47:01

But I won't even write that word down.

47:04

Okay.

47:05

Why do I come back to that for a minute?

47:08

Because I would like to think about two functions, F1 and F2.

47:16

Out of two functions, I can always create the minimum or the maximum.

47:23

So suppose I have two convex functions, convex functions F1 and F2.

47:34

Okay.

47:35

Then I could choose the minimum.

47:38

I could choose my new function.

47:41

So I call it little M for minimum.

47:44

M of X is the minimum of F1 and F2.

47:51

And I could choose a maximum function, which would be the maximum of F1 of X and F2 of X at the same point X.

48:03

If I, it's just natural, it's saying, okay, I have two functions.

48:07

I've got a bowl and I've got another bowl.

48:11

And suppose they're both convex.

48:14

So I'm just stretching you to think here.

48:20

I've got the graphs of two convex functions.

48:24

And I would like to consider the minimum of those two functions.

48:29

And also the maximum of those two functions.

48:32

I believe life is good.

48:34

One of these will be convex and the other won't.

48:38

And can you identify which one is convex and which one is not convex?

48:46

What about the minimum?

48:48

Is that a convex function?

48:50

So just look at the graph.

48:52

What is the minimum look like?

48:54

The minimum is this guy until they meet somehow on some surface and then this guy.

49:01

Is that convex?

49:02

We have like one minute to answer that question.

49:06

Absolutely no.

49:07

It's got this bad king in it.

49:10

What about the maximum of the two functions?

49:14

So the maximum is the one that's above all the points that are,

49:20

or things that are above or on there's the maximum function.

49:27

That was the minimum function.

49:29

It had a king.

49:31

The maximum function is like that.

49:33

And it is convex.

49:36

So maximum yes minimum no.

49:41

And we could have the maximum of 1500 functions.

49:50

If the 1500 functions are all convex, the maximum will be.

49:55

Because it's the part way above everybody's graph.

50:00

And that would be the graph of the maximum.

50:03

OK.

50:04

Good.

50:05

And now finally let me just say how do you know whether a function is convex?

50:13

How to test?

50:15

How to test?

50:17

OK.

50:20

So let me take just a function of one variable.

50:24

What's the test you learned in calculus?

50:28

Freshman calculus actually.

50:30

Just show that this is a convex function.

50:36

What's the test for that?

50:40

Second derivative should be positive or possibly zero.

50:45

So second derivative, greater equals zero.

50:51

Everyone.

50:53

That's this convex.

50:58

Final question.

51:00

Suppose f is a vector.

51:03

So this is a vector.

51:06

And so I have n functions of n variable.

51:09

No, I don't.

51:10

I have one function.

51:13

But I'm in n variables.

51:16

So this is what's the test for convexity test.

51:27

So it would be passed, for example, by x1 squared plus x2 squared.

51:35

Would it be passed by?

51:37

So here would be the question.

51:39

Would it be passed by x transpose some symmetric matrix s?

51:45

That would be a quadratic, a pure quadratic, would it be convex?

51:54

What would be the test?

51:57

I'm looking for an n-dimensional equivalent of positive second derivative.

52:04

The n-dimensional equivalent of positive second derivative is convexity.

52:08

And we have to recognize what's the test.

52:12

So I could apply it to this function, or I could apply it to any function of n variables.

52:23

And I maybe should be here.

52:28

What's the test here?

52:29

Here I have a matrix instead of a number.

52:33

So what's the requirement going to be?

52:38

Positive definite, or semi definite.

52:43

Or semi definite, just as here.

52:46

So the test is positive semi definite, Hessian.

52:54

And here the Hessian is actually that s because the second derivatives will produce,

53:00

I'll put a half in there.

53:02

The second derivatives will produce s equal to Hessian age.

53:06

So this here the s.

53:08

So positive semi definite, Hessian in general, second derivative matrix for a quadratic.

53:17

So it's convex problems that where we're going to get farther with.

53:28

We run into no saddle points.

53:31

We run into no local minimum.

53:34

Once we found the minimum, it's the global minimum.

53:37

These are the good problems.

53:38

OK.

53:39

Again, happy to see you today, and I look forward to Wednesdays.

53:44

Once we found the minimum, it's the global minimum.

53:49

These are the good problems.

53:51

OK.

53:52

Again, happy to see you today, and I look forward to Wednesdays.

00:00

Einführung in die Optimierung

01:40

Taylor-Reihe in der Optimierung

10:00

Gradient- und Hessian-Konzept

13:20

Übersicht über das Newton-Verfahren

15:53

Konvergenz des Newton-Verfahrens

24:20

Steilster Abstiegsmethode

27:33

Newton-Verfahren zur Optimierung

30:38

Konvergenzrate

32:34

Konvexität in Minimierungsproblemen

41:20

Schnittmenge und Vereinigung in konvexen Mengen

44:10

Verstehen konvexer Funktionen

47:40

Maximierung und Minimierung konvexer Funktionen

49:14

Maximale Funktionen identifizieren

50:04

Verstehen konvexer Funktionen

50:13

Testen der Konvexität in Funktionen

52:43

Konvexitätsbedingungen mit Hessian

05:30

Was hat's mit der hessschen Matrix in der Optimierung auf sich?

11:00

Wie hilft dir das Newton-Verfahren dabei, Gleichungen systematisch zu lösen?

09:40

Welche Rolle spielen Ableitungen im Optimierungsprozess?

20:00

Was macht es, dass Newtons Verfahren so schnell zu Lösungen konvergiert?

30:50

Warum ist der steilste Abstieg nicht so effizient wie die Newton-Methode?

28:40

Welche wichtige Rolle spielt die Hessian bei Optimierungsmethoden?

32:50

Wie beeinflusst die Konvexität den Minimierungsprozess in der Optimierung?

43:05

Welcher Beweis zeigt, dass der Schnitt von konvexen Mengen immer konvex ist?

47:51

Können wir neue konvexe Funktionen als Minimum oder Maximum von bestehenden definieren?

50:30

Wie finden wir raus, ob eine Funktion konvex ist, wenn wir Ableitungen benutzen?

52:46

Welche Rolle spielt die Hesse-Matrix beim Testen der Konvexität für n-dimensionale Funktionen?

53:34

Warum ist es wichtig, globale Minima bei Optimierungsproblemen zu finden?


StatistikKonvexe AnalyseNewton-VerfahrenMathematisches ModellAlgorithmusComputerwissenschaftenAnalyse der DatenMaschinelles LernenMathematische OptimierungJacobimatrix und DeterminanteGradient descentNumerische Analyse

Beschreibung

Das Thema behandelt die Grundlagen der Optimierung und konzentriert sich darauf, eine Funktion Schritt für Schritt zu minimieren. Der Inhalt wird wahrscheinlich verschiedene Algorithmen und Techniken erkunden, die in der Optimierung verwendet werden und die für Anwendungen im maschinellen Lernen und in der künstlichen Intelligenz entscheidend sind. Ihr könnt erwarten, die wesentlichen Konzepte und Prinzipien zu lernen, die der Optimierung zugrunde liegen, wie zum Beispiel das Finden des Minimalwerts einer Funktion. Der Autor wird euch durch den Prozess führen, diese Konzepte anzuwenden, um Probleme zu lösen, und dabei reale Beispiele und praktische Demonstrationen verwenden. Dieser Bildungsinhalt könnte auch die Bedeutung der Optimierung im Deep Learning ansprechen und ihre Rolle als grundlegenden Algorithmus hervorheben. Indem die Grundlagen der Optimierung behandelt werden, zielt dieses Material darauf ab, eine solide Basis für das Verständnis fortgeschrittenerer Themen im maschinellen Lernen und in der KI zu bieten.