What you see here is one part of a really elaborate process that defines the universally used
to JPEG image compression format.
JPEG is rather complex, and in this video, the majority of the focus will be on understanding
how computer scientists came up with an algorithmic and mathematical framework
resolving the complex problems that image compression presents.
But to understand the motivation behind the ideas in JPEG, we'll have to dive into the inner
workings of the many components involved.
The perspectives will take will be a little bit unorthodox, but my hope is that you come
away with the better understanding of the big themes in image compression, which apply to other
compression related problems. It's not an exaggeration to say that these concepts are used
every time you open an image, play a video, or listen to some music. As we go through JPEG,
we'll interact with the wide variety of beautiful ideas in the world of data compression
and signal processing that make the technology around us possible.
Before we dive into JPEG, let's talk about how computers represent images.
The standard color space that computers use is the RGB model.
Every pixel of an image stores three values from 0 to 255 with higher values representing a
larger weighting of the respective color.
So assuming each color component is expressed in 8 bits, or a single byte of memory,
an image has 3 bytes per pixel.
Here's an image with a little more than 5 million pixels. Based on our assumptions,
the total size of this image should be about 15 megabytes. But with JPEG compression,
the actual file is only 0.8 megabytes.
Same number of pixels, but 5% of the expected size, and the image looks absolutely beautiful.
This is the real magic of JPEG compression.
JPEG aggressively takes advantage of several clever ideas to achieve
seemingly ridiculous amounts of compression with minimal effects on the quality of the
original image. One of the primary reasons JPEG works so well is it uses lossy compression.
To understand what that means, let's think about compression from a big picture perspective.
We start with an RGB representation of an image, and then we encode it using a compression algorithm.
This is what we store in memory and it's more compact, but quite different than our original RGB
representation. So part of a compression scheme requires also defining a decoding component
that converts the stored representation of our data into the RGB format that a computer can render
as an image. Part of the JPEG standard is defining how both the encoding and decoding work.
A key point in JPEG is that the final decoded image is not going to be the same as the original
uncompressed image. That's why we call it lossy compression. In the compression part of the
pipeline, we are going to deliberately lose information. To get compression on the levels of
5%, there's really no other option other than to actually discard some information from our
original image. Now the fun question to ask is what sort of information from an image can we get
rid of and how do we get rid of it? Answering this question is going to be the primary focus
of our journey into understanding JPEG. Here's an interesting image for you. If I were to ask you
what colors the squares of A and B were, I imagine most of you would quickly say that A
is a darker shade of gray than B. But what if I told you that A and B were actually the same color?
It's okay if you don't see that. This picture is designed to trick our visual system.
But once we have a connector of the common color between the two squares, it's much easier for us
to see that they are in fact the same color. So what's going on here? Over the years, scientists
have developed a human visual system model through the study of our eyes and one incredibly
interesting finding through experimentation is that our eyes are much more sensitive to brightness
than they are to colors. And part of the JPEG compression scheme can take advantage of this,
but to understand how we have to dive into the world of color spaces.
As we've discussed, the RGB color space is a combination of red, green, and blue color
components. If we put each value on a separate axis in a three-dimensional space, we can see
how all the possible colors are just a point on this cube. One aspect of RGB color space is as
you progress on the diagonal from the origin to the color 255, 255, 255, you get gradually brighter
colors. And in fact, the exact line between these points defines all possible gray-scale colors,
which are a direct measure for brightness. This idea of separating brightness is
core to another color space called YCBCR. YCBCR stands for Y, Chroma Blue, and Chroma Red.
Our Y component is going to measure the luma or brightness of an image, and our CB and
CR components are going to encode the colors. If we look at the color space, the Y can be thought
of as a single vertical axis with larger values encoding more brightness.
Every cross section of the space defines a range of colors at that particular brightness.
For our purposes with JPEG, using the color space gives us direct access to the part of
color that our eyes perceive best. As a result of being more sensitive to brightness than colors,
one idea to compress our original image involves sampling less of the CBCR components
and keeping all of the luma components. The technique is referred to as Chroma Down sampling,
or more commonly, Chroma Sub sampling. Suppose I have this 8x8 image which has the following Y,
CB and CR components. The key idea of Chroma Down sampling or Sub sampling is to take fewer
samples from CB and CR components since our eyes are less sensitive to them. Here's one approach
that defines a 4 to 0 Chroma Down sampling scheme. We go through our original 8x8 image
in 2x2 blocks and simply average the group of pixels to get a shared value of the four pixels
in the original image. Averageing the pixels by the way is all down sampling really means.
Chroma Sub sampling is the same exact idea, but instead of averaging, we just choose one of the
samples usually the top left pixel to be the color of the entire 2x2 block.
Once we have these fewer samples from the color components, we can merge them with the luma
component which will retain the original 16 pixels and this gives us our sub sampled image.
In this case, you can see quite a difference since our 8x8 pixel image is significantly scaled up,
but in real world images, it's often hard to see any changes after sub sampling.
By merging 2x2 blocks on the CB and CR channels into one color, we are left with the
quarter of the original data in each color channel, shrinking the total file size by 50%.
We're still quite far from the 5% levels that we saw in JPEG, so we're going to have to exploit
more than just even perception of brightness. For the following components of JPEG,
let's focus on the Y channel, which essentially defines gray scale images.
The principles we'll discuss from here on out will also apply to the color components of an image.
The next clever idea in JPEG requires looking at images in a completely different perspective,
one that can be a little bit counterintuitive. One way to think about images is treating them
as signals. If I slice a particle row of an image, I essentially have a row of pixels
each with some value between 0 and 255. If we plot these values, we can get an approximation
of a signal. Visualizing an image as a signal allows us to talk about frequency components within
an image. Higher frequency components correspond to rapid changes between pixels,
while lower frequency components are related to smoother changes between pixels.
There are two key aspects of frequencies within images that are incredibly important to JPEG
compression. The first is that a lot of real world images shot from cameras are mostly composed
of lower frequency components. In other words, if I take a random portion of a realistic image,
it's pretty likely that the pixels in that area do not change that rapidly.
And the second key fact is that from a variety of experiments, the human visual system is generally
less sensitive to higher frequency detail in images. JPEG takes advantage of these ideas by
strategically removing less important and less common higher frequency components from an
image to achieve even more compression. But there's one big problem. How do we get frequency
components from an image? This is where some particularly clever and beautiful math comes into
play. The answer to this question lies in a special operation called the discrete cosine transform
or the DCT. The DCT works for any size input, but to simplify things, let's focus on an input
of eight pixels. Just as we did earlier, let's suppose these eight pixels form some sort of signal.
We'll never be sure what exactly the signal looks like since we only have eight points,
but the clever and definitely not obvious idea of the DCT is to represent these eight points as
sums of sample points from cosine waves. And I really want to emphasize the fact that we only
care about the discrete samples. Visually, I think it's nice to see the continuous signals
and cosine waves, but throughout a discussion, the only values that really matter are the sample
points from these functions. The DCT takes an input of sample points from our original signal
and gives us an output of the same size, we'll refer to the outputs of the DCT as coefficients.
These coefficients represent the weights of cosine waves of different frequencies that contribute
to the original signal. A nice analogy is to think of this as unraveling a complex signal
into a weighted sum of simpler cosine waves. If you've never interacted with this type of idea
before, it's natural to be confused. What cosine waves do we even use? How do cosine waves
relate to pixels on an image? None of it makes any sense. Don't worry, these are important
questions that we will answer. Let's start simple and talk about cosine waves. Here's a graph
of cosine from 0 to pi. I've given you this general notion that the DCT is supposed to tell us
how much of a specific cosine wave is contained in a signal. So let's test this out.
What happens if I provide an actual cosine wave as the input signal to the DCT? What do we expect
to happen? Okay, we can try this, but there's a problem. To follow our existing example,
we need eight sampled points from the cosine wave to make this work. How exactly should we sample
the cosine wave? Well, there are a few options, but let me present to you the most common one.
What we can do is split our cosine waves domain into eight even sizes, and then we take the midpoint
of each of these sizes. This gives us the following input points, which we can generalize
for any number of points. But for our purposes, we'll stick with the smaller n equals eight example.
So going back to our question, what should we expect the output to be when we pass
in sampled points from a standard cosine function? This is an interesting experiment.
When we pass these points from a cosine wave into a DCT transform, we get the following output.
Only one coefficient has a non-zero value, meaning there's only one cosine wave that contributes
to our input. And that seems to make some sense since the input is literally from a cosine wave.
In this case, the first index is the only coefficient with a non-zero value. When trying to
understand complex ideas, it really helps to play around with these simple examples. A cool follow-up
to our experiment is to see what happens when we change the amplitude of this cosine wave.
The first index DCT coefficient increases. If we flip the cosine wave by multiplying negative one,
the DCT coefficient changes sign. It's exactly acting like a weight or a cosine wave.
When the amplitude of the input cosine wave changes, the weight correspondingly reflects that
change. So taking a step back, how does this relate to images?
Well, just as we took images and represented them as signals, the reverse also works.
Standard grade scale images have pixels ranging from zero to 255.
The intuition with cosine waves to images makes more sense when we shift the range of pixel
values by 128. With pixel values from negative 128 to 127, we can see a better mapping between
this cosine wave to an actual set of 8 pixels. This particular wave is a nice way to represent
a row of gradually decreasing pixel values. And the magnitude of that change as well as the
direction of the change is reflected in the amplitude of the original cosine wave and consequently
the DCT coefficient. So let's continue this experiment to see what else we can uncover about the DCT.
We've messed with the amplitude of a cosine wave. What other parameters could we change?
A simple one is to just shift the cosine wave up or down. Let's see what happens when we try that.
It looks like shifting up or down the signal only affects the zero index coefficient.
That's an interesting data point that we'll come back to. Another parameter of cosine waves is
the frequency. What we're going to do now is show the DCT coefficients as we wind up the frequency
of this cosine wave. I'll keep the sampling strategy we discussed earlier consistent among
all frequencies. Let's see what happens. As we increase the frequencies, we get a few different
DCT coefficients for the respective cosine wave. That is until we get to this cosine wave.
For this particular cosine wave, only the second index has a non-zero coefficient.
This cosine wave is actually just double the frequency of the previous cosine wave.
This is super interesting. The first index of the output seems to nicely correspond with the
cosine wave of frequency 1, while the second index correlates with a cosine wave of frequency 2.
Let's continue this experiment of increasing frequencies. But before I continue,
see if you can take a guess at what frequencies the other coefficients will correspond to.
Here we go. We slowly increase the frequency and boom. The index 3 coefficient
corresponds to a cosine wave of frequency 3. Then frequency 4 comes next.
And this pattern continues until we get to a cosine wave of frequency 7.
Pretty insane, right? So for the coefficients index 1 to 7, it looks like they represent
the weight on a cosine wave with the frequency that matches the index.
So what about the remaining index 0? We saw shifting cosine waves up and down led to a change
in the 0th index. What cosine wave does that represent? Some of you have probably figured it out,
but if you think about what a 0 frequency cosine wave is, it's just a constant signal.
What that means in terms of images is it gives us a measure of the overall brightness of a set
of pixels. Brighter images will have a larger 0th coefficient than darker images.
This is why shifting up a cosine wave only impacts the 0th coefficient.
Putting this all together, each of these frequencies correspond to a different pattern of images.
And what the core DCT does is break down how each of these fundamental patterns contribute
to the original image. And it turns out that all possible combinations of 8 pixel values can be
represented as a sum of these 8 cosine waves. Why that's true is not at all obvious,
but we can begin to understand it once we translate this intuition
to the actual math behind the DCT. If you look at the mathematical definition of the DCT,
we usually have a vector definition of the original signal and the output coefficients.
We want to define the kate index of the coefficient vector mathematically.
What you'll often see is something that looks like the following.
And with the intuition that we just built up, we'll see that this equation is doing exactly what
we want. Let's start with the cosine term. This function should be familiar. It's the exact
representation of a sample point from a cosine wave using our earlier sampling scheme,
and it incorporates the frequency of the cosine wave as well. Now what's interesting is in order
to get the kate index, we are actually summing over a product of each sample point with samples
from the cosine wave. Why does that make sense? This type of expression might look vaguely
familiar to a lot of you. Let me rewrite this another way. We know that the original signal points
can be represented as a vector, but what if we rewrote the sample points from the cosine wave
as a vector as well? What does this expression mean in the context of these two vectors?
It's a dot product, and what we know about dot products is there a nice way to measure
similarity between two vectors. That's why when we pass in sample points from a cosine wave
of frequency k as the input to the DCT, we got large values at the kate index coefficient.
These two vectors were just scaled versions of each other, so the dot product was maximized.
And this perspective reveals what I think is truly the most surprising and elegant part of the DCT.
By picking the points through the sampling method, we can think of the entire DCT as a matrix
vector product. All we're doing here is a linear transformation. The rows of the matrix are the
sampled points from the cosine waves of the respective frequencies. And what's truly astounding
is that all row vectors in this matrix are orthogonal to each other. What I mean by that is if
you take the dot product of any two row vectors representing cosine waves, you will get zero
if they are different rows of the matrix. Intuitively, this is why in our earlier experiments,
when we pass in a cosine wave of a particular frequency as an input into the DCT, we didn't get
a contribution from any of the other coefficients which represent a different frequency cosine waves.
The orthogonality of the sampled points from different cosine waves generates this behavior.
It's really quite beautiful. Another great property of the DCT that follows from these facts
is invertibility. I've talked about the DCT as a way of decomposing a signal into a coefficient
representation of weights associated with cosine waves. We can also reverse this process.
If I take my coefficient representation of the signal, I can apply what's called the inverse DCT
to get back the original signal. And it is the exact same signal. No information is lost in this
step. How we do that is by multiplying our coefficient representation with the inverse of the matrix.
What's cool about this is that because of the orthogonality of the vectors,
the inverse is just the transpose of our original matrix with some additional normalization constants.
Now there's a super nice interpretation of the inverse DCT. The sampled cosine wave
points are now column vectors. So what the inverse DCT is doing is essentially summing over
a weighted combination of cosine waves directly to get the original signal. And because these
columns are orthogonal to each other, that's what allows us to represent any set of eight points
with these eight cosine waves. Absolutely incredible. I know we spent some time and went
through some fairly complex math to get here. But it's precisely these details that are the most
fundamental part of not only the DCT, but many other similar transforms in the world of signal
processing. Now that we have a good intuition on the one dimensional DCT, let's talk about how
JPEG specifically uses it. JPEG takes an image and splits it into eight by eight blocks and then
centers their values around zero by subtracting 128. Then we take the block and apply the DCT
to each row of the block, giving us eight sets of DCT coefficients. We then apply the DCT to each
column of the block. This process is what defines the two dimensional DCT. So in the end,
we have 64 coefficients, each of which are a weight on a specific eight by eight pattern.
Notice the first row and column correspond to the earlier one dimensional patterns,
and the other elements are compositions of these patterns. And just like in the one dimensional
case, the big idea here is that we can build up any eight by eight image using these 64
fundamental patterns. The same signal perspective we talked about earlier also applies here,
except now with 2D waveforms. What's going on here is we are plotting the pixel value on the
z-axis with brighter pixels having larger values. What's fun to play around with is seeing
how the waveform and image come together as we slowly put together the 64 coefficients
in increasing frequencies. Seeing this in action makes you realize that one interesting
property is that by the time we incorporate a small portion of the coefficients, our signal and
image already look pretty close to the original versions. There's an even more direct experiment we
can run to quantify this notion. This particular eight by eight block was randomly picked out
of the original image. If we map out the magnitude of the DCT coefficients on this block,
we see that most of the largest values are in the upper left section, which corresponds to lower
frequency components. And what's even more interesting, if I take any eight by eight block on this
image, almost all of them have the same property. This property of the DCT is what's commonly
referred to as energy compaction. After applying the DCT, most of the largest values are
concentrated in a few low frequency coefficients and this holds true in a lot of real world
images. The concept of energy compaction is incredibly important in image compression. As we will
see, it's exactly the property that will allow us to aggressively compress images while still
retaining high visual quality. Fun fact, the original discovery of the DCT centered around
approximating other transforms that had better energy compaction properties, but were too expensive
to carry out. The DCT is just one example of a transform that has this property for real world
images and we use it because it's quite easy to compute. There's a lot of complexity involved here,
but one of my goals in this discussion of the DCT and JPEG was directly interacting with these
deep and important ideas through questions and visual experiments. Interactivity is a core part
of learning and a website that does a fantastic job of interactive explanations is brilliant,
the sponsor for this video. From the basics of mathematics and algorithmic thinking to more
complex ideas in deep learning and probability, brilliant offers a variety of courses and learning
paths for those interested in getting hands-on practice. I would discussions of JPEG interacted
with some linear algebra in the application of image compression and brilliant has an entire
linear algebra module that goes through the fundamentals and even shows applications of these
ideas in image compression, cryptography, error correcting codes and much more. When I was a student,
I really enjoyed their computer science fundamentals course, which has engaging visualizations
of concepts and great practice problems that helped me solidify my foundations. You can get started
for free by going to brilliant.org slash reducible, which is linked in the description below.
Brilliant is providing a special offer through this channel where the first 200 members to sign
up get 20% off the annual subscription. It's a great way to learn more about the topics in these
videos and also a good way to support this channel. Big thanks to Brilliant for sponsoring this video.
Let's put everything we've discussed with the DCT together in one more experiment. We'll split
our image into 8 by 8 blocks and then basically rebuild the image with each block having only a
certain number of DCT coefficients. We're going to start off with zero coefficients and slowly build
up the image. After one coefficient, we end up with basically a blur of the original image.
And as we add DCT coefficients slowly, notice how quickly the image starts looking like the original.
By the time we get the less than 25% of the DCT coefficients, you almost can't even tell the
difference between the two images. This confirms the key aspects of why JPEG works for this particular
image. Almost all the blocks are composed of the lowest frequency components and we are generally
less sensitive to changes in high frequency details. So at this point, we know we can eliminate
higher frequency components from the DCT. But the next natural question is how we actually do this.
The process for eliminating higher frequency components in JPEG is called quantization.
Quantization is a simple idea. Given an 8 by 8 matrix of frequency coefficients from the DCT,
what we're going to do is basically divide each element by a scalar value and round it to an
integer. These values are defined in terms of a quantization table. Notice larger values in the
bottom right of the table leading to zero values in the higher frequency components.
In the decoding state of JPEG, we'll actually be multiplying this result by the same quantization
matrix element by element. And as you can see, the final coefficient matrix will be quite
different from the original one. So what that means is we're purposely losing information in this
step. But the key idea here is most of the lower frequency components will be retained.
This is why the energy compaction property of the DCT is so useful. When the largest values lie
in the lowest frequencies, we will end up with a lot of zeros in the less important high
frequency components. These quantization tables are provided by the JPEG standard from visual
experiments and are the main way for JPEG to define quality of compression. High quality
compression parameters can be translated to lower quantization table values. In practice,
JPEG also defines a separate quantization table for both the Luma and Color channels. Notice
that in the Color channels, quantization can be even more aggressive. After performing quantization,
we have a matrix of quantized DCT coefficients where we can now exploit redundancy to get even
more compression. The last part of JPEG encoding involves a combination of run-length encoding
and Huffman encoding. One clever trick is that a JPEG encoded will order the coefficients in a
zigzag manner to maximize the chance of a large sequence of zeros in order. Classic run-length
encoding can compress this fairly easily. All that's going on here is we are compressing every
sequence of zeros into account of the occurrences in a continuous sequence. JPEG actually performs
something a little bit more sophisticated by keeping track of a triplet. For every coefficient,
this triplet encodes the number of preceding zeros, the number of bits required to encode the
coefficient, and finally, the actual coefficient value. We also have an end of block value to signal
that everything from here on out will be zeros. This particular scheme works well with Huffman
encoding to further exploit redundancy. The big idea of Huffman codes is that more frequently
used data can be encoded with fewer bits, and it turns out especially with the nature of quantization,
these triplets can be further compressed and some of these values will be more frequent than others.
However, I'm purposefully not going to go into the details of how JPEG users have been
codes to compress the data because it really does get quite tricky. To give you some sense of the
problems, we have to deal with encoding signs of coefficients as well as triplets for all 8x8 blocks.
Most encoders also encode the top left coefficient separate from all the other coefficients.
And when you handle that, you have to deal with this on both Luma and Color channels.
And when you eventually get that working, a good chunk of your logical break when you have to
deal with the different types of chroma subsampling. Implementing an optimized fully functional JPEG
encoder in decoder is no joke. I wouldn't give that task to even my worst enemies.
But in terms of the big picture, all that's going on in this component is taking advantage of
the redundancy that quantization creates. A JPEG decoder will be able to use the Huffman code data
in the files to get back all quantized DCT coefficients that were encoded.
This part of the JPEG algorithm does not lose any information.
JPEG as a whole brings about an interesting discussion on the philosophy of data compression.
The classic and most straightforward way to compress data is by taking advantage of redundancy.
This is the basis of loss's image compression algorithm such as those found in PNG file formats.
In fact, for images where it's really important not to lose any information,
PNG format is recommended over JPEG. But on most real world images, being aware of the
medium of presentation introduces another really powerful perspective.
A lot of innovation in JPEG compression comes from experiments and understanding of human
visual systems. It's from these experiments that we realized human eyes are less sensitive to color
and also less sensitive to higher frequencies, so we can remove that information without a
significant visual impact. This is why JPEG is so much more effective at compressing images than
lossless formats. You'll find these same types of techniques used in audio and video compression,
where algorithms use our perceptions of sound and motion respectively to remove less relevant data.
In fact, variations of the discrete cosine transform and quantization show up in both audio and
video compression. It really is incredible to me how people in these fields came up with the
mathematical and algorithmic framework to utilize the way we actually perceive the digital
technology around us. There's so much depth to these topics that I can never hope to cover in
just one video, but I do hope this gives you a sense and appreciation for the complexity of the
technology around us that we use on a daily basis. Thanks for watching and I'll see you all in the
next one.
Understanding JPEG Compression
Color Representation in RGB Model
Chroma Subsampling Techniques
Discrete Cosine Transform in Image Processing
Understanding Coefficients in DCT
Matrix Representation of DCT
Energy Compaction in DCT
Understanding JPEG Information Retention
The Philosophy of Data Compression
Impact of Human Perception in Compression
What are the key advantages of using JPEG compression over other methods?
How does Chroma Subsampling reduce image file sizes while maintaining quality?
What role does the Discrete Cosine Transform play in JPEG compression?
How do DCT coefficients correspond to cosine wave frequencies in images?
What role does energy compaction play in image compression techniques like JPEG?
How does quantization affect JPEG compression quality and size?
How does quantization utilize redundancy in JPEG compression?
What makes JPEG more effective than lossless formats like PNG?
The video starts by explaining the complexity of image compression and how computer scientists developed an algorithmic and mathematical framework to address these challenges. It delves into the inner workings of JPEG and its various components, providing an unconventional perspective on the subject. The video will cover a wide range of topics, including data representation, color spaces, and signal processing. By the end of the video, viewers will have a deeper understanding of the big themes in image compression and how they apply to other compression-related problems. The video will also explore the beauty of the ideas in data compression and signal processing that make technology possible.