Have you ever wondered how your internet search browser predicts what you're about to ask?
For instance, the algorithm knows that after you type will dinosaurs, there's a pretty good chance
that next you'll type come back, but had programmers actually build a trained algorithm
to make this kind of prediction.
Well, in this course, you'll learn the process for creating complex models,
which can be used to offer suggestions about all kinds of things.
These models work by applying previously gathered data to make educated guesses.
About everything from dinosaurs to the music you listen to, to the route you might take to work.
And without talented data professionals and the machine learning models they create,
we wouldn't have this useful feature.
Throughout this course, in addition to models, you will explore the major characteristics of machine learning
and keep developing your skills in Python.
But before we start, let me offer my congratulations on everything you've accomplished.
You've practiced data cleaning, reviewed key statistics concepts, and explored regression models.
Along the way, you've become better equipped to navigate the more complex parts of being a data professional.
I will be your guide as we consider a big picture perspective of the world of machine learning and complex models.
My name is Sushila. I'm a data scientist and I work on projects for YouTube here at Google.
At YouTube, we use machine learning every day to deliver content to users from music reviews to travel vlogs to my personal favorite cat videos.
I'm excited to explore machine learning with you.
As you've learned previously, machine learning is the use and development of algorithms and statistical models to teach computer systems to analyze and discover patterns in data.
It's applications are diverse from optimizing microchip design and improving earthquake predictions to recommending videos to watch on YouTube and more.
This course will help you on your journey to becoming a data professional who can use machine learning to work on applications similar to the ones I just mentioned.
Before you begin, it may be helpful to review linear and logistic regression and statistical models covered earlier in this program.
The terms complex models and machine learning are often used interchangeably.
In this course, when the term complex model is used, we are referring to mathematical or computational models in general, inclusive of everything from regression to deep learning.
First, you'll learn the different types of machine learning, like supervised and unsupervised learning.
Then, you'll learn how to implement some of them. Each type will be broken down by its characteristics so that the uses and purposes of each type are clear.
Next, we'll consider another perspective of the data science workflow and discover how to apply pace to machine learning.
In addition, you'll learn how exploratory data analysis or EDA relates to machine learning.
You will also practice additional skills like feature engineering.
Another part of your introduction to the machine learning landscape will be selecting relevant models and metrics.
You'll determine the most appropriate model for your purpose and data. You will learn which tools to use to evaluate your models performance.
You will also continue to work in Python. You'll explore the most commonly used Python libraries and packages for specific types of machine learning.
And you'll use the same resources that Google's data experts rely on in their work.
By the end of this course, you will have a virtual tool belt packed with the tools you need to build and improve on complex models.
Lastly, you will have the opportunity to practice your new job ready skills in a portfolio project that's based on a common workplace scenario for data professionals.
Machine learning is an exciting field for data professionals. It's evolving and developing all the time with new applications every day.
The concepts in this course will help prepare you to join the field and advance your career in data science.
Let's discuss some topics that will cover in this part of this course.
First, we'll consider the main types of machine learning.
Then we'll review different variable types used for machine learning, like continuous, categorical and discrete.
It's essential to have an understanding of these concepts so you can build an appropriate model.
Next, we'll introduce you to recommendation systems and explore how they use content based and collaborative filtering to make content suggestions.
Both types of filtering are important techniques for building many of the recommendation systems used today.
Finally, we'll learn about the ethics and applications of machine learning.
We'll examine the questions that data professionals must ask as they plan, analyze, construct and execute models.
Because machine learning models are very powerful, it's important for data professionals to consider the ethical implications of their work.
Coming up, we'll give you the tools to avoid some common mistakes that can mean the difference between a problematic project and a successful one.
Now let's get started.
Earlier in the program, you learned that logistic regression and linear regression have different purposes.
For example, when you discover data points are grouped into two different categories in a data set, a straight line may not describe the data very well or be useful as a means of predicting outcomes.
A logistic regression or sigmoid function is a better fit for a data set like that, particularly when new data points are introduced to it.
As you've learned, linear regression is particularly good for data sets in which data points can be represented with a straight line.
With some data sets, simple regression models may not be sufficient for your analysis.
For example, this graph shows a distribution of three classes.
A simple linear regression model wouldn't be very useful if we didn't know what class our observations belonged to.
And a simple logistic regression model wouldn't be able to predict a problem with three classes.
This is where data professionals need more complex machine learning.
But many machine learning models use regression principles as a foundational layer to begin the process of teaching a computer model to make decisions.
Depending on the type of data available and the kind of problem you want to solve, you'll probably select one of two machine learning types.
Supervised and unsupervised.
Because supervised learning problems occur more frequently in the workplace, data professionals use this type most often.
Supervised machine learning uses labeled datasets to train algorithms to classify or predict outcomes.
Data professionals use supervised machine learning for prediction.
Labeled data is data that has been tagged with a label that represents a specific metric, property, or class identification.
For example, imagine you need an algorithm to predict whether a bird is a penguin or an ostrich based on height.
You have a data set of heights and an indicator that specified whether that measurement came from a penguin or an ostrich.
The height value is the x data.
The indicator is the label or the y data.
Here's another example.
You own a restaurant and you have data on how many customers visit per month and how much revenue you generate per month.
If x is the number of customers and y is the amount of revenue, then you can use an algorithm to predict next month's revenue based on the number of
projected customers.
Whether you're measuring birds or predicting revenue, you need labeled data for supervised machine learning.
Next, think about the terms classify and predict as they apply to supervised learning.
We can use our bird and restaurant examples to help.
The bird example requires an algorithm that seeks to classify or collect different types together into categories, classes, or groups.
In the restaurant example about predicting revenue, the algorithms goal is to forecast or estimate a value given data that is already labeled.
To summarize, supervised machine learning algorithms use data with answers already in it and use it to make more answers, either by categorizing or by estimating future data.
As a data professional, you will manually adjust these types of models to meet business needs.
Using your knowledge of data cleaning, statistics, and regression, you will learn to train, tune, and optimize complex models to deliver more accurate results.
The other most common type of machine learning used by data professionals is unsupervised learning.
Unsupervised machine learning uses algorithms to analyze and cluster unlabeled data sets.
In this type, data professionals ask the model to give them information without telling the model what the answer should be.
At this point, you may have an idea of what unlabeled data means.
Think back to the ostrich example we discussed earlier.
Unlabeled data would describe a set of flightless birds and not contain any kind of labels, tags, or categorizations.
When you receive a data set like this, the goal is to group the birds by their similarity based on patterns detected by your model, without their necessarily being a correct answer.
Once an algorithm is deployed, unsupervised learning will manage data as it comes in and classify or analyze it.
For example, when a news aggregator categorizes an article by topic, or a media platform categorizes a video by genre, this is done by unsupervised learning algorithms.
Later in the program, you will learn how these algorithms work on a conceptual level, how to implement them, and how to apply them to data sets you'll encounter on the job.
There are a couple of other types of machine learning besides supervised and unsupervised.
Reinforcement learning is often used in robotics and is based on rewarding or punishing a computer's behaviors.
The computer will take action based on a policy or set of rules that it has learned.
If the action results in a favorable outcome, it will receive a reward.
In an unfavorable outcome, the computer will receive a penalty.
Based on whether it received a reward or a punishment, the computer will update its policy, trying to optimize for rewards or minimize penalties.
This process will repeat until a satisfactory policy is found.
Finally, there is deep learning.
Deep learning models are made of layers of interconnected nodes.
Each layer of nodes receives signals from its preceding layer.
Nodes that are activated by the input they receive then pass transformed signals either to another layer or to a final output.
Another term you often hear in connection with machine learning is artificial intelligence.
Artificial intelligence includes all types of machine learning.
So we will rely on the term for the purposes of this course.
Instead, we will focus on supervised and unsupervised learning.
Most common applications of machine learning and having strong skills in these domains is valuable to potential employers.
Supervised and unsupervised learning use many of the same principles as reinforcement learning and deep learning.
So you'll have the foundation you need to explore these topics further on your own.
Now you're familiar with the machine learning landscape.
In this course, machine learning falls under the scope of artificial intelligence, which is illustrated on this map.
Machine learning and artificial intelligence refer to the same principle, training and computer to detect patterns in data without being explicitly programmed to do so.
Under the category of machine learning, you'll also find all the other types of learning we've discussed.
Finally, there's one aspect of machine learning and data science that every data professional should know.
Quality is more important than quantity.
A small amount of diverse and representative data is often more valuable for data professionals than a large amount of biased and unrepresented data.
The concept of infinity can be difficult to comprehend.
Whether we're cooking a meal or reading a book, most activities humans partake in have a beginning, middle and end.
In other words, they're finite.
But as you've learned from previous courses, we data professionals deal with infinity all the time.
And this is also true when building complex models.
As you learned earlier in the program, continuous features can take on an infinite and uncountable set of values.
Understanding this concept is critical when selecting a machine learning model and choosing the measurements to check that model's utility.
Imagine you own a citrus tree farm, and you want to learn the average weight of this year's yield of compquats.
The entire yield or population is 100 bushels.
Using simple random sampling, you pull three compquats from each bushel and weigh them individually.
The recorded individual weights of all these compquats are considered continuous data.
Because the possible value of one compquat is infinite and uncountable.
In other words, one compquat doesn't weigh 15 grams, exactly.
And out of the 300 compquats you weighed, their weights may have been measured to two decimal places.
Like 15.76, 16.09, and 15.56.
But the measurement is continuous because the weights could be any infinite number between those measured points.
Like 15.762950.
Simply put, weight is a continuous feature because it has an uncountable set of possible values.
Conversely, the total number of compquats in your 100 bushels has a fixed quantity.
Because of this, the total number of compquats is not a continuous feature.
As a data professional, knowing whether the features you input into a machine learning algorithm are continuous or fixed will be essential to choosing the correct model and the evaluation metric for that model.
Recognizing whether data features are continuous is not the only indicator to consider when deciding which machine learning model to use, but it is a very helpful one.
Here's our machine learning map again with some new information added to it.
But neat supervised, you find a block of models used to predict continuous outcomes, including several regressors.
Supervised learning models that make predictions that are on a continuum are called regression algorithms.
Some of these models were introduced earlier in the program.
Others will be defined later in the course.
For now, just know that data professionals use these types of models to work with continuous data.
The goal of these models is to predict outcomes or values based on the data sets provided.
The nature of a data professionals job in this case is to train the model to predict the values as accurately as possible.
And that's what you'll learn coming up.
To there.
Sorting candies by size and shape and color is pretty simple.
So it may seem like teaching a machine to sort wouldn't be that difficult.
Whether or not a particular model is appropriate for a problem like sorting by characteristic is largely determined by what type of variable it must predict.
In this video, we'll review a few types of variables that are helpful for determining the right supervised machine learning model for your data.
As a reminder, continuous variables are variables that can take on an infinite and uncountable set of values.
On the other hand, categorical variables and discrete variables are not continuous by nature.
Rather, categorical variables contain a finite number of groups or categories.
For example, you might use a categorical variable to classify a vehicle type, like car, motorbike or bus.
Next, there are discrete variables, which have a countable number of values between any two given values.
In this way, discrete variables are unlike continuous variables, which are uncountable and have an infinite set of values.
So the height of a tree is a continuous variable, but the number of trees in a park is a discrete variable.
Discrete variables are able to be counted and categorical variables are able to be grouped.
For example, the paint color of a house is categorical, while the number of houses in a neighborhood painted lavender is discrete.
Recall the definition of supervised machine learning.
A category of machine learning that uses labeled datasets to train algorithms to classify or predict outcomes.
categorical classification and discrete variables are a part of that supervised learning definition.
Many machine learning algorithms are trained using large datasets that group data inputs into two or more groups.
Knowing what types of features you have in a dataset and what outcomes you're looking for will help you to determine the most applicable machine learning model.
Let's consider an example from the manufacturing sector.
You are the lead data scientist at a stuffed animal manufacturer.
You're lucky to have an automated system, which stuffs, stitches and tags, plush cats and dogs at the same time.
The system was set up that way because the stuffed animals are sold in packs of two, one cat and one dog.
But now, the retailer is requesting that cats and dogs be sold separately.
Rather than buying new parts to update the machine, the plant manager asks you to use a camera to identify the cats and dogs so they can be separated automatically.
The algorithm for grouping the cats and dogs based on images from a camera would use categorical data as part of a supervised machine learning model.
The algorithm will ask, is this a dog or a cat?
You'll train the computer using the visual data to recognize and separate the incoming dogs and cats.
With that problem solved, you're asked to build a model to predict how many shipping containers are needed to ship all of the stuffed animals.
This has a discrete target variable because you're counting a number of containers.
Now, let's revisit our machine learning map with the newly added categorical area. You'll find a few new terms.
Classification is the broad category under which logistic regression, decision tree classifiers, naive phase classifiers and some others reside.
Notice that the decision tree, random forest and boosting models are present in both the continuous area and the categorical area as both regressors and classifiers.
Just like any other field of study, there are functions and applications which can't be categorized into just one group.
As you continue to develop your understanding of these models, the placement in both areas will make more sense.
You'll spend more time on these algorithms later in the course and pretty soon, you'll be building some models of your own.
Now that you understand the different types of features used in machine learning models, let's investigate how they can be used together in a type of model you're likely very familiar with.
Have you ever been streaming your favorite new album and when you reach the end something entirely new begins to play?
You've never heard it before, but you really like it.
How did your streaming service do such a good job choosing a new song for you? It used a recommendation system.
Recommendation systems are a subclass of machine learning algorithms that offer relevant suggestions to users.
And as you probably realize, they're everywhere.
Just about any website or app that matches you with something, whether it's an outfit to wear or a recipe to cook, most likely uses a recommendation system.
The main goal of a recommendation system is to quantify how similar one thing is to another and use this information to suggest a closely related option.
In this way, recommendation systems make it easier for users to find and connect with information products and content that's relevant and enjoyable.
Let's examine how this works.
First, use selected a song on the music streaming service.
Then, when the song ended, the service played more music related to your initial choice.
This is an example of content-based filtering, in which comparisons are made based on attributes of content itself.
In this case, attributes of the music you played are compared to attributes of other music to determine similarity.
To make this comparison, there must be data about each song that's a deconstruction of its attributes.
In other words, everything that makes the song unique is identified and labeled, like the artist's voice type, the rhythm or beat, or whether a certain instrument is featured.
Then, when you search for a song, the content-based recommendation system will access the list of attributes for that song and every other song in its library.
Finally, the system will compare them all using the same list of attributes. Good song recommendation systems compare hundreds of attributes.
Content-based filtering has benefits and drawbacks. Some of the benefits are that they're easy to understand.
They help recommend more of what a user likes, even niche things that few others are interested in, and they don't need information from any other users to work.
Another advantage is that the filtering is not limited to comparing items, like songs.
They can map users and items in the same space, and then recommend things that are closest to a user's typical preferences.
Interestingly, sometimes a benefit can also be a drawback and vice versa.
For instance, the fact that content-based systems always recommend more of the same type of thing can be a drawback.
Users won't be introduced to something that diverges from what they've selected in the past, or learn something new.
Another disadvantage is that the attributes often have to be selected and mapped manually for all the items, which is an enormous amount of work.
Finally, content-based filtering is ineffective at making recommendations across content types, because different content types don't use the same features.
For instance, a book doesn't have beats per minute, so the same streaming service won't be able to use your song preferences to recommend a new novel.
So the use cases can be limited.
Note that, in this music streaming example, you didn't actually rate anything. You just listened to your playlist, and the algorithm found similar songs.
When you stream videos on the other hand, you might rate or review something when you like it.
The recommendation system can also use your feedback to suggest other videos you might like.
In the video streaming example, you both viewed and actively participated in the feedback process by liking the videos you enjoyed.
A drawback of this method of recommendation is that you probably like videos about various topics, but the system will use your feedback to only suggest similar videos to the ones you liked.
Here's another example of how recommendation systems work based on your feedback called collaborative filtering.
When a user actively likes content by rating it or giving it a good review, it leads to collaborative filtering.
A recommendation system will use collaborative filtering to make comparisons based on who else liked the content.
Then it will suggest videos to someone else with similar preferences.
Collaborative filtering is different from content based filtering in that the recommendation system doesn't need to know anything about the content itself.
All that matters is if you liked it. It's a different flavor of recommendation system.
Which brings me to ice cream.
Perhaps we want to know which ice cream flavors David will enjoy.
A collaborative filtering system would compare past items that he has liked to what other people have liked.
And because David's likes and dislikes are very similar to Fatima's, the algorithm will predict that he'll enjoy Fudge Browning.
Because Fatima does.
Collaborative filtering works regardless of what the items are. The flavors don't matter. It doesn't even matter that it's ice cream.
It could be cars or restaurants or hair products.
All that matters is that David and Fatima have similar tastes.
The ability to recommend across content types is one of the main advantages of collaborative filtering.
Other benefits are that it finds hidden correlations in the data and it doesn't require tedious manual mapping.
Drawbacks of collaborative filtering are that these systems need lots of data to even start getting useful results.
And every user must give the system lots of data.
Also, collaborative filtering data is very sparse, which means it has a lot of missing values.
Let's use movies as an example.
There are hundreds of thousands of movies in existence, but most people have only viewed a small fraction of them.
So each person's movie data would have missing values for all the movies they haven't experienced.
A recommendation system would need to use advanced filtering techniques to manage all that empty space.
Most recommendation systems are highly complex and use hybridized models that make use of elements from both content based and collaborative filtering.
But even in the simplest use cases, there are different needs, resources, and strategies available for data professionals to use.
And it's the data professionals job to recommend the best problem solving approach.
Recently, you learned about recommendation algorithms and how they help users discover everything from new music to stream to new hair care products to try or new games to play with friends and family.
These algorithms can be extremely accurate, but they can also miss the mark.
For example, one limitation of recommendation tools is the problem of popularity bias, which is the phenomenon of more popular items being recommended too frequently.
This leaves the majority of other items, which might be just as pleasing to users, not getting the attention they deserve.
As machine learning becomes more accessible, and its power applied to an ever broadening set of challenges, there's greater potential for models to have unintended and even harmful consequences.
So, as a data professional, it's important that you prioritize fairness in the data that you have and use.
And this responsible data stewardship is taking steps to reduce the potential of unintended consequences of your machine learning applications.
Data professionals must also consider risk.
They may need to make decisions that could expose a business and the people it serves to negative consequences.
Recognizing the potential for bias will help to minimize risk.
Bias in machine learning is particularly deceptive, because it stems from human bias.
But because a computer makes the prediction, it's easy for the result to seem objective.
Often, the bias is unintentional.
Let's consider an example of creating an unintentionally unethical model.
Suppose you're building a facial recognition model to deploy as part of a service for a sunglasses retailer.
To generate a database of facial templates, you recruit people in your office to have their faces scanned.
Eventually, you have several hundred scans, which you think should be plenty to generate the templates you need.
You're excited to test what you've made, so you ask people from another department to act as a test group.
The test results exceed your expectations.
You even make the project open source so others can use it and build on it.
But then the service goes live.
That's not performing nearly as well as expected.
One reason might be that you didn't consider the full range of people who would be using the service.
If all the people you used to generate the templates were, say older than 30,
perhaps the service didn't work well on young adults.
Or maybe you used far more people on one end of the gender spectrum.
Since your work was released to the public, other people might now be using your templates in their own models,
without realizing that there could be a problem with the template and the facial recognition model.
None of this was caused by bad intentions.
The negative consequences were the result of bias in the training data, which was inherited from bias in the data collection process.
Specifically, the faces used to generate the template didn't represent a wide enough variety of people.
By using this repository to build your model, you ended up with a data set with a data class imbalance.
In other words, the input data was biased before the modeling even began.
This is just one example of how machine learning solutions have the potential to carry with them unintended consequences related to equity and fairness.
You'll learn about other examples later in this course.
You'll also learn some fundamental questions you can ask at each step of the modeling process to help reduce the risk of a model causing harm.
Productive models are central to machine learning and have the potential to do a lot of good, but with that potential comes risk.
Therefore, it's critical for data professionals to ensure that their models are ethical.
First, let's consider what that means.
There's no simple guide for this because there are so many different kinds of models that are applied to a variety of tasks.
As a data professional at any level, you'll probably have to make decisions about models that carry ethical implications, no matter what problem you're trying to solve.
Nonetheless, it's always important to ask questions that help you consider the fairness of your model.
Let's explore some of the questions that you should ask during the planning stage of model development.
Right away, you should ask yourself what the intended purpose of the model is, how will its predictions be used and by whom?
Who is affected by the model and how harmful or significant could the effects be?
If your model uses personal information, have these people given their consent for you to collect and use this data?
Is there a way for them to withdraw their consent?
Are they aware of what you're doing with their information?
A common application of machine learning is making predictions of merit.
Who is eligible to receive something?
It may be alone, admission to a university, or access to government services.
These are ethically sensitive situations because models that make these predictions affect people's lives.
If you're designing a model to predict whether a bank should issue someone alone, who will use that information?
What could happen if the model is wrong?
What are the long-term consequences if the loan is denied?
Once you've considered these questions, the next step is to analyze the data.
This step carries its own set of questions.
You must ask whether the data you intend to use to build your model is appropriate, well-sourced and representative.
To continue our example, since your loan data go back many years,
if so, it's likely that marginalized and oppressed classes are underrepresented in the data,
because previous measurements were influenced by bias or prejudice.
There's a common saying in data science.
Garbage in, garbage out.
If there are problems with your data, then there will be problems with your predictions.
After you've planned your process and analyzed the data,
it's time to construct the model and ask additional questions.
For instance, is it important that the model's predictions be explainable?
With some modeling methodologies, it may be difficult to know where their predictions came from.
This is sometimes known as a black box model.
Neural networks are widely known for being difficult to explain,
and therefore they're not appropriate for many applications where transparency is important.
Algorithms like random forest, adabust, and XG boost aren't completely black box,
but they may require additional efforts to explain and justify their predictions.
At the other end of the spectrum, linear and logistic regression methods are highly explainable, so we're single decision trees.
Once you've planned everything out, analyze the data and build your model,
there are more questions to ask before you complete the execute phase.
First of all, ask yourself if you understand your model and its predictions.
Do they make sense? Are their predictions fair?
One way of evaluating model fairness is by checking to see how the model's error is distributed over a population.
If the model only makes errors in particular cases that are similar, it could carry higher ethical risk.
Another question to ask is whether someone is assigned the responsibility of reviewing and monitoring the model,
both pre and post deployment to make sure it's performing well and to assess the potential for harm.
Finally, make sure you're considering the issue of consent at each stage of the past process.
As you can appreciate, there are a lot of questions to ask to ensure ethical model development.
Few data professionals will be responsible for answering all of these questions themselves,
but nearly every data professional must answer some of them, so it's important for you to always keep in mind.
When you ask the right questions at each stage of the pace workflow, you help to ensure your models are good for business and good for everyone involved.
Today, there are many tools and programs that can help you perform data analytics and build machine learning models.
Knowing what's available in your digital tool belt is important as you approach and solve problems as a data analytics professional.
At YouTube, we need to wrangle vast quantities of data.
Rather than reinventing the wheel each time, I use tools and libraries that other data professionals have already created to help me clean, validate and visualize my data efficiently.
You've been applying many of these tools throughout this program, but let's take a moment to review some of the software you've used, some that you haven't, and figure out how they relate.
When creating any Python script or program, development is almost always done inside an integrated development environment or IDE.
An IDE is a piece of software within interface to write, run, and test a piece of code.
If you took the original Google Data Analytics certificate, you coded using a language R, and R's accompanying IDE, R Studio.
It's possible to create and run scripts inside any standard text editor, but IDE's provide many tools to support the development of your code.
You already used an IDE in this program, maybe without even knowing it.
You've coded in Python and executed that code within the Jupiter notebook interface.
In this case, Jupiter notebook is the IDE.
For most coding languages, there are many IDE's available to a developer.
They all perform similarly with differences in functionality and included tools.
Selecting one will often depend on your personal preference or on your employer's preference.
Later on, you'll learn more about IDE's.
But for now, let's examine how they each handled different Python files.
The two most common file types are Python scripts, denoted with the extension dot pi, and Python notebook files, denoted with the extension dot i pi and b.
Depending on the task, a data professional might use both of these file types and may even alternate while working on one problem.
Although both of these file types can execute code, they each have their own advantages.
It's important to remember that Python is not just a language used in data science.
It's a flexible general purpose language that can be used for web development, automation, cryptography, and other tasks.
Many times, all you'll need is a Python script.
Python script is the Python code written in a plain text file, executed by the computer without the need for human supervision.
In situations when it's not necessary for a human to check the code while it's running, data professionals generally prefer to use Python script.
Scripts are especially useful when the program incorporates several files.
Scripts are also helpful when there are many errors in the program that require debugging,
since scripts can take advantage of additional functionality that notebooks cannot.
However, Python scripts typically aren't ideal for data science.
A data analytics professional, especially during EDA, needs to use Python to interactively explore a dataset and view the outputs of their code in near real time.
Often, these results are shared with colleagues and must be in a human readable format.
Python notebooks are preferable for data tasks that use code to tell a story.
Notebooks can be really useful for pairing code with human readable descriptions and outputs.
Noncode elements like images, links, and general text can be embedded directly into the file.
They also have some nice functional advantages, such as the ability to export the file as a PDF.
It might seem like.py files are preferable to.i.py and be files, but that isn't necessarily true.
Python notebooks are just another tool common in the data space, both for learners and industry professionals.
Many employers see candidates who have experience working with existing Python notebooks and know how to create new ones.
You'll continue to use Jupyter notebooks for this part of the course.
However, all the code and concepts you've learned will work just as well in a standard.py file if you're ever in a situation that requires that file type.
Python scripts, notebooks, and IDEs are just part of the tool belt.
Think of them as the foundation for the rest of the projects you'll be working on in this program and as a data professional.
Selecting the right combination of these tools will help you successfully complete any task you're given.
Earlier, you explored integrated development environments.
Keeping in mind the different types of Python files that are available to you.
In this video, you'll learn more about specific options for IDEs that you might use as a data professional.
Knowing whether or not you want to use a Python notebook or a Python script can help you visualize the overall workflow of your project.
But when you start a project using one type, it doesn't mean you need to use it throughout the whole project.
Data professionals change their development environment in the middle of their workflow more often than you might think.
You can always switch if you realize that you need some functionality that is offered by a different IDE or file type.
Jupiter notebooks is one of the most commonly used IDEs that support Python notebooks.
However, it only offers support for Python notebooks, not Python scripts.
Other IDEs, such as spider, will only support Python scripts.
Some IDEs, such as visual code studio, can support both Python notebooks and scripts.
Something else to consider when you're selecting your IDE is the tools that are built into the software.
Many of them are relatively simple and make development more efficient.
Code completion is a very common feature.
It auto completes what you type based on the functions and variables that are present in the code.
Additionally, many IDEs include a file manager, which is helpful when you're managing a larger project.
Debugging support and code testing are also available for many IDEs.
However, they're more advanced and might not be something you use in your day-to-day work as a data professional.
Sometimes you might start working in an IDE, only to find that it's missing some tools you need or want.
Once you figure out which features you need or which changes you'd like to make, you can then either customize the IDE you're using or use a different IDE that better fits your needs.
Think of your IDE as a kitchen.
There are various tools and utensils that are vital for cooking a meal.
But the kitchen is the physical space where all the work is going to happen.
If you're the person cooking, having a kitchen where you feel comfortable and find everything you need makes the process of cooking much easier.
Applying the same logic to your tools and software will make you a better data professional throughout your career.
Throughout this program, you've learned a lot about Python and its built-in functions.
You've also used some Python packages and libraries for coding.
If you recall, Python packages are a collection of modules that include functionality that is not necessarily present in the base Python language.
Modules are used to organize functions, classes, and other data in a structured way.
Libraries in Python are simply collections of packages.
In this video, you'll explore some common packages, including what they can do and how they can work together to help you accomplish a task.
Generally, there are three types of Python packages that you'll be using as a data professional.
They can be used to accomplish the same tasks.
However, they often have small differences that can make one more useful depending on the situation.
The first category will explore is operational packages.
They're also the first packages you'll normally use in the analytical process.
Operational packages load, structure, and prepare a data set for further analysis.
When creating a Python file for analysis, the first thing you have to do is read in your data.
The pandas package is often the most useful for doing this.
But the pandas read CSV function, which reads your data into a data frame, is only a tiny percentage of what's included in the package.
For efficient analysis and modeling, you can use functions that are built into the pandas package.
This makes it easier to complete tasks, including preliminary data inspection, cleaning data, and merging and joining the data frames.
Other operational packages, such as NumPy and SciPy, provide functions for advanced mathematical operations.
The second category is data visualization packages.
There are many different packages that can help you create the perfect plots and graphs based on the needs of a project.
While simple plotting functions exist across the most popular packages, there are small differences among them.
You should become familiar with as many as possible.
Matt Plotlib is usually the go-to library for basic visualizations in Python.
It has a wide range of features and can be challenging to master, but is extremely powerful, allowing developers to create almost anything they can imagine.
Seaborn is another visualization package that is focused on statistical visualization.
Statistical visualizations are simple to create using Seaborn, though it's not always possible or requires too much effort to create other types of plots.
Plotlib is often used for presentations or publications, such as creating a data visualization for an interactive dashboard.
It's similar to Matt Plotlib in the sense that it can be challenging to master, but it can create incredible graphs and even allows you to add interactive elements to the visualizations.
The final category of packages used in this course are from a machine learning.
Scikit learn is a machine learning library that is built upon many of the packages we've already discussed.
This library enables you to build a variety of model types, both supervised and unsupervised.
It also provides a great interface in which to analyze the results of a model.
Packages vastly expand the functionalities of Python and experience data professionals will use many of them in their work.
As you become more familiar with all of these exciting tools and features, you'll be even more prepared for a successful data career.
Here's a little secret about the data field.
Almost no one knows what they're doing 100% of the time.
Even the most experienced professionals encounter issues with their code or the data analysis process itself and need to search for answers.
That's why it's so important to understand the available resources that can help you find the solutions you need.
So let's consider a situation. You just finished a piece of code and press the run button, but an error appears.
It could be a problem with the way you imported your data or something not totally correct in the way you prepared the data for analysis.
What can you do? The first step might be searching for the error, as most problems will have been encountered by other developers before.
Python is particularly good at telling you where the error is, whether it's a simple syntax error or some other exception.
Your IDE will often specifically identify the problem, along with the exact line number where it was caught.
So take the error output and search for it online. This usually yields some pretty helpful results.
In fact, you'll often find your exact same error in the first few search results.
If you're not getting the help you need there, you can search on a public platform like Stack Overflow directly.
Stack Overflow is a go-to resource for coding issues, considered by many data professionals to be the definitive collection of coding questions and answers.
Not only that, but the community is very responsive and helpful, so you should feel comfortable posting your own questions.
Another resource that can be great is the documentation for the package or module you're working with.
Documentation is an in-depth guide written by the developers who created the package.
Documentation features specific information about various functions and features and usually includes helpful examples.
Kaggle is another resource that many data professionals use at some point in their journey.
The online community features tens of thousands of public data sets, along with Python notebooks that provide examples of how to conquer a variety of analyses.
Many people who are just starting out in the data science industry use Kaggle, because it offers tutorials and data sets to learn and practice machine learning techniques.
Data professionals who've been in the industry for many years also use Kaggle to learn about new advances in technology and keep their skills sharp.
When you encounter an error message or other coding issue, it's also a best practice to consider whether your tools are up to date.
Package updates can sometimes break pre-existing code or change the way a particular function is accessed and used.
Your code will usually throw an error that tells you if a certain package or piece of software is out of date.
But staying proactive can prevent the problems in the first place, so make sure you have correct versions of all your software and tools.
These resources are just a few to consider when you need help with your programming.
You'll find developers collaborating all over the internet, and the more you learn, the easier you will get to find solutions yourself.
There's a lot of technology that helps you perform data tasks, and there's an extensive knowledge based online to reference whenever you have a problem during the process.
But it's equally important, especially when working in industry, to know which team members can help you answer questions or solve problems.
And this doesn't necessarily mean other data professionals. There are many stakeholders who contribute to data tasks.
Information technology teams, business intelligence departments, and marketing professionals can all provide critical information.
Companies often have their own specific tech, and the IT department can explain what's available for use.
For example, if you realize you require a specific software or hardware to complete the problem you're working on, then the IT department can help you access what you need.
Business intelligence teams take in raw data and make it accessible for further analysis.
This could be in the form of dashboards for quick insights, or they may provide preliminary information about a data set, which you'll then use to inform your models.
There are also marketing departments, sometimes known as sales, accounts, insights, or product management teams.
They can give you more of the why behind a particular analysis from a final product perspective.
If you want to confirm that your data project is headed in the right direction, checking in with marketing can give you a clearer target to work toward.
In addition to those we've mentioned, there are many other teams you might consult, knowing the professionals who work adjacent to your project can help you tremendously.
So learning about them is valuable whenever you start a new job.
The different teams at a workplace is something you can and should ask about in an interview situation.
Questions such as, what other teams will I be working with?
Or, what resources are available if I encounter this issue?
Can give you a great perspective on what your day-to-day work environment will be like.
Remember, data work is collaborative.
Your projects will always have other stakeholders involved, and you should be taking advantage of these key resources as much as possible.
Take a moment to reflect on how far you've come.
We covered quite a few new and complex concepts.
Let's review what you've learned.
You identified and differentiated the main types of machine learning, including supervised, unsupervised, reinforcement, and deep learning.
We discussed the difference between continuous and discrete features.
You also learned about categorical features, which are finite sets of discrete features.
Then we provided an overview of recommendation systems and how they're useful for suggesting content from music to ice cream flavors.
You're now familiar with how recommendation systems work, and you can identify the differences between content-based and collaborative filtering, as well as their benefits and drawbacks.
Finally, you learned why it's important to be a responsible steward of data, and how to ask essential questions at each step of model development.
By thinking through the ethical implications, you learned how to reduce the potential for your model to cause unintended or harmful consequences.
All these concepts and skills are essential to a career as a data professional.
Having this knowledge will be especially helpful as you continue in this course.
Coming up, you'll select machine learning models, learn ways to measure them, and build the models and Python.
These important skills will enable you to be a data-driven storyteller, a powerful influencer of change wherever you work.
Congratulations!
Meet you in the next part of the program.
Welcome to another section in this course. I'm so happy to be here with you again.
Together, we're about to apply the knowledge and skills you've been developing to do something really exciting.
Build machine learning models from start to finish.
As you've been discovering, a lot goes into the process of model building.
Data professionals apply many different techniques when working to achieve a business goal,
and we have tons of opportunities to keep learning and improving upon what we create.
Data professionals are also keenly aware that we're human. This means we're inherently imperfect.
Well, just like us, the models and the datasets we build are also imperfect.
But that doesn't mean they're not useful. It just means we need to be aware of their limitations.
In fact, expecting imperfection can be an advantage.
Plus, there are many tools to help us better understand data limitations, and even turn them into effective data solutions.
Coming up, we're going to explore these benefits as we build and refine our models.
This will be a valuable proficiency for you as you move forward into the data career space.
I can't wait to begin. Join me in the next video to get started.
Welcome back. In this video, we're going to focus on the pace workflow for building effective models.
As a refresher, pace consists of four stages, plan, analyze, construct and execute.
pace helps guide the steps data professionals take as we align our data and models with the business needs.
Soon, we'll use it to gain a more comprehensive understanding of the data and prepare it for modeling using feature engineering
and techniques to manage class imbalances.
From there, we'll practice a supervised learning model using Python called naive base.
The goal will be to model bank customer churn rate to predict whether a customer will close their bank account.
In this context, churn is the rate at which customers stop doing business with a company over a given period of time.
And for continuous improvement, we'll create different model iterations.
Finally, we'll use performance metrics to evaluate how successfully the model addressed the business need.
Using the pace workflow, not only provides a good baseline for approaching the problem, but also helps data professionals stay focused throughout the process.
By applying and referencing this framework, you'll have the skills to manage data driven problems in your career.
Ready? Let's go!
Happy to have you back. I have a plan for this video.
And the first part is discussing the plan step in pace.
There are two parts, centering the business need and considering the most appropriate machine learning model.
When using pace, it's essential to align your plan with the business and data teams.
For machine learning, this means ensuring that the machine learning model you plan to construct meets the actual business needs.
This may seem a bit obvious, but given the potential complexity of the problem, multiple departments will likely be involved in the output.
You also have to consider the data that's available to you before getting into the rest of the process of building the model.
Remember that you will most often be building models for a company or an organization.
Because of that, you'll need to ensure that throughout the other stages of pace that your data, modeling, metrics, and optimizing strategies stay focused on what you developed during the plan stage.
Let's explore an example. Suppose that you're working in the finance industry, and your firm is trying to predict housing prices.
You have a massive data set about houses in a certain area.
This data set contains information about the houses, such as square footage, the number of bedrooms and bathrooms, and location.
Most importantly, the data set also contains the most recent sale price for each house.
As the second part of the plan stage of your pace workflow, you use the context and requirements of the business need to consider what type of machine learning model would be best suited for the problem of predicting housing prices.
Based on what you know of machine learning models so far, you need to create a supervised continuous model to get the desired numerical result.
Continuous models are the only ones that can help you predict housing prices for this example.
Now that you have your plan for the business problem and the scope for the machine learning model, you're ready for the next step.
Analyze. See you soon.
Earlier, we explored the plan stage of our pace framework. Now that you've finished preliminary planning, it's time for the analyzed stage of pace.
While keeping the business need in focus is particularly true for the plan stage, it's also essential throughout the process of developing a machine learning model.
The business need informs a data professional on what the model needs to produce. The result indicates what type of model is required.
The business need also informs data professionals about the data that's necessary to train the model to achieve the desired result.
The main focus of the analyzed stage is to develop a deeper understanding of the data while keeping in mind what the model needs to eventually predict.
For example, if you're creating a supervised learning model, the first thing you'll need to know is what your model is trying to predict.
In other words, you'll need to understand your response variable. You've already done this earlier in the course when you were deciding which type of supervised learning model to use, continuous or categorical.
Let's use weather as an example.
If you need to predict the exact amount of precipitation in inches, you'll probably go with a continuous model.
But if the business needs a model that can predict whether it will be rainy, cloudy, or sunny, that requires a categorical model.
Often, as a data professional, your data isn't structured exactly the way you need it to be.
This is where you can take advantage of many of the exploratory data analysis principles you learned earlier.
You'll be using all the techniques you learned so far to develop an understanding of what data you have available and how it's structured.
Let's return to the continuous model example about predicting precipitation in inches.
In your data set, the rainfall amounts that were recorded might not be in the exact units needed, or the data could be split between rain, snow, and other types of precipitation.
These are all details that a data professional needs to analyze before building the model itself.
For the categorical model example, where you're predicting whether it's rainy, cloudy, or sunny, her data set might not be labeled with the categories you need.
The individual days in the data set might be labeled with only cloud cover metrics, which is something you'll have to change to be able to analyze the results effectively.
After getting a solid understanding of what your response variables are and how they're structured, the next step is exploring your predictor variables.
Understanding the relationships that exist between variables in your data set is essential to building a model that will produce valuable results.
Similar to the response variables, the predictor variables might not be in the format or style that you want.
In those cases, the same considerations apply. You'll need to figure out how you want your data structured before building your model.
This process of carefully considering the variables you have and what you need leads into the next part of the analyze stage.
Feature Engineering. Coming up, you'll learn about feature engineering, various techniques that are available, which scenarios to use them in and what they can do for your data.
Welcome back. Previously, you learned that the main focus of the analyze stage is to develop a deeper understanding of the data.
Carefully considering the variables you have and what you need leads into the next part of the analyze stage. Feature Engineering.
Feature Engineering techniques solve the problems in how your data is structured and, if done well, can improve your model's performance.
In this video, you'll learn more about feature engineering, how it works and when to use it.
Feature Engineering is the process of using practical, statistical and data science knowledge to select, transform or extract characteristics, properties and attributes from raw data.
This definition has quite a bit to consider.
First, the process of feature engineering is highly dependent on the type of data you're working with.
Before we go much further, let's check out some examples.
Earlier, you learned about types of features, or variables called continuous and categorical.
Remember, continuous variables are variables with values obtained by measurement.
As a result, they can take on an infinite and uncountable set of values.
Categorical variables on the other hand are variables that contain a finite number of groups, categories, or countable numerical values.
Your process for feature engineering will be changing and altering these variables with the end goal of using them to train a model.
This can often be a challenging process.
The process used in the workplace can sometimes require multiple rounds of EDA and feature engineering to get everything in a suitable format to train a model.
This process highlights one reason why the pace framework is so beneficial to data professionals.
The analyzed stage builds directly from the plan stage, or put more simply the plan informs how you analyze.
Without a strategically aligned business and technical plan, the analyzed stage and feature engineering process would be like trying to build a skyscraper.
Without having blueprints.
The three general categories of feature engineering are feature selection, extraction, and transformation.
Let's learn more about those categories.
We'll start with the feature selection.
The goal of this type of feature engineering is to select the features in the data that contribute the most to predicting your response variable.
In other words, you drop features that do not help in making a prediction.
This can be done either manually or algorithmically.
Let's use a simple weather dataset as an example.
It contains five different variables and 14 data points.
The variable on the far right represents whether or not you would want to play soccer or football outside based on the other data.
In our example, whether or not it's windy outside might not affect our playing soccer.
If that's the case, then we would want to select Outlook, Temperature and Humanity and drop windy from our dataset.
Features selection would mean selecting the variables that help the most in making a prediction.
The next category is feature transformation.
In feature transformation, data professionals take the raw data in the dataset and create features that are suitable for modeling.
This process is done by modifying the existing features in a way that improves accuracy when training the model.
In our weather dataset example, your data might include exact temperatures.
But you might only need a feature that indicates if it's hot, cold, or temperate.
To make that transformation, you could define some cut-off points for the data and create a new categorical feature from the numerical data.
You might define anything above 80 degrees Fahrenheit as hot, anything below 70 as cold, and anything in between as temperate.
Feature transformation would mean transforming the degrees into the temperature categories you've defined.
Finally, let's discuss feature extraction.
This type of feature engineering involves taking multiple features to create a new one that would improve the accuracy of the algorithm.
For example, imagine we want to create a new variable called muggy that could be used to model whether or not we play soccer.
If the temperature is warm and the humidity is high, the variable muggy would be true.
If either temperature or humidity is low, then muggy would be false.
Remember, the feature engineering techniques are designed to improve your model's performance.
While you can make a lot of improvements by tweaking and optimizing the model, the most sizable performance increases often come from developing your variables into a format that will best work for the model.
In my work, this often appears as needing to make an outcome variable binary. For example, we may get survey ratings from users that are on a scale of 1 to 5 stars.
But we need to predict whether a piece of content is good or bad. In this case, we need to change our response variable by mapping the star ratings to either the good or bad label.
This is one example of how data professionals better understand their data in the analyze stage.
During EDA, you're just beginning to develop an understanding of your data.
Feature engineering is a step beyond EDA. You're selecting, extracting, or transforming variables or features from data sets for the construction of machine learning models.
Now that you have a better idea of the concept of feature engineering, you're ready to perform some of your own feature engineering and Python, which you'll do in a later video.
Hello again. In this video, we'll continue our exploration of the analyze stage of pace.
Understanding what your variables are and how their structure is only part of the process.
It's also essential to understand the frequency in which the variables exist. For classification problems, you need to specifically understand the frequencies of the response variable.
As a data professional, you might encounter data sets that are unequal in terms of their response variables.
One example of unequal data sets is in the context of fraud detection. You could have millions of examples of non fraudulent transactions, and only a few thousand examples of actual fraudulent transactions.
How can a model be built to detect fraud with such limited data to train the model?
This issue is known as class imbalance. A class imbalance is when a data set has a predictor variable that contains more instances of one outcome than another.
The class with more instances is called the majority class, while the class with fewer instances is called the minority class.
It's extremely rare for a data set to have a perfect 5050 split of the outcomes. There is normally some degree of imbalance.
However, this isn't necessarily a problem. Believe it or not, a 7030 or a 8020 split can be fun.
Major issues only arise when the majority class makes up 90% or more of the data set. You'll only know if there's an imbalance issue after the model is built.
There are two techniques that allow us to fix any potential issues, up sampling and down sampling. Both of them involve altering the data in a way that preserves the information contained in the data while removing the imbalance.
Down sampling involves altering the majority class by using less of the original data set to produce a split that's more even.
The number of entries of the majority class decreases leading to more of a balance.
You can use different techniques to achieve this, but generally they are all based on this concept.
One technique is to do this randomly by selecting entries to remove, or you can follow a formula.
For example, you can take the mean of two data points in the majority class, remove those data points and add the average data point.
Up sampling is the opposite of down sampling. Instead of reducing the frequency of the majority class, you artificially increase the frequency of the minority class.
Similar to down sampling, there are multiple ways you can achieve this. The simplest technique is called random over sampling, where random copies of data points in the minority class are copied and added back to the data set.
Or mathematical techniques can be used to generate non-identical copies, which are then also added to the data set. So, if both up sampling and down sampling achieve the same result, you might be wondering which one to use.
Most of the time, you won't know which one is preferred until you've built the model and observed how it performs. However, there are some general rules you can follow regarding when to up sample and when to down sample.
Down sampling is normally more effective when working with extremely large data sets. If you have a data set that has 100 million points, but has a class imbalance, you don't need all of that data to build a good model. You definitely don't need the additional data that would come from up sampling.
Alternatively, up sampling can be better when working with a small data set. If you're working with a data set that only has 10,000 entries, removing any of that data will more than likely have a negative impact on the model's performance.
Keep in mind, class balancing is a fickle process and may require some trial and error. Building models with both up sample data and down sample data will determine which technique is better in any given situation.
Additionally, you'll have to experiment with what sort of split you're rebalancing achieves.
Balancing the data so the classes are split 50-50 might not always be optimal. On the other hand, turning a 99-1 split into a 70-30 split might be fine, and that is something to consider as you develop your model.
Welcome back. In this video, you'll code and Python to wrap up the analyze stage of the pace framework. In the analyze stage, you gain a deeper understanding of your data.
This is the stage where you prepare your data so it can be used to train the model. Then, you'll move on to the construct stage of the pace workflow.
For the rest of this part of the course, we'll build models that will predict customer churn at a bank.
Customer churn is the business term that describes how many and at what rate customers stop using a product or service or stop doing business with a company altogether.
The models will be supervised classification models because they'll each predict a categorical target. In this case, it's a binary one, whether or not each customer churned or stayed.
Before we get into the data set, we need to import the packages that we'll use. All we'll need for this notebook are NumPy and Pandas because we're only preparing the data for modeling.
Now, we'll read in the data set from a CSV file to a Pandas data frame. We'll call the data frame DF original and inspect it using the head function.
The data set that we'll be using to solve this problem contains customer data, where each entry in the data set represents one customer.
For each customer, there are several features that describe the customer's relationship with the bank and information about the customer's finances.
Additionally, there's metadata for each customer, like name, gender, and customer identification number.
The two features that might not be completely intuitive are tenure, which represents how many years the customer has used the bank and geography, which identifies which country the customer lives in.
Additionally, we have the feature labeled Exitid. This indicates whether the customer left the bank, a one signifies that they stopped doing business with the bank, and a zero indicates that they are still a customer.
The variable Exitid will be the response variable, or the variable that our model will attempt to predict.
When modeling, the best practice is to perform a rigorous examination of your data before beginning feature engineering and feature selection.
This process is important because not only does it help you understand your data, what it's telling you and what it's not telling you, but it can also give you clues about what new features to create.
We've already learned the fundamentals of exploratory data analysis or EDA, so this notebook will skip that essential part of the modeling process.
Just remember that a good data science project will always include EDA.
Let's get a quick overview of our data. We'll use the info function to inspect the data frame.
From this table, we can confirm that the data has 14 features and 10,000 observations.
We also know that nine features are integers, two are floats, and three are strings.
Finally, we can tell that there are no null values, because there are 10,000 observations, and each column has 10,000 non null values.
Next, we'll prepare this data set to be used to train the model.
The first thing we're going to do is feature selection.
If you recall, this is the process of picking out the features that we want the model to use to predict the outcome, and we drop any features that aren't useful to the model.
In our bank data, notice that the first column is called row number. This just enumerates the rows.
Since a row number shouldn't have any intrinsic correlation with our response variable, we'll remove this feature from our data set.
The same is true for customer ID, which appears to be a number assigned to the customer for administrative purposes, and surname, which is the customer's last name. We'll drop these two.
As you complete feature selection, keep the ethical concerns you learned earlier in the course in mind.
Consider the implications of your data and the resultant model. For example, in this modeling exercise, we will not include the gender column.
This data feature raises a set of complex issues, technically, culturally, and ethically.
We recognize that the most rigorous approach would be to model both with and without this feature and examine how it influences predictions.
Whatever the approach, it should be driven by an aim for equitable outcomes and for your particular use case.
We'll remove these columns by calling the drop function on our data frame.
We pass to it a list of names of the columns that we want to remove, and we indicate that we're dropping columns, not rows, by including access equals one.
We'll assign the results to a new data frame called Turn DF.
The resultant data frame is shown by calling the head method. However, there's still more to do before we start training the model.
Next, let's practice feature extraction. This is the process of taking two or more features and using them to create a brand new feature that will make the model more accurate.
Normally, feature extraction is done using statistics to analyze how predictive each variable is and whether the new feature that is extracted is more predictive than the original variables on their own.
For now, we're going to extract a feature as an example, without conducting the analysis we would normally perform if we were trying to build a production ready model.
Let's create a new variable and call it loyalty.
We'll do this by taking the tenure of a customer and dividing it by their age.
The logic behind using tenure and dividing it by the customer's age is that it represents the percentage of a person's life that they've been customers of the bank.
People with greater percentages may be more loyal customers.
Now we have a new column called loyalty, which we can verify by inspecting the data frame. Let's move on to feature transformation.
There are some features in this data set that need to be transformed. Remember, feature transformation is the process of changing how a single feature is represented in the data set with the goal of improving the accuracy of the model.
Many classification models require you to convert categorical features to make them numeric.
Our data set has one categorical feature called geography.
Let's check how many classes appear in the data for this feature by using the unique function on the series.
There are three unique values, France, Spain and Germany. Let's encode this data so it can be represented using Boolean features.
We'll use a pandas function called get dummies to do this.
When we call PD.get dummies on this feature, it will replace the geography column with three new Boolean columns, one for each possible category contained in the column being dummy.
When we specify drop first equals true in the function call, it means that instead of replacing geography with three new columns, it will instead replace it with two columns.
We can do this because no information is lost, and the data set is shorter and simpler.
In this case, we end up with two new columns called geography Germany and geography Spain. We don't need a geography France column. Why not?
Because of a customer's values in geography Germany and geography Spain are both zero, we'll know they're from France.
After feature selection, extraction, and transformation, the next step is modeling.
We'll be using this data set for most of the modeling you'll do in the remainder of this part of the course.
And you now have a solid foundation for the construct and execute stages of pace.
Hello again, you're now about halfway through the pace workflow. During the planning stage, you developed a better understanding of the business need and the data available.
And in the analyze stage, you investigated the data using exploratory data analysis practices.
You'll apply feature engineering techniques to select transform and extract data into a format that was suitable for training.
This video continues onto the construct stage, where you'll bring the model to life. You'll do this by building a model called naïve base.
Naïve base is a supervised classification technique based on base theorem, with an assumption of independence among predictors.
The effect of the value of a predictor variable on a given class is not affected by the values of other predictors.
Let's break it down. Base theorem gives us a method of calculating the posterior probability, which is the likelihood of an event occurring after taking into consideration new information.
In other words, when you calculate the probability of something happening, you take relevant observations into account.
It can be represented with this equation, which calculates the posterior probability of c given x.
Probability of x given c is the probability of the predictor given the class, which is multiplied by p of c, the prior probability of the class.
The product of these two terms is then divided by p of x, which is the prior probability of the predictor.
The posterior probability equation can be rewritten to reveal what's going on behind the variables.
The probability of a class given first predictor variable, p of x1 given c, is multiplied by the probability of a class given the second variable, p of x2 given c, and so on for all predictor variables used in the model.
This is complex.
So let's return to the weather-based example that we discussed in an earlier video and apply naïve base to gain a better understanding.
The weather dataset will help you build a model to decide whether to go outside and play soccer.
This dataset has five columns.
The first four are the predictor variables, and the final column is the label of the dataset.
It shows whether we should play soccer.
The outlook variable identifies if it is rainy, cloudy, or sunny.
The humidity variable indicates the relative humidity, and of course the windy variable determines if there's wind.
Start with the outlook variable.
Calculate the posterior probability of one of the features in the dataset.
To do this, construct a frequency table for each attribute against the target by tallying the number of times soccer is and isn't played for a given attribute.
Then, transform the frequency tables into likelihood tables by calculating the number of times soccer is and isn't played for each attribute.
Use this information to find the probability of the predictor given the class p of x given c.
The probability of the class p of c and the probability of the predictor p of x.
In other words, everything needed to calculate the posterior probability.
As a reminder, you may pause and review the video if needed.
The process of finding the posterior probability needs to be done for every possible class that is potentially being predicted.
In this case, there are only two outcomes, play or don't play.
Once these values are found, the prediction is made based on the class with the highest posterior probability.
When you repeat the process for the second class, don't play.
You're only required to modify the calculations after finding the likelihood table.
Observe that the posterior probability of playing while it is sunny is higher than the posterior probability of not playing.
So if it's sunny outside, a naive-based model would predict that the conditions are right to play soccer.
Later, you'll explore how multiple predictor variables can be used to make a prediction.
All the same concepts will apply.
No matter the number of variables that are used, naive-based calculates posterior probabilities and makes predictions based on which outcome has the highest probability.
You're doing great work. Keep up the momentum and I'll be with you again soon.
Now that we've discussed how naive-based works, it's time to implement it using Python.
We're going to continue our work with the bank-churn data frame that we prepared in the feature engineering notebook.
Remember, we dropped the row number, customer ID, surname, and gender columns.
Dummy encoded the geography column to convert from categorical to Boolean and engineered a new feature called Loyalty by dividing each customer's tenure by their age.
Recall that the predictor variables of this data set are of different types.
For example, balance and estimated salary are continuous while geography is categorical.
Also, remember that scikit-learn has a few different implementations of the naive-based algorithm.
And each assumes that all of your predictor variables are of a single type.
As a data professional, one of the first things you'll learn on the job is that real-world data is never perfect.
Sometimes the data violates the assumptions of your model. In practice, you'll have to do the best you can with what you have.
For this lesson, we're going to use the Gaussian and B classifier.
This implementation assumes that all of your variables are continuous, and that they have a Gaussian or normal distribution.
Our data doesn't perfectly adhere to these assumptions, but a Gaussian model may still give us usable results, even with imperfect data.
Let's get started.
As always, the first thing to do is import any packages and libraries that you'll need.
We'll begin by importing numpy, pandas, and matplotlib.
We'll also import train tests split to help us split our data into training and test sets.
The model will be using is called Gaussian and B, which we'll import from scikit learns naive-based module.
Next, we'll import functions that we'll use to calculate our model's accuracy, precision, recall, and F1 scores.
Finally, we'll import confusion matrix and confusion matrix display, which will help us calculate and plot a confusion matrix of our model's results.
Let's read in the data frame and call it churndf.
Before we begin modeling, let's do a couple more things.
First, we'll check the class balance of the exited column, which is our target variable.
We can do this by calling value counts on the pandas series.
The class is split roughly 80, 20.
In other words, about 20% of the people in this data set churnd.
This is an unbalanced data set, but it's not extreme, so we'll proceed without doing any class rebalancing of our target variable.
Secondly, naive-based models operate best when the predictor variables are conditionally independent from each other.
When we prepared our data, we engineered a feature called loyalty by dividing tenure by age.
Because this new feature is just the quotient of two existing variables, it's no longer conditionally independent, so we're going to drop tenure and age.
This step may or may not be beneficial, but we'll do it to help adhere to the assumptions of our model.
We've prepared our data and we're ready to model.
Now, we need to split the data first into features and target variable, and then into training data and test data.
Let's assign our predictive features to a variable called X, and the exited column are target to a variable called Y.
Then, we can split into training and test data.
We do this using the trained test split function.
We'll put 25% of the data into our test set and use the remaining 75% to train the model.
Notice that we include the argument stratify equals Y.
If our master data has a class split of 80-20, stratifying ensures that this proportion is maintained in both the training and test data.
Equals Y tells the function that it should use the class ratio found in the Y variable, which is our target.
The less data you have overall and the greater your class imbalance, the more important it is to stratify when you split the data.
If we didn't stratify, then the function would split the data randomly, and we could get an unlucky split that doesn't get any of the minority class in the test data.
In that case, we wouldn't be able to effectively evaluate our model.
Worst of all, we might not even realize what went wrong without doing some detective work.
Finally, we set a random seed so we and others can reproduce our work.
Now, it's time to build the model.
Just as with linear and logistic regression, our modeling process will begin with fitting our model to the training data and then using the model to make predictions on the test data.
First, we'll instantiate the Gaussian and B model, assigning it to a variable called G and B.
Then, we'll fit it to the X and Y training data.
Lastly, we'll use the predict method to use the model to make predictions on the X test data, assigning the results to a variable called YPreds.
Now we can check how our model performs using the evaluation metrics we imported.
For each one, we pass to it first the actual Y test data and then the predictions.
Hmm, this isn't very good.
Our precision, recall, and F1 scores are all zero.
What's going on?
Well, let's consider our precision formula.
There are two ways for the model to have our precision of zero.
The first is if the numerator is zero, which would mean that our model didn't predict any true positives.
The second is if the denominator is also zero, which would mean that our model didn't predict any positives at all.
Dividing by zero results in an undefined value, but scikit-learn will return a value of zero in this case.
Depending on your modeling environment, you may get a warning that tells you there's a denominator of zero.
We don't have a warning, so let's check which situation is occurring here.
To do this, we'll call NumPy's unique function on the predictions.
The model predicted zero, or not churned, for every sample in the test data, both the numerator and the denominator are zero.
Consider why this might be, perhaps we did something wrong in our modeling process,
or maybe using Gaussian and B on predictor variables of different types and distributions just doesn't make a good model.
Maybe there were problems with the data.
Before we give up, maybe the data can give us some insight into what might be happening, or what further steps we can take.
Let's use describe to inspect the x-data.
Something that stands out is that the loyalty variable we created is on a vastly different scale than some of the other variables we have, such as balance or estimated salary.
The maximum value of loyalty is 0.56, while the maximum value for balance is over 250,000, almost six orders of magnitude greater.
One thing that you can try when modeling is scaling your predictor variables.
Some models require you to scale the data in order for them to operate as expected while others don't.
If B's does not require data scaling. However, sometimes packages and libraries need to make assumptions and approximations in their calculations.
We're already breaking some of these assumptions by using the Gaussian and B classifier on this data set, and it may not be helping that some of our predictor variables are on very different scales.
In general, scaling might not improve the model, but it probably won't make it worse. Let's try it.
We'll use a function called min max scalar, which will import from the SK learn preprocessing module. Min max scalar normalizes each column, so every value falls in the range of 0 to 1.
The columns maximum value would scale to 1, and its minimum value would scale to 0. Everything else would fall somewhere in between. This is the formula.
To use a scalar, you must fit it to the training data and transform both the training data and the test data using that same scalar.
Let's apply this and retrain the model. First, import the scalar.
Then, we'll instantiate it and assign it to a variable called scalar.
Now, we fit the scalar by passing our x-trained data to it. Next, we use the transform method to scale the x-training data.
Finally, we transform the x-test data. Now, we'll repeat the steps to fit a model. Only this time, we'll fit it to our new scaled data.
When we calculate the performance metrics for this model, we don't get an error. The model isn't perfect, but at least it's now predicting customers who turn.
Let's examine more closely how our model classified the test data. We'll do this with a confusion matrix.
Remember that a confusion matrix is a graphic that shows you your models true and false positives and true and false negatives.
We can plot this using the confusion matrix display and confusion matrix functions that we imported.
Here's the helper function that will allow us to plot a confusion matrix for our model.
All of our model metrics can be derived from the confusion matrix and each metric tells its own part of the story.
What stands out most in the confusion matrix is that the model misses a lot of customers who will turn. In other words, there are a lot of false negatives.
355 to the exact. This is why our recall score is only 0.303.
Coming up, you'll investigate the various model evaluation metrics and when to use them. You'll explore the use of several metrics to evaluate model performance.
Then, you will determine which of the models best satisfies the business requirements for the data project.
Meet you there.
Wow, you've reached the last stage of the pace workflow.
Execute. This is where model analysis happens and it's finally production ready.
You've already learned a lot about model metrics, the options available to you, and what they can demonstrate to a data professional about the model that has been built.
And, you've built supervised learning models, including a categorical model in the form of logistic regression.
The metrics you used to evaluate those models will also make it possible to evaluate a naive-based model.
As a review, accuracy reflects the number of correct predictions divided by the total number of predictions.
However, accuracy doesn't always tell the full story.
Some data sets will feature a strong class imbalance, which occurs when the majority of items belong to only one class. Then, the data set is considered imbalanced.
Here's an example using a binary classification problem.
An IT professional wants to use a model to detect malware in the computers at their company.
Perhaps there are 5,000 instances in their data set, but only 500 positive instances where malware was actually present in a computer.
This person would have an imbalance to data set, as the chances of finding malware among all the checks that happen is actually comparatively low.
This is where their precision and recall metrics can help.
Precision measures what proportion of positive predictions were correct.
In other words, if the model predicted that malware was going to be present, how many times was it actually on a computer?
Precision is calculated by dividing the number of true positives by the sum of the true and false positives.
And the recall metric, on the other hand, finds the proportion of actual positives that were identified correctly.
In the context of our example, recall indicates how many would be malware threats were classified.
Recall is calculated by dividing the number of true positives by the sum of the true positives and false negatives.
F1 score combines both precision and recall in one metric.
Accuracy, precision, recall, and F1 score are top metrics in classification techniques.
More specifically, precision, recall, and F1 score are especially useful for measuring unbalanced classes.
In any case, data professionals use all four metrics to evaluate categorical supervised learning models.
As you've begun to discover, each model performs differently.
And some algorithms work better than others.
When building any model intended for production, it's essential to improve the results.
You might change specific parameters to discover how the performance improves.
So, you should always keep in mind that model building is an inherently iterative process.
The first model that you produce will almost never be the one that gets deployed.
The iterative process provides the information needed to get the model working optimally.
After tweaking the parameters, or changing how features are engineered in each model, the performance metrics provide a basis for comparing the models to each other and against themselves.
Coming up, we'll find out what these metrics reveal about our model.
Use them to evaluate other models we've built and examine how to improve model performance.
Continuous improvement is a key part of being a data professional.
So these exercises are preparing you to keep advancing all kinds of data processes.
Developing a machine learning model is a complex process.
But having a solid framework to rely on helps set you up for success, no matter the business need.
In this part of the course, the pace workflow provided support to help you navigate the different stages of addressing an example business scenario.
During the plan stage, you assessed the business need to determine what type of model is best suited for predicting bank customer turn.
This decision was based on the available data.
Next, in the analyze stage, you examined the data using EDA practices and feature engineering techniques.
This process revealed more details about the data to help inform your plans for building the model.
From there, you went on to construct the first iteration of your naive base model.
You then tested the model with preliminary evaluation metrics to determine its performance against the test data.
And finally, in the execute stage, you closely evaluated the model's performance and considered how it could be improved.
And that is the pace workflow for machine learning models.
Having this process in your tool belt will allow you to solve many business problems.
While the models may very well get more complex as you continue your journey,
sticking within the framework will help you achieve the results you need.
You've come a long way on your journey into the world of data and you've been building a strong foundation for data modeling.
So far, you've learned about linear regression, logistic regression, and naive base models.
All supervised learning techniques that make predictions on labelled data.
Most machine learning applications today are based on supervised learning.
But when we consider all the available data in the world, the vast majority of it is unlabeled.
Photographs, voice recordings, videos, social media posts, these are all examples of unlabeled data.
You may be familiar with this concept in the context of data analytics.
If you earned your Google Data Analytics career certificate, you learned that any data that's not organized and an easily identifiable manner is known as unstructured.
In this program, we'll sometimes refer to it as unlabel.
But the meaning is the same.
So, how do we make sense of all that unlabeled data?
We use unsupervised learning techniques. When our data is unlabeled, these methods make it possible for data professionals to learn about the data's underlying structure,
and find out how different features relate to each other.
Earlier in the course, you explored one very common type of unsupervised learning, recommendation systems.
You learned that there are a subclass of machine learning algorithms, which offer relevant suggestions to users,
such as new songs for your playlist or a new coat for winter.
Now, you'll get to know many other exciting methodologies and applications of unsupervised learning.
In this section of the course, you'll first learn about k-means, an unsupervised modeling technique.
You'll investigate how it clusters data based on each observation similarity to others in the data.
You'll also build a k-means model and learn how to evaluate it using metrics called inertia and silhouette score.
I'm thrilled to have you with me as we explore unsupervised learning.
There's so much great potential for future development.
We've only just begun tapping into the vast amount of unstructured data in the world.
Let's start modeling.
In this video, we'll introduce you to the k-means algorithm.
K-means is an unsupervised partitioning algorithm.
It's used to organize unlabeled data into groups or clusters.
It does this by creating a logical scheme to make sense of the data.
With k-means, each cluster is defined by a central point or a centroid.
Its position represents the center of the cluster, also known as the mathematical mean, hence the name k-means.
There are four steps to building a k-means model.
Let's examine them one at a time.
In step one, you choose the number of centroids and place them in the data space.
K represents the number of centroids in your model, which is how many clusters you'll have.
This is a decision that you make.
Sometimes, you'll have an idea about the number of clusters necessary for a project.
For example, if your company manufactures five different products, you might want to set your k-value to five.
Other times, you won't know how many clusters your data should be split into.
So try different values for k, and determine what provides the best results.
Here, it's a parent that our data is grouped into two clusters, one on top, and the other on bottom.
At step one, we'll randomly initiate two centroids represented by the blue and red X's.
Step two is to assign each data point to its nearest centroid.
The nearest centroid is the one that's closest in space.
In this example, the top two observations are assigned to the blue centroid, and the bottom two observations are assigned to the red centroid.
As a quick refresher, in this context, an observation is simply any data point being observed.
Step three is to recalculate the centroid of each cluster.
Again, the centroid's location is calculated by taking the mean of all of the points in its cluster.
Note that the centroid's move to the midpoint of their clusters.
This will happen each iteration until the algorithm reaches convergence.
This is the stable point found at the end of a sequence of solutions.
Step four is to repeat steps two and three until the algorithm converges.
In this case, we have very little data, so the model is simple and has already converged.
If you had a lot more data, the centroid's would get increasingly closer to their related clusters.
You also might find that the cluster assignment of each data point changes as the centroid locations move within each iteration.
Something to be mindful of is that it's important to run the model with different starting positions for the centroids.
This helps avoid poor clustering caused by local minima.
In other words, not having an appropriate distance between clusters.
Let's explore this concept using our example.
What if this had been the initial positions of our centroids?
Notice what happens.
In step two, we assign the points to their nearest centroid.
With these particular starting positions, the two observations on the left are assigned to the red cluster
and the two observations on the right are assigned to the blue cluster.
For step three, we recalculate the position of each cluster centroid.
The model has converged, but the clusters aren't what we'd expect.
We know that the most intuitive clustering would be for the top two observations to be in one cluster and the bottom two in another.
But that's not what happened, and further iterations will not change this.
This is why it's important to run the model with different centroid initializations and to avoid poor clustering due to the model converging in local minima.
Note that this clustering isn't wrong, it's still a valid resolution of the model.
It just doesn't make much sense in this context.
After all, the goal is to find a clustering scheme that makes sense of your data.
Finally, note that even though K means is a partitioning algorithm, data professionals typically talk about it as a clustering algorithm.
The difference is that outlying points in clustering algorithms can exist outside of the clusters.
However, for partitioning algorithms, all points must be assigned to a cluster.
In other words, K means does not allow unassigned outliers.
So to recap, K means is an unsupervised learning technique that groups un-level data into K clusters based on similarity.
The clustering process has four steps that repeat until the model converges.
The value for K is a decision that the modeler makes.
And finally, it's important to build multiple models to avoid poor clustering.
Later in this section, you'll learn how to determine the best value for K.
You'll also discover some of the limitations of K means models.
Lots more coming up!
At this point, you're familiar with the basic intuition behind the K means algorithm.
In this video, we're going to demonstrate how to apply your knowledge of K means to an actual example.
We'll use the K means algorithm to compress colors in a photographic image.
This demonstration is intended to lead you through an application of the K means theory to give you a deeper understanding of how it works.
So, for this video, focus less on the mechanics of the code itself and more on the results.
Let's get started!
We're going to use a photograph of some tulips as our data, which will read in as an array using map plotlips i.m. read function,
and display using its i.m. show function.
This is the image that we're going to manipulate using K means.
When we check the shape of the image in pixels, we're told that it's 320 by 240 by 3.
We can interpret these numbers as pixel information.
Each dot on the screen is a pixel.
This photograph has 320 vertical pixels and 240 horizontal pixels.
But what is the dimension of 3?
This dimension refers to the values that encode the color of each pixel.
Each pixel has 3 parameters, red or r, green or g, and blue or b.
Together, these values are known as rgb values.
The value for each color, rg and b, can range from 0 to 255.
This means that there are 256 cubed, or more than 16 million different possible combinations
of rgb, each resulting in a unique color.
To prepare this data for modeling, we'll reshape it into an array,
where each row represents a single pixels rgb color values.
Now, we have an array that is 76,800 by 3.
Each row is a single pixels color values.
Let's create a panda's data frame to help us understand and visualize this data.
Each row of the data frame represents a single pixel,
and the three columns are its rg and b values.
Because we have only three columns, we can visualize this data in three dimensional space.
This graph plots each of the photographs pixels in a 3d coordinate space.
Each axis ranges from 0 to 255.
Just like each value in rgb, and each dot in the graph is the color specified by its rgb values,
just like in the original photograph.
The more intense the color, the more dots are concentrated in that area.
The most represented colors here are the most abundant colors in the photograph,
mostly red, greens, and yellows.
We can examine this graph from different angles, and even zoom in and out.
We can also train a k-means model on this data.
The algorithm would create k clusters by minimizing the squared distances from each point to its nearest centroid.
Here's an experiment.
What do you expect to happen if we build a k-means model with just a single centroid?
In other words, with k equal to 1, let's find out.
We'll first instantiate the model, as a refresher,
instantiation involves creating a copy of the class,
which inherits all class variables and methods.
So, let's set the number of clusters to 1 and fit it to our data.
Now, we're going to copy the original image,
replace each of its rows with the values of its closest cluster center,
and reshape the image so we can display it.
The image we get back doesn't resemble tulips at all.
So what happened?
Well, let's run through the k-means steps.
First, the algorithm randomly placed a centroid in the 255 by 255 by 255 color space.
Then, it assigned each point to its nearest centroid,
because there was only one centroid, all points were assigned to it,
and therefore to the same cluster.
Next, the algorithm updated the centroid's location to the mean location of all of its points.
Again, there's only a single centroid, so it updated to the mean location of every point in the image.
Usually, these steps would repeat until the model converges,
but in this case, it took only one iteration.
We updated each pixels RGB values to be the same as those of our centroid.
The result is the image of our tulips where every pixel is replaced with the average color.
We can verify this for ourselves by manually calculating the average for each column in the array.
This will return the mean R value, G value, and B value.
Let's compare this to what the k-means model calculated as the final location of its one centroid.
We'll do this by using the cluster centers attribute of the fit k-means model object.
They're the same.
Now, let's return to the 3D rendering of our color space.
Only this time, we'll add the centroid.
The centroid is a large circle in the middle of the color space.
Notice that this is the center of gravity, so to speak, of all the points in the graph.
Okay, now let's refit another k-means model to the data.
Only this time, using k equals 3.
Take a moment now to consider what you might expect to result from this.
Go through the steps of what the model is doing like we did above.
What colors are you likely to see?
So, we refit the model, setting our number of clusters to 3, and we get three centroid locations,
which are the RGB values we can use to display the colors of each centroid.
We'll use a helper function to display our color swatches, and there they are.
You might have hypothesized that there'd be similar colors as a result of the three cluster models.
And that's correct. The photos dominant colors of red, green, and yellow are present here.
Again, we can replace each pixel in the original image with the RGB values of the centroid
to which it was assigned by the new k-means model.
This is a function that will display the photo for any value of k that we choose.
We'll call this function with three as its argument.
We now have a photo with just three colors, the same three colors from the swatches above.
Each color's RGB values correspond to the values of the location of its nearest centroid.
We can return once more to our 3D coordinate space.
This time we'll re-color each dot to correspond with the color of its centroid.
This will allow us to see how the k-means algorithm clustered our data spatially.
Check it out.
Each pixel is now colored according to the RGB value of its centroid,
and the clusters are at the vertices of the space.
This whole process can be applied for any value of k.
Here's the output of each photo for k equals 2 through 10.
Notice that it becomes increasingly difficult to see the difference between the images each time a color is added.
This is a visual example of something that happens with all clustering models,
even if the data is not an image that you can see.
As you group the data into more and more clusters,
additional clusters beyond a certain point contribute less and less to your understanding of your data.
This demonstration has deepened your understanding of how the k-means algorithm works.
Soon we'll explore methods for numerically determining which k value is best for particular data.
As always, feel free to explore the notebook more on your own to keep building your skill set.
You now have some familiarity with the intuition behind the k-means methodology.
In some of the examples we presented, we plotted points in two dimensional space.
In those cases, it was clear if the model was correctly assigning points to clusters.
We also studied an example where we were able to visualize the data in three dimensions.
Unfortunately, most cases data professionals encounter on the job are not so easy.
Your data will have many more than three dimensions,
so you will not be able to visualize how each observation relates to those around it.
You might not even know how many clusters there should be.
So, how do you decide the value for k?
And once you do, how do you know if your model is working as intended?
In linear and literacy regression, you used metrics such as R squared, mean squared error,
area under the ROC curve, precision, and recall to evaluate the effectiveness of your model.
But in unsupervised learning, you don't have any labeled data to compare your model against.
The metrics aren't applicable.
In fact, your model isn't predicting anything.
Instead, it's grouping observations based on their similarities.
It's up to you to investigate and understand the different clusters.
Remember, if you have some domain knowledge or the problem you're trying to solve has its own constraints,
use these things to your advantage.
For example, maybe you're investigating customer segmentation for a service that offers four different subscription levels.
Then you'd probably want to use four as your value for k.
But if you have no way of knowing and advance what value to use for k, don't worry.
There are other ways to figure it out.
Before we get started with evaluation metrics, consider what makes for a good clustering model.
Basically, you want clearly identifiable clusters.
This means that with any cluster or intra-cluster, the points are close to each other.
It also means that between the clusters themselves or inter-cluster, you want lots of empty space.
One way to evaluate the intra-cluster space in a k-means model is to identify its inertia.
This is a different concept from inertia as it's defined in physics.
Here, inertia is defined as the sum of the squared distances between each observation and its nearest centroid.
Essentially, this is a measurement of how closely related observations are to other observations within the same cluster.
That information is then aggregated across all the clusters to produce a single score for the particular metric being measured.
Another important metric for evaluating your k-means model is the silhouette score.
This is a more precise evaluation metric than inertia because it also takes into account the separation between clusters.
Silhouette score is defined as the mean of the silhouette coefficients of all the observations in the model.
We'll cover this in more depth later.
For now, just know that the silhouette score helps evaluate your model, provides insight as to what the optimal value for k should be,
and uses both intra-cluster and inter-cluster measurements in its calculation.
Nice work! You now have a better understanding of what makes a good clustering model.
You also understand more about some of the metrics to determine your model's effectiveness.
Coming up, we'll further explore these metrics and learn how data professionals put them to use on the job.
Previously, you were introduced to inertia and silhouette scores as metrics to evaluate k-means models.
These are indispensable tools for data professionals who work with k-means models and any other model in the clustering family.
Now let's expand on these concepts.
Consider again, what makes a good clustering model.
Ideally, you'd have tight clusters of closely related observations, and each cluster is well separated from other clusters.
Remember that inertia is the sum of the squared distances between each observation and its closest centroid.
It's a measurement of intra-cluster distance, so it gauges how closely related each observation is to the other observations in its own cluster.
inertia can be represented by this formula, where n is the total number of observations in the data and c sub k is the centroid of the cluster that the observation x sub i is in.
The more compact the clusters, the lower the inertia, because there's less distance between each observation and its nearest centroid.
Therefore, it's important for inertia to be as close to zero as possible.
Can inertia ever be zero? Well, it's possible, but this scenario wouldn't offer any new insight into the data.
Here's why.
In one case, if all observations were identical, this would mean all data points are in the same location.
Then inertia equals zero for all values of k.
And the second case is when the number of clusters is equal to the number of observations.
If each observation is in its own cluster, then its centroid is itself.
inertia is a great metric because it helps us to decide on the optimal k value.
We do this by using the elbow method.
In the elbow method, we first build models with different values of k.
Then we plot the inertia for each k value.
Here's an example.
Notice that the greater the value is for k, the lower the inertia.
So should you always select high k values?
Well, no.
A low inertia is great, but if it results in meaningless or inexplicable clusters, it doesn't help you at all.
A good way of choosing an optimal k value is to find the elbow of the curve.
This is the value of k at which the decrease in inertia starts to level off.
In this example, that occurs when we use three clusters.
Sometimes it might be difficult to choose between two consecutive values of k.
In that case, it's up to you to determine which is best for your particular project.
The second important metric for evaluating your k means model is the silhouette score.
This is a more precise evaluation metric than inertia because it takes into account the separation between the clusters.
Silhouette score is defined as the mean of the silhouette coefficients of all the observations in the model.
Each observation has its own silhouette coefficient, which is calculated as b minus a, over whichever value is greater, a or b.
Where a is the mean distance from that observation to all other observations in the same cluster,
and b is the mean distance from that observation to each observation in the next closest cluster.
The silhouette coefficient can be anywhere between negative one and one.
Consider this schematic.
If an observation has a silhouette coefficient close to one, it means that it's both nicely within its own cluster and well separated from other clusters.
A value of zero indicates that the observation is on the boundary between clusters.
If your observation has a silhouette coefficient close to negative one, it may be in the wrong cluster.
So, as you've just experienced, when using silhouette score to help determine how many clusters your model should have,
you'll generally want to opt for the k value that maximizes your silhouette score.
In inertia and silhouette score are important metrics to help you determine the most appropriate number of clusters for your k means model.
Now that you're familiar with how they're derived, you'll be well prepared to continue working with them.
Previously, you learned about inertia and silhouette scores.
You understand that their metrics used to help decide an effective value for k in a k means model.
And because k means is an unsupervised learning model, it's used to find structure and relationships within data, but there are no right answers.
Therefore, data professionals who use these models have to rely on these metrics to help them determine whether their model is identifying characteristics of their data that are useful for their needs.
In this video, we're going to build a k means model and evaluate it using inertia and silhouette score.
We'll go over which packages to import how to scale data, instantiating and fitting the model.
And of course, using the labels and inertia attributes and silhouette score function to determine a final value for k.
Once again, we'll return to Jupiter notebooks as the platform where we'll build our models.
In a new notebook, the first step, as always, is to begin with import statements.
This will create the computing environment with the necessary packages and tools for your project.
In this case, we'll import numpy and pandas as our operational packages.
We'll also import the following task-specific items from scikit-learn, k means silhouette score and standard scalar.
Note the syntax as each item is imported from a different package.
K means comes from SK-Learn.cluster, silhouette score comes from SK-Learn.metrics, and standard scalar comes from SK-Learn.pre-processing.
The function make blobs is something we'll use just for this demonstration to help us create synthetic data.
We'll use C-born for graphing.
In practice, you'd have a real dataset and you'd read in this data and perform EDA data cleaning and other manipulations to prepare it for modeling.
For simplicity and to help us focus on modeling and analysis, we're going to use synthetic data for this demonstration.
We'll start by creating a random number generator.
This is to help create reproducible synthetic data.
We'll use it to generate cluster data.
For now, we won't know how many clusters there are.
By calling the random number generator and assigning the result to a variable, we can avoid viewing the true number of clusters our data has.
This will let us use inertia and silhouette coefficients to determine it.
This next step uses make blobs and our random number generator to create data that has an unknown number of clusters.
At least to us.
These steps return a numpy array, but it's usually helpful to view your data as a pandas data frame.
This is often how your data will be organized when modeling on the job.
So we'll convert our data to a pandas data frame.
In the six columns, we find that our data has six features.
This is too many dimensions for us to visualize in 2D or 3D space.
We can't observe how many clusters there are, so we'll need to use our detective skills to figure it out.
Because K means uses distance between observations as it's measure of similarity, it's important to scale our numerical data before modeling if it's not already scaled.
For this, we'll use side kit learns standard scalar.
Standard scalars scales each point x sub i by subtracting the mean for that feature and dividing by the feature standard deviation.
This ensures that all the variables have a mean of zero and a variance standard deviation of 1.
There are a number of scaling techniques available in side kit learns preprocessing package, including standard scalar, min max scalar, normalizer and others.
There's no firm rule for determining which method will work best, but with K means models using any scalar will almost always lead to better results than not scaling at all.
We can instantiate standard scalar and transform our data in a single step by using the dot fit transform method and passing our data to it as an argument.
Here's a tip.
If your computer has enough memory, it's helpful to keep an unscaled copy of your data to use later.
So we'll assign the scaled data to a new variable called x underscore scaled.
Now that the data is scaled, we can start modeling.
Because we don't know how many clusters exist in the data, we'll begin by examining the inertia values for different values of k.
Let's start with k equals 3, an arbitrary number, in this case.
One thing to note is that by default, side kit learns implements an optimized version of the k means algorithm called k means plus plus.
This helps ensure optimal model convergence by initializing centroids far away from each other.
Because we're using k means plus plus, we will not rerun the model multiple times.
Now, let's instantiate the model.
Because we want to build a model that puts our data into three clusters, we set the n underscore clusters parameter to three.
We'll also set the random underscore state to an arbitrary number.
This is only so others can reproduce our results.
If we let this value blank, it's possible others could replicate our code exactly and still get different results due to the random initial placement of centroids.
The next step is to fit the model to the data.
We do this by using the fit method and passing in our scaled data.
This returns a model object that has learned your data.
You can now call its different attributes to view inertia, location of centroids, and class labels, among others.
We can get the cluster assignments by using the labels attribute.
Similarly, find the inertia by using the inertia attribute.
Let's find out what happens when we check the cluster assignments and inertia for this model.
The labels attribute returns a list of values that is the same length as the training data.
Each value corresponds to the number of the cluster to which that point is assigned.
Because our k means model cluster the data into three clusters, the value assigned to each observation will be zero, one, or two.
The inertia attribute returns the sum of the square distances from each sample to its closest cluster center.
This inertia value isn't helpful by itself.
We need to compare the inertia of multiple k values.
To do this, create a function called k means inertia that fits a k means model from multiple values of k, in our case, two through ten.
The function calculates the inertia for each value depends it to a list and returns that list.
Then we plot it.
The x-axis is the number of clusters, and the y-axis is the inertia.
This plot contains an unambiguous elbow at five clusters.
Models with more than five clusters don't appear to reduce inertia at all.
Right now, it seems like a five cluster model might be optimal.
But let's check the silhouette scores. Hopefully the results will corroborate our findings.
To get a silhouette score, we call the function and pass to it two required parameters, the training data, and its assigned labels.
Let's check this out for the k means three model we created earlier.
It worked. However, this value isn't very useful if we have nothing to compare it to.
Just as we did for inertia, we'll write a function that compares the silhouette score of each value of k from two through ten.
Now, plot these silhouette scores.
This plot indicates that the silhouette score is closest to one when our data is partitioned into five clusters.
It confirms the inertia analysis.
In this case, because we used synthetic data, we can review how many clusters actually existed in our data.
This is the variable created by the random number generator at the beginning of the video. We called it centers.
We were right. We were able to use inertia and silhouette score to correctly deduce that our data has five clusters.
At this point, we'll want to do some further analysis to determine whether we can understand our clusters and if there are appropriate for our use case.
We'll instantiate a new k means model with n underscore clusters equals five and fit it to our scale data.
Okay, now we can confirm that there are five unique labels ranging from zero through four.
So, we can use them to create a new column in the unskilled data frame.
Next, we could perform analysis on the different clusters to identify what makes them different from one another.
This would not have been possible with the previously scaled data because the numbers wouldn't make a lot of sense.
Note that in many cases, it's not always clear what differentiates one cluster from another.
And it can take a fair bit of effort to determine whether it makes sense to cluster your data a given way.
This is where domain knowledge and expertise through practice are very valuable.
Congratulations! You've reached the end of another section of the course.
Along the way, you've discovered that unsupervised learning is a vast field with many different applications.
First, you learned about k means models for deriving structure from your data.
You were introduced to the concept of clustering similar groups of data around centroids and building a k means model.
We explored the importance of running a k means model multiple times with different values for k.
We also explained the issues with local minima and how to build models with different centroid initializations to make sure you're getting the most accurate results.
You're now much more familiar with inertia and silhouette score, methods for choosing the best number of clusters and evaluating the effectiveness of your model.
And you can identify the elbow of an inertia curve and use it to help determine an optimal k value.
There's a whole world of unsupervised learning models and methodologies out there.
This is just the beginning, but now you're empowered with key tools for navigating the landscape and developing your talents as a data professional.
Hello and welcome back, you've come so far building models of your own using the tools and skills you've been learning throughout this program.
Now, in this final part of the course, you'll revisit supervised machine learning by investigating some more advanced classification techniques.
These advancements are very exciting for data professionals because they enable us to overcome some typical modeling limitations.
One such method is tree-based learning.
Tree-based learning is a type of supervised machine learning that performs classification and regression tasks.
It uses a decision tree as a predictive model to go from observations about an item represented by the branches to conclusions about the items target value represented by the leaves.
Soon, you'll learn how single decision trees provide a foundation for more advanced approaches to all kinds of data work.
Then, you'll move on to ensemble learning techniques, which enable you to use multiple decision trees simultaneously in order to produce very powerful models.
In addition to learning how these new models work and their use cases, you'll be introduced to hyper parameter tuning,
knowing how and when to adjust or tune a model can help a data professional significantly increase performance.
Together, we're going to build models that can be very impactful for tons of different business applications.
What you're about to learn can really make you stand out to employers in the industry. Let's get started.
In the world of supervised learning, there are tons of techniques that can help you make predictions.
One popular tool for classification and prediction is the decision tree.
It serves as a foundation for some of the most effective models used in industry today.
A decision tree is a flow chart like supervised classification model and a representation of various solutions that are available to solve a given problem based on the possible outcomes of related choices.
Like all supervised learning classification techniques, decision trees enable data professionals to make predictions about future events based on the information that is currently available.
They also have some very specific advantages in certain areas over other supervised learning models.
Decision trees require no assumptions regarding the distribution of underlying data.
Unlike the models we've covered previously, they can handle colinearity easily.
Additionally, preparing data to train a decision tree can be a much less complex process, requiring little pre-processing if any at all.
However, decision trees are not perfect. No model is.
Decision trees can be particularly susceptible to overfitting.
The model might get extremely good at predicting scene data, but as soon as new data is introduced, it may not work nearly as well.
This is something that you'll need to keep in mind while building these types of models.
A decision tree consists of nodes and edges.
The edges connect together the nodes, essentially directing from one node to the next along the tree.
Decisions are made at each node.
At each, a single feature of the data is considered and decided on.
By the end, any relevant features will have been resolved, resulting in the classification prediction.
Let's explore this little further.
Here's a decision tree that will help you decide whether or not to go outside and play soccer or football on any given day.
The first decision that will be made relates to the weather outlook.
For this tree, there are three options, sunny, cloudy, or rainy.
This node where the first decision is made is called the root node.
It's the first node in the tree, and all decisions needed to make the prediction will stem from it.
It's a special type of decision node because it has no predecessors.
The nodes where a decision is made are decision nodes.
Decision nodes always point to a leaf node or other decision nodes within the tree.
So for our example, if it's supposed to be sunny or rainy, the tree will continue making more decisions to arrive at a final prediction.
However, if it's cloudy, the tree arrives at a prediction.
Soccer will be played.
This brings us to a leaf node.
Leaf nodes are where a final prediction is made.
The whole process ends here, so no further decisions are required after this point.
Now, view where the decision tree would have gone if the outlook had been sunny.
We're not at a leaf node yet. There are still decisions to be made.
This time, the consideration is about the humidity.
If the humidity is above 75%, the tree ends at a leaf node that says, don't play soccer.
However, if the humidity is below 75%, the decision tree will say, play soccer.
The nodes that are pointed to whether leaf nodes or other decision nodes are child nodes.
The node that is pointing to them is a parent node.
The algorithm decides what and where variables are split based on what will provide the most predictive power.
So, for example, if 90% of the time that it rains, soccer is not played, this variable would be very predictive.
Splitting the data on outlook would give new groups each of which has a majority of play and don't play.
Now, you know the basics of decision trees.
This foundation will be helpful as you continue learning about tree-based modeling.
Coming up, you'll cover the aspects of building the tree and using your training data to develop the nodes and edges.
Then, you'll learn how to optimize tree-based models and what you as a data professional can do to maximize their capabilities.
Now that you've developed a solid understanding of classification models, it's time to build a model that's a bit more advanced.
In this video, you'll examine how to create a single standard decision tree in Python.
We're going to use a decision tree to approach the same business need as with the naive base model from earlier, modeling customer bank churn.
The first thing, as always, is to import any necessary packages and libraries into our notebook.
You've experienced most of these before, but the packages for the decision tree itself are new.
Let's import decision tree classifier, the site kit learned implementation of a single decision tree.
Additionally, import the plot tree function to produce a visual of the decision tree after it's built.
We also import the confusion matrix and confusion matrix display functions to help us calculate and plot a confusion matrix for our model.
And lastly, we have our four evaluation metrics.
We're going to read in the original dataset as a pandas data frame as usual.
Remember, this is where you'd normally do exploratory data analysis or EDA.
Then you would use what you learned from EDA and what you know about the use case of your model to decide on an appropriate evaluation metric.
For our bank churn models, we're going to assume that a metric that balances precision and recall is best.
The metric that helps us achieve this balance is F1 score, which is defined as the harmonic mean of precision and recall.
Again, there are many metrics to choose from.
The important thing is that you make an informed decision that is based on your use case.
Now that we've decided on an evaluation metric, let's prepare the data for modeling.
Just as before, we'll drop the unproductive features and the gender column so our model doesn't predict based on gender.
Then we'll dummy and code the geography column, creating boolean columns from the categorical column.
Our last preparation is to separate our target variable from the rest of the data and then split the data into training and test sets using the train test split function.
Don't forget to stratify based on the target.
The first thing we'll do is train a baseline decision tree model.
We won't tune it. It's just to give us scores that we can use as points of reference.
We do this by instantiating the classifier and setting the random state.
We'll assign it to a variable called decision tree.
Next, we'll fit it to the training data.
This grows a decision tree on our data. It all happens behind the scenes.
Finally, we'll use the predict method to use the tree we've just grew to make predictions on the X test data, assigning the results to a variable called DTPread.
Now we can get the results by using the different evaluation metric functions we imported.
This model's F1 score is better than what we got from the naive base model we built.
Let's inspect the confusion matrix of our decision tree's predictions.
First, we'll write a short helper function to help us display the matrix.
Notice from this confusion matrix that the model correctly predicts many true negatives.
This is to be expected because the data set is imbalanced in favor of negatives.
When the model makes an error, it appears slightly more likely to predict a false positive than a false negative, but it's generally balanced.
This is reflected in the precision and recall scores both being very close to each other.
Next, let's examine the splits of the tree.
We'll do this by using the plot tree function that we imported.
We passed to it our fit model as well as some additional parameters.
Note that if we did not set max depth equals 2, the function would return a plot of the entire tree, all the way down to the leaf nodes.
But we are most interested in the splits nearest to the root because these tell us the most predictive features.
Class names displays what the majority class of each node is, and filled colors the nodes according to their majority class.
How do we read this plot?
The first line of information in each node is the feature and split point that the model identified as being the most predictive.
In other words, this is the question that's being asked at that split.
For our root node, the question was, is the customer less than or equal to 42 and a half years old?
At each node, if the answer to the question that asks is yes, the sample would move to the child node on the left.
If the answer is no, the sample would go to the child node on the right.
Genie refers to the node's genie impurity.
This is a way of measuring how pure a node is.
The value can range from 0 to 0.5.
A genie score of 0 means there's no impurity.
The node is a leaf, and all of its samples are of a single class.
A score of 0.5 means the classes are all equally represented in that node.
Samples is how many samples are in that node and value indicates how many of each class are in the node.
Returning to the root node, we have value equals 5,972 and 1,528.
Notice that these numbers sum to 7,500, which is the number of samples in the node.
This tells us that 5,972 customers in this node stayed and 1,528 customers churned.
Lastly, we have class. This tells us the majority class of the samples in each node.
If we look at the top of the tree, this plot tells us that if we could only do a single split on a single variable,
the one that would most help us predict whether a customer would churn is their age.
If we look at the nodes at depth 1, we notice that the number of products and whether or not the customer is an active member,
also are both strong predictors of whether or not they will churn.
This is a good indication that it might be worthwhile to return to your EDA and examine these features more closely.
Now that you have a basic understanding of how tree-based modeling works in Python,
you have two more techniques to learn before moving on to some more powerful optimization techniques.
Hyper-primiter tuning and cross-validation.
Using these, we can optimize single decision trees even further and you'll use those concepts to supercharge the models you'll learn later on.
Meet you again soon.
Recently, you've been exploring how to build a decision tree classifier model.
For many of the models you've worked with throughout this course, you use devaluation metrics,
such as F1 score, to gauge their performance.
But throughout this section of the course, you'll be taking an extra step to gain some additional performance increases from your models.
A very popular and widely used technique to improve performance after creation is known as hyper-primiter tuning.
Hyper-primators are parameters that can be set before the model is trained.
They can be tuned to improve model performance directly affecting how the model is fit to the data.
Hyper-primiter tuning is the process of adjusting the parameters to find the best values that will result in the most optimal model.
Just like a musician tuning the strings on their guitar, the idea is to achieve balance and a beautiful result.
For tree-based modeling, there are many hyper-primators that can be tuned, and they can have a big impact on the model itself.
You've actually already used one previously in this course when you worked with K-means.
As you'll recall, when building a K-means model, you set the value of K to produce different cluster results.
But when you changed its value, you performed hyper-primiter tuning.
One of the more basic hyper-primators for a decision tree is called max depth.
Setting this hyper-primiter defines a limit of how long a decision tree can get.
The depth of a decision tree is the number of levels between the root node and the farthest node from the root node with the root node itself being level zero.
Consider our previous example.
This tree has three levels.
The root node is level zero.
The nodes in the middle are level one, and the leaf nodes all the way at the bottom are level two.
So this tree has a depth of two.
However, this decision tree has a depth of four.
Even though this decision tree isn't as filled out as the previous example,
what matters is the distance of the farthest node from the root,
and whether it's a leaf node or a decision node.
This leads us back to max depth.
When working with very large datasets, you could potentially create massive trees that are very deep.
But this isn't necessarily what you want for your model.
So, setting a value for max depth can help reduce overfitting problems by limiting how deep the tree will go.
Additionally, it can reduce the computational complexity of training and using the model in the first place.
For example, if you're finding that a decision tree has the same performance with a depth of 10 versus a depth of 100,
you can set max depth to 10 and achieve the desired performance more quickly.
Another very commonly used hyperparameter is called min samples leaf.
This hyperparameter defines the minimum number of samples that must be contained in a leaf node.
It means that split will only happen if there are enough samples in each of the result nodes to satisfy the required value.
For example, maybe part of the way down your tree, there's a decision node that currently has 10 samples.
However, the min samples leaf hyperparameter is set to 6.
There would be no way to split the data so that each leaf node has 6 samples and therefore no further split can take place.
There are other hyperparameters for decision trees that you'll learn, but first, let's explore finding the optimal values for the parameters.
And here's where something called grid search is useful.
Grid search is a tool to confirm that a model achieves its intended purpose by systematically checking every combination of hyperparameters to identify which set produces the best results based on the selected metric.
So, at the end, you'll have values that produce optimal results for your model.
When performing a grid search, the first step is to specify which parameters you want to tune and the set of values that you want to search over.
For example, maybe we want to tune max step and min samples leaf.
We would define potential values for each of these.
For max step, we could check depths of 4, 8, 12, 20, and 30.
For min samples leaf, we could try 10, 50, and 100.
The algorithm will check every combination of values to see which pair has the best evaluation metrics.
It would first check max depth of 4 with min samples leaf of 10, then 50, then 100.
The algorithm would then check max depth of 8 with min samples leaf of 10, then 50, then 100.
This continues until every combination has been analyzed.
Remember, you can try any values and any number of values during grid search if you believe the benefits are worth the cost of your computing time.
Coming up, we'll put into practice many of the tree-based modeling concepts you've learned so far.
All the way from constructing the tree to optimizing it and using it to make some classification predictions.
Looking forward to it!
You now know about many of the tools involved with building some pretty powerful models.
Now in this video, you'll explore model validation.
Model validation is the set of processes and activities intended to verify that models are performing as expected.
This is achieved with a validation dataset, which is a sample of data that's held back during training.
The validation dataset is instead used to give an unbiased estimate of the skill of the final tuned model.
Note that validation data is different from test data and must remain unseen until the very end of the process.
In some of the models we've built so far, we haven't been doing this.
But that's okay. We were just using those models to understand the process of building an evaluation.
Previously, before training the model, we took our data and split it into two sets, one training set and one testing set.
These sets were used to train and test the model, respectively.
With validation, the data is actually split into three sets.
The first two are training and testing sets as before, but now there's an additional validation set.
This validation set is used instead of the test set to evaluate the model, leaving the test set untouched.
In addition, another popular method is cross validation.
Cross validation is a process that uses different portions of the data to test and train a model across several iterations.
It works like validation, but with a slight twist.
Instead of having one validation set to evaluate the model, the training data is split into multiple sections, known as folds.
Then, the model is trained on different combinations of these folds.
For example, perhaps we want five folds.
First, the data would be split into the training and test data.
Then, the training data would be split into the five folds.
The first model iteration will train with folds one, two, three and four, using the fifth folds to get metrics for the model.
The next will train with folds one, two, three and five, using the fourth fold to get metrics.
This process repeats until every combination is done, and the evaluation metrics are averaged to get final validation scores.
Which validation technique you choose mainly depends on the data set you're working with?
Cross validation is particularly useful when working with smaller data sets, as it maximizes the utility of the data available.
More so than standard validation.
However, cross validation is not necessary when working with very large data sets.
There's so much data that maximizing the utility is not required, and actually can be problematic depending on the computing resources at your disposal.
However, if limited computing resources or constraints in the data are not issues,
then cross validation is almost always applied.
Validation schemes are essential to building and selecting effective models.
Data professionals working on these types of business projects are responsible for determining the best scheme to use.
This comes with experience, along with an understanding of the data and the available tools.
You're well on your way to developing these important skills.
Hello, and welcome back. In this video, we'll be taking the foundation we built when creating a decision tree classifier in Python and expanding on it to fine tune our models.
If you recall, hyperparameter tuning involves changing parameters that directly affect how the model trains before the learning process begins.
Different models have different types of hyperparameters that are available for you to adjust.
You've learned about two that apply to tree based models, max depth and min samples leaf.
As a reminder, max depth defines how long a decision tree can get, and min samples leaf defines the minimum number of samples for a leaf node.
These are the hyperparameters that will be tuned on a single decision tree.
When originally exploring hyperparameter tuning, we also considered the steps for finding the optimal values for hyperparameters.
Randomly entering values won't produce the best results, which is why we use grid search.
As a refresher, grid search specifies a series of values for each hyperparameter to be tuned.
It systematically checks every combination of those values to determine which set produces the best results based on the selected metric.
Think of it as brute forcing the different hyperparameter values.
Imagine forgetting your pin and trying every single number between 0000 and 999.
Sure it would take time, but eventually you'd find it.
This is how grid search works.
Okay, now let's get into the code. You'll work within the same framework as the other classification models you've created.
We'll begin where we left off in the decision tree notebook.
Remember, we've already performed feature engineering, and the data has already been split into X and Y data, as well as training and test sets.
But now, we're going to add a new function.
Grid search CV is imported from the model selection package of scikit-learn, enabling the hyperparameters to be tuned.
The CV in grid search CV stands for cross validation.
Each time a set of hyperparameters is used, it's scored against a validation set, keeping the test data unseen.
You'll use the validation scores when comparing models moving forward.
Note that you won't be comparing this tuned decision tree to the existing models.
All the other models were scored and compared using test data, which is actually an improper practice.
When data professionals perform model selection in the workplace, the test data must always remain unseen to the models being worked on.
That data is only used at the very end of the model development process.
As mentioned before, the parameters you'll tune will be maxed up and min samples leave.
A dictionary is defined, where the key is the name of the hyperparameter, and the value is a list of numbers that will be tried as that hyperparameter.
While the grid search is based on F1 score, you still want to find out what the other scores are.
So, create a set called scoring with the names of each of the desired metrics.
Next, create an instance of a decision tree classifier named Tune Decision Tree.
The grid search CV function is then called.
As arguments pass in the decision tree classifier object, the parameters, the scoring methods, the number of cross validation folds, and specify the metric the search will focus on.
Finally, fit the model to the data.
Check which hyperparameters the grid search identified.
By getting the best estimator attribute from the grid search object, you can observe the values it found.
So, a max depth of 8, and a min samples leave of 20, was best when using the F1 score as a measure.
Getting the best score attribute confirms the best average F1 score across the different folds among all the combinations of hyperparameters.
Note that this model achieved a score of about 0.5607.
This final code block is a helper function to extract scores for the model.
It produces a data frame that has the name of the model along with the four scores you've been using.
Call this function at the very end and save the resulting data frame as a CSV file for later use.
Right now, there's nothing to compare this score with.
However, note the model has an F1 score of 0.5605.
And soon, you'll go on to create other more advanced tree-based models and find scores to compare with this one.
With those numbers, you'll be able to determine which model is not only the best performer, but also the best in the context of the business needs.
At this point, you've learned that decision trees are useful because they're easy to understand and interpret flexible with regard to the data they use and highly versatile.
The decision trees can be good predictors, but you also know that they're prone to overfitting and they're very sensitive to variations in the training data.
How do we solve these problems?
The answer is by using the wisdom of the crowd.
Perhaps you're familiar with this concept because it can apply to everyday situations as well.
If I have a jar filled with jelly beans and I ask a spatial math expert to examine it and guess how many jelly beans there are,
their estimate will typically be less accurate than if I ask a thousand ordinary people to do the same thing and then take the average of their guesses.
We can apply the same concept to modeling, using a process called ensemble learning or simply ensembleing.
Ensemble learning involves building multiple models and then aggregating their outputs to make a final prediction.
Just like in our jelly bean example, predictions using an ensemble of models are very accurate even when the individual models themselves are barely more accurate than a random guess.
A best practice when building an ensemble is to use very different methodologies for each model it contains, such as a logistic regression, a naive base model, and a decision tree classifier.
This way when the models make errors and they always will, the errors will be uncorrelated.
The goal is for them to not all make the same errors for the same reasons.
You could build an ensemble using the three models I just mentioned.
You'd train each model on the same data, then use each model's individual predictions to make a final prediction.
Say by taking the majority vote if it's a classification task or averaging the results if it's a regression task.
But there's another way to build an ensemble, a way that uses the same methodology for every contributing model.
In this kind of ensemble, each individual model that comprises it is called a base learner.
For this method to work, you usually need a lot of base learners and each is trained on a unique random subset of the training data.
If the base learners were all trained on the exact same data, there would be too much correlation between the errors.
This would affect the strength of the base learners.
And if a base learner's prediction is only slightly better than a random guess, it becomes a weak learner.
So, to address this, data professionals do something called bagging in order to ensure random subsets of the data and strong learners.
The word bagging comes from bootstrap aggregating.
Let's break this down.
Remember from statistics that bootstraping refers to sampling with replacement.
That's what happens during bagging, too.
Each base learner samples from the data with replacement.
For bagging, this means the various base learners all sample the same observation and a single learner can sample that observation multiple times during training.
The aggregation part of bagging refers to the fact that the predictions of all the individual models are aggregated to produce a final prediction.
For regression models, this is typically the average of all the predictions.
For classification models, it's often whichever class receives the most predictions, which is the mode.
When bagging is used with decision trees, we get a random forest.
A random forest is an ensemble of decision tree base learners that are trained on bootstrap data.
The base learners' predictions are all aggregated to determine a final prediction.
Random forest takes the randomization from bagging one step further.
A regular decision tree model will seek the best feature to use to split a node.
A random forest model will grow each of its trees by taking a random subset of the available features in the training data,
and then splitting each node at the best feature available to that tree.
This means that each base learner in a random forest model has different combinations of features available to it,
which helps to prevent the problem of correlated errors between learners in the ensemble.
Each individual base learner is a decision tree.
It may be fully grown, so each leaf is a single observation, or it may be very shallow, depending on how you choose to tune your model.
On-sampling, many base learners helps reduce the high variance that you typically get from a single decision tree.
On-sampling is powerful because it combines the results of many models to help make more reliable final predictions.
Plus, these predictions have less bias and lower variance than other standalone models.
Coming up, we'll explore random forests in more detail. Lots to come!
Now that you've been introduced to random forests, let's examine a little more closely what they are and how they function.
It's important to understand this methodology because it's commonly used in data work, and many of its component steps are used by other more advanced modeling strategies.
As a refresher, a random forest is an ensemble of learning trees whose predictions are all aggregated to determine a final prediction.
Each tree in a random forest model uses bootstrapping to randomly sample the observations in the training data with replacement.
Remember, this means that any tree in the model can use the same observation, and the same observation can be sampled more than once by the same tree.
Bootstrapping is a critical component of random forest models.
It ensures that every base learner in the ensemble is trained on different data, while allowing each learner to train on a dataset that's the same size as the original training data.
Because there are duplicated observations in the tree's training data, each one will be missing some of the observations from the original training dataset.
One more important principle of random forest models is that all trees in the ensemble are trained on a random subset of the available features in the dataset.
No single tree sees all the features.
Again, this is to introduce another element of randomness and ensure that each tree is as different from the others as possible.
You learned that one of the main weaknesses of decision trees is that they are very sensitive to new data, so they're prone to overfitting.
Therefore, randomizing both the data and the features used by each base learner means that no single tree can overfit all the data.
This is because no single tree sees all the data.
In fact, the trees underfit the data. They are high bias, but together they can be very powerful predictors that are more stable than a regular single decision tree.
In addition, random forests are very scalable.
All the base learners they rely on can be trained in parallel using different processing units, even across many different servers.
Finally, just like decision trees, random forest models need to be tuned to find the combination of hyperparameter settings that results in the best predictions.
After all, random forests are made up of many decision trees and data professionals want them all to be as effective as possible.
Hey, welcome back. In this video, we'll build on your understanding of how decision trees grow. This will be the basis on which we'll tune a random forest model.
You've learned that random forests make predictions by sampling features and observations to grow trees.
With decision trees, splits are decided by which variables and which cutoff values offer the most predictive power.
Now, let's consider that decision trees continue to split until one of a certain set of conditions is met.
The first condition has to do with the observations that a leaf contains.
When all of the observations belong to the same class, this means the leaf node is pure.
The second condition affecting where a tree splits is whether the minimum leaf size or maximum depth is reached.
Also, a decision tree may stop growing if it achieves a certain performance threshold.
The value and metric for this threshold can both be specified by the modeler.
You'll recall that settings such as these are known as hyperparameters and they can be tuned to improve model performance,
directly affecting how the model is fit to the data.
We demonstrated that one of the most important hyperparameters in a decision tree is its max depth.
This specifies how many levels the tree can have and ultimately determines how many splits it can make.
Remember, every time a node splits, your data gets separated into smaller subsets.
The model is drawing another decision boundary.
We also introduced you to min samples leaf, which defines the minimum number of samples for a leaf node.
With min samples leaf, a split can only occur if it guarantees a minimum number of observations in the resulting nodes.
Now, a new concept min samples split can be used to control the threshold below which nodes become leaves.
Random forest models have these same hyperparameters because they are ensembles of decision trees.
These hyperparameters control how the learner trees are grown.
But random forests also have some other hyperparameters, which control the ensemble itself.
This first hyperparameter controls the randomness of the trees, and it's called max features.
This setting specifies the number of features that each tree selects randomly from the training data to determine its splits.
For example, if you have a dataset with features A, B, C, D, and E,
and you build a random forest model with max features set to 3, your first tree might use features A, C, and E to determine its splits,
and the next tree might use features B, D, and E, and so on.
The second hyperparameter number of estimators controls how many decision trees your model will build for its ensemble.
For example, if you set your number of estimators to 300, your model will train 300 individual trees.
If you're building regression trees, then the model's final prediction would average the predictions of all 300 trees.
If you're building classification trees, the final prediction would be determined by whichever class received the majority vote from the 300 individual trees.
For random forest models, performance will typically increase as trees are added to the ensemble, but only to a certain point.
After this point, improvement will level off, and adding new trees will only increase your computing time.
This happens because the new trees will become very similar to existing trees, so they won't contribute anything new to the model.
As a final point, many data professionals build models without hand-setting each hyperparameter.
In fact, when using scikit-learn, the model might perform well with no hyperparameters at all.
That's because it has an effective default setting.
And remember to make the most of grid search to help you iterate.
Data professionals know how to experiment with combinations of hyperparameters in order to build the model that makes the very best predictions.
Now that you're familiar with the logic behind random forests and some of its most important hyperparameters, you're ready to build a model.
In this video, we'll build a random forest model that uses grid search to crossvaluate and tune the hyperparameters.
Let's open up a Jupiter notebook and get started.
Recall that in this scenario, we're trying to predict which customers will close their bank accounts.
Import numpy and pandas, map plotlib, grid search CV, and train test split.
Then, all the evaluation metrics.
Also, import random forest classifier.
Note that we're using the classifier because we're trying to solve a classification problem,
but we could also import the random forest regressor if we were predicting on continuous data.
Remember that we've prepared our data by dropping the row number, customer ID, and surname columns because they don't have predictive value.
We've also dropped the gender column because we don't want the model to predict on the basis of gender.
Then, we dummy encoded our categorical variables to prepare for modeling.
The last step before modeling is to split the data into the training and test sets using train test split.
We're going to compare our models F1 score to what we got from our naive base and decision tree models.
So it's important to split the data in the same exact way to ensure that all models train and test on the same data.
Therefore, we'll make sure that our test data is 25% of all the data.
We'll also stratify based on our target column and set the random seed to 42, just as we did previously.
This splits our X and Y data into X train, X test, Y train, and Y test.
Okay, we're ready to model.
Let's find out what happens if we tune our model using cross validation.
One thing you'll notice when we build ensemble models is that the training time will usually be much longer than what you've experienced so far.
That's because, instead of building a single tree, we're now building from 75 up to 150 trees for each combination of hyperparameters we specify.
It's useful to know how long it takes a model to train.
You can get a cells run time by entering percent percent time at the top.
This is called a magic command.
Magic commands often just called magic.
Our commands that are built into iPython to simplify common tasks.
They always begin with either percent or percent percent.
Now, define the hyperparameter grid.
Two and five hyperparameters.
Max Depth, min samples leaf, min samples split, max features, and number of estimators.
Notice that for max Depth, none is included.
This means that one of the options allows the trees to grow without a specific limit on their depth.
Next, instantiate our classifier and assign it a random state for reproducibility.
And specify the metrics the model will capture.
Finally, instantiate the grid search object.
It has two positional arguments, the classifier and the parameter grid.
Tell it to use the scoring metrics specified above and set CV to five.
This means the model will be cross validated using five folds.
Lastly, specify refit equals F1.
This is necessary when we've given multiple scoring metrics because it tells grid search that even though we want to check a few different metrics,
the one we care most about is the F1 score.
As a quick refresher, F1 score is a combination of precision and recall,
combining the two into a single metric.
In this instance, when we call the best estimator, it's the one with the highest average F1 score across all five validation folds.
Now, fit the model to the training data.
Depending on the processing power available, the number of hyper parameter combinations specified in the grid search,
the size of the data set, and the number of folds used to cross validate.
This could take a long time.
In this example scenario, the time magic tells us it took about 20 minutes to fit.
There's always a trade-off between searching over a large hyper parameter space and good runtime.
The more hyper parameters you search, the better your model will be, but the longer it will take to fit.
When models take a long time to fit, you don't want to have to run them again.
If your kernel disconnects or you shut down the notebook and lose the cells output, you'll have to refit the model,
which can be frustrating and time consuming.
The good news is that there's a method that enables you to save the fit model object to a specified location,
and then quickly read it back in.
And in the next video, we'll discover how that works.
Meet you there.
In the last video, we began creating a random forest model that used grid search to cross validate and tune hyper parameters.
Now, we'll build on that by using a separate validation data set to validate a model.
But first, recall where we left off.
We discovered a common issue in the data realm, the trade-off between searching over a large hyper parameter space and a good runtime.
As we observed, the more hyper parameters searched, the better the model, but the longer it takes to fit.
When models take a long time to fit, it's inefficient to have to keep running and refitting them again.
Once you find a model you're happy with, you don't want to start from scratch every time you open your notebook.
And that's where Pickling comes in.
Pickle is a tool that saves the fit model object to a specified location, and then quickly reads it back in.
It also allows you to use models that were fit somewhere else without having to train them yourself.
So let's pick up where we left off and pick goal the model.
First, specify a file path to the directory where the model will be saved.
Then, create a with open statement, passing to it the file path, plus the name you want to use to save this model followed by .pickle.
This creates an empty pickle file.
The second argument, WB, gives permission to write to the file in binary, which is how Pickling works.
Use as to assign the return value of open to a local variable named to write.
Call Pickle.dump and pass the fit model object to it.
Then, the two right variable.
In the next cell, read back in the Pickled model from where it's saved.
The only difference in syntax is using RB to specify that we'll be reading binary and using Pickle.load to assign a new variable, which points to the fit model.
Make sure you call this new variable by the same name you used for your fit model above.
In this case, RF underscore CV.
If you comment out the line of code where you fit the model and the cell where you pickle the model, you can close the notebook, reopen it, and rerun all the cells without having to wait for the model to fit.
You can also send the pre-fit model to other people to use.
Now, use the best params attribute to identify the hyperparameter values of the model that had the best average F1 score across all the cross validation folds.
To find the average F1 score of the best model, use the best score attribute.
Then, use the make results function to generate a table of all the results and concatenate that with the overall table to compare all the models.
Interesting.
The cross validated random forest model has an average F1 score of 0.58 across all five validation folds, which is a little better than the single tuned decision tree.
It also has better recall, precision, and accuracy.
Nice.
Okay.
Now, let's use a separate validation dataset to validate the model.
To do this, split the training data into a new training set and validation set.
Use train test split.
Remember to stratify the Y data.
Use an 8020 split.
Don't forget, this is only splitting the training data, which itself is 75% of all data.
This means that our new training set will be 80% of 75% of the data.
And the new validation set will be 20% of 75% of the data.
The test data remains untouched.
This next part is a little tricky.
For its search CV wants to cross validate the data.
In fact, if the CV parameter was left blank, it would split the data into five folds for cross validation by default.
Because you're using a separate validation set, it's important to explicitly tell the function how to perform the validation.
This includes telling it every row in the training and testing sets.
Use a list comprehension to generate a list of the same length as our XTR data,
where each value is either a negative one or a zero.
Use this list to indicate to grid search CV that each row labeled negative one is in the training set,
and each row labeled as zero is in the validation set.
Call this list split index.
Now, import a new function called predefined split from scikit learns model selection package.
Predefined split provides trained test indices to split data into training and testing sets using a predefined scheme.
The next step is almost identical to what you did before.
Search over all the same hyper parameters and keep the syntax the same as when cross validating.
But now, pass the new split index list to predefined split and assign it to a variable.
We'll call the variable custom split.
Finally, set the CV parameter of the grid search equal to custom split.
Now it's time to fit the model.
Use time magic to get the time it takes the model to train.
Wow, in this example scenario, the model took only about four minutes to train.
During cross validation, the training data was divided into five folds.
An ensemble of trees was grown with a particular combination of hyper parameters on four folds of the data,
validating it against the fifth fold that was held out.
This whole process happened for each of five holdout folds.
Then, another ensemble was trained with the next combination of hyper parameters, repeating the whole process.
This continued until there were no more combinations of hyper parameters to run.
But now, there's a separate holdout set for validation.
An ensemble is built for each combination of hyper parameters.
Each ensemble is trained on the new training set and validated on the validation set.
But this only happens one time for each combination of hyper parameters.
Instead of five times with cross validation.
That's why the training time was only a fifth as long.
All right.
Now, pickle the model again.
Run the cell where the pickle is written.
Then, go back and comment out the line of code as well as the call to write the pickle.
Remember to have a cell where the pickled model can be read back in.
Check the results.
When you call best params, notice that the ensemble with the best F1 score used slightly different hyper parameters than the cross validated model.
Now, the F1 score is 0.576, better than the single decision tree model, but not as good as the cross validated model.
Both would likely produce similar results.
Just keep in mind that the cross validated model is a little more reliable because it was more rigorously validated.
In fact, if a different random seed had been used to create the validation set, it's possible that we might have gotten lucky and even had a model that performed a little better than our cross validated model.
But note that this doesn't mean it would be expected to perform better on the test data.
This notebook's purpose was to demonstrate the different processes involved invalidating a random forest model using cross validation and validation with a separate data set.
In practice, it's unlikely that you do both.
Instead, it would be more effective to choose an approach based on time requirements, the amount of data, and the number of different hyperparameter combinations to be explored.
As always, the work that data professionals do requires thoughtful trade-offs and adaptability.
Random forest is one methodology for building tree based ensemble models.
Now, I'm going to introduce you to another related methodology called boosting.
Boosting is one of the most powerful modeling methodologies in the field.
It's used in nearly every industry that relies on predictive modeling.
Many winning models from Kaggle and other competitions use boosting.
It's an essential tool in any modelers tool belt.
Boosting is a supervised learning technique where you build an ensemble of weak learners.
This is done sequentially with each consecutive base learner trying to correct the errors of the one before.
Remember, a weak learner is a model whose prediction is only slightly better than a random guess, and a base learner is any individual model in an ensemble.
This practice is similar to random forest and bagging.
Like random forest, boosting is an ensemble technique, and it also builds many weak learners then aggregates their predictions.
But there are some key differences.
Unlike random forest, which builds base learners in parallel, boosting builds learners sequentially.
This is because each new base learner in the sequence focuses on what the preceding learner got wrong.
Another difference from random forest is that for boosting models, the methodology you choose for the weak learner isn't limited to tree-based methods.
However, we will use tree-based implementations in this course because these are common and effective ways of building boosting models.
There are various different boosting methods available, but throughout this part of the program, we'll explore two of the most commonly used methodologies.
The first is called adaptive boosting, or add a boosting.
Add a boost is a tree-based boosting methodology where each consecutive base learner assigns greater weight to the observations incorrectly predicted by the preceding learner.
Here's a demonstration.
Add a boost builds its first tree on training data that gives equal weight to each observation.
Then, the algorithm evaluates which observations were incorrectly predicted by this first tree.
It increases the weights for the observations that the first tree got wrong, and decreases the weights for those that it got right.
This process repeats until either a tree makes a perfect prediction, or the ensemble reaches the maximum number of trees, which is a hyper parameter that is specified by the data professional.
Once all the trees have been built, the ensemble makes predictions by aggregating the predictions of every model in the ensemble.
Because, add a boost can be used for both classification and regression problems, this final step is a little different depending on which type is being addressed.
For classification, the ensemble uses a voting process that places weight on each vote.
Base learners that make more accurate predictions are weighted more heavily in the final aggregation.
For regression, the model calculates a weighted mean prediction for all the trees in the ensemble.
There's one disadvantage to note about boosting.
You can't train your model in parallel across many different servers because each model in the ensemble is dependent on the one that preceded it.
This means that, in terms of computational efficiency, it doesn't scale well to very large datasets when compared to bagging.
However, this generally isn't a concern, unless you're working with particularly large datasets.
But there are many note worthy advantages, including being one of the more accurate methodologies available today.
Also, just like random forest, the fact that it's based on an ensemble of weak learners means that the problem of high variance is reduced.
This is because no single tree weighs two heavily in the ensemble.
Boosting has a few key advantages.
First, unlike random forest, it reduces bias.
It's also easy to understand and doesn't require the data to be scaled or normalized.
Boosting can handle both numeric and categorical features, and it can still function well, even with multicolinearity among the features.
Plus, it's robust to outliers.
Now, note that resilience to outliers is a major strength of all tree-based methodologies.
This is because the model splits the data the same way regardless of how extreme a value is.
Here's an example.
Suppose you have six elephants.
Three are females that weigh 2,000, 2,500, and 3,000 kilograms.
And three are males that weigh 4,000, 4,500, and 5,000 kilograms.
If you grew a decision tree using this data, it would draw a decision boundary between males and females at 3,500 kilos.
The midpoint between the weights of the heaviest female and the lightest male.
Now, suppose that instead of weighing 5,000 kilos, the last male elephant weighed 10,000 kilos.
Your model would still divide males and females at 3,500 kilos.
It doesn't matter that the last elephant doubled in size.
Speaking of doubling in size, your experience and skills are growing at an enormous rate.
I'm really proud to be taking this journey with you, and can't wait to keep it up.
Previously, you learned that boosting is an ensemble technique that builds models sequentially,
with each model in the sequence focusing on the mistakes of the previous one.
And you discovered that adabust works by assigning greater weight in each model to the incorrect predictions of the model that preceded it.
Now, we're going to explore gradient boosting.
The gradient boosting is different from adaptive boosting because instead of assigning weights to incorrect predictions,
each base learner in the sequence is built to predict the residual errors of the model that preceded it.
Here's a demonstration.
This example uses a decision tree regressor, so imagine that the target is a continuous variable.
Let's start with a set of features, X, and a target variable Y.
We'll train the first base learner decision tree on this data and call it learner 1.
Learner 1 makes its predictions, which we'll call Y hat sub 1.
The residual errors of learner 1's prediction are found by subtracting the predicted values from the actual values,
call the set of residual errors error 1.
Now, train a new base learner using the same X data,
but instead of the original Y data, use error 1 as the target.
That's because this learner is predicting the error made by learner 1.
Call this new base learner learner 2.
Learner 2's predictions are assigned to Y hat sub 2.
Then, compare learner 2's predictions to the actual values and assign the difference to error 2.
In this case, the actual values are the errors made by learner 1.
This process will continue for as many base learners as we specify.
For now, repeat it just once more.
Stopping here results in an ensemble that contains three base learners.
To get the final prediction for any new X, add together the predictions of all three learners.
If you like, pause the video now and repeat the process to review how it works.
Ensembles that use gradient boosting are called gradient boosting machines or GBMs.
GBMs are among the most widely used modeling techniques today because of their many advantages.
One of these is high accuracy.
As we mentioned earlier, many machine learning competition winners succeeded largely because of the accuracy of their boosting models.
Another advantage is that GBMs are scalable.
Even though they can't be trained in parallel, like random forests, because their base learners are developed sequentially,
they still scale well to large data sets.
GBMs also work well with missing data.
The fact that a value is missing is viewed as valuable information, so GBMs treat missing values just like any other value when determining how to split a feature.
This makes gradient boosting relatively easy to use with messy data.
Also, because their tree based, GBMs don't require the data to be scaled, and they can handle outliers easily.
Gradient boosting also has its drawbacks.
One is that GBMs have a lot of hyperparameters, and tuning them can be a time-consuming process.
Another drawback is that they can be difficult to interpret.
GBMs can provide feature importance, but unlike linear models, they do not have coefficients or directionality.
They only show how important each feature is relative to the other features.
Because of this, they're often called black box models.
This is a model whose predictions cannot be precisely explained.
In some industries such as medicine and banking, it's essential that your model's predictions be explainable.
Therefore, GBMs are not well-suited for some applications.
GBMs can also have difficulty with extrapolation.
extrapolation is a model's ability to predict new values that fall outside of the range of values in the training data.
For instance, if one loaf of bread costs $1, two loaves of bread cost $2, and three loaves cost $3,
a linear regression model would have no trouble predicting that 10 loaves cost $10.
But a GBM wouldn't be able to, unless it saw the cost of 10 loaves in the training data.
Finally, GBMs are prone to overfitting if not trained carefully.
Usually, this is caused by tuning too many hyperparameters, which can result in the trees growing to fit the training data,
but not generalizing well to unseen data.
You're doing a wonderful job filling up your data toolkit.
Everything we're exploring together is priming you for an exciting career.
Keep up the great work.
There are numerous machine learning packages that include implementations of the boosting models we examined earlier.
Most of them have many of the same tunable hyperparameters as decision trees, because the most popular ones use tree-based learners.
They are also similar to random forests in that they have additional hyperparameters that control the ensemble as a whole.
This video will explore some of these hyperparameters, so you'll be able to assemble models that fit well to your data and make accurate predictions.
The implementation of GBM modeling that we'll be using from this point on is called XG Boost.
XG Boost stands for extreme gradient boosting.
XG Boost is used widely in the field of predictive modeling, and as a data professional, you're likely to encounter it frequently if you work with models.
Scikit Learn has its own GBM implementation, which is similar.
But XG Boost is a commonly used gradient boosting package that has many useful optimizations.
These optimizations include fast training, effective regularization of features, and tunable hyperparameters, which can improve model predictions.
Let's return to Max Depth.
As you'll recall, this was used both in decision trees and random forests.
It has the same functionality in XG Boost, which is that it controls how deep each base learner tree will grow.
The best way to find this value is through cross validation.
The model's final Max Depth value is usually low.
Remember, the deeper the tree, the more a model learns feature interactions that could be very specific to the training data, but may not generalize well to new information.
Even short trees are powerful because of the ensemble.
Typical values from Max Depth are two through ten, but this depends on the number of features and observations in the data.
The second hyperparameter is N estimators, which is the number of estimators or maximum number of base learners that the ensemble will grow.
This is best determined using grid search.
For smaller datasets, more trees may be better than fewer.
For very large datasets, the opposite could be true.
Typical ranges are between 50 and 500.
Now, let's investigate some hyperparameters that we haven't used before.
This first one is very important.
It's called learning rate.
You'll remember that each time an ensemble builds a new base learner, it fits the data to the error from the previous model.
In a basic implementation, the predictions of all the trees could then be summed to determine a final prediction.
In this case, each tree's prediction is considered equally important to the final prediction.
In practice, we use the learning rate to indicate how much weight the model should give to each consecutive base learners' prediction.
Lower learning rates mean that each subsequent tree contributes less to the ensemble's final prediction.
This helps prevent over-correction and over-fitting.
Another common name for this concept is shrinkage, because less and less weight is given to each consecutive trees prediction in the final ensemble.
Think of it like writing a bike for the first time.
Before you find your balance, you might move the handlebars too far in one direction, causing you to veer off course.
So you need to make an adjustment, but often you'll over-correct.
So you need to shift back the other way.
Each time you correct, you move the handlebars less and less until you're traveling smoothly in the direction you want to go.
This is the same idea as what happens when you slow the learning rate.
Each correction affects the prediction less than the one before it.
Also, if you use a low learning rate, your model will often require more trees to compensate.
Again, this is best determined using grid search.
Typical values are from 0.01 to 0.3.
The last hyperparameter will examine is very similar to min samples leaf from decision trees, but it has a different name.
It's called min-child weight.
A tree will not split a node if it results in any child node with less weight than what you specify in this hyperparameter.
Instead, the node would become a leaf.
This is a regularization parameter, so values that are too high cause the model to underfit the data.
The range of this setting is 0 to infinity.
If set between 0 and 1, the algorithm interprets this as a percentage of your data.
So 0.1 would mean that a node could not split unless its children each have greater than or equal to 10% of the training observations.
Generally, think of values greater than 1 as being equivalent to the number of observations in a child node.
So a value of 10 would mean no child node could contain fewer than 10 observations.
We're nearing the end of this course and you've learned so much already.
As always, make the most of course resources and always feel free to return to any of these videos to keep practicing.
In this video, we'll demonstrate how to build and tune an XG Boost classification model.
This model to compare their performance of all our previous models and select a final one.
Let's return to the bank churn data.
Import most of the same libraries, packages and functions used in previous models.
NumPy, pandas, map plot lib, pickle, all model metrics, grid search and train test split.
We have two new imports as well.
XGB class fire and plot importance, both of which come from the XG Boost library.
Remember that certain columns have been dropped, including customer ID and gender.
Also, the geography column was dummy encoded.
Just as before, assign features and target data to variables X and Y.
Then use train test split to split it into X train, X test, Y train and Y test.
Stratify based on the target and set the same test size and the same random seed used for previous models.
This helps ensure a direct comparison when evaluating model performance.
Now, begin modeling.
Use grid search to tune some hyper parameters.
Specifically, focus on max depth, min child weight, learning rate, and number of estimators.
Define the values that grid search will permeate as a dictionary called CV Perrams.
Next, instantiate the classifier.
Note the objective parameter was set to binary colon logistic.
This means that the model is performing a binary classification task that outputs a logistic probability.
The objective would be different for different kinds of problems.
For instance, if you were trying to predict more than two classes, or performing a linear regression on continuous data.
Now, set the random state.
Score in the same way as random forest.
Accuracy, precision, recall, and F1 score.
And finally, instantiate the grid search.
Remember to set refit to F1 as this tells grid search to refit the model that had the best average F1 score when it finishes its search.
Now, fit the model to the training data.
Use the handy time magic, so the cell outputs the time it takes to run.
This example scenario took nine minutes and 45 seconds.
Okay, now, pickle the model.
It's also possible to use models that were built in other notebooks.
For example, to use the random forest model, import it to this notebook using another with open statement.
Call it RF underscore CV.
Now, compare the models by using the best score attribute for both the new XG boost model,
and the random forest model from earlier.
In this case, XG boost outperformed the cross validated random forest model by a very close margin of just 3,000.
Now, use the make results function created previously to generate a results table for this model,
and append it to the overall results table.
This makes it possible to compare the scores across all models.
Sort the results on the F1 column in descending order.
The table clearly shows that our XG boost model outperformed all other models when measuring on F1 score.
All right.
Now, it's time to evaluate how the superior XG boost model performs when making predictions on the test holdout data.
Use grid search CVs predict method to make predictions on the X test data and assign the results to the variable.
Then, compare these predictions to the actual values contained in Y test and generate evaluation metrics.
Wow!
The model performed better on the test data than on the validation data for all four metrics.
This is always a possibility, but don't be alarmed if your model performs slightly worse on the test data.
After all, test data is completely unseen by the model.
Successful data professionals know that a job isn't finished simply because they've produced a model that results in an effective performance metric.
It's equally important to interpret that model and make recommendations based on the findings.
A confusion matrix is very helpful when assessing a model's variables and features.
For instance, in our model, from the 2500 people in our test data, there are 500 and 9 customers who left the bank.
Of those, our model captures 256.
The confusion matrix indicates that when the model makes an error, it's usually a type 2 error.
In other words, it gives a false negative by failing to predict that a customer will leave.
On the other hand, it makes far fewer type 1 errors, which are false positives.
Whether these results are acceptable depends on the costs of the measures taken to prevent a customer from leaving, versus the value of retaining them.
In this case, bank leaders may decide that they'd rather have more true positives, even if it means also capturing significantly more false positives.
If so, perhaps optimizing the model's based on their F1 scores is insufficient, maybe we'd retrain them to focus on recall instead.
What is certain is that our model helps the bank.
Consider the results if decision makers had done nothing.
In that case, they'd expect to lose 509 customers.
Alternatively, they could give everybody an incentive to stay.
That would cost the bank for each of the 2500 customers in our test set.
Finally, the bank could give incentives at random, say by flipping a coin.
Doing this would incentivize about the same number of true responders as our model selects.
But the bank would lose a lot of money offering the incentives to people who aren't likely to leave.
Plus, our model is very good at identifying these customers.
Another way to help explain our model is by checking the most important features.
XG Boost gives us a very useful function called plot importance to let us observe the relative feature importance of our model.
After we've imported the function, we can use it to output a bar graph by passing to it the best estimator from grid search.
In our model, estimated salary, balance, credit score, and age were the top predictors of whether a customer will leave.
It would probably be useful to return and do another EDA focused on these features.
At this point, you might also want to add back in the gender column as well as a column of your final models predictions to your original data.
This would allow you to measure how evenly your model distributed its error across reported gender identities.
From linear and logistic regression to naive bays, decision trees, random forest, and XG Boost, you're now equipped with a powerful set of tools.
They'll help you stand out as a professional in the exciting and rewarding data career space.
Out of all the models you've learned throughout this program,
the tree-based modeling techniques are going to be some of the ones you'll use most throughout your data journey.
In this section of the course, you discovered why tree-based models are often preferred over other supervised learning techniques, such as naive bays and logistic regression.
From there, you explored decision trees. You learned how they worked, how to build them, and how they're used to make predictions on future events.
Then, you considered hyperparameter tuning with decision trees. You learned about max depth and min samples leaf,
understanding how changing these hyperparameters affect how the model is trained, and in turn affect your model's performance.
You then explored some more advanced tree-based modeling topics, such as ensemble learning,
two of these techniques in particular, bagging and boosting, enabled multiple decision trees to arrive at a model that works better than any one tree ever could.
You implemented one of the most popular bagging methods, random forests, and observed how this technique compares to single decision trees.
Then, you learned several boosting methods and came to understand two popular approaches, adaptive boosting and gradient boosting.
After learning their differences and unique advantages, you implemented them in Python, and again witnessed how they compare to random forests and single decision trees.
And for each of these ensembling methods, you discovered how hyperparameter tuning can come into play.
Whereas, this made relatively minor differences and improvements with the single decision trees,
you observed how tuning hyperparameters such as learning rate and estimators can take an ensemble learning model from good to great.
These models are some of the most cutting-edge in data science today. The skills you gain with these tools will absolutely make you stand out to potential employers.
Why you've learned will also expand your data science education and serve as a launch pad for you to really boost into the data world.
Hi, it's great to be with you again. You might recognize me from the last course, I'm Tiffany, and it's time again to complete a portfolio project and apply what you've learned throughout this course.
Just as in the earlier courses, this portfolio project will guide you to complete several tasks and create artifacts that showcase your skills.
During interviews, you may be asked questions to test your understanding of different machine learning models.
Also, having projects on your resume can help you stand out to a hiring manager who may invite you to complete an interview.
During the interview, you can rely on your portfolio to discuss data science in general or explain modeling strategies more specifically.
In order to complete the portfolio project, you'll be presented with details about some business cases.
Choose one and use the instructions to complete a new entry in your pace strategy document and create machine learning models to solve the problem.
By the time you complete this project, you'll have machine learning models that you can add to your portfolio.
At this point, you're almost finished with this course, and you have learned everything you need to complete this project and you're well on your way to advancing your career as a data professional.
In this project, you will solve a data problem.
Using the models you learned in this course, and then make a business recommendation, following the pace workflow, you will create a plan and communicate your process for completing the project.
Ready? Then let's get started.
In this course, you learned about supervised and unsupervised machine learning models, how they work and how to build them in Python.
In the previous course, you practiced building, interpreting, and evaluating regression models.
Up to this point, you've also been working hard to develop skills with Python, data visualizations, and statistics too.
Now is time to compile all of that knowledge as you complete this portfolio project.
Here, you'll be presented with a business problem and a data set.
You will then go through the pace workflow to create a plan, build machine learning models, document your work, and select the model that would be the best solution to the problem.
All of the models you build will be models that you learned in this course, and may include, name, phase, decision tree, random forest, XG boost, and even came in.
You'll also select and appropriate evaluation metric to gauge your models performance.
As you learned in this course, data professionals analyze and discover patterns and data, which informed the most appropriate models that are needed to solve business problems.
Then, they communicate about their work and recommendations to colleagues and stakeholders.
And remember, building these models will require a bit of patience.
You've done an excellent job developing your skills, and they will definitely be useful as you complete this project.
Please feel free to return to any of the other videos and course materials if you need a refresher on any of the content.
At this point in your program progress, you've covered so many topics.
Everything from understanding the data career space to Python, visualizations, statistics, modeling, and more.
Your portfolio includes machine learning models on a data set to solve a problem, and your pace strategy document has a new entry, where you explain your work at each stage of the process.
Through every step of this program, you've created a number of artifacts to add to your portfolio that demonstrate your knowledge and skills.
There is so much to be proud of.
There are multiple ways you can highlight your work and explain what you've done to potential employers and hiring managers and future interviews.
As I've mentioned previously, you'll want to dedicate interview time to talking about the tools you've learned about.
The transferable skills you've developed and the experiences you've had in this program.
As a data professional, you may be asked to learn and adapt to new tools on the job, just as we've illustrated in this program.
There are a lot of great tools out there and different businesses use different tools and skills depending on their needs.
As you apply for jobs, keep in mind that you have learned a lot of transferable skills that can be applied across different tools.
In this course, we discuss the importance of determining the most appropriate models to use based on the problem you're solving and data available.
Along the way, you discovered you can use different machine learning models to help find a business solution.
And you acknowledged that it's important to explain your process when working with machine learning models.
During an interview, and maybe the case that you're asked, how would you use supervised learning models to address a business problem?
How can you use different machine learning models to help you find a business solution?
Why is it important to explain your process when working with machine learning models?
I encourage you to consider what you have learned in this program to begin answering these questions.
Of course, there will likely be other points of discussion in the interview as well.
In this portfolio project that you just completed, you built different machine learning models using Python.
These models helped identify potential solutions to a unique business challenge.
Additionally, you recorded a new entry in your piece strategy document that details your thoughts, considerations, and process steps for this project.
I'll also highlight that this project built upon the knowledge you developed as you progress through this program.
Now you're prepared to perform the tasks and responsibilities of a data professional.
As a reminder, your interviewers have a business challenge, just like stakeholders on data projects.
They have an open job position they need to fill.
Think about what they need to know about you to make a decision that solves that business challenge.
Just like you've been practicing in each portfolio project.
Coming up in the Capstone course, you'll bring together all of the content and skills from across the program and apply them in one project.
This will be an opportunity to apply skills from each of the portfolio projects in the earlier courses to solve a new business problem.
This will provide you with even more artifacts to add to your portfolio.
Congratulations! You just finished the final instructive portion of the program and you're ready to move on to your Capstone project.
You've learned a whole lot throughout this final section, and you're now ready to take this new knowledge and move on in your data journey.
Whether your next steps are to continue your education or take what you've learned into industry, you now have a comprehensive foundation to build on or use.
We started this section learning about the foundations of machine learning, with a focus on the different types of models that are available to a data professional.
You saw how different types of business needs required different types of models.
Additionally, you learned about recommendation systems and many of the most common use cases for these types of models, along with the different popular techniques and their advantages and disadvantages.
From there, you started building out your machine learning tool belt, different integrated development environments, types of python files, and data-oriented python packages, together give you the tools you need to approach any data-driven problem.
The pace workflow is the framework in which you can put those tools to use.
By taking the time to follow the steps of plan, analyze, construct, and execute, you can ensure that you stay on track to produce a model that will deliver meaningful results.
The plan stage involved taking a close look at the business need and the data available and determining what type of model would be appropriate.
In the analyze stage, you applied many of the exploratory data analysis principles that you learned earlier in the program.
Additionally, you learned about a new subset of techniques called feature engineering that allow you to manipulate and prepare your data for modeling in a variety of ways.
To construct stage, introduce you to a new type of supervised classification model, naive phase.
You learned about and built this model along with applying evaluation metrics to gauge the performance of the model.
Finally, in the execute stage, you performed any needed validation techniques further evaluated the model and made any needed tweaks to get the most performance out of it.
Next, you took a deeper look at unsupervised learning models.
K means is one of the most widely used unsupervised learning techniques, and in this part of the program, you built a K means model and used common evaluation techniques to fully understand its results.
Finally, you learned about tree-based modeling.
Tree-based techniques are some of the most effective models that currently exist in the industry.
You saw how single decision trees work, learning how they function conceptually, along with building one for yourself.
With this foundation, you were introduced to two ensemble techniques, bagging and boosting.
Within tree-based modeling, you saw one of the most important aspects of using advanced machine learning techniques.
Hyperparameter tuning is essential for building models in industry, allowing you to optimize models to fit your specific needs.
In the next section, you'll be taking everything you learned throughout the entire program and applying it to a capstone project that will be an invaluable piece of your portfolio.
See you soon!
Einführung in das maschinelle Lernen
Verstehen von Algorithmen und Vorhersagen
Arten des maschinellen Lernens
Überwachtes vs. Unüberwachtes Lernen
Kontinuierliche und kategoriale Variablen
Verstehen von Merkmalsarten im überwachten Lernen
Praktische Anwendungen des maschinellen Lernens
Erforschung von Empfehlungssystemen
Inhaltsbasiertes Filtern erklärt
Überblick über kollaboratives Filtern
Ethische Überlegungen bei der Modellentwicklung
Nutzung von Datenwerkzeugen und Bibliotheken
Integrierte Entwicklungsumgebungen (IDEs)
Verstehen von Python-Dateitypen
Beliebte Python-Pakete in der Datenanalyse
Übersicht über Empfehlungssysteme
Verstehen von Datenbeschränkungen
Pace Workflow Einführung
Planungsphase in Tempo
Analysephase in Pace
Klassenbalancierungstechniken
Die Analysephase in PACE
Verstehen von Kundenabwanderung
Merkmalsauswahl in der Modellierung
Merkmalextraktion und -transformation
Aufteilung von Trainings- und Testdaten
Bedeutung der Stratifikation
Verstehen von Modellbewertungsmetriken
Interpretation der Verwirrungsmatrix
Einführung in den K-Means-Algorithmus
Verstehen der K-Means-Initialisierung und -Iteration
Lass uns mal über das Konzept der Zentroiden im K-Means Clustering quatschen.
Anwendung von K-Means zur Farbkompression von Bildern
Visualisierung von Pixel-Daten im 3D-Raum
Bestimmung des optimalen Wertes für K im Clustering
K-Means Modellbewertungstechniken
K-Means-Modell in Jupyter-Notebooks erstellen
Nutzung synthetischer Daten für das Clustering
Daten-Skalierungstechniken für K-Means
Clusteranalyse mit Entscheidungsbäumen
Daten für Entscheidungsbäume vorbereiten
Verstehen von Bewertungsmetriken
Training Baseline Entscheidungsbaum-Modell
Verwirrungsmatrix anzeigen
Visualisierung von Entscheidungsbaum-Splits
Gitter-Suche zur Hyperparameter-Optimierung
Verstehen von Modellbewertungsergebnissen
Was ist Ensemble-Lernen?
Erforschung von Bagging-Techniken
Einführung in Random Forests
Gitter-Suche für Zufallswälder
Modell anpassen
Das Modell serialisieren
Kreuzvalidierung und Validierungsset
Vorteile des Boostings
Verstehen von Gradient Boosting Machines
Vorteile und Nachteile von GBMs
Hyperparameter in XG Boost
Ein XG Boost-Modell erstellen
Einführung in das Portfolio-Projekt
Anwendung von Fähigkeiten in Portfolio-Projekten
Maschinenlernmodelle konstruieren
Übersicht über das Capstone-Projekt
Wie sagen Algorithmen Ihren nächsten Suchbegriff voraus?
Welche Rolle spielen Datenprofis im Machine Learning?
Was sind die wichtigsten Unterschiede zwischen überwachten und unüberwachten Lernen?
Warum ist es wichtig, die verschiedenen Variablentypen zu verstehen, wenn man Modelle erstellt?
Wie erkennst du kategorische und diskrete Daten in Machine-Learning-Modellen?
Welche Herausforderungen gibt's, wenn man Inhalte aus verschiedenen Bereichen empfiehlt?
Wie kann Beliebtheits-Bias die Empfehlungen beeinflussen?
Welche ethischen Überlegungen sollten Datenprofis im Hinterkopf behalten?
Was sind die wichtigsten ethischen Fragen, die man beachten sollte, wenn man Modelle entwickelt?
Warum ist es für Datenprofis wichtig, ein digitales Werkzeugset zu haben?
Wie helfen IDEs dabei, die Programmier-Effizienz für Datenprofis zu steigern?
Welche Vorteile bieten Python-Notebooks im Vergleich zu Skripten?
Wie schlagen Empfehlungssysteme vor, was du als Nächstes schauen sollst?
Warum ist es wichtig, die Grenzen deiner Modelle zu verstehen?
Was sind die vier Phasen des Pace-Workflows, um Modelle zu erstellen?
Wie passt die Planungsphase die Modelle an die Bedürfnisse des Geschäfts an?
Wie entscheidest du dich zwischen Up Sampling und Down Sampling Techniken?
Warum ist explorative Datenanalyse so wichtig, bevor man mit Feature Engineering anfängt?
Wie sagt das Naïve Bayes Modell Ergebnisse voraus, indem es bedingte Unabhängigkeit nutzt?
Warum ist es wichtig, Daten zu schichten, wenn man Datensätze aufteilt?
Was sagt eine Verwirrungsmatrix über die Leistung des Modells aus?
Wie organisiert K-means Clustering unbeschriftete Daten?
Wie beeinflusst es die Cluster-Ergebnisse, wenn man die Anzahl der Zentroiden ändert?
Was passiert mit unserem Tulpenbild, wenn wir K-means mit einem Zentroid anwenden?
Wie können wir die Effektivität unseres K-means Clustering visuell einschätzen?
Wie bewerten wir die Effektivität von K-Means-Modellen mit Metriken wie Inertia und Silhouette-Werten?
Kann synthetische Daten uns helfen, die optimale Anzahl an Clustern in einem Modell genau zu bestimmen?
Wie wirkt sich das Skalieren auf die Leistung eines K-Means-Clustering-Modells aus?
Was sagt der F1-Score über die Leistung des Modells aus?
Wie interpretieren wir die Splits in einem Entscheidungsbaum?
Warum ist das Abstimmen von Hyperparametern so wichtig für die Optimierung von Modellen?
Wie beeinflusst unsichtbare Testdaten die Genauigkeit bei der Modellauswahl?
Warum mit so vielen Hyperparameter-Kombinationen beim Modellieren rumexperimentieren?
Was sind die Vorteile von Ensemble-Methoden wie Bagging?
Wie können wir mit Pickle Zeit sparen, wenn wir Modelle anpassen?
Was ist der Unterschied zwischen Kreuzvalidierung und einem separaten Validierungsset?
Warum sind Boosting-Techniken so wichtig im Predictive Modeling?
Warum sind Gradient Boosting Machines bei Datenprofis so beliebt?
Was macht Hyperparameter-Tuning so wichtig für den Erfolg von GBM?
Wie beeinflusst die Lernrate die finalen Vorhersagen des Modells?
Wie kann es dir helfen, bei Vorstellungsgesprächen mit einem Portfolio aufzufallen?
Warum ist es wichtig, dass Datenprofis ihre Modellierungsprozesse erklären?
Was musst du in Vorstellungsgesprächen betonen, um die Herausforderungen des Arbeitgebers zu lösen?
Die behandelten Themen umfassen die Grundlagen des maschinellen Lernens, seine Anwendungen in der Datenanalyse und wie man damit komplexe Modelle erstellen kann. Ihr werdet lernen, wie man diese Modelle erstellt, einschliesslich der Anwendung von zuvor gesammelten Daten, um fundierte Vermutungen anzustellen. Der Inhalt untersucht auch die wichtigsten Merkmale des maschinellen Lernens, das es Computersystemen ermöglicht, Daten zu analysieren und Muster zu entdecken. Ein weiterer wichtiger Aspekt ist die Verwendung der Programmiersprache Python, die für die Arbeit mit Algorithmen des maschinellen Lernens und statistischen Modellen unerlässlich ist. Ausserdem werdet ihr in die Bedeutung talentierter Datenprofis eintauchen, die diese komplexen Modelle entwickeln. Der Bildungsinhalt setzt ein gewisses Mass an Vorwissen in der Datenanalyse und -bereinigung, Regressionsmodellen und grundlegenden statistischen Konzepten voraus.