Azure Machine Learning Tutorial
In this tutorial we'll introduce Azure Machine Learning (AML), considerations for organizing an Advanced Analytics team, and then show you how to develop your first predictive model. There are several data preparation and release steps we take into consideration while delivering the final predictive model:
- Create Machine Learning Experiment
- Data Preparation
- Data Selection and Transformation
- Data Modeling
- Interpretation and Evaluation
- Data visualization and Dashboards
Note: This article assumes you already have an Azure subscription, and have setup your Azure Machine Learning account. You may setup a free account here.
What is Azure Machine Learning?
Azure Machine Learning is a cloud service for performing predictive analytics. Starting with raw data, Azure ML provides tools for cleansing that data (e.g., removing duplicate records), then running different learning algorithms on the data to search for patterns. Azure ML can then generate a model, which is software that an application calls to detect pattern matches in new data. The model can return a probability indicating how strong the match is. This lets the application make better decisions about what to do.
What are examples and use of predictive models?
Predictive analytics helps your organization with identifying unforeseen patterns and allows you to develop models which can be used to predict what may occur.
Use machine learning to anticipate market trends and predict areas of focus. Whether you're a healthcare organization looking to improve patient outcomes and organizational efficiency, or a financial services company looking to gain insight into financial drivers, customer behaviors, and operational performance.
Defining your Advanced Analytics Team
Advanced Analytics is commonly comprised of three phases:
- Data Preparation - data is taken in, transformed, cleansed, and denormalized. Then, the data is profiled, explored, and visualized.
- Modeling - feature and algorithm selection takes place, along with model testing and validation of data before it is deployed to the operationalization phase.
- Operationalization - scored, visualized, measured, and then used for predictive analytics.
It's not uncommon to have multiple resources with unique skillsets, working together to accomplish the goal of implementing an advanced analytics solution. However, there are resources who are quite capable of completing all the advanced analytics lifecycle phases on their own.
Advanced Analytics Lifecycle
Advanced Analytics Team Consideration
- Data Engineer - Responsible for managing source data from the relational and online analytical procession (OLAP) environment(s). Including Extract, Transform and Loading of data between discreet systems.
- Data Steward - This is usually a business unit resource outside of IT. The data steward is responsible for maintaining data elements in a metadata registry and adjudicating data disputes.
- Data Scientist - Responsible for creating and testing machine learning algorithm models, among other complex statistical modeling tasks.
- Business Analyst - Responsible for having a thorough understanding of business subject matter. Additionally, the proponent should be experienced in defining requirements for solving business problems, or to advance business growth efforts.
- Executive - Responsible for running a specific department and/or the business as a whole. Their primary goal is to achieve corporate growth.
- BI Developer - Resource responsible for managing the architecture and development of systems applications. In this case, they would be developing web services to interface with Azure Machine Learning experiments and the development of dashboards and data visualizations.
In all likelihood, your organization already has 90% of the team resources described above. If you're reading this article, you're either fulfilling the role as Data Scientist, or soon to be. :) Remember, team collaboration is key in the advanced analytics lifecycle. You'll work more efficiently and effectively, and deploy more successful solutions by delegating skilled resources to their respective tasks.
Before we get started:
This purpose of this tutorial is to expose you to some of the general concepts working with Azure Machine Learning and is not intended for use as a production ready solution. Optionally, you may contact us directly here, should you require consulting assistance with your Azure Machine Learning initiatives.
ML Experiment Overview
We will be building a predictive model to determine patient diabetes readmissions. The dataset used is derived from Diabetes 130-US hospitals for years 1999-2008: This data has been prepared to analyze factors related to readmission as well as other outcomes pertaining to patients with diabetes. You may find a copy of the data here.
Let's get started!
Step 1 - Create AML Experiment
In this step we will create our Azure Machine Learning Experiment. We assume you have already created an Azure Subscription.
2. From the bottom left corner, click
you should see the following:
3. Before we create the new canvas for our experiment, let's first upload a sample dataset to work with. You will need to download the diabetic_data.zip (3.19 mb)
file and extract to your local file system - remember the location, so that you may reference it.
4. Click on the "Dataset" menu item on the left pane. You should see the following:
5. Click on "From Local File", navigate to where you saved your file and click the OK check after completing the following:
6. Click on the Experiment menu item from the left pane and select "Blank Experiment" from the tile choices, as depicted in Step 2. You should see the AML designer:
Step 2 - Data Preparation
In this step we will prepare the data by taking steps to ingest, transform and explore the raw source data.
1. Change the name of the experiment to something meaningful. For example, I named mine Diabetes Readmissions - Classification.
2. From the component pane on the left, expand "My Datasets" and select the "Diabetes Dataset" (or the given name) you created earlier. Drag that to the designer canvas like so:
3. Let's explore the data a little. Right click on the dataset and select Dataset >> Visualize
A data grid showing total amount of rows, columns and relevant patient encounter data is presented. If you click on one of the columns you'll also notice some general statistics on the column selected and a histogram. Close the dataset viewer.
4. Let's handle some basic transformations. From the modules pane on the left, type in "Clean Missing Data" to find the Data Transformation module. Drag the module twice to the designer, just below the diabetes dataset.
Hey, what's Clean Missing Data transformation?
The Clean Missing Data transformation is used to handle missing values from within a dataset. For this example, we will focus on both String and Numeric based values. You can set the cleaning mode and choose what the replacement value should be. To learn more about this module please visit here.
5. Now connect the dataset to the first Clean Missing Data (CMD) module and select the module, so that you may edit its properties. The first CMD will focus on cleaning missing String values.
6. Click on the "Launch column selector" from the properties pane. Configure the properties as depicted below and save.
7. Now, connect the first CMD left connector (Cleaned Dataset) to the second CMD below.
8. Configure the second CMD for missing Numeric values. Follow the steps above, but change the column type to Numeric and the Replacement value to 0.
9. Save your project and run the experiment. Optionally, you may visualize each of the CMD's to view explore the data.
10. We will now select a subset of columns we're interested in for our modeling operations. From the module pane on the left, search for "select columns in dataset". Drag the Select Columns in Dataset transformation module to the designer and connect the last CMD left connector (Cleaned Dataset) to the Select Columns in Dataset module.
Hey, what's Select Columns in Dataset transformation?
Select Columns in Dataset is used to include or exclude columns from a dataset in an operation. This is useful for limiting columns for downstream operations and/or reducing size of the dataset by removing unneeded columns. To learn more about this module please visit here.
11. While having active focus on the Select Columns in Dataset module, click on the "Launch column selector" from the right properties pane.
12. Configure the Select Columns to exclude the following: admission_type_id, encounter_id and patient_nbr. Click the OK check.
13. Save and run your experiment. Optionally, you may visualize the data in the Select Columns in Dataset module.
14. We'll now rescale some existing numeric values used, so that they are constrained to a standard range. From the modules pane on the left, search for "normalize data". Drag the Normalize Data transformation module to the designer, just below the Select Columns in Data transformation module.
Hey, what's Normalized Data transformation?
Normalized Data transformation module transforms data to a common scale. For example, say you have a column ranging from 0 to 1 and the next with 5,000 to 50,000. The difference in the scale of the numbers can cause problems when you attempt to combine the values as features during modeling.
Normalization remedies this problem, by transforming values so they maintain general distribution levels and conform to a common scale. To learn more about this module please visit here.
15. Connect the Select Columns in Dataset to the Normalize Data module. Your designer should look like the following:
16. While you still have focus on the Normalize Data module, you will configure its properties. On the right properties pane, click on "Launch column selector", so that we may select which numeric columns to rescale. The properties should be set as follows:
Transformation Method: ZScore
ZScore value transformation formula:
Use 0 for constant columns when checked: Checked
Columns to transform: time_in_hospital, num_procedures, num_lab_procedures, num_medications, number_outpatient, number_emergency, number_inpatient, number_diagnoses
17. We will not make our String values Categorical by use of Edit Metadata transformation module. From the modules pane on the left, search for "edit metadata". Drag the Edit Metadata transformation module to the designer, just below the Normalize Data transformation module.
Hey, what's Edit Metadata transformation?
Edit Metadata transformation module allows you to change metadata associated with columns in your dataset. For the purpose of out tutorial, we will make String values Categorical, so that these columns will be treated as categories not as results, scores, labels, or other type values. To learn more about this module please visit here.
18. Connect the left connector (Transformed Dataset) from Normalized Data module to the Edit Metadata module.
19. While you still have focus on the Edit Metadata module, you will configure its properties. On the right properties pane, click on "Launch column selector", so that we may select which column types to categorize. The properties should be set as follows:
Selected Columns: Column type: String, All
Data Type: Unchanged
Categorical: Make Categorical
New Column Names: <leave blank>
You designer should look similar to the image below.
20. Save and Run your experiment. Optionally, you may visualize your Edit Metadata dataset; you'll notice the string value columns Feature Type are now of Categorical Feature.
21. Now we will Split the data into two distinct sets. This will be useful for our training and testing sets used in Step 3 - Data Modeling. From the modules pane on the left, search for "split data". Drag the Split Data transformation module to the designer, just below the Edit Metadata transformation module.
Hey, what's Split Data transformation?
Split Data transformation module allows you to divide a dataset into two distinct sets. It's useful for your training and testing sets used in modeling. For the purpose of out tutorial, we will split rows by parts of 70-30 percent. This means that 70% is used for training the first dataset and 30% for testing the second dataset. To learn more about this module please visit here.
22. Connect the connector (Results dataset) from Edit Metadata module to the Split Data module.
23. While you still have focus on the Split Data module, you will configure its properties. On the right Properties pane, the properties should be set as follows:
Splitting Mode: Split Rows
Fraction of rows in the first output dataset: .70
Randomized split: Checked
Random Seed: 123
Stratified split: False
24. Save and Run your experiment. Optionally, you may visualize the Split Data dataset; you'll notice two Results Dataset1 and Results Dataset 2 datasets. Visualize each of the datasets and notice the row counts differ, according to the percentage of split.
This concludes our Data Preparation steps. Now we'll move onto the fun part of Modeling!
Step 3 - Data Modeling
In this step we will model the machine learning solution by selecting appropriate statistical algorithm, training, scoring, evaluating and finally deploying the machine learning model via web service.
1. We will now select an algorithm to predict one of two states of our targeted variable "Reemitted". From the module pane on the left, search for "Two-Class Logistic Regression". Drag the Two-Class Logistic Regression classification module to the designer. You will not connect this to the Split Data module.
2. Configure the Two-Class Logistic Regression module properties like so:
Hey, what's Two-Class Logistic Regression classification module?
Two-Class Logistic Regression classification module is a logistic regression model that can be used to predict the probability of outcome from the target variable. For example, a simple YES or NO. Logistic regression is considered supervised learning method, and therefore requires a labeled dataset. To learn more about this module please visit here.
3. Now we will Train our classification model. For our solution to provide a prediction, the model must learn from known data in a process know as training. During this process, data is evaluated by the machine learning algorithm, which looks for rules and patterns that can be used for prediction.
From the modules pane on the left, search for "train model". Drag the Train Model module to the designer, just below the Two-Class Logistic Regression classification module. Connect the Two-Class Logistic Regression module (Untrained Model) connector to the left connector (Untrained Model) of the Train Model module. Your designer should look like so:
Hey, what's Train Model module?
Train Model module is used to train a classification or regression model. Training a classification or regression model is a type of supervised learning. This means you must provide a dataset that contains historical data from which to learn patterns. To learn more about this module please visit here.
4. Connect the bottom left connector (Results Dataset 1) of the Split Data to the top right connector (Dataset) of the Train Model module. Your designer should look like so:
5. While you have focus on the Train Model module, you will configure its properties. On the right Properties pane, select "Launch column selector" and include "Readmitted" column name. The properties should be set as follows:
6. Save and Run your experiment. When the model is trained, right-click the output and select Visualize to view the model parameters and feature weights.
7. It's time to score our model. From the modules pane on the left, search for "score model". Drag the Score Model module to the designer, just below the Train Model module. Connect the Train Model module (Trained Model) connector to the left connector (Trained Model) of the Score Model module. Then connect the bottom right connector (Results dataset 2) of the Split Data dataset to the top right connector (Dataset) of the Score Model module. Your designer should look like so:
Hey, what's Score Model module?
Score Model module is used to generate predictions using a trained classification or regression model. The predicted value can be in many different formats, depending on the model and your input data. In this tutorial we are using a classification model to create the scores, Score Model outputs a predicted value for the class, as well as the probability of the predictive value. To learn more about this module please visit here.
8. Score Model Properties - make sure the Append score columns to output is checked.
9. Save and Run your experiment. Optionally, visualize the Score Model dataset and scroll to the far right of the columns. You will notice two new columns "Scored Labels" and "Scored Probabilities". In addition, to the rights you'll see statistical information for the new columns.
10. Lastly, we will evaluate our model. From the modules pane on the left, search for "evaluate model". Drag the Evaluate Model module to the designer, just below the Score Model module. Connect the Score Model module (Scored Dataset) bottom connector to the top left connector (Scored Dataset) of the Evaluate Model module. Optionally, you may also enter a Summary and Description of your Evaluate Model module. Your designer should look like so:
Hey, what's Evaluate Model module?
Evaluate Model module is used to measure the accuracy of a trained model. You provide a dataset containing scores generated from a model, and the Evaluate Model module computes a set of industry-standard evaluation metrics. To learn more about this module please visit here.
11. Save and Run your experiment. Right click the Evaluate Model module to visualize the results. You should see similar results:
The view above provides classification model with Receiver Operating Characteristics (ROC) chart and table of values with statistical metrics. We can determine the true positive rate against false positive rate between the ROC and Area Under the Curve (AUC) values. The closer the curve is to the top left corner the better the predictive performance is. However, the closer the curve to the diagonal line, one can assume the predictive model tends to preform poorly. In our example, we could benefit from making some modifications to our model and/or trying out other algorithms to enhance performance. Tuning model performance will be in another tutorial.
12. We will now deploy our predictive model as a web service. From the bottom tool bar, click Set Up Web Service and select Predictive Web Service (Recommended). Azure ML Studio will create the predictive experiment automatically.
After it's finished, you may click Close to hide the results bar.
Your predictive experiment designer should look similar:
13. Save and Run your predictive experiment. After the process completes you will need to deploy the web service by clicking on Deploy Web Service button. When this process completes, you'll be redirected to the Diabetes Readmissions - Classification Demo [predictive exp.] web service page.
Web Service Page
Two key items to take notice of:
1. API key - this is used by other applications needing to consume the web service.
2. REQUEST/RESPONSE - this is the URL to your web services. You can obtain it by right clicking on the link and selecting "Copy shortcut".
Copy the API Key and the URL shortcut to notepad. You'll need to reference these in the upcoming Step 4 - Operationalize step.
Your Predictive Web Service is now ready for consumption by other applications. For example, you can consume this web service using Excel or Power BI.
This concludes our Data Modeling steps. Now we'll move onto Operationalizing our Azure Machine Learning solution.
Step 4 - Operationalize
In this step we will operationalize the Azure Machine Learning model web service. For this tutorial, we will use Excel as the tool of choice for consuming our Diabetes Readmissions web service.
Note: you are not limited to using just Excel. Other tools like Power BI, Reporting Services or even custom managed code can consume this web service and integrate into existing applications at your organization.
1. Open Excel and select the Insert tab from the menu ribbon. Find the "Add-ins" group and click "My Add-ins".
You should see the Azure Machine Learning add-in prompt displayed, if you do not see it please visit Azure Machine Learning add-in to install the add-in.
2. Click Add from the bottom right corner of the Office Add-ins dialog. You should now see the Azure Machine Learning Add-in pane to the right within Excel.
3. Click on the + Add Web Service to configure your web service properties.
4. Copy and Paste the URL & API Key you saved earlier in notepad, to the related inputs. You should have configured something similar:
5. Click Add button to save configuration and connect to the web service.
6. You can expand the View Schema, Predict and Errors section to review the properties and content.
7. Expand the Predict section. Place your cursor on the Input textbox and click on the grid icon to the right of the input box.
This will open the Select Data dialog. Select Sheet1!$A$1 for the value and click OK. You should now see the following:
8. For the Output value: Create a new worksheet (Sheet2) and enter Sheet2!$A$1 for the Output value. Your configuration should look like below:
9. We will need some sample data to work with before we can get some results. We'll need to open the existing diabetic_data.csv you saved to your local system and copy the first 51 rows.
10. Place your cursor on A1 of Sheet1 and paste your csv data to Sheet1 like so:
11. Make sure to modify the Input textbox to "Sheet1!A1:AX51", so that all column header and data is capture for the prediction.
12. From the right pane, click on the Predict button and the web service should return the prediction results on Sheet2 - click on the Sheet2 to review the results. Scroll to the far right of the worksheet to see the Scored Probability.
Azure Machine Learning is a powerful, secure and fun predictive modeling tool to use. What once took machine learning solutions weeks to build, now only takes hours to develop and deploy. You have all the tools necessary to begin creating your machine learning models and integrated them with modern business intelligence tools, or custom built applications through Azure Machine Learning web services.