Machine Learning Model
- Summary
-
Discussion
- Could you explain ML models with some examples?
- What are the essentials that help an ML model learn?
- What possible structures, loss functions and optimizers are available to train an ML model?
- What exactly is saved in an ML model?
- Which are the formats in which ML models are saved?
- What metadata could be useful along with an ML model?
- Which are some useful tools when working with ML models?
- Milestones
- Sample Code
- References
- Further Reading
- Article Stats
- Cite As
In traditional programming, a function or program reads a set of input data, processes them and outputs the results. Machine Learning (ML) takes a different approach. Lots of input data and corresponding outputs are given. ML employs an algorithm to learn from this dataset and outputs a "function". This function or program is what we call an ML Model.
Essentially, the model encapsulates a relationship or pattern that maps the input to the output. The model learns this automatically without being explicitly programmed with fixed rules or patterns. The model can then be given unseen data for which it predicts the output.
ML models come in different shapes and formats. Model metadata and evaluation metrics can help compare different models.
Discussion
-
Could you explain ML models with some examples? Consider a function that reads Celsius value and outputs Fahrenheit value. This implements a simple mathematical formula. In ML, once the model is trained on the dataset, the formula is implicit in the model. It can read new Celsius values and give correct Fahrenheit values.
Let's say we're trying to estimate house prices based on attributes. It may be that houses with more than two bedrooms fall into a higher price bracket. Areas 8500 sq.ft. and 11500 sq.ft are important thresholds at which prices tend to jump. Rather than encode these rules into a function, we can build a ML model to learn these rules implicitly.
In another dataset, there are three species of irises. Each iris sample has four attributes: sepal length/width, petal length/width. An ML model can be trained to recognize three distinct clusters based on these attributes. All flowers belonging to a cluster are of the same species.
In all these examples, ML saves us the trouble of writing functions to predict the output. Instead, we train an ML model to implicitly learn the function.
-
What are the essentials that help an ML model learn? There are many types (aka shapes/structures/architectures) of ML models. Typically, this structure is not selected automatically. The data scientist pre-selects the structure. Given data, the model learns within the confines of the chosen structure. We may say that the model is fine-tuning the parameters of its structure as it sees more and more data.
The model learns in iterations. Initially, it will make poor predictions, that is, predicted output deviate from actual output. As it sees more data, it gets better. Prediction error is quantified by a cost/loss function. Every model needs such a function to know how well it's learning and when to stop learning.
The next essential aspect of model training is the optimizer. It tells the model how to adjust its parameters with each iteration. Essentially, the optimizer attempts to minimize the loss function.
If results are poor, the data scientist may modify or even select a different structure. She may pre-process the input differently or focus on certain aspects of the input, called features. These decisions could be based on experience or analysis of wrong predictions.
-
What possible structures, loss functions and optimizers are available to train an ML model? Classical ML offers many possible model structures. For example, Scikit-Learn has model structures for regression, classification and clustering problems. Some of these include linear regression, logistic regression, Support Vector Machine (SVM), Stochastic Gradient Descent (SGD), nearest neighbour, Guassian process, Naive Bayes, decision tree, ensemble methods, k-Means, and more.
For building neural networks, many architectures are possible: Feed-Forward Neural Network (FFNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), Long Short Term Memory (LSTM), Autoencoder, Attention Network, and many more. In code, these can be built using building blocks such as convolution, pooling, padding, normalization, dropout, linear transforms, non-linear activations, and more.
TensorFlow supports many loss functions: BinaryCrossentropy, CategoricalCrossentropy, CosineSimilarity, KLDivergence, MeanAbsoluteError, MeanSquaredError, Poisson, SquaredHinge, and more. Among the optimizers are Adadelta, Adagrad, Adam, Adamax, Ftrl, Nadam, RMSprop, and SGD.
-
What exactly is saved in an ML model? ML frameworks typically support different ways to save the model:
- Only Weights: Weights or parameters represent the model's current state. During training, we may wish to save checkpoints. A checkpoint is a snapshot of the model's current state. A checkpoint includes model weights, optimizer state, current epoch and training loss. For inference, we can create a fresh model and load the weights of a fully trained model.
- Only Architecture: Specifies the model's structure. If it's a neural network, this would be details of each layer and how they're connected. Data scientists can share model architecture this way, with each one training the model to suit their needs.
- Complete Model: This includes model architecture, the weights, optimizer state, and a set of losses and metrics. In PyTorch, this is less flexible since serialized data is bound to specific classes and directory structure.
In Keras, when saving only weights or the complete model,
*.tf
and*.h5
file formats are applicable. YAML or JSON can be used to save only the architecture. -
Which are the formats in which ML models are saved? Open Neural Network Exchange (ONNX) is open format that enables interoperability. A model in ONNX can be used with various frameworks, tools, runtimes and compilers. ONNX also makes it easier to access hardware optimizations.
A number of ML frameworks are out there, each saving models in its own format. TensorFlow saves models as protocol buffer files with
*.pb
extension. PyTorch saves models with*.pt
extension. Keras saves in HDF5 format with*.h5
extension. An older XML-based format supported by Scikit-Learn is Predictive Model Markup Language (PMML). SparkML uses MLeap format and files are packaged into a*.zip
file. Apple's Core ML framework uses*.mlmodel
file format.In Python, Scikit-Learn adopts pickled Python objects with
*.pkl
extension. Joblib with*.joblib
extension is an alternative that's faster than Pickle for large NumPy arrays. If XGBoost is used, then a model can be saved in*.bst
,*.joblib
or*.pkl
formats.With some formats, it's possible to save not just models but also pipelines composed of multiple models. Scikit-Learn is an example that can export pipelines in Joblib, Pickle, or PMML formats.
-
What metadata could be useful along with an ML model? Data scientists conduct multiple experiments to arrive at a suitable model. Without metadata and proper management of such metadata, it becomes difficult to reproduce the results and deploy the model into production. ML metadata also enables us to do auditing, compare models, understand provenance of artefacts, identify reusable steps for model building, and warn if data distribution in production deviates from training.
To facilitate these, metadata should include model type, types of features, pre-processing steps, hyperparameters, metrics, performance of training/test/validation steps, number of iterations, if early stopping was enabled, training time, and more.
A saved model (also called exported or serialized model), will need to be deserialized when doing predictions. Often, the versions of packages or even the runtime will need to be the same as those during serialization. Some recommend saving a reference to an immutable version of training data, version of source code that trained the model, versions of libraries and their dependencies, and the cross-validation score. For reproducible results across platform architectures, it's a good idea to deploy models within containers, such as Docker.
-
Which are some useful tools when working with ML models? There are tools to visualize an ML model. Examples include Netron and VisualDL. These display the model's computational graph. We can see data samples, histograms of tensors, precision-recall curves, ROC curves, and more. These can help us optimize the model better.
Since ONNX format aids interoperability, there are converters that can convert from other formats to ONNX. One such tool is ONNXMLTools that supports many formats. It's also a wrapper for other converters such as keras2onnx, tf2onnx and skl2onnx. ONNX GitHub code repository lists many more converters. Many formats can be converted to Apple Core ML's format using Core ML Tools. For Android,
tf.lite.TFLiteConverter
converts a Keras model to TFLite.Sometimes converters are not required. For example, PyTorch can natively export to ONNX.
ONNX models themselves can be simplified and there are optimizers to do this. ONNX Optimizer is one tool. ONNX Simplifier is another, built using ONNX Optimizer. It basically looks at the whole graph and replaces redundant operators with their constant outputs. There's a ready-to-use online version of ONNX Simplifier.
Milestones
At IBM, Arthur Samuel writes the first learning program. Applied to the game of checkers, the program is able to learn from mistakes and improve its gameplay with each new game. In 1959, Samuel popularizes the term Machine Learning in a paper titled Some Studies in Machine Learning Using the Game of Checkers.
Rumelhart et al. publish the method of backpropagation and show how it can be used to optimize the weights of neurons in artificial neural networks. This kindles renewed interest in neural networks. Although backpropagation was invented in the 1960s and developed by Paul Werbos in 1974, it was ignored back then due to the general lack of interest in AI.
Hinton et al. publish a paper showing how a network of many layers can be trained by smartly initializing the weights. This paper is later seen as the start of Deep Learning movement, which is characterized by many layers, lots of training data, parallelized hardware and scalable algorithms. Subsequently, many DL frameworks are released, particularly in 2015.
2016
Vartak et al. propose ModelDB, a system for ML model management. Data scientists can use this to compare, explore or analyze models and pipelines. The system also manages metadata, quality metrics, and even training and test data. In general, from the mid-2000s we see interest in ML model management and platforms. Examples include Data Version Control (DVC) (2017), Kubeflow (2018), ArangoML Pipeline (2019), and TensorFlow Extended (TFX) (2019 public release).
2017
Microsoft and Facebook come together to announce Open Neural Network Exchange (ONNX). This is proposed as a common format for ML models. With ONNX, we obtain framework interoperability (developers can move their models across frameworks) and shared optimizations (hardware vendors and others can target ONNX for optimizations).
2019
While there are tools to convert from other formats to ONNX, one ML expert notes some limitations. For example, ATen operators in PyTorch are not supported in ONNX. This operator is not standardized in ONNX. However, it's possible to still export to ONNX by updating PyTorch source code, which is something only advanced users are likely to do.
2020
In an image classification task, a performance comparison of ONNX format with PyTorch format shows that ONNX is faster during inference. Improvements are higher at lower batch sizes. On another task, ONNX shows as much as 600% improvement over Scikit-Learn. Further improvements could be obtained by tuning ONNX for specific hardware.
Sample Code
References
- Advani, Vaishali. 2020. "What is Machine Learning? How Machine Learning Works and future of it?" Blog, Great Learning, April 29. Accessed 2020-12-31.
- Apple GitHub. 2020. "Core ML Tools." Apple, on GitHub, December 18. Accessed 2020-12-31.
- ArangoDB. 2019. "ArangoDB Extends Open Source Solution with ArangoML Pipeline; First Multi-Model Metadata Layer for Machine Learning Pipelines." News, ArangoDB, October 2. Accessed 2021-01-01.
- Boyd, Eric. 2017. "Microsoft and Facebook create open ecosystem for AI model interoperability." Blog, Microsoft Azure, September 7. Accessed 2021-01-01.
- Brownlee, Jason. 2016. "Save and Load Machine Learning Models in Python with scikit-learn." Machine Learning Mastery, June 8. Updated 2020-08-28. Accessed 2020-12-31.
- Dowling, Jim. 2019. "Guide to File Formats for Machine Learning: Columnar, Training, Inferencing, and the Feature Store." Towards Data Science, on Medium, October 26. Accessed 2020-12-31.
- Google Cloud. 2020a. "Exporting models for prediction." Docs, AI Platform, Google Cloud, December 14. Accessed 2020-12-31.
- Google Cloud. 2020b. "Getting model metadata." Docs, BigQuery ML, Google Cloud, November 16. Accessed 2020-12-31.
- Inkawhich, Matthew. 2020. "Saving and Loading Models." Tutorial, PyTorch, October 2. Accessed 2020-12-31.
- Janapati, Vishnuvardhan. 2020. "Part I: Saving and Loading of Keras Sequential and Functional Models." The Startup, on Medium, July 8. Accessed 2020-12-31.
- Joblib. 2019. "Persistence." Documentation, Joblib, May 29. Accessed 2020-12-31.
- Juarez, Seth, and Anna Soracco. 2019. "Machine Learning Models." AI Show, Channel 9, MSDN, October 1. Accessed 2020-12-31.
- Katsiapis, Konstantinos. 2020. "Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX)." Blog, TensorFlow, Septemebr 25. Accessed 2021-01-01.
- Kurenkov, Andrey. 2020. "A Brief History of Neural Nets and Deep Learning." Skynet Today, September 27. Accessed 2021-01-01.
- Lewi, Jeremy, and David Aronchick. 2018. "Announcing Kubeflow 0.1." Blog, Kubernetes, May 4. Accessed 2021-01-01.
- Mao, Lei. 2019. "PyTorch Model Export to ONNX Failed Due to ATen." Blog, July 3. Accessed 2021-01-01.
- Marr, Bernard. 2016. "A Short History of Machine Learning -- Every Manager Should Read." Forbes, February 19. Accessed 2021-01-01.
- Maynard-Reid, Margaret. 2019. "E2E tf.Keras to TFLite to Android." On: Medium, September 7. Accessed 2020-12-31.
- Microsoft Docs. 2019. "What is a machine learning model?" Windows Machine Learning, Microsoft Docs, April 1. Accessed 2020-12-31.
- Nazrul, Syed Sadat. 2018. "Clustering Based Unsupervised Learning." Towards Data Science, on Medium, April 3. Accessed 2020-12-31.
- ONNX. 2020. "Homepage." ONNX. Accessed 2020-12-31.
- ONNX GitHub. 2020a. "ONNXMLTools." ONNX, on GitHub, December 19. Accessed 2020-12-31.
- ONNX GitHub. 2020b. "ONNX Optimizer." ONNX, on GitHub, December 25. Accessed 2020-12-31.
- PaddlePaddle GitHub. 2020. "VisualDL." PaddlePaddle, on GitHub, December 30. Accessed 2020-12-31.
- Patruno, Luigi. 2019. "Storing Metadata from Machine Learning Experiments." ML in Production, April 8. Accessed 2020-12-31.
- Petrov, Dmitry. 2020. "DVC 3 Years and 1.0 Pre-release." Blog, DVC, May 4. Accessed 2021-01-01.
- PyTorch. 2018. "torch.onnx." Documentation, v0.4.0, PyTorch, April. Accessed 2021-01-01.
- PyTorch. 2020. "torch.nn." Documentation, v1.7.1, PyTorch, December. Accessed 2021-01-01.
- Roeder, Lutz. 2020. "Netron." On: GitHub, December 31. Accessed 2020-12-31.
- Schad, Jörg. 2020. "From Data to Metadata for Machine Learning Platforms." insideBIGDATA, May 15. Accessed 2020-12-31.
- Shin, Terence. 2020. "All Machine Learning Models Explained in 6 Minutes." Towards Data Science, on Medium, January 3. Accessed 2020-12-31.
- TensorFlow. 2020a. "Module: tf.keras.losses." API Docs, TensorFlow, v2.4.0, September 12. Accessed 2021-01-01.
- TensorFlow. 2020b. "Module: tf.keras.optimizers." API Docs, TensorFlow, v2.4.0, December 14. Accessed 2021-01-01.
- TensorFlow GitHub. 2020. "l02c01_celsius_to_fahrenheit.ipynb." Examples, TensorFlow, on GitHub, September 9. Accessed 2020-12-31.
- UCI. 2020. "Iris Data Set." UCI Machine Learning Repository. Accessed 2020-12-31.
- Van Kuppevelt, D., C. Meijer, F. Huber, A. van der Ploeg, S. Georgievska, and V.T. van Hees. 2020. "Mcfly: Automated deep learning on time series." SoftwareX, Elsevier, June 12. Accessed 2021-01-01.
- Van Veen, F. and S. Leijnen. 2016. "The Neural Network Zoo." The Asimov Institute, September 14. Updated 2019-04-22. Accessed 2021-01-01.
- Vartak, Manasi, Harihar Subramanyam, Wei-En Lee, Srinidhi Viswanathan, Saadiyah Husnoo, Samuel Madden, and Matei Zaharia. 2016. "ModelDB: A System for Machine Learning Model Management." HILDA'16, San Francisco, CA, USA, June 26. Accessed 2021-01-01.
- Wikipedia. 2020a. "Arthur Samuel." Wikipedia, December 15. Accessed 2021-01-01.
- Wikipedia. 2020b. "Machine learning." Wikipedia, December 31. Accessed 2021-01-01.
- Wikipedia. 2020c. "Comparison of deep learning software." Wikipedia, December 15. Accessed 2021-01-01.
- Xu, Faith, and Prabhat Roy. 2020. "Tutorial: Accelerate and Productionize ML Model Inferencing Using Open-Source Tools." Tutorial, ODSC, March 6. Accessed 2020-12-31.
- daquexian. 2020. "ONNX Simplifier." On: GitHub, December 25. Accessed 2020-12-31.
- scikit-learn. 2020. "9. Model persistence." Docs, v0.24.0, scikit-learn, December. Accessed 2020-12-31.
Further Reading
- Juarez, Seth, and Anna Soracco. 2019. "Machine Learning Models." AI Show, Channel 9, MSDN, October 1. Accessed 2020-12-31.
- Patruno, Luigi. 2019. "Storing Metadata from Machine Learning Experiments." ML in Production, April 8. Accessed 2020-12-31.
- Dowling, Jim. 2019. "Guide to File Formats for Machine Learning: Columnar, Training, Inferencing, and the Feature Store." Towards Data Science, on Medium, October 26. Accessed 2020-12-31.
- Katsiapis, Konstantinos. 2020. "Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX)." Blog, TensorFlow, Septemebr 25. Accessed 2021-01-01.
Article Stats
Cite As
See Also
- ML Model Debugging
- MLOps
- Machine Learning
- Open Neural Network Exchange
- Machine Learning as a Service
- Evaluation Metrics in Machine Learning