Machine Learning in Python
However, Python is most prevalently used in the industry for Machine Learning across business domains. It is voted by data scientists as the ‘language most desirable to learn’ in a 2018 StackOverflow survey.
Several factors influence the choice of language for ML implementation. Programming expertise in the organisation, interoperability with existing code/data frameworks might be reasons to stick to traditional languages such as C++ or Java. But the advantages that new-age languages such as Python and R bring to Machine Learning are plenty.
Ease of coding, strong supporting developer community, and excellent data manipulation features are key reasons for Python’s suitability for Machine Learning.
What are major steps in the ML process and their corresponding Python libraries?
- Step 1 – Collecting and preparing data (
- Step 2 – Choosing the model or algorithm (
- Step 3 – Training the model with historical data (
- Step 4 – Making an outcome prediction from training data and comparing the accuracy with the test data (
- Step 1 – Collecting and preparing data (
How is the data collection process (Step 1) handled in Python?
Data about the system is first collected from various sources. This data would primarily be unstructured data, contain unclear labels and several invalid/incorrect entries. This data is then sorted, labelled, validated and then stored in organized data structures.
How is the ML model chosen and implemented (Step 2) in Python?
Once the data is ready, it is divided into training and test data. Here, the appropriate ML algorithm is to be applied based on the nature of the data and the type of prediction we seek. Size of the training data, linearity of data, number of affecting parameters are all important factors to consider. Linear regression, classification, SVM, Clustering, Decision trees are some of the frequently used algorithms.
All popular algorithm implementations are present in the
sklearn(scikit-learn) library of Python. It provides a list of supervised and unsupervised algorithms with a uniform interface for invocation for easy programming.
How is the ML model trained with data (Step 3) using Python?
Feeding the model with all the gathered data and allowing it to interpret the parameters incrementally is the next step. The training process involves deducing values for dependent variables based on independent variable values in the data set. Some variables are more relevant to the outcome than the others.
estimator.fit()function from the
sklearnlibrary allows the model to take the test X values and fit them into the model to infer the corresponding Y value. Supervised learning is done using either
predictor.predict_prob()(for classification problems).
How is the final outcome predicted and evaluated (Step 4) in Python?
In the final step, the derived Y values from the model are verified and evaluated for accuracy. This is done by comparing the derived Y value against the Y value from test data. High degree of accuracy indicates a good fit of the model on the problem statement, meaning the predictions are reliable.
Once the outcomes are ready, the data need to be presented in easy human-readable form. This step is known as data visualization. It involves converting tabular data into visual representations such as graphs, maps, images, and so on.
Python takes care of prediction with the
sklearnlibrary. The accuracy calculation functions are also present in this. Data visualization is done using the
matplotliblibrary. Popular visualizations are scatter plots, time series, box plots, histograms and heatmaps. Refer code snippet in sample code section.
What are the advantages of Python over other programming languages for Machine Learning?
Listed below are some distinct features of Python that make it extremely suitable for ML:
- Ease of coding – Often described as a programmer’s delight, Python’s code footprint is among the smallest across high level languages. Short, to-the-point syntax is a big plus, making programs easily readable.
- Dynamic typing – Python doesn’t mandate you to declare variable data types. So mid-way through your program, your variable can change from integer to character to an object. When large volumes of data are being collected, sometimes a variable data type is unknown or frequently changing. Thus, the dynamic typing feature is useful in data preparation for Machine Learning.
- Strong developer community – The open and collaborative development of Python libraries has made it one of the fastest growing developer communities. Quick consultation for queries and excellent sample code sharing make it easy for beginners. Since ML is an emerging field, community support is a vital plus.
- Embeddable code – Python code can be embedded as script snippets into traditional C, C++ or Java programs. This makes feature extensions easy, as legacy code need not be scrapped altogether.
What are the disadvantages of Python and when to use alternative programming languages for Machine Learning?
Python also comes with some inherent drawbacks. Dynamic typing presents a problem to handle efficient memory management and garbage collection. So whenever performance critical features are to be implemented, better stick to C/C++ code, and make them Python extensions. Python is not the ideal choice for real-time data analysis either, due to its slower processing speed than C/C++.
R is another extremely popular language for Machine Learning as it is a language designed especially for scientific computation and statistical analysis, unlike Python which is a general purpose high-level language.
When an algorithm relies heavily on Java features (such as Android, IOT sensors, JDBC extensions), it is quite natural to write the ML portion also in Java.
How are advanced ML features supported in Python?
Apart from general purpose Machine Learning, there are other allied applications such as Deep Learning, Neural Networks, and Natural Language Processing. These form the basket of frameworks for Artificial Intelligence. Python is ahead of alternative languages in support for such advanced applications.
Among the ML frameworks or packages are TensorFlow, Keras, PyTorch, fastai, Caffe2, Scikit-Learn, and Chainer. Theano is an old DL framework that's no longer maintained. In many of frameworks, the underlying implementation may be in another language but they expose Python APIs for use in Python-based applications. Popular Deep Learning frameworks include TensorFlow from Google, and PyTorch from Facebook. Although primarily in written in C++, developers can use the Python APIs to invoke these frameworks. They do however have interfaces in C++, R, and Java.
Could you mention some relevant tools for ML in Python?
For general scientific computing, SciPy is a handy package. Dask enables parallel computing. High Performance Analytics Toolkit (HPAT) is an alternative to Apache Spark to scale your apps to multiple clusters on the cloud. To optimized your code for performance, consider using Numba and Cython. To run efficiently on specific hardware, Intel offers Intel® Math Kernel Library (MKL) and Intel® Distribution for Python. Likewise, PyCUDA allows Python code to access Nvidia's CUDA parallel computation API.
Among the data visualization packages in Python are Matplotlib, Seaborn, ggplot, Bokeh, pygal, Plotly, geoplotlib, Gleam, missingno, and Leather. Many are capable of many types of plots while others fulfil specific needs.
Several Python-based Deep Learning (DL) frameworks (TensorFlow, Chainer, Keras) are released. TensorFlow, the most popular of them, is an open source Python library for fast numerical computing. Developed by the Google Brain team, it comes with strong support for Machine Learning and Deep Learning. Its flexible numerical computation core is used across many other scientific domains.
- Bierly, Melissa. 2016. "10 Useful Python Data Visualization Libraries for Any Discipline." Mode Blog, June 08. Accessed 2019-02-07.
- Carvalho, Neville. 2016. "Python Memory Issues: Tips and Tricks." DZone, October 04. Accessed 2019-02-07.
- Fojo, Dani, Victor Campos, and Xavier Giro-Nieto. 2018. "Reproducing and Analyzing Adaptive Computation Time in PyTorch and TensorFlow." Universitat Politècnica de Catalunya, on SlideShare, February 6. Accessed 2020-08-01.
- Jain, Rashmi. 2016. "5 Free Python IDE for Machine Learning." HackerEarth BLog, December 22. Accessed 2019-02-07.
- Marr, Bernard. 2016. "A Short History of Machine Learning -- Every Manager Should Read." Forbes, February 19. Accessed 2020-08-01.
- Matplotlib. 2018. "Gallery." Matplotlib, ver. 3.0.2, doc version v3.0.2-2-g91e2d00a8, November 11. Accessed 2019-02-07.
- Mindfire Solutions. 2017. "Advantages and Disadvantages of Python Programming Language." Mindfire Solutions, on Medium, April 24. Accessed 2019-02-07.
- NVIDIA. 2019. "PyCUDA." NVIDIA Developer. Accessed 2019-02-07.
- OpenCV Tutorials. 2013. "Image Processing in Python using OpenCV library." OpenCV Tutorials, on Read The Docs, September 21. Accessed 2019-02-07.
- Pandas. 2019. "Release Notes." Pandas, NUMFOCUS Project. Accessed 2019-02-07.
- Papadopoulou, Eirini-Eleni. 2018. "Top 10 Python tools for machine learning and data science." JAXenter, April 25. Accessed 2019-02-07.
- Peng, Tony. 2017. "RIP Theano." Synced, September 29. Accessed 2019-02-07.
- Pillow. 2019. "Pillow docs – The Python Imaging Library." Pillow, ver. f38f01bb, on Read The Docs. Accessed 2019-02-07.
- PyPI. 2019. "Numpy Release History." PyPI, Python Software Foundation. Accessed 2019-02-07.
- Python Docs. 2019. "General Python FAQ." Python, ver. 3.7.0, Python Software Foundation, February 07. Accessed 2019-02-07.
- Python Docs. 2019b. "Python Documentation by Version." Python Docs, Python Software Foundation. Accessed 2019-02-07.
- Pythonspot. 2017. "NLTK docs for NLP in Python." Accessed 2019-02-07.
- Robert C. 2016. "Installing Intel® Distribution for Python* and Intel® Performance Libraries with Anaconda." Developer Zone, Intel Software, May 28. Accessed 2019-02-07.
- scikit-image. 2019. "scikit-image: Homepage." scikit-image. Accessed 2019-02-07.
- scikit-learn. 2018a. "sklearn.linear_model.LinearRegression." scikit-learn v0.20.2, December 20. Accessed 2019-02-07.
- scikit-learn. 2018b. "Choosing the right estimator." Tutorial, scikit-learn, v0.20.2. Accessed 2019-02-07.
- Scikit-learn. 2019. "Release History." Scikit-learn. Accessed 2019-02-07.
- Scikit-learn. 2020. "Developing scikit-learn estimators." v0.23.1, Scikit-learn. Accessed 2019-02-07.
- StackOverflow. 2018. "Developer Survey Results 2018." StackOverflow. Accessed 2019-02-07.
- Sullivan, John. 2018. "Data Cleaning with Python and Pandas: Detecting Missing Values." Towards Data Science, on Medium, October 05. Accessed 2019-02-07.
- Wikipedia. 2020. "Comparison of deep-learning software." Wikipedia, July 26. Accessed 2020-08-01.
- Yegulalp, Serdar. 2017. "Facebook brings GPU-powered machine learning to Python." InfoWorld, January 19. Accessed 2020-08-01.
- Yufeng G. 2017. "The 7 Steps of Machine Learning." Medium, September 01. Accessed 2019-02-07.
- Raschka, Sebastian. 2015. Python Machine Learning. Birmingham - Mumbai: Packt Publishing.
- TowardsDataScience. 2019. "TowardsDataScience - Machine Learning." Accessed 2019-02-07.
- Real Python. 2019. "Real Python Tutorials - Machine Learning." Accessed 2019-02-07.