Machine Learning in Python

Popular Python ML libraries. Source: Orekhova 2021.
Popular Python ML libraries. Source: Orekhova 2021.

Machine Learning algorithms enable a computer system to learn about the problem environment directly from historical/real-time data, without being explicitly programmed. Algorithm implementation can be done using any programming language such as C, C++, Java, Python, JavaScript or R.

However, Python is most prevalently used in the industry for Machine Learning across business domains. It is voted by data scientists as the ‘language most desirable to learn’ in a 2018 StackOverflow survey.

Several factors influence the choice of language for ML implementation. Programming expertise in the organisation, interoperability with existing code/data frameworks might be reasons to stick to traditional languages such as C++ or Java. But the advantages that new-age languages such as Python and R bring to Machine Learning are plenty.

Ease of coding, strong supporting developer community, and excellent data manipulation features are key reasons for Python’s suitability for Machine Learning.

Discussion

  • What are major steps in the ML process and their corresponding Python libraries?

    Machine Learning programs essentially consist of 4 steps. In Python, each of these steps are supported by a well-developed library implementations:

    • Step 1 – Collecting and preparing data (numpy, pandas library)
    • Step 2 – Choosing the model or algorithm (sklearn library)
    • Step 3 – Training the model with historical data (sklearn library)
    • Step 4 – Making an outcome prediction from training data and comparing the accuracy with the test data (matplotlib, sklearn libraries)
  • How is the data collection process (Step 1) handled in Python?

    Data about the system is first collected from various sources. This data would primarily be unstructured data, contain unclear labels and several invalid/incorrect entries. This data is then sorted, labelled, validated and then stored in organized data structures.

    NumPy and Pandas libraries of Python contain data structures such as Array, DataFrame and Series that provide grouping, indexing and data manipulation functions.

  • How is the ML model chosen and implemented (Step 2) in Python?
    A cheatsheet from scikit-learn guides data scientists in choosing the right estimator for their ML problem. Source: scikit-learn 2018b.
    A cheatsheet from scikit-learn guides data scientists in choosing the right estimator for their ML problem. Source: scikit-learn 2018b.

    Once the data is ready, it is divided into training and test data. Here, the appropriate ML algorithm is to be applied based on the nature of the data and the type of prediction we seek. Size of the training data, linearity of data, number of affecting parameters are all important factors to consider. Linear regression, classification, SVM, Clustering, Decision trees are some of the frequently used algorithms.

    All popular algorithm implementations are present in the sklearn (scikit-learn) library of Python. It provides a list of supervised and unsupervised algorithms with a uniform interface for invocation for easy programming.

  • How is the ML model trained with data (Step 3) using Python?

    Feeding the model with all the gathered data and allowing it to interpret the parameters incrementally is the next step. The training process involves deducing values for dependent variables based on independent variable values in the data set. Some variables are more relevant to the outcome than the others.

    The estimator.fit() function from the sklearn library allows the model to take the test X values and fit them into the model to infer the corresponding Y value. Supervised learning is done using either predictor.predict() or predictor.predict_prob() (for classification problems).

  • How is the final outcome predicted and evaluated (Step 4) in Python?
    A heatmap visualization generated with geoplotlib package. Source: Bierly 2016.
    A heatmap visualization generated with geoplotlib package. Source: Bierly 2016.

    In the final step, the derived Y values from the model are verified and evaluated for accuracy. This is done by comparing the derived Y value against the Y value from test data. High degree of accuracy indicates a good fit of the model on the problem statement, meaning the predictions are reliable.

    Once the outcomes are ready, the data need to be presented in easy human-readable form. This step is known as data visualization. It involves converting tabular data into visual representations such as graphs, maps, images, and so on.

    Python takes care of prediction with the sklearn library. The accuracy calculation functions are also present in this. Data visualization is done using the matplotlib library. Popular visualizations are scatter plots, time series, box plots, histograms and heatmaps. Refer code snippet in sample code section.

  • What are the advantages of Python over other programming languages for Machine Learning?

    Listed below are some distinct features of Python that make it extremely suitable for ML:

    • Ease of coding – Often described as a programmer’s delight, Python’s code footprint is among the smallest across high level languages. Short, to-the-point syntax is a big plus, making programs easily readable.
    • Dynamic typing – Python doesn’t mandate you to declare variable data types. So mid-way through your program, your variable can change from integer to character to an object. When large volumes of data are being collected, sometimes a variable data type is unknown or frequently changing. Thus, the dynamic typing feature is useful in data preparation for Machine Learning.
    • Strong developer community – The open and collaborative development of Python libraries has made it one of the fastest growing developer communities. Quick consultation for queries and excellent sample code sharing make it easy for beginners. Since ML is an emerging field, community support is a vital plus.
    • Embeddable code – Python code can be embedded as script snippets into traditional C, C++ or Java programs. This makes feature extensions easy, as legacy code need not be scrapped altogether.
  • What are the disadvantages of Python and when to use alternative programming languages for Machine Learning?

    Python also comes with some inherent drawbacks. Dynamic typing presents a problem to handle efficient memory management and garbage collection. So whenever performance critical features are to be implemented, better stick to C/C++ code, and make them Python extensions. Python is not the ideal choice for real-time data analysis either, due to its slower processing speed than C/C++.

    R is another extremely popular language for Machine Learning as it is a language designed especially for scientific computation and statistical analysis, unlike Python which is a general purpose high-level language.

    When an algorithm relies heavily on Java features (such as Android, IOT sensors, JDBC extensions), it is quite natural to write the ML portion also in Java.

  • How are advanced ML features supported in Python?
    ML Libraries and Frameworks in Python. Source: https://www.ipsr.edu.in/wp-content/uploads/2018/10/image1.png
    ML Libraries and Frameworks in Python. Source: https://www.ipsr.edu.in/wp-content/uploads/2018/10/image1.png

    Apart from general purpose Machine Learning, there are other allied applications such as Deep Learning, Neural Networks, and Natural Language Processing. These form the basket of frameworks for Artificial Intelligence. Python is ahead of alternative languages in support for such advanced applications.

    Among the ML frameworks or packages are TensorFlow, Keras, PyTorch, fastai, Caffe2, Scikit-Learn, and Chainer. Theano is an old DL framework that's no longer maintained. In many of frameworks, the underlying implementation may be in another language but they expose Python APIs for use in Python-based applications. Popular Deep Learning frameworks include TensorFlow from Google, and PyTorch from Facebook. Although primarily in written in C++, developers can use the Python APIs to invoke these frameworks. They do however have interfaces in C++, R, and Java.

    For Natural Language Processing (NLP), the NLTK Python library is widely used. There's also Pattern for data mining, NLP, and ML.

    Image processing functions are supported by OpenCV, scikit-image, and Pillow frameworks.

  • Could you mention some relevant tools for ML in Python?

    For general scientific computing, SciPy is a handy package. Dask enables parallel computing. High Performance Analytics Toolkit (HPAT) is an alternative to Apache Spark to scale your apps to multiple clusters on the cloud. To optimized your code for performance, consider using Numba and Cython. To run efficiently on specific hardware, Intel offers Intel® Math Kernel Library (MKL) and Intel® Distribution for Python. Likewise, PyCUDA allows Python code to access Nvidia's CUDA parallel computation API.

    Among the IDEs suitable for ML applications in Python are JuPyter/IPython Notebook, PyCharm (free or paid), Spyder, Rodeo, and Geany. Rodeo has been built specifically for ML and Data Science.

    Among the data visualization packages in Python are Matplotlib, Seaborn, ggplot, Bokeh, pygal, Plotly, geoplotlib, Gleam, missingno, and Leather. Many are capable of many types of plots while others fulfil specific needs.

Milestones

1990

This is the decade when Machine Learning algorithms move from being rule driven to data driven.

2000

Python 2.0 is released. The development of the language becomes more open and community driven.

2006

NumPy package of Python, for scientific computing is released. Statistical modelling for machine learning becomes simpler with this package.

2008

Python 3.0 is released. Memory management becomes more efficient. This helps us process large data structures, as is common in ML.

2008

Pandas package of Python, for data manipulation and processing, is released. This simplifies data clean up and pre-processing, which are the first steps in a typical ML data pipeline.

2010

Scikit-learn package of Python sees its public release. It covers all the major ML algorithm implementations such as for regression, classification and clustering. This library simplifies ML programming for developers.

2015
ML Deep Learning Frameworks in Python. Source: Fojo et al. 2018, slide 29.
ML Deep Learning Frameworks in Python. Source: Fojo et al. 2018, slide 29.

Several Python-based Deep Learning (DL) frameworks (TensorFlow, Chainer, Keras) are released. TensorFlow, the most popular of them, is an open source Python library for fast numerical computing. Developed by the Google Brain team, it comes with strong support for Machine Learning and Deep Learning. Its flexible numerical computation core is used across many other scientific domains.

2017

Facebook brings GPU-powered Machine Learning to Python. PyTorch is a Python implementation of the Torch Machine Learning framework for deep neural network programming. It can complement or partly replace existing Python packages for math and stats, such as NumPy.

Sample Code

  • # Sample Code for ML algorithm - Linear Regression
    # Source - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
    # Accessed - 2019-02-07
     
    import numpy as np
    from sklearn.linear_model import LinearRegression
    X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
    # y = 1 * x_0 + 2 * x_1 + 3
    y = np.dot(X, np.array([1, 2])) + 3
    reg = LinearRegression().fit(X, y)
    reg.score(X, y)
    reg.coef_
    array([1., 2.])
    reg.intercept_ 
    reg.predict(np.array([[3, 5]]))
    array([16.])
     

References

  1. Bierly, Melissa. 2016. "10 Useful Python Data Visualization Libraries for Any Discipline." Mode Blog, June 08. Accessed 2019-02-07.
  2. Carvalho, Neville. 2016. "Python Memory Issues: Tips and Tricks." DZone, October 04. Accessed 2019-02-07.
  3. Fojo, Dani, Victor Campos, and Xavier Giro-Nieto. 2018. "Reproducing and Analyzing Adaptive Computation Time in PyTorch and TensorFlow." Universitat Politècnica de Catalunya, on SlideShare, February 6. Accessed 2020-08-01.
  4. Jain, Rashmi. 2016. "5 Free Python IDE for Machine Learning." HackerEarth BLog, December 22. Accessed 2019-02-07.
  5. Marr, Bernard. 2016. "A Short History of Machine Learning -- Every Manager Should Read." Forbes, February 19. Accessed 2020-08-01.
  6. Matplotlib. 2018. "Gallery." Matplotlib, ver. 3.0.2, doc version v3.0.2-2-g91e2d00a8, November 11. Accessed 2019-02-07.
  7. Mindfire Solutions. 2017. "Advantages and Disadvantages of Python Programming Language." Mindfire Solutions, on Medium, April 24. Accessed 2019-02-07.
  8. NVIDIA. 2019. "PyCUDA." NVIDIA Developer. Accessed 2019-02-07.
  9. OpenCV Tutorials. 2013. "Image Processing in Python using OpenCV library." OpenCV Tutorials, on Read The Docs, September 21. Accessed 2019-02-07.
  10. Orekhova, Kate. 2021. "10 Best Python Libraries for Machine Learning in 2021." Blog, Selected Firms, October 11. Updated 2021-12-30. Accessed 2022-03-26.
  11. Pandas. 2019. "Release Notes." Pandas, NUMFOCUS Project. Accessed 2019-02-07.
  12. Papadopoulou, Eirini-Eleni. 2018. "Top 10 Python tools for machine learning and data science." JAXenter, April 25. Accessed 2019-02-07.
  13. Peng, Tony. 2017. "RIP Theano." Synced, September 29. Accessed 2019-02-07.
  14. Pillow. 2019. "Pillow docs – The Python Imaging Library." Pillow, ver. f38f01bb, on Read The Docs. Accessed 2019-02-07.
  15. PyPI. 2019. "Numpy Release History." PyPI, Python Software Foundation. Accessed 2019-02-07.
  16. Python Docs. 2019. "General Python FAQ." Python, ver. 3.7.0, Python Software Foundation, February 07. Accessed 2019-02-07.
  17. Python Docs. 2019b. "Python Documentation by Version." Python Docs, Python Software Foundation. Accessed 2019-02-07.
  18. Pythonspot. 2017. "NLTK docs for NLP in Python." Accessed 2019-02-07.
  19. Robert C. 2016. "Installing Intel® Distribution for Python* and Intel® Performance Libraries with Anaconda." Developer Zone, Intel Software, May 28. Accessed 2019-02-07.
  20. Scikit-learn. 2019. "Release History." Scikit-learn. Accessed 2019-02-07.
  21. Scikit-learn. 2020. "Developing scikit-learn estimators." v0.23.1, Scikit-learn. Accessed 2019-02-07.
  22. StackOverflow. 2018. "Developer Survey Results 2018." StackOverflow. Accessed 2019-02-07.
  23. Sullivan, John. 2018. "Data Cleaning with Python and Pandas: Detecting Missing Values." Towards Data Science, on Medium, October 05. Accessed 2019-02-07.
  24. Wikipedia. 2020. "Comparison of deep-learning software." Wikipedia, July 26. Accessed 2020-08-01.
  25. Yegulalp, Serdar. 2017. "Facebook brings GPU-powered machine learning to Python." InfoWorld, January 19. Accessed 2020-08-01.
  26. Yufeng G. 2017. "The 7 Steps of Machine Learning." Medium, September 01. Accessed 2019-02-07.
  27. scikit-image. 2019. "scikit-image: Homepage." scikit-image. Accessed 2019-02-07.
  28. scikit-learn. 2018a. "sklearn.linear_model.LinearRegression." scikit-learn v0.20.2, December 20. Accessed 2019-02-07.
  29. scikit-learn. 2018b. "Choosing the right estimator." Tutorial, scikit-learn, v0.20.2. Accessed 2019-02-07.

Further Reading

  1. Raschka, Sebastian. 2015. Python Machine Learning. Birmingham - Mumbai: Packt Publishing.
  2. TowardsDataScience. 2019. "TowardsDataScience - Machine Learning." Accessed 2019-02-07.
  3. Real Python. 2019. "Real Python Tutorials - Machine Learning." Accessed 2019-02-07.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
7
2
1267
4
4
542
1595
Words
6
Likes
6391
Hits

Cite As

Devopedia. 2022. "Machine Learning in Python." Version 11, March 26. Accessed 2024-06-25. https://devopedia.org/machine-learning-in-python
Contributed by
2 authors


Last updated on
2022-03-26 08:46:30