Author Archives: julioccelise

How to Create an Endpoint with a Machine Learning Model – The Basic Way

Sometimes you just want to deploy an ML model without using fancy tools.

When you had your first machine learning lessons, probably one of your first thoughts “ok now I trained my model, how can I use it in a real application?”.

If you work with ML-related cloud tools (AWS SageMaker, GC vertex AI, Azure Machine Learning, etc.), you know how easy it can be to configure and deploy a machine learning model so it can be in inference mode, starting to make predictions with real data. However, this can not be the case due to different reasons, here and example:

  • You work in a company where paying for a ML cloud-based service is not worth it. probably because the proyect is too small, a PoC, or it’s not planned to last alive for long time
  • Your infrastructure is on-premise
  • The deployed model is hosted in an intranet, so there’s no internet connection
  • All the previous points + you’re budged limited

Besides all the previous points, it’s a good idea to know how services work behind scenes, especially if this big world is new to you.

This blog post, includes different areas for a machine learning solution, from using the dataset to train the model until you have an endpoint in your local computer. Putting this endpoint in a production environment is up to you since it will depend on your infrastructure.

For this tutorial, we’ll create an endpoint that will predict the type of iris plant based on the sepal and petal measures (yes, the classic iris problem with the classic iris dataset). You’ll need Python for sure. I’ll be using scikit-learn as the machine learning library and Flask as the webserver. You can use the libraries of your choice (TensorFlow, PyTorch, Django…) since this post focuses on how to deploy the trained model.

You can find the source code and instructions on how to set up this project in my Github repository.

Creating the ML model

As mentioned before, our dataset contains data of 150 iris plants. The columns are:

  • sepal_length, in cm
  • sepal_width, in cm
  • petal_length, in cm
  • petal_width, in cm
  • class, which can be one of these three:
    • Iris-setosa
    • Iris-versicolour
    • Iris-virginica

The first 4 columns will be our independent variables and the class column our dependent variable:

Here comes the important part. As you know, data has some problems if we want to start training and predicting with it just as it is:

  • If you don’t scale numeric values, some features can have more weight than others, for example salary vs age, even if in some cases age could be more important.
  • Categorical variables need to be converted into dummy variables and avoid the dummy variable trap.

These and other transformations on the data are easy to perform, but they need to be consistent, for example, to remember the dummy variables column order, or knowing which dummy variable was omitted. To solve this problem, you’ll need to preserve all the transformers you use during your training, and, in the same way, you’ll need to preserve the model itself.

In this sample project, we are using only one scaler:

But for any extra transformer, keep an individual object. This means, that, for example, if you have two columns with categorical values, you’re going to need two OneHotEncoder objects, one per column.

We need to do the same with our model, in this case, our classifier:

Once you have trained, validated, and tested the model, and you are happy with it, now it’s the moment of the truth, to serialize the model and the scaler. To do that, we’ll use the dump function from the joblib module:

Creating the endpoint

Now that we’ve trained and saved our model, let’s create an API endpoint to consume it sending the plant data, and return the prediction in a JSON format.

This endpoint will be created using Flask. If you have never used it before, find more details on how to configure it in the project repository or the official Flask documentation.

The important things to remark here:

  • In line 2, we are loading the load function from the joblib module. This function is how we’ll deserialize the scaler and the model.
  • In line19, we create a two dimmensions array named plant. It contains the 4 necessary independent variables we need to predict to which iris plant these features belong.
  • In line 21, we are loading the standard scaler. If you have serialized more scalers, one hot encoders or any other object, this is the time to load them.
  • In line 22 we do the same, we load the trained model.
  • In line 24 we scale the feature values.
  • In line 25 the model predicts which is in the array, it returns an array with all the predictions (in this case only one)
  • In line 27, we return the predicted class.

Testing the solution

Now that we have everything set, it’s time to test everything all together.

In the repository, the endpoint folder includes a file called endpoint_tester.py. It’s a very basic file that makes a call to the local endpoint.

First, we need to run the Flask application. Open a terminal/command line, navigate to the endpoint folder to configure and run Flask:

For Windows:

> set FLASK_APP=app
> set FLASK_ENV=development

For macOS and Linux:

$ export FLASK_APP=app
$ export FLASK_ENV=development

After that, in the terminal/command line run the next command:

flask run

Keep the console open, it will show a similar message to this:

Flask application running in a local environment

In another terminal/command line, navigate as well to the endpoint folder and run the next command:

python endpoint_tester.py

This command will show you a prompt asking you for the plant properties and after you enter the four, it will call the endpoint and display the plant type:

Conclusion

Putting a trained model in inference mode is not complicated when you need to do it manually. Nowadays, this approach is not the most popular, but it helps a lot to understand the full lifecycle since this part is something not taught in ML courses.

Spyder vs Jupyter

Spyder vs Jupyter Notebook – Which IDE is better?

Spoiler: There’s no better IDE, just that one that satisfies your needs.

Writing code for data science is not the same as writing it for development, with this I mean that, although we can use the same programming language, we try to achieve different goals.

Coding in data science is a basic task that can be used in a wide range of areas in the field. For example, you’ll be writing code to train a machine learning model, to do an EDA, to perform statistical analysis, to process data using ETL/ELT tools, for custom visualizations in dashboards… just to mention a few.

In this post, we will focus on one programming language, Python, and in two of its most famous IDEs, Spyder and Jupyter Notebook, and see what are the advantages and disadvantages of using them.

The good news is that using Anaconda, you can install both IDEs, manage and launch them either from the command line or from Anaconda Navigator.

Spyder

Source: spyder-ide.org

Spyder is an acronym of Scientific Python Development Environment, it’s an open-source IDE, which means it’s maintained by a community. It is mostly used by data scientists due to its features and plugins, which are developed thinking especially about them. It is written in Python.

Pros:

  • IDE with a more classic style (text edit area, file explorer, plots are, variable explorer and other sections are distributed as many other IDEs)
  • After Spyder 4, it integrates Kite, which is an autocomplete tool that uses AI to suggest items based on the libraries documentation and the most used objects by the person writing the code
  • You can run a single file, which is something you can do in every IDE and actually, for that, you don’t need any IDE since you can run a whole file from the command line. The interesting part is you can run pieces of code that are selected and Spyder can keep track of all the variables and objects in the variable explorer, so you can see what values are being contained in every object without the need of writing in the console the name of an object and returning the result in the same console (Personally this is the biggest advantage of Spyder)
  • Since the screen is divided into sections, data is easy to find without the need of scrolling up and down searching for example for a plot or a line of code. In these sections you can for example copy big sections of code, export all plots, and similar functions given by this design.

Cons:

  • The user interface can be a bit complex for beginners
  • Kite is really good, but in my experience it can be annoying as well when you want to navigate through the code with the keyboard arrows since you want to go one line up or down and the Kite’s prompt will pop up and you’ll end up navigating through its suggestions
  • In recent versions of Spyder there are some times when you can feel how the IDE freezes for some milliseconds and it responses after that. If you performed more than one action in this moment, all these actions will be done after the freeze
  • Working with some frameworks such as Pyspark can make Spyder really slow, but at least you can keep the advantages previously described

Jupyter Notebook

Source: jupyter.org

Just as Spyder, Jupyter Notebook is an open-source project maintained by a community, in this case, the Project Jupyter. It’s a web-based IDE, it allows many companies to have their own implementation of it and be offered as a service.

If you are used to working with notebooks during your data science projects, you probably know and use Jupyter Notebook or another tool based on it.

The way a notebook works is by building blocks, they can be text, code, or another type of content and the goal of a notebook is more specific than Spyder, where you can use it for many different tasks. In Jupyter Notebook, the goal is to show how data can be used in an explainable way to achieve a goal, it can be a machine learning model, an EDA, an statistical analysis, etc.

Pros:

  • It’s easy to work with it, basically you write a block of code, and you run it
  • It’s a perfect IDE to explain how code works. Although you can write comments in code, in a notebook you can have block of code, even with markup to create titles, headers and other formatted text. Personally, I consider this as the most important Jupyter Notebook feature
  • Many enterprise companies and services have their own versions of Jupyter Notebook, such as AWS (Amazon SageMaker Notebooks) or Google (Google Colab), so if you’re familiar with this tool, it’ll be easier for you to work in real life projects

Cons:

  • The use of notebooks is limited to data research (which is great), but don’t expect to create an ETL job or a FaaS here since these tasks are more backend oriented
  • Since it’s a web-based IDE, things can get stuck in the background and you won’t notice it. If this happens, you’ll need to restart the kernel and run again your code again

Conclusion

In the end, both IDEs are amazing and will help you to achieve your goals. This is more a matter of which one makes you feel more comfortable or which is used by your organization.

Something to mention is that I’m remarking on the main features of each IDE. There are more of them to explore and leverage that is a big plus in any IDE. And, last thing, the design of these IDEs is not that strict, for example, you can run notebooks in Spyder with a plugin, and you can write code as a single script in Jupyter Notebook through the text editor.