What is beyond our model?

We explore an architecture for a complete software solution: deploy a Python model, from the user interface to the connection with the model and obtaining the results.


For a long time and especially in our exploration stage in the field of machine learning challenges, we focused on the search for a model that, starting from the data in a particular state, can act to obtain a result in the ranges of accurate we consider acceptable.
Well, the field of work to obtain the model is so extensive, complex, and fascinating in itself that it can make us lose sight of the rest of the components involved in a software solution.
Therefore our model will be surrounded by another set of problems as complex, extensive, and fascinating as the model itself we have developed, for example:

  • The extraction and transformation of data to have a dataset similar to the one we started
  • The deployment of the model in production
  • Data collection and subsequent validation to monitor and correct our initial model.

We will then meet the fields of ETL, Data Wrangling and MLOps.
As data scientists, it is no longer enough to find a model that will solve the challenge; we have to at least think about all these related areas’ feasibility.
Let us suppose for a moment that we manage to secure the data collection pipeline.
Let us suppose we defend ourselves during our feature engineering pipeline from new cases or anomalous data.
Let us suppose we get an efficient way to put it into production and store the datasets, and we can supervise the model behavior.
Ugh! It seems that we have everything planned. Right?

Probably not!

Possibly we lack software that interacts with the user to present the predictions results. Sometimes our model will connect with complex environments. It is necessary to create an API between thems so that the model can display its results; sometimes it will be necessary to develop software to use our model.
This article will present a complete solution in a test case; the components and tools used can change by any other type of language, library, and architecture.

The use case

We start from a model developed in Python, which receives a CSV and returns an Excel file, adding a column with the prediction result.

We need to deploy in a server this model, and develop an user interface that:

  • Login user
  • Allow the user to select a CSV file from any source in the appropriate format and upload it to the model.
  • Display a panel to the user with all the uploaded files and their status (processed, failure, progress).
  • Advertise the user when a model is finished processing.

Additionally, we will envelope the model in an API to calling as a service by other software.

Solution structure

The proposed solution consists of six components and three access points.

The components are in order of appearance:
1) User Interface (WordPress + PHP)
2) Database (MySQL)
3) Repository in the server to store the input and output datasets
4) Scheduler (cronJob) to schedule the search of pending jobs and invoke the model
5) API for model execution (Pyhton — Uvicorn — Starlette)
6) Machine learning model in Python

The access points to the system are:
a) User, when uploading a new dataset to the application
b) Scheduler, when detecting a job to be processed
c) Someone with the right credentials wants to specifically invoke the model’s API to process a dataset registered in the database directly.

Image for post

The idea is that a user uploads through the interface a CSV to be processed.
Component 1 allows that and controls that the CSV is correct. It saves the database a job in “pending” status and deposits in the input repository the dataset.
Then, the scheduler checks every “x” time there is a pending job, takes it, and changes to an “in-process” state, invoking the function that invokes the model. The process’s result is deposited in the output repository, updating the task’s status to “Processed.”
From the user panel (point #1), you can detect jobs with errors, remove them from the job list, archive them, or retry them.

The components

1 — User panel

The user panel developed in WordPress®, through a plugin the access and management of users:

Image for post

After access, the panel has a header where the user can select a dataset validated in its structure before loading it into the database as pending and leaving it in the input repository.

2 — The database

Image for post

The selected database was MySQL

The table “wp_records” on which the tasks recorded has the structure

3 — Data repository

The input and output repository are two directories defined on the server where the files are kept as entries generated in the task table.
Once the Python model processes a dataset and returns a CSV with the result, it saved in an output directory.

4 — Scheduler ejecution

The following code (process_files) is indicative of how to process the files, opens the database (open_db), looks for if there are tasks in the state “PENDING” and if there are, takes the first of them and invokes the prediction process (process_file).

import mysql.connector as mysql
from mysql.connector import Error
from datetime import datetime
from jproperties import Propertiesif __name__ == '__main__':
   result = ''
      db = open_bd()
      cursor = db.cursor()
      sql_Query = "SELECT id FROM wp_records WHERE status_file='IN PROCESS'"
      records = cursor.fetchall()
      if (len(records) != 0):
          result = 'There are files in progress' + str(records[0][0])
          sql_Query = "SELECT id FROM wp_records WHERE status_file='PENDING'"
          records = cursor.fetchall()
          if (len(records) > 0):
             id = records[0][0]
             result = process_file(id)
      except Error as e:
          result = 'Error ' + e
          if (db.is_connected()):
             result = result + ' (BD correctly closed)'

The code to open the database has a configuration file ConfigFile.properties where the database and access credentials are defined.

def open_bd():
   configs = Properties()
   with open('ConfigFile.properties', 'rb') as config_file:
      host = configs.get("HOST")[0]
      user = configs.get("USER")[0]
      pass = configs.get("PASSWORD")[0]
      database = configs.get("DBNAME")[0]
      db = mysql.connect(
           host = host,
           user = user,
           passwd = pass,
      return db

The code to process a specific file (a database id could be something like this):

def process_file(prm_id):
   os.environ['TZ'] = 'America/Montevideo'
   resultado = ''
      db = open_bd()
      cursor = db.cursor()
      sql_Query = "SELECT path_up FROM wp_records WHERE id=%s"
      id = (prm_id,)
      cursor.execute(sql_Query, id)
      record = cursor.fetchone()
      input_file = str(record[0])
      upd_Query = "UPDATE wp_records SET status_file = 'IN PROGRESS',date_init=%s WHERE id=%s"
      today = datetime.today()
      params = (str(today),prm_id,)
      cursor.execute(upd_Query, params)
      outfile = ''
      output_file = predictions.prediction(input_file)
      today = datetime.today()
      upd_Query = "UPDATE wp_registros SET status_file='PROCESS OK',path_result = %s,date_result=%s WHERE id=%s"
      params = (output_file,str(today),prm_id,)
      cursor.execute(upd_Query, params)
      resultado = 'OK - File was created [' + output_file + ']'
   except Error as e:
      result = 'Error ' + e + ' processing id ' +  prm_id
      if (db.is_connected()):
         result = result + ' (BD closed)'
         return result

5 — Uvicorn and Startlette API

The following is a Python code for developing an API using Uvicorn and Starlette, which must be installed (using pip install).

In our API we deploy three methods:

  • / to verify that the server is running and our API is working
  • /process_file?id=id_base_data is a method to process a certain file, we must pass by parameter the id of the task table corresponding to the file to be processed (remember that in component 1 were created the entries in this table and downloaded the datasets into the repository)
  • /process_all without parameters serves to search all entries for a slope and process it.
from starlette.applications import Starlette
from starlette.responses import JSONResponse
import uvicorn
import connect
import pendings
from datetime import datetimeapp = Starlette(debug=True)
async def homepage(request):
   return JSONResponse({'MySAMPLE': 'It works!'})@app.route("/process_file", methods=["GET"])
   async def process_file(request):
   elid = request.query_params["id"]
   before = datetime.today()
   result = connect.process_file(elid)
   after = datetime.today()
   return JSONResponse({
      "resulta": result,
      "Start" : str(before),
      "End" : str(after)})@app.route("/process_all", methods=["GET"])
async def process_all(request):
   before = datetime.today()
   result = pendings.process_files()
   after = datetime.today()
   return JSONResponse({
      "result": result,
      "Start" : str(before),
      "End" : str(after)})if __name__ == '__main__':
   uvicorn.run(app, host='', port=8000)

The code that runs through the slopes and the one that processes a file specified in item #4

6 — The Python model

The model is in another module of the application and is not displayed in this article since it is not our goal to detail it here.


Unlike previous articles where we focused on the tools and approaches to a specific Machine Learning challenge, this time, we have presented another part of the effort, perhaps not as attractive but essential to transforming a solution to a use case into a complete solution.

Of course, each solution’s architecture defining according to the problem, with the client’s existing structure, the knowledge we have of the tools for each instance, or the people who complement our team.

There are many libraries and tools for each stage.

In a nutshell: talk about other aspects close to the model that could be necessary for a complete solution.

Shere this:

Leave a comment