Preamble
Machine Learning in 10 Lines
Everyone who reads news articles, magazines, or other general interest publications has a basic understanding of what machine learning is. And this is not just a trend; machine learning already permeates our daily lives and will do so even more in the future. From customized Internet advertisements to robot dentists or self-driving cars, machine learning appears to be some sort of all-encompassing superpower.
But what exactly is machine learning? It primarily consists of a collection of statistical algorithms that can extract insights from existing data based on those algorithms. Basically, supervised and unsupervised learning are two families that these algorithms fall under. In supervised learning, the goal is to make some kind of prediction, like if an email is spam or not (classification), or how many beers will be sold in a store next week (regression), etc. On the other hand, unsupervised learning focuses on determining how my cases are divided into groups. These algorithms bring similar items as close together as they can while keeping items that differ from one another as far apart as they can, each with their own unique characteristics.
Whether you are talking about a 10 or 10,000 employee company, if you do not use your data to make decisions, you are definitely falling behind your competitors. The popularization of Machine Learning revolutionized the way we do business.
Without leaving the database, machine learning
Definitely the most crucial tools for data persistence are relational databases. Although there are other options that might be appropriate for particular uses, it’s likely that no company with even a basic IT infrastructure doesn’t have a database.
Thus, if a database exists for every business, it contains information that is valuable. This implies that by utilizing machine learning, every business has the chance to enhance its decision-making process with little effort. The disadvantage is that you must conduct your machine learning processes outside of the database for the majority of DBMS. PostgreSQL does not have this problem.
You can train and use your machine learning algorithms inside PostgreSQL because it has many extensions for other languages.
Let’s examine how to use PLPython inside PostgreSQL to perform Kmeans, one of the most well-liked unsupervised learning algorithms.
KMeans in PostgreSQL Step by Step
First, load some test data.
We’re going to use the publicly accessible iris dataset for this example.
Then, in your database, you must create the iris table:
CREATE TABLE iris( sepal_length REAL, sepal_width REAL, petal_length REAL, petal_width REAL, species varchar(20) );
After the table is made, we can fill it with the information we just downloaded. Please remove the final empty line from the iris.data file before executing the following command.
COPY iris FROM '/path/to/iris.data' DELIMITER ',';
Now that we have the data we will be using, let’s move on to kmean’s primary purpose.
Installation of dependencies is step two.
Installing the prerequisites will allow us to create our function:
- The Python code presented here runs on both the current main versions of Python (2.7 and 3.5), but if you are planning to use PL/Python further in the future you should take this into consideration. You can also install both versions of Python by following the instructions that correspond to the OS of the machine where PostgreSQL is installed.
- PL/Python: Starting with Postgres 9.5, there are two plpython extensions available, one for Python 2 (plpython) and one for Python 3 (plpython3); again, you can have both installed simultaneously. PL/Python is the extension that enables you to run Python code without leaving Postgres.
- You must create the extension once they are installed, so connect to the database and type.
CREATE EXTENSION plpython
and/or
CREATE EXTENSION plpython3
- Install additional Python libraries: Install scikit-learn and pandas using your preferred Python package manager (pip, conde, etc.). These two packages are essential if you want to begin exploring machine learning with Python.
In a nutshell, Step 3: Kmeans in PostgreSQL
PL/Python functions can be called just like any other SQL function. The integration is very straightforward because Python has a vast array of machine learning libraries. PL/Python also gives full support to Python and has a set of functions that make it easy to run any parameterized query. Therefore, running machine learning algorithms only requires a few lines of code. Take a look at this
CREATE OR replace FUNCTION kmeans(input_table text, columns text[], clus_num int) RETURNS bytea AS $$ from pandas import DataFrame from sklearn.cluster import KMeans from cPickle import dumps all_columns = ",".join(columns) if all_columns == "": all_columns = "*" rv = plpy.execute('SELECT %s FROM %s;' % (all_columns, plpy.quote_ident(input_table))) frame = [] for i in rv: frame.append(i) df = DataFrame(frame).convert_objects(convert_numeric =True) kmeans = KMeans(n_clusters=clus_num, random_state=0).fit(df._get_numeric_data()) return dumps(kmeans) $$ LANGUAGE plpythonu;
The script is quite straightforward as you can see. Then, using PL/Python’s execute function, we construct the query. First, we import the functions we need, then we create a string from the columns passed or replace it with * if an empty array is passed. Although it is outside the scope of this article, I highly suggest reading about using PL/Python to parametrize queries.
The query must be built and run before we can cast it into a data frame and convert the numerical variables to numeric type (by default, they may be interpreted as something else). Then we call K-means, where the amount of input groups is passed as a parameter as the number of clusters to obtain. We dumped it into a cPickle and then returned the object that was kept in a Pickle. Pickling is needed to restore the model later, since Python wouldn’t be able to directly restore the kmeans object from a bytearray coming from PostgreSQL if it wasn’t done.
In this case, we are using Python 2, so the extension is called plpythonu. The final line specifies the extension language. If you want to run it in Python 3, you should use the plpython3u extension language.
Fourth step: keeping the model
Making a model and then doing nothing with it is illogical. We will therefore need to store it.
To begin, let’s make a models table:
CREATE TABLE models ( id SERIAL PRIMARY KEY, model BYTEA NOT NULL );
In this instance, our table only has a primary key and a byte array field, which contains the serialized version of the actual model. Be aware that it has the same data type as the one that our defined kmeans returns.
Once we have the table, using the model, adding a new record is simple:
INSERT INTO models(model) SELECT kmeans('iris', array[]::text[], 3);
In this case, we pass an empty array as the columns parameter so that clustering can be done with all of the numeric variables in the table. Please remember that this is only an illustration. In a production scenario, you might want to add, for instance, a few extra fields that can help you distinguish between the various models.
Five: Displaying model information
While we have so far been able to build and store a model, it is not very helpful to retrieve it directly from the database. You can test it by performing
select * from models;
For the purpose of displaying pertinent data about our model, we must return to Python. We’ll use the following function:
CREATE OR replace FUNCTION get_kmeans_centroids(model_table text, model_column text, model_id int) RETURNS real[] AS $$ from pandas import DataFrame from cPickle import loads rv = plpy.execute('SELECT %s FROM %s WHERE id = %s;' % (plpy.quote_ident(model_column), plpy.quote_ident(model_table), model_id)) model = loads(rv[0][model_column]) ret = map(list, model.cluster_centers_) return ret $$ LANGUAGE plpythonu;
Let’s start at the beginning and say that we are passing the table containing the models and the binary column once more. The results of a plpython query are loaded into Python using cpickle’s load function, which allows you to see how the output is read.
We are aware that all kmeans objects have the attribute “cluster_centers_” after the model has been loaded, which is where the centroids are kept. Centroids, or the means for each variable in each group, are the mean vectors for each group. Since plpython cannot handle numpy arrays, we must convert them to a list of lists because they are originally stored as a numpy array. The result of listing every row, which creates a list of lists and each sub-list representing the centroid of a group, is the returned object because of this.
This is just an example of how to get a model to show a certain trait. Similar functions can be written to return different properties or even the entire set.
Let’s examine the results it returns:
hernan=> select get_kmeans_centroids('models','model',1); get_kmeans_centroids -------------------------------------------------------------------------------------------- {{4.39355,1.43387,5.90161,2.74839},{1.464,0.244,5.006,3.418},{5.74211,2.07105,6.85,3.07368}} (1 row)
The values are the group’s vector of means, and each element enclosed in braces represents a group.
Making predictions is step six.
Let’s use the model we have now to make predictions! Using kmeans, this entails providing an array of values (one for each variable) and receiving the group number for each one. The function and the previous one are very similar:
CREATE OR replace FUNCTION predict_kmeans(model_table text, model_column text, model_id int, input_values real[]) RETURNS int[] AS $$ from cPickle import loads rv = plpy.execute('SELECT %s FROM %s WHERE id = %s;' % (plpy.quote_ident(model_column), plpy.quote_ident(model_table), model_id)) model = loads(rv[0][model_column]) ret = model.predict(input_values) return ret $$ LANGUAGE plpythonu;
In contrast to the prior function, we now have an additional input parameter (input_values) that we can use to pass the input values that represent a case (one value per variable) for which we want to obtain the group determined by clustering.
We return an array of integers rather than an array of floats because we are referring to a group index.
hernan=> select predict_kmeans('models','model',1,array[[0.5,0.5,0.5,0.5]]); predict_kmeans ---------------- {1} (1 row)
Please be aware that even if you are passing just one element, you must pass an array of arrays. This has to do with how Python handles arrays.
Additionally, we can supply the function with column names, as in:
select species,predict_kmeans('models','model',1,array[[petal_length,petal_width,sepal_length,sepal_width]]) from iris;
As can be seen, the associated group is strongly related to the species.
Conclusion
This article has demonstrated that Postgres can be used to train and use machine learning. To prepare everything, you must, however, be knowledgeable in Python. Still, for those who do not know how to create a complete machine learning toolkit in Python or another language, this can be a very good solution.
About Enteros
Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enhancing Identity and Access Management in Healthcare with Enteros
- 19 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Maximizing Efficiency with Enteros: Revolutionizing Cost Allocation Through a Cloud Center of Excellence
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Driving Efficiency in the Transportation Sector: Enteros’ Cloud FinOps and Database Optimization Solutions
- 18 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Empowering Nonprofits with Enteros: Optimizing Cloud Resources Through AIOps Platform
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…