DATA PREPARATION FOR MACHINE LEARNING

Home > Enteros’ Blog – Thoughts on Database Technology, Machine / Deep learning, and a Generative AI > Database Performance Management > DATA PREPARATION FOR MACHINE LEARNING

Preamble

You have probably noticed that artificial intelligence and machine learning are currently popular topics. Absolutely, as it is a very significant issue that will undoubtedly influence how we live in the future. However, after looking at the majority of the machine learning-related code that is available online, it occurs to me how much “coding” people actually use to prepare the data. Often, the input data is “just a matrix,” created over a long period of time by numerous people (and occasionally from memory).

The obvious question in my situation is: Why not prepare the data in SQL? It is much easier, more flexible, and demands less effort.

Creating some sample data

I’ll create a straightforward table with just 20 random values to show what we can do with PostgreSQL. Let’s define a table first:

 test=# CREATE TABLE inputdata ( 	id int, 	data numeric 	DEFAULT random() * 1000 ); CREATE TABLE The table is populated: test=# INSERT INTO inputdata 	SELECT * FROM generate_series(1, 20); INSERT 0 20

Although we currently only have a table with 20 randomly chosen values, you actually already have some data available, such as:

 test=# SELECT * FROM inputdata; id | data ----+------------------ 1 | 542.76927607134 2 | 813.954454381019 3 | 215.18046176061 4 | 989.989245776087 5 | 142.890753224492 6 | 326.086463406682 7 | 24.8975520953536 8 | 266.512574627995 9 | 86.0621216706932 10 | 801.756543107331 11 | 790.149183012545 12 | 317.997705657035 13 | 975.230060052127 14 | 385.490739252418 15 | 746.592517476529 16 | 621.084009762853 17 | 208.689162041992 18 | 529.119417071342 19 | 260.399237740785 20 | 563.285110052675 (20 rows)

Thoughts on sampling, training, and verification

You will always begin by dividing the data into different parts if you are training an AI model (perhaps a Support Vector Machine or SVM, a neural network, or whatever).

• Data needed to train your AI model
• Data for testing your AI model

To train your model, use the training data. The performance of your model is then evaluated using the test data. Data segmentation is crucial and, in a sense, the secret to success.

Table sampling in PostgreSQL

PostgreSQL core already includes sampling as of version 9.5. As an illustration:

 test=# SELECT * 	FROM inputdata TABLESAMPLE BERNOULLI (50) 				REPEATABLE (87); id | data ----+------------------ 3 | 215.18046176061 4 | 989.989245776087 5 | 142.890753224492 10 | 801.756543107331 11 | 790.149183012545 12 | 317.997705657035 14 | 385.490739252418 15 | 746.592517476529 16 | 621.084009762853 19 | 260.399237740785 20 | 563.285110052675 (11 rows)

The table must be amended to include the TABLESAMPLE clause and guidelines for how it should function. I decided to use the repeatable Bernoulli table sampling method (TSM) in this case. In order to ensure that we can train our model repeatedly using the same input data, it makes sense to use the REPEATABLE clause in the context of machine learning. No matter how many times we run the code, PostgreSQL will consistently return the same sample (as long as the underlying data stays the same, of course).

Machine learning: Lots and lots of data …

So far, everything seems okay. This tactic does have a small disadvantage, though. To function properly, a state-of-the-art model requires a LOT of input data. The number of rows is in the millions. The following issue (expressed in pseudo code) is one we will eventually encounter:

 SELECT	* FROM 	inputdata WHERE	id NOT IN (SELECT id FROM test_data_set);

This raises the following two issues:

• It requires a lot of storage space to keep the test data set.
• The price of the large NOT IN statement is reasonable.

Is there another way to carry out that task then? What I came up with is a fairly easy-to-implement tactic that works.

Because PostgreSQL lacks a “NOT IN TABLESAMPLE”-clause and because we don’t want to duplicate our data, the idea is to use a view that can be used to extract the sample:

 test=# CREATE VIEW preparation_step1 AS 	SELECT 	*, abs(hashtext(id::text) % 100) 	FROM 	inputdata ; CREATE VIEW

The hashtext function is used to first generate a hash from the input data. The hashtext yields evenly spaced numbers, which is exactly what we need in this situation. It will produce 100 slices of data, each containing 1% of the total. Be mindful that the hashtext function is capable of returning negative values. Positive values will be generated by the “abs” function.

 test=# \x Expanded display is on. test=# \df *hashtext* List of functions -[ RECORD 1 ]-------+----------- Schema | pg_catalog Name | hashtext Result data type | integer Argument data types | text Type | normal

The outcome will already be quite helpful when considering the view:

 test=# SELECT * FROM preparation_step1 LIMIT 10; id | data | abs ----+------------------+----- 1 | 542.76927607134 | 47 2 | 813.954454381019 | 26 3 | 215.18046176061 | 4 4 | 989.989245776087 | 92 5 | 142.890753224492 | 58 6 | 326.086463406682 | 12 7 | 24.8975520953536 | 95 8 | 266.512574627995 | 88 9 | 86.0621216706932 | 36 10 | 801.756543107331 | 81 (10 rows)

We can now exclude certain data. An illustration: “abs 50” may be training data, and the remaining information may be used to validate and examine our models.

If your dataset is extremely huge, this strategy is generally acceptable (xxx-million rows or so). If your dataset is too small, it might not be the best. It would be preferable in this situation to utilize the sample techniques offered by your chosen library (TensorFlow, sklearn, etc.). What makes that so? You incur the danger of producing a biased sample if your data collection is small (e.g., only a few hundred rows). Why does that matter? Consider you have a dataset with details about men and women. The sample should have the same distribution as the original data set, which means that there should be an equal number of males and women. Libraries like sklearn and others support the “stratification” approach to solving the issue. In my straightforward SQL example, I’m assuming that the model will receive a significant amount of data, so stratification won’t be a problem.

About Enteros

Enteros offers a patented database performance management SaaS platform. It finds the root causes of complex database scalability and performance problems that affect business across a growing number of cloud, RDBMS, NoSQL, and machine learning database platforms.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

Optimizing Database Performance and Cloud FinOps in the Insurance Sector with Enteros and AWS CloudFormation

3 April 2025
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Driving Growth and RevOps Efficiency in the Technology Sector with Enteros

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Enhancing Healthcare Efficiency with Enteros: AIOps-Driven Database Performance Monitoring

2 April 2025
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Optimizing Cost Estimation and RevOps Efficiency with Enteros: Enhancing Financial Stability and Balance Sheet Management

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Preamble

Creating some sample data

Thoughts on sampling, training, and verification

Table sampling in PostgreSQL

Machine learning: Lots and lots of data …

About Enteros

RELATED POSTS

Optimizing Database Performance and Cloud FinOps in the Insurance Sector with Enteros and AWS CloudFormation

Driving Growth and RevOps Efficiency in the Technology Sector with Enteros

Enhancing Healthcare Efficiency with Enteros: AIOps-Driven Database Performance Monitoring

Optimizing Cost Estimation and RevOps Efficiency with Enteros: Enhancing Financial Stability and Balance Sheet Management