Excel into PostgreSQL

imputing missing values with rapidminer

Posted on June 29, 2016

Continuing on data import into PostgreSQL database, this covers a quick way of loading Excel data while guessing (aka “imputing”) missing values. In this case, I’ll use RapidMiner’s Impute Missing Values operator with a k-nearest neighbors learning algorithm.

Import Excel with RapidMiner

Here’s a sample sales spreadsheet to import. Note, that some of the Region values are missing, but can be guessed at from other records.

Excel

The RapidMiner top-level process looks likes this:

Process

However, the Impute Values operator takes a sub-process, a learning model that can make an educated guess at the missing values, based on surrounding data. Double-click into the ‘Impute Values’ operator. This will show it’s sub-process area. Drag a ‘k-NN’ operator over and connect the ‘exa’ and ‘mod’ ports.

Subprocess

Export to Postgres using RapidMiner

For the Write Database operator, creating the Postgres database connection is easy in RapidMiner:

Connection

Executing the process reads the spreadsheet, imputes the missing values and stores the data into a created ‘orders’ Postgres table:

Orders

Most importantly, you can see the missing values have been inserted as part of the data cleansing during this load process.

More in this series…