Continuing on data import into PostgreSQL database, this covers a quick way of loading Excel data while guessing (aka “imputing”) missing values. In this case, I’ll use RapidMiner’s Impute Missing Values operator with a k-nearest neighbors learning algorithm.
Here’s a sample sales spreadsheet to import. Note, that some of the Region values are missing, but can be guessed at from other records.
The RapidMiner top-level process looks likes this:
However, the Impute Values operator takes a sub-process, a learning model that can make an educated guess at the missing values, based on surrounding data. Double-click into the ‘Impute Values’ operator. This will show it’s sub-process area. Drag a ‘k-NN’ operator over and connect the ‘exa’ and ‘mod’ ports.
For the Write Database operator, creating the Postgres database connection is easy in RapidMiner:
Executing the process reads the spreadsheet, imputes the missing values and stores the data into a created ‘orders’ Postgres table:
Most importantly, you can see the missing values have been inserted as part of the data cleansing during this load process.