KMEANS PQL function need string to numeric conversion

Dear all,

I tried to use KMEANS PQL function for clustering O2C Sales order data.
I got error ‘Error in formula, String and boolean are not supported as input or model training’.

Normally in machine learning project, we need to convert Categorical (String) column to numeric columns x categories (E.g. in scikit-learn I can use OneHotEncoder to do it). Is there a something way in Celonis to do same conversion ?

Additionally I tried below transformation SQL in Event Collection. (referring vertica ONE_HOT_ENCODER_FIT)

select one_hot_enncoder_fit(‘one_hot_encoder_model’,‘vbak’,‘vkorg’ using parameters output_view=‘vbak_encoded’);

and returned
Execution error: [Vertica]VJDBC ERROR: Function one_hot_enncoder_fit(unknown, unknown, unknown) does not exist, or permission is denied for one_hot_enncoder_fit(unknown, unknown, unknown)

Any other ideas ?

Dear Kazuhiko,

I was able to replicate your error message. If your column contains numbers but is coded as a string you can use the STR_TO_INT function to covert the string to numbers. Please bear in mind that if the string isn’t a number it returns a null. Also if the string contains floats all decimal numbers are ignored. It would however be possible to work round that by first using a function such as SUBSTRING to split the string into two parts and then to convert both parts to ints, you could then divided the decimal part by say 10 to convert it back into a decimal and then adding it to the integer part.

I hope this helps. If you have further questions don’t hesitate to write back.

Best wishes,

Calandra

Dear Calandra,

Thanks for your cooperation. Now I realized I need to explain a little bit more.

For example, my case table has columns ‘Sales organization’ and ‘Document type’.
I have sales organization value A, B, C and document type value D, E, F, G.
These categorized string should be split to multiple columns like below, then run KMEANS.

Using your solution and converting A -> 1, B->2, C->3 in the same column does not work correctly because A and B and C are equally different but distance between 3 and 1 is different from that of 2 and 1.

If I can use machine learning workbench (new feature), I probably use Python code and smoothly solve this issue.

Dear Kazuhiko,

yes, using machine learning workbench for this problem is a good idea.

I wish you all the best with that. If you need help with using the machine learning workbench then please make a new ticket with ‘Python’ or ‘machine learning workbench’ in the title so that we can get our Python experts to answer your question as quickly as possible.

Best wishes,

Calandra

Dear Calandra,

I looked at another topic and found Pycelonis documentation.
I understand that I can use my laptop installed python environment and integrate with IBC, even I do not have access to use IBC ML workbench.

I will try it and feedback to this community if there is something issue.

Best regards,
Kazuhiko