Hi, I'm new to Celonis. I would like to implement a SQL transformation to delete duplicate records from a table. The idea is to have the table name as the only input to the procedure. Any suggestions? Thanks.

I was thinking of reading all the fields of the table and then performing a group by. Is this the right way?

Page 1 / 1

In the table creation, you can find duplicates using ROW_NUMBER in combination with PARTITION BY.

ROW_NUMBER() OVER(PARTITION BY primary keys] ORDER BY quantitative value] DESC) AS record_number

Then you can do:

DELETE FROM Otable name] WHERE record_number > 1;

Hope this helps!

Thanks Matt, I was also looking at a similar solution but, as I see it, it is too table-dependent.

I am looking for a solution that can be applied to any table where the number of fields is not known in advance (the primary key must be constructed from the concatenation of all fields in the table). Something like this:

------------------------

SELECT field_1,

field_2,

....,

field_n

FROM "my_table"

GROUP BY field_1,

field_2,

....,

field_n;

------------------------

I know an alternative can be the following, but I would like to try to not use DISTINCT clause.

------------------------

SELECT DISTINCT "my_table".*

FROM "my_table"

------------------------

I hope I have explained myself better.

There was a Machine Learning script Celonis had deployed which was a "Duplicate Checker", but I'm not sure if it is still active with the new PyCelonis changes. It's difficult to deploy a non-table dependent clean-up since primary keys differ across all tables.

I do agree, DISTINCT makes it to computationally intensive.

You can concatenate all the columns as a string, then calculate the hash. If 2 hash are equal, then they are duplicates.

If you load that in a dataframe I am pretty sure there are built-in functions to remove duplicates.

HTH

Reply

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded