Skip to main content

I was thinking of reading all the fields of the table and then performing a group by. Is this the right way?

In the table creation, you can find duplicates using ROW_NUMBER in combination with PARTITION BY.

 

ROW_NUMBER() OVER(PARTITION BY primary keys] ORDER BY quantitative value] DESC) AS record_number

 

Then you can do:

 

DELETE FROM Otable name] WHERE record_number > 1;

 

Hope this helps!


Thanks Matt, I was also looking at a similar solution but, as I see it, it is too table-dependent.

I am looking for a solution that can be applied to any table where the number of fields is not known in advance (the primary key must be constructed from the concatenation of all fields in the table). Something like this:

------------------------

SELECT field_1,

field_2,

....,

field_n

FROM "my_table"

GROUP BY field_1,

field_2,

....,

field_n;

------------------------

 

I know an alternative can be the following, but I would like to try to not use DISTINCT clause.

------------------------

SELECT DISTINCT "my_table".*

FROM "my_table"

------------------------

 

I hope I have explained myself better.


There was a Machine Learning script Celonis had deployed which was a "Duplicate Checker", but I'm not sure if it is still active with the new PyCelonis changes. It's difficult to deploy a non-table dependent clean-up since primary keys differ across all tables.

 

I do agree, DISTINCT makes it to computationally intensive.


You can concatenate all the columns as a string, then calculate the hash. If 2 hash are equal, then they are duplicates.

 

If you load that in a dataframe I am pretty sure there are built-in functions to remove duplicates.

 

HTH


Reply