How would I go about removing database errors in event collection?

Question

Problem: In the dataset I am working with, a mistaken batch process in the source system added extra timestamps which are throwing off throughput times.  Objective: My objective is to map to null all activities in my activity table which are repeated back to back. Example: In the simplified table below, I would want the final activity to map to null since it is a repeated activity "c" of the same work order "3" Table = "Celonis_Remap_Practice_csv" Partial Solution:I created a custom dimension on a variant explorer with the following code: CASEWHEN   "Celonis_Remap_Practice_csv"."Activity" = ACTIVITY_LAG ("Celonis_Remap_Practice_csv"."ACTIVITY")   AND INDEX_ACTIVITY_ORDER ( "Celonis_Remap_Practice_csv"."ACTIVITY" ) > 1THEN NULLELSE "Celonis_Remap_Practice_csv"."Activity"END; This effectively removed the duplicate activity from the variant explorer, but I want duplicate activities completely removed from the analysis.  Question 1: How can I create an analysis level filter that removes duplicate activities without removing entire cases with duplicate activities? -or- Question 2: How can I write a valid "WHERE" statement in a data transformation in Event Collection given that functions such as Activity_Lag and Index_Activity_Order are not supported? -- This is what I have tried so farUPDATE "Celonis_Remap_Practice_csv"SET "Activity" = NULLWHERE "Celonis_Remap_Practice_csv"."Activity" = ACTIVITY_LAG ("Celonis_Remap_Practice_csv"."ACTIVITY")    AND INDEX_ACTIVITY_ORDER("Celonis_Remap_Practice_csv"."Activity") > 1;

tony.hoang11 · Answer

Hello,Please confirm my understanding is correct. There was an error within the source system, before the data came into Celonis.(If there was an error with the extraction into Celonis, you can simply do a full load to get a clean set of data and that should resolve your issue)This error manifests in that you have multiple representations of the same Activity, in the Activity table. From my understanding, these representations all have the same data points. They are in essence, carbon copies of each other. If this is the case, you can do a one-time creation of a new activity table:CREATE TABLE IF NOT EXISTS Celonis_Remap_Practice_v2 AS (SELECT DISTINCT * FROM Celonis_Remap_Practice_csv)Be sure to point the data model to this new activity table in the data pool. You can give it an alias in the data model to keep the same name so you don't have to alter anything else downstream. This will not null out anything, but will deduplicate your Activities table.If these duplicates are not true duplicates and have distinct values somewhere, we can still try the same concept above using ROW_NUMBER and a CTE.For example (replace the comments in  with relevant data):CREATE TABLE IF NOT EXISTS Celonis_Remap_Practice_v2 AS (WITH TEMP_ACTIVITY AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY  ORDER BY ) AS 'temprownumber' FROM Celonis_Remap_Practice_csv)SELECT  FROM TEMP_ACTIVITY WHERE temprownumber = 1)If you actually want to keep the records and want to null them out you can alter the very last select statement above:SELECT CASE WHEN temprownumber = 1 then "Activity" ELSE NULL END AS "Activity", FROM TEMP_ACTIVITY )

Reply

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded