Skip to main content

Problem: In the dataset I am working with, a mistaken batch process in the source system added extra timestamps which are throwing off throughput times.

 

Objective: My objective is to map to null all activities in my activity table which are repeated back to back.

 

Example: In the simplified table below, I would want the final activity to map to null since it is a repeated activity "c" of the same work order "3"

 

Table = "Celonis_Remap_Practice_csv"

Dataset 

Partial Solution:

I created a custom dimension on a variant explorer with the following code:

 

CASE

WHEN

  "Celonis_Remap_Practice_csv"."Activity" = ACTIVITY_LAG ("Celonis_Remap_Practice_csv"."ACTIVITY")

  AND INDEX_ACTIVITY_ORDER ( "Celonis_Remap_Practice_csv"."ACTIVITY" ) > 1

THEN NULL

ELSE "Celonis_Remap_Practice_csv"."Activity"

END;

 

This effectively removed the duplicate activity from the variant explorer, but I want duplicate activities completely removed from the analysis.

 

Question 1: How can I create an analysis level filter that removes duplicate activities without removing entire cases with duplicate activities?

 

-or-

 

Question 2: How can I write a valid "WHERE" statement in a data transformation in Event Collection given that functions such as Activity_Lag and Index_Activity_Order are not supported?

 

-- This is what I have tried so far

UPDATE "Celonis_Remap_Practice_csv"

SET "Activity" = NULL

WHERE "Celonis_Remap_Practice_csv"."Activity" = ACTIVITY_LAG ("Celonis_Remap_Practice_csv"."ACTIVITY")

    AND INDEX_ACTIVITY_ORDER("Celonis_Remap_Practice_csv"."Activity") > 1;

Hello,
Please confirm my understanding is correct. There was an error within the source system, before the data came into Celonis.
(If there was an error with the extraction into Celonis, you can simply do a full load to get a clean set of data and that should resolve your issue)
This error manifests in that you have multiple representations of the same Activity, in the Activity table. From my understanding, these representations all have the same data points. They are in essence, carbon copies of each other. If this is the case, you can do a one-time creation of a new activity table:
CREATE TABLE IF NOT EXISTS Celonis_Remap_Practice_v2 AS (SELECT DISTINCT * FROM Celonis_Remap_Practice_csv)

Be sure to point the data model to this new activity table in the data pool. You can give it an alias in the data model to keep the same name so you don't have to alter anything else downstream. This will not null out anything, but will deduplicate your Activities table.

If these duplicates are not true duplicates and have distinct values somewhere, we can still try the same concept above using ROW_NUMBER and a CTE.
For example (replace the comments in <> with relevant data):

CREATE TABLE IF NOT EXISTS Celonis_Remap_Practice_v2 AS (
WITH TEMP_ACTIVITY AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ORDER BY ) AS 'temprownumber' FROM Celonis_Remap_Practice_csv
)
SELECT FROM TEMP_ACTIVITY WHERE temprownumber = 1)

If you actually want to keep the records and want to null them out you can alter the very last select statement above:

SELECT
CASE WHEN temprownumber = 1 then "Activity" ELSE NULL END AS "Activity",
FROM TEMP_ACTIVITY )

As a standard, we should always have the following columns on a given table:

  • Case ID
  • Activity
  • Timestamp
  • Sort (Optional)

 

In case you have daily-based activities, you should use the Sort column as auxiliary.

 

On your table, seems like you don't have the Case ID defined. But you have the "work order", which can be understood as Sort. You need to define a Case ID as aggregator.


Reply