Skip to main content

Hi all,
I am lately thinking of how to make my life as Data Engineer easier and get a up-to-date documentation of the current backend relationships. The goal is to see: which tables from source are transformed into which VIEWS / TABLES downstream. I want to have a quick and accurate answer to:

  1. Which tables are sources for a given table in the datamodel and how are they connected?
  2. What is the potential impact on downstream tables if I change something on a given extraction.
  3. Which tables are not in use?
  4. Potentially even better would be a granularity of column and not only View/Table

I believe those questions belong all to the concept of Data Lineage. Which in my opinion is currently very poorly supported within Celonis.

I would like to open the discussion on how to best do this and if this is a relevant problem to more people:

  • What is you workflow when you face the questions above?
  • (How) are you guys documenting the SQL Backend?
  • Is it done manually?

I’ve found PyCelonis works really well to answer these questions. I have a couple of scripts that put together documents/spreadsheets with all of this information. There’s some prep work that goes into it ahead of time but it is definitely doable. Vertica also has a lot of system tables you can use to get this information. 

SELECT *
FROM COLUMNS

will get you almost everything you’re looking for. You’ll have to do some substitution of values, especially because all the data_schemas come in their 36 character alpha-numeric values and not whatever you’ve named them.


Hi Matt,
thanks for the hint. I also was thinking in direction of PyCelonis and some Scripts to throw into Machine Learning Workbench. 
Can you elaborate a bit on how you would identify the lineage aka. the chain (or more likely tree) of which table is source of which table from those system tables?
From my understanding SELECT * FROM COLUMNS or even FROM TABLES or FROM VIEWS can only give you current static list of what is there but has not much info about how this table was created.


Right so I started with the system tables first to get the high level “end product” of tables. 

I’m currently working on a script that allows me to go and get all the transformations and their queries. In PyCelonis you have get_jobs() and get_tasks(). These two functions will let you see your transformations and then extract them. From there, you can do string manipulation and reading the transformations to get what your source tables are and then what you’re joining. 


Hi Matt,
thanks for that nudge in the direction of using PyCelonis. I like that automated approach and think the initial setup effort is worthwile for our rather big backend.

I’m still in the midst of setting it up, but two intermediate findings maybe helpful for future readers:

  • if you got the task already like matt described with get_tasks() you can grab their sql statements with task.get_statement()
  • there are SQL parser libraries which might be able to ease the process of interpreting the sql statement and reading the source tables with custom python. To me sqllineage looked rather mature and promising.

Reply