Highly intested as well on the column granularity.
I do use the Marketplace App "Monitoring" from Celonis to at least track and drill down APC on Table level. But this does not solve your question regarding APC per column or even which ones are unused.
Hi Muthappan Alagappan,
From my perspective finding out the list of columns which has not been used in transformation can be done through ML workbench using python, but that is a complicated one. I cannot share the code that I am using but I could definitely give an idea on how it can be done:
1) Using pycelonis find out your dj in which your transformation script is.
2) Using regular expression module in python find out a pattern using which you can list out table and column names.
3)Extract your transformation statement from dj using python and then apply the regex pattern in order to retrieve table and column names.
4)Next using python access your extraction datapool , from the dp find out list of tables/columns that have been extracted , then compare the list of tables there with the earlier list you made from transformations and those tables/columns that are there in extraction dp but not in transformation dp are the one which have been extracted but not used.
5) using requests and json module in python you can also retrieve the size of the extracted tables by giving api requests and then merge this information with the table list.
Thanks,
Akshay
Highly intested as well on the column granularity.
I do use the Marketplace App "Monitoring" from Celonis to at least track and drill down APC on Table level. But this does not solve your question regarding APC per column or even which ones are unused.
Thank you Manuel, yes i do use table level monitoring using celonis Monitoring view, but this doesnt go to the column level.
i am also trying to understand if there is any standard vertica tables gives the size at the column level, but no luck yet.
Hi Muthappan Alagappan,
From my perspective finding out the list of columns which has not been used in transformation can be done through ML workbench using python, but that is a complicated one. I cannot share the code that I am using but I could definitely give an idea on how it can be done:
1) Using pycelonis find out your dj in which your transformation script is.
2) Using regular expression module in python find out a pattern using which you can list out table and column names.
3)Extract your transformation statement from dj using python and then apply the regex pattern in order to retrieve table and column names.
4)Next using python access your extraction datapool , from the dp find out list of tables/columns that have been extracted , then compare the list of tables there with the earlier list you made from transformations and those tables/columns that are there in extraction dp but not in transformation dp are the one which have been extracted but not used.
5) using requests and json module in python you can also retrieve the size of the extracted tables by giving api requests and then merge this information with the table list.
Thanks,
Akshay
Thank you Akshay, i understand your approach, but the challenge i have is same columns might be used at multiple tables and multiple data jobs.
But i got couple of ideas from your approach
- From the data pool you can export the data pool in json file. Do a regular expressions to find where used and how many times it is used in that json file. Drawback, if a column is used in multiple tables then our count might be wrong.
- i have a raw table and list of columns, i have transformed table and the list of columns in the model level, compare those two tables to see what is used, but drawback is if the columns are used just in the transformation but not used in any data model tables, i will miss the counts/columns.
Thank you
Muthappan
There are system tables within SQL Vertica you can use.
SELECT *
FROM COLUMNS
This will show you every column and every table within your EMS instance. I've built out a number of data integrity tables using this one and "projections". Projections will tell you whether a table contains the full metadata to be able transform in its most efficient way.
There are system tables within SQL Vertica you can use.
SELECT *
FROM COLUMNS
This will show you every column and every table within your EMS instance. I've built out a number of data integrity tables using this one and "projections". Projections will tell you whether a table contains the full metadata to be able transform in its most efficient way.
Hi Matt,
Thank you for that COLUMNS and PROJECTIONS table, i dont see any columns which reflect size of the column.
Thanks
Muthappan
Thank you Akshay, i understand your approach, but the challenge i have is same columns might be used at multiple tables and multiple data jobs.
But i got couple of ideas from your approach
- From the data pool you can export the data pool in json file. Do a regular expressions to find where used and how many times it is used in that json file. Drawback, if a column is used in multiple tables then our count might be wrong.
- i have a raw table and list of columns, i have transformed table and the list of columns in the model level, compare those two tables to see what is used, but drawback is if the columns are used just in the transformation but not used in any data model tables, i will miss the counts/columns.
Thank you
Muthappan
Hi @muthappan.alaga and @akshay.harid,
We created a script in Python with this approach, and used sql_metadata · PyPI to get the table names and columns from the SQL scripts.
One downside is that this method is not bullet-proof when using 'WITH'-statements and/or 'SELECT *' statements, since the column is then invisible, or is linked to the with-clausule and not the original table. If this is not applicable to you, then this a great way of creating insights in your Transformations.
Best regards,
Jan-peter