Classiffy | Community

Hi everyone,

I have a column with many numbers, and I have uploaded a screenshot as an example. I want to classify them into groups. FYI: I know the maximum, minimum, and median, and I also know the total number of numbers.

How can I do that?

Thanks in advance!

Page 1 / 1

Can you explain a bit more what you wanna do? What kind of groups do you want?

Hi Jonas,

Thanks for your reply! What I’m looking to do is categorize the numbers into different groups based on their values, but I want the ranges to be dynamic and adapt to the data itself.

For example, I would like to define categories based on the distribution of the numbers, such as grouping numbers that are significantly higher, middle-range, and lower.

For example something like this :

Category 1: Numbers greater than 10,000
Category 2: Numbers between 1,000 and 10,000
Category 3: Numbers between 500 and 1,000

Let me know if you need further clarification!

Hi Saba, do you want to do this in your SQL script? If yes, you can do the following,

You can add a column to the table lCategory] which will have these three categories - Category 1 (Numbers greater than 10,000), Category 2 (Numbers between 1,000 and 10,000) and Category 3 (Numbers between 500 and 1,000).

Or you can also create a table from the existing table, with a new column where you can see these Categories -

CREATE TABLE TABLE_A
AS SELECT *,
CASE
WHEN ‘MAX POTENTIAL SAVING’ > 10000 THEN 'CATEGORY 1'
WHEN ‘MAX POTENTIAL SAVING’ BETWEEN 1000 AND 10000 THEN 'CATEGORY 2'
WHEN ‘MAX POTENTIAL SAVING’ BETWEEN 500 AND 1000 THEN 'CATEGORY 3'
END AS CATEGORY
FROM TABLE

Thanks Tanmay!

However, I want to do it in PQL, and I need something dynamic. I don't want to explicitly mention values like 'greater than 1000' in my code, for example.

@Saba in pql it would be similar with case when statements to assign a bucket category
You have to define the logcal ‘range’ that defines a category.
If you have fixed categories but dynamic ranges you could work with percentages or variables that you could have user defined in the views themselves

CASE 	WHEN	0	<=	"case_amount_total"."CASE_AMOUNT" 	AND	"case_amount_total"."CASE_AMOUNT"  	<=	1000000	THEN 'A: 	0	<=	1 000 000	 '
	WHEN	1000000	<=	"case_amount_total"."CASE_AMOUNT"  	AND	"case_amount_total"."CASE_AMOUNT"  	<=	5000000	THEN 'B:	1 000 000	<=	5 000 000	 '
	WHEN	5000000	<=	"case_amount_total"."CASE_AMOUNT"  	AND	"case_amount_total"."CASE_AMOUNT"  	<=	10000000	THEN 'C:	5 000 000	<=	10 000 000	 '
	WHEN	10000000	<=	"case_amount_total"."CASE_AMOUNT"  	AND	"case_amount_total"."CASE_AMOUNT"  	<=	25000000	THEN 'D:	10 000 000	<=	25 000 000	 '
	WHEN	25000000	<=	"case_amount_total"."CASE_AMOUNT"  	AND	"case_amount_total"."CASE_AMOUNT"  	<=	50000000	THEN 'E:	25 000 000	<=	50 000 000	 '
	WHEN	50000000	<=	"case_amount_total"."CASE_AMOUNT"<AND	"case_amount_total"."CASE_AMOUNT"  	<=	100000000	THEN 'F:	50 000 000	<=	100 000 000	 '
	WHEN	100000000	<=	"case_amount_total"."CASE_AMOUNT"  	AND	"case_amount_total"."CASE_AMOUNT"  	<=	250000000	THEN 'G:	100 000 000	<=	250 000 000	 '
	WHEN	250000000	<=	"case_amount_total"."CASE_AMOUNT"  THEN 'H:	250 000 000 <=	1000000000	 '
END

@matthias.gielk11

Thanks for sharing that.

My main concern is that if new data falls outside the defined ranges, it won’t be categorized and will return NULL.

I’m having trouble finding a dynamic formula and writing the PQL code for it.

I need the categorization to adjust automatically as new data comes in without manually updating the ranges every time.

Any thoughts on how to approach this?

@Saba
Can you describe the context of the data and what ‘new’ data would look like?
Eg. You expect new cases with a new ‘created on’, or you expect new orders that have ‘new customer names’, etc..
It could be different solutions depending on if it is text, numbers, dates, etc…
I see up top you wrote about numbers - is that the one?

For numbers you could use perecentages where the range can be based on the MIN/MAX values.

@matthias.gielk11

For example, right now my maximum potential saving for one item number is 62,567.50.

We say that anything greater than 60,000 is our first priority.

But what if, as the data keeps updating, a new item number with a new maximum potential saving comes in with the value of 84,000?

In this case, 84,000 becomes our first priority, and 62,567.50 is still treated as the first priority as well, since both are greater than 60,000.

Alternatively, if we set anything greater than 60,000 as the first priority and anything greater than 50,000 as the second, but the new maximum potential saving is 52,000, the classification result would not have 1 as the first priority. Instead, it would start from 2.

The key point is that dynamic code is necessary to ensure that the data is properly updated, and that priorities are recalculated as new data comes in.

By new data i mean numbers…
Also to put it in a nutshell I do not want to mention these numbers like (greater than 60000 or 50000) in my code.

@Saba in that case i would consider using a logic that relies on percentages of the MAX value.

You could say that at any time anything above 80% of the MAX value is categorised as priority 1, above 50% could be 2 and anything below could be 3. This would resolve not having a first priority.

CASE 
    WHEN value_column >= (SELECT MAX(value_column) * 0.8 FROM your_table) THEN 'priority 1'
    WHEN value_column > (SELECT MAX(value_column) * 0.5 FROM your_table) THEN 'priority 2'
    ELSE 'priority 3'

@matthias.gielk11

Yes, I did.
CASE
WHEN KPI("max_potential_saving") > 0.8 * KPI("kpi_max_") THEN '1'
WHEN KPI("max_potential_saving") > 0.6 * KPI("kpi_max_") THEN '2'
WHEN KPI("max_potential_saving") > 0.4 * KPI("kpi_max_") THEN '3'
WHEN KPI("max_potential_saving") > 0.2 * KPI("kpi_max_") THEN '4'
ELSE '6'
END

max_potential_saving = 62,567.50

kpi_max_ = It is my value which should be classified

but the result is not in order.

For example It shows 12000 in the first row and 62,567.50 in the third row !

I have also mentioned it in my previous post :

@Saba not a specialist in SQL/PQL but i guess the result here is being calculated row by row, so each specific case would also be the max. You could resolve this by calculating the MAX on the column as a seperate KPI and referencing it. So KPI_max shouldn’t be the classified value but the reference value.

so the outcome of max_potential_saving should be the to be classified value
and kpi_max should return 62,567.50

I would recommend renaming your column names and KPI’s to be more descriptive as the table column and 2 KPI’s sound very alike and probably difficult to differentiate :)

@matthias.gielk11

You are right, I should rename them.

I already made a mistake :)

max_potential_saving = It is my value which should be classified

kpi_max_ =62,567.50

This is the correct one and as you said the kpi_max_ is the kpi which calculates the max on column but still the problem remains...

@Saba Think i have it. We can use global to find the global MAX and simplify the logic

with buckets on top 70%, above 50% and below

CASE WHEN "Pizza_Case_Pizza_Case"."REVENUE" / global (MAX("Pizza_Case_Pizza_Case"."REVENUE")) > 0.7 THEN 1 
WHEN "Pizza_Case_Pizza_Case"."REVENUE" / global (MAX("Pizza_Case_Pizza_Case"."REVENUE")) > 0.5 THEN 2
ELSE 3 END

but why 60 is in the third row ? it should be the first because it is greater than others

@Saba that’s just sorting priority which you can set by clicking the column name

Reply

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded