I have the following Pyspark dataframe. I need to calculate the 'Coverage' row.
Coverage is calculated in the following way: Take the 'Inventory' value for each day, and 'Coverage' will be the amount of days we can sum 'Shipments' (starting from the next day) without exceeding the 'Inventory' value, for each Material & Plant. So, 'Coverage' would have the following values:
|
Plant |
Material |
Category |
14-06-22 |
15-06-22 |
16-06-22 |
17-06-22 |
18-06-22 |
19-06-22 |
20-06-22 |
21-06-22 |
29-06-22 |
1 |
011 |
1234 |
Shipments |
213 |
122 |
341 |
142 |
93 |
177 |
213 |
187 |
133 |
2 |
011 |
1234 |
Inventory |
1245 |
944 |
1056 |
891 |
995 |
1333 |
1122 |
1289 |
943 |
3 |
011 |
1234 |
Coverage |
6 |
4 |
6 |
5 |
5 |
6 |
5 |
7 |
5 |
4 |
Z21 |
546 |
Shipments |
62 |
133 |
164 |
156 |
128 |
130 |
55 |
159 |
288 |
5 |
Z21 |
546 |
Inventory |
344 |
278 |
284 |
653 |
499 |
766 |
567 |
879 |
654 |
6 |
Z21 |
546 |
Coverage |
2 |
1 |
2 |
4 |
3 |
5 |
3 |
5 |
5 |
In further explanation: For 14-06-22, Plant 011 and Material 1234 has an Inventory of 1245. According to the previous formula, Coverage would be the sum of each day's Shipment values (starting from the next day) until inventory value 1245 is reached. Therefore:
Coverage = amount of Shipment days (122 + 341 +142 +93 + 177 + 213 = 1088) so, we had to sum 6 days in order to get the closest number to Inventory (1245) without exceeding it. Therefore, Coverage for 14-06-22 Plant 011 and Material 1234 will be 6.
This applies to all coverage values. When we reach the end of the column days and we still haven't reached the Inventory Value, we continue summing from the first day. Example:
Date 19-06-22, Plant Z21 and Material 546 has an Inventory of 766. The Coverage formula would be:
Coverage = amount of Shipment days ( 55 + 159 + 288 + 62 + 133 = 697) so, we had to sum 5 days in order to get the closest number to Inventory (766) without exceeding it. Therefore, Coverage for 19-06-22, Plant Z21 and Material 546 will be 5.
How can I actually perform this formula iterating over each column & row using pyspark/sql ? I am using a Databricks notebook.