question

Maryam H2 avatar image
0 Likes"
Maryam H2 asked Felix Möhlmann commented

reward function

for a reward function in the RL tool, how i can set up the reward somehow so that it refers to a row in the Performance Measures table?

For example:

 Reward = (Quantity_i ) - (StayTime_i)

where Quantity_i shows the items in a queue for each item type i ( this has been set in the performance measure table) and StayTime_i is the stay time of item type i in a rack ( this also has been set in a performance measure table). The sign of StayTime_i is negative because i want it to penalize for the time it remains in the racks.



FlexSim 24.1.0
reward function
· 1
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Jeanette F avatar image Jeanette F ♦♦ commented ·

Hi @Maryam H2, was Felix Möhlmann's answer helpful? If so, please click the "Accept" button at the bottom of their answer. Or if you still have questions, add a comment and we'll continue the conversation.

If we haven't heard back from you within 3 business days we'll auto-accept an answer, but you can always comment back to reopen your question.

0 Likes 0 ·

1 Answer

Felix Möhlmann avatar image
0 Likes"
Felix Möhlmann answered Felix Möhlmann commented

Performance measure values are flexscript nodes. You need to evaluate the node (passing in the stored reference if needed) to get the actual value.

capture1.png

treenode pfmValue = Model.find("/Tools/PerformanceMeasureTables/PerformanceMeasures>variables/performanceMeasures/1/Value");
Variant value = pfmValue.subnodes[1].evaluate(pfmValue.subnodes[2].value);

capture1.png (10.6 KiB)
· 5
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Maryam H2 avatar image Maryam H2 commented ·

Hi @Felix Möhlmann @Jeanette F

the code above does not return any value. I have both the target inventory levels in a parameters table and the current inventory levels in a performance measure table. If I want to penalize the variation from the target inventory levels and encourage the agent to take actions that minimize this penalty, how should I structure the reward function? Do you have an example I could reference?

I was thinking how i can define a reward function as below to start:

def reward_function(current_inventory, target_inventory):
    # Calculate the difference between current and target inventory levels
    deviation = abs(current_inventory - target_inventory)
    # Penalize the deviation (negative reward)
    reward = -deviation
    # scale the penalty by a factor 
    penalty_factor = 0.1 # Adjustable
    reward = -penalty_factor * deviation
    return reward

Also, is there a way to instruct the agent to minimize the frequency of actions (such as placing orders for item types and receiving them in queues) in order to reduce ordering costs and extend the time interval between orders as much as possible? If so, how can I do this?


0 Likes 0 ·
Maryam H2 avatar image Maryam H2 Maryam H2 commented ·
hi @Felix Möhlmann any idea about my question?
0 Likes 0 ·
Felix Möhlmann avatar image Felix Möhlmann Maryam H2 commented ·

The code from my original answer does not. I just tested it again in version 24.0.2. You might have to adjust the path to get the value of the correct PFM though. The "1" in the path is the rank of the performance measure.

The fundamental logic of your code makes sense. It's just not FlexScript. I have heard and read that clamping the reward to lie between -1 and 1 works best for many RL algorithms, so that might be worth trying.

If you want to define a function in FlexSim, have a look at user commands. Your logic in a user command (plus clamping the value to the [-1, 1] interval) would look something like this:

double current_inventory = param(1);
double target_inventory = param(2); double reward = -Math.fabs(current_inventory - target_inventory); reward *= 0.1; reward = Math.min(1, Math.max(reward, -1)); return reward;

You determine when a decision is made by setting up the decision events. If the agent can influence this (for example ordering a larger quantity and a decision is only made when stock falls below a certain level) then it should learn to do so, if the reward function takes into account the amount of time since the last decision.

0 Likes 0 ·
Show more comments