flexsim reward calculation and warm up issues

Question

question

mark zhen asked Oct 30 2023 at 8:03 AM Jeanette F commented Nov 10 2023 at 7:42 PM

flexsim reward calculation and warm up issues

0926.fsmThere are some problems with the reward function. When I add the penalty, there seems to be a problem with the calculation of reward in the env file.

I added a penalty to my reward function, but I think my reward should not be a negative value because even if it is less than 0.1, there will still be a lot of negative values. The last problem is that there is something wrong with my warm up calculation process, warm up will not reset the calculation of custom labels as well.

Software Version:

FlexSim 23.0.0

reinforcement learning warmup reward function

1698652886689.png (693.4 KiB)

1698652897636.png (1.5 MiB)

0926.fsm (71.9 KiB)

· 3

Jason Lightfoot ♦♦ commented · Oct 30 2023 at 12:06 PM

Warmup just resets flexSim's stats (some use a flag to indicate if they should do so), not user labels. If you've already sent rewards for a timestep before the warmup time they won't be ignored/discounted, so I'm not sure what your warm-up is trying to acheive.

0 ·

mark zhen Jason Lightfoot ♦♦ commented · Oct 30 2023 at 2:10 PM

I want to calculate my average, so I have to collect the data after the warm-up time, right? Without this my data would be wrong

0 ·

Jeanette F ♦♦ commented · Nov 10 2023 at 7:42 PM

Hi @mark zhen , was Kavika F's answer helpful? If so, please click the "Accept" button at the bottom of their answer. Or if you still have questions, add a comment and we'll continue the conversation.

If we haven't heard back from you within 3 business days we'll auto-accept an answer, but you can always comment back to reopen your question.

0 ·

Answer 1 · 2023-10-30T15:40:53Z

Kavika F answered Oct 30 2023 at 3:40 PM Kavika F commented Nov 01 2023 at 7:37 PM

Hey @mark zhen , I think part of your problem is the Tardiness calculation. On your first item in the model, you have an item that starts with a date of 7001.21. Which is just your Model.time + 7000.

What does this date try to accomplish? Is it a target finish date? If so, please label that better, maybe "TargetFinishDate".

However, because you have such a large difference between "date" and "finish" when you do

item.finish - item.date

you have a big negative number. I suspect that most of the initial items will have this large negative number. And because you're listening to every processor's pull strategy to assign a reward based on a single sink's tardiness label (which is set to a large negative number that only changes after an item finishes), you'll get a lot of rewards that are negative, even when the upstream items may actually have actual tardiness.

1698679955633.png (9.7 KiB)

1698680252598.png (5.2 KiB)

· 18

mark zhen commented · Oct 30 2023 at 6:21 PM

I'm not quite sure what you mean, because I have written it in SINK. If it is a negative number, it will be equal to 1. If it is a positive number, it will be given 0. This means that I will calculate that among my 930 orders, I am within the time limit. How many odd numbers are completed?

0 ·

Kavika F ♦ mark zhen commented · Oct 31 2023 at 7:40 PM

@mark zhen, in the model you supplied, your reward function is as follows:

double reward = Model.find("Sink1").avgTardiness;
 
if (reward < 0.50) {
    reward = -1.0;
} else if (reward > 0.50) {
    reward = 1.0;
} else if (reward > 0.55) {
   reward = 3.0;
} else if (reward > 0.60) {
    reward = 7.0;
}
 
int done= Model.time > 18000;
return [reward, done];

You have a process flow that writes the avgTardiness value to a table; the sink then gets that value from the table and sets it as the reward.

That avgTardiness is the equation:

Model.find("Sink1").totalTardiness / Model.find("Sink1").as(Object).stats.input.value;

Essentially averaging the total Tardinesses. The initial value is 0. Because of that, the first few times the processors use their Pull Strategy, they get a -1 reward because avgTardiness on the sink is 0.

The first item that enters the sink has a Total Tardiness of -6910. So, every time a processor does a Pull Strategy call, it will have a -1 reward. After running the model to the end, every tardiness value I saw was negative, meaning the average was always negative, and so the reward is always -1. That's why you get such a large negative number for your reward total at the end.

0 ·

mark zhen Kavika F ♦ commented · Oct 31 2023 at 8:03 PM

@Kavika F

I think something happened here

That avgTardiness is the equation:

Model.find("Sink1").totalTardiness / Model.find("Sink1").as(Object).stats.input.value;

Your understanding of this paragraph may be wrong

My avgTardiness The formula I set here is my item.Tardiness. When there is a item0Tardiness greater than 0, I will record the number of delays and divide it by the total capacity to calculate the percentage..

double reward = Model.find("Sink1").avgTardiness;

So my reward should be equal to 0.66, right?

0 ·

1698782630349.png (9.2 KiB)

Show more comments

question