question

mark zhen avatar image
0 Likes"
mark zhen asked Jeanette F commented

flexsim reward calculation and warm up issues

0926.fsmThere are some problems with the reward function. When I add the penalty, there seems to be a problem with the calculation of reward in the env file.1698652886689.png

1698652897636.png

I added a penalty to my reward function, but I think my reward should not be a negative value because even if it is less than 0.1, there will still be a lot of negative values. The last problem is that there is something wrong with my warm up calculation process, warm up will not reset the calculation of custom labels as well.

FlexSim 23.0.0
reinforcement learningwarmupreward function
1698652886689.png (693.4 KiB)
1698652897636.png (1.5 MiB)
0926.fsm (71.9 KiB)
· 3
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Jason Lightfoot avatar image Jason Lightfoot ♦ commented ·
Warmup just resets flexSim's stats (some use a flag to indicate if they should do so), not user labels. If you've already sent rewards for a timestep before the warmup time they won't be ignored/discounted, so I'm not sure what your warm-up is trying to acheive.
0 Likes 0 ·
mark zhen avatar image mark zhen Jason Lightfoot ♦ commented ·

I want to calculate my average, so I have to collect the data after the warm-up time, right? Without this my data would be wrong

0 Likes 0 ·
Jeanette F avatar image Jeanette F ♦♦ commented ·

Hi @mark zhen , was Kavika F's answer helpful? If so, please click the "Accept" button at the bottom of their answer. Or if you still have questions, add a comment and we'll continue the conversation.

If we haven't heard back from you within 3 business days we'll auto-accept an answer, but you can always comment back to reopen your question.

0 Likes 0 ·

1 Answer

·
Kavika F avatar image
0 Likes"
Kavika F answered Kavika F commented

Hey @mark zhen , I think part of your problem is the Tardiness calculation. On your first item in the model, you have an item that starts with a date of 7001.21. Which is just your Model.time + 7000.

1698679955633.png

1698680252598.png

What does this date try to accomplish? Is it a target finish date? If so, please label that better, maybe "TargetFinishDate".

However, because you have such a large difference between "date" and "finish" when you do

item.finish - item.date

you have a big negative number. I suspect that most of the initial items will have this large negative number. And because you're listening to every processor's pull strategy to assign a reward based on a single sink's tardiness label (which is set to a large negative number that only changes after an item finishes), you'll get a lot of rewards that are negative, even when the upstream items may actually have actual tardiness.


1698679955633.png (9.7 KiB)
1698680252598.png (5.2 KiB)
· 18
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

mark zhen avatar image mark zhen commented ·

I'm not quite sure what you mean, because I have written it in SINK. If it is a negative number, it will be equal to 1. If it is a positive number, it will be given 0. This means that I will calculate that among my 930 orders, I am within the time limit. How many odd numbers are completed?

0 Likes 0 ·
Kavika F avatar image Kavika F ♦ mark zhen commented ·

@mark zhen, in the model you supplied, your reward function is as follows:

double reward = Model.find("Sink1").avgTardiness;

if (reward < 0.50) {
    reward = -1.0;
} else if (reward > 0.50) {
    reward = 1.0;
} else if (reward > 0.55) {
   reward = 3.0;
} else if (reward > 0.60) {
    reward = 7.0;
}

int done= Model.time > 18000;
return [reward, done];

You have a process flow that writes the avgTardiness value to a table; the sink then gets that value from the table and sets it as the reward.

That avgTardiness is the equation:

Model.find("Sink1").totalTardiness / Model.find("Sink1").as(Object).stats.input.value;

Essentially averaging the total Tardinesses. The initial value is 0. Because of that, the first few times the processors use their Pull Strategy, they get a -1 reward because avgTardiness on the sink is 0.

The first item that enters the sink has a Total Tardiness of -6910. So, every time a processor does a Pull Strategy call, it will have a -1 reward. After running the model to the end, every tardiness value I saw was negative, meaning the average was always negative, and so the reward is always -1. That's why you get such a large negative number for your reward total at the end.

0 Likes 0 ·
mark zhen avatar image mark zhen Kavika F ♦ commented ·

@Kavika F

I think something happened here

That avgTardiness is the equation:


Model.find("Sink1").totalTardiness / Model.find("Sink1").as(Object).stats.input.value;

Your understanding of this paragraph may be wrong

My avgTardiness The formula I set here is my item.Tardiness. When there is a item0Tardiness greater than 0, I will record the number of delays and divide it by the total capacity to calculate the percentage..

1698782630349.png

1698782644297.png

double reward = Model.find("Sink1").avgTardiness;

So my reward should be equal to 0.66, right?


0 Likes 0 ·
1698782630349.png (9.2 KiB)
Show more comments

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.