RL reward function

I have been studying the RL example model for some time now. It seems to work fine but one thing I don't understand. It may be the nature of the RL or there might also be something wrong in my model:

In the manual it says (about the reward function): "This will give a larger reward for items that processed faster, and a smaller reward if the item had a long changeover time." Shouldn't this lead to situation, where the processor always takes same type of items (after type 3 it should try to take 3) because they don't have setup time and it will lead to shortest lead time --> max reward? Now it seems that the RL tells the model to take item types quite randomly. I have tried to train the model longer and even initialized the buffer with one type of items to make sure that there is always same type of item available. The behaviour did not change. I'd be thankful for any help as this topic interests me a lot.

changeovertimesrl.fsm

Software Version:

FlexSim 24.0.1

· 3

5 |100000

Attachments: Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Sam Stubbs ♦ commented · Oct 16 at 06:44 AM

That is part of the issue with RL as a concept. Reward functions aren't a guarantee of behavior. The game of RL is trying to figure out what rewards to give based on the observation spaces and actions to try and try to coax it into the direction you want. But this ends up being a much more complicated issues to solve. Unfortunately, this is an issue more in the concept of RL itself. You might actually want to try posting a question on a board that deals with RL/Gymnasium itself, as you might find a more expert response that deals with RL Behavior. The behavior/learning of the agent isn't actually really something that is specific to FlexSim. FlexSim really only acts as the environment to train in, all of the behavior is bound in the RL code, in this case it's Gymnasium.

1 ·

Joerg Vogel commented · Oct 16 at 05:42 AM

@Tomi Kosunen .I think, currently you are gathering data in learning state of a RL-engine. Then the input should be randomly to let find the engine a pattern.

0 ·

Tomi Kosunen commented · Oct 16 at 06:00 AM

@Joerg Vogel I first train the model. The On Request Action is set to "Take random" and Python code flexsim_training.py is collecting the data.

After that I change the OnRequestAction code to "Query a server for a predicted action from a trained model", start Python code flexsim_inference.py and run the model. Then the Trained code is controlling the model and it sets the Action table parameter ItemType.

Have you tested the model / what is your opinion: should the algorithm control the model so that the machine always tries to take similar Item Type as the previous one was?