Hi
I have been studying the RL example model for some time now. It seems to work fine but one thing I don't understand. It may be the nature of the RL or there might also be something wrong in my model:
In the manual it says (about the reward function): "This will give a larger reward for items that processed faster, and a smaller reward if the item had a long changeover time." Shouldn't this lead to situation, where the processor always takes same type of items (after type 3 it should try to take 3) because they don't have setup time and it will lead to shortest lead time --> max reward? Now it seems that the RL tells the model to take item types quite randomly. I have tried to train the model longer and even initialized the buffer with one type of items to make sure that there is always same type of item available. The behaviour did not change. I'd be thankful for any help as this topic interests me a lot.