question

mark zhen avatar image
0 Likes"
mark zhen asked Joerg Vogel commented

I want to understand how can I tune my reinforcement learning

My current model is as follows I want my reinforcement learning to learn to have the best policy when I pull the goods

1665641917869.png

Then I am currently using the odd-job processing. The red part in front is the wip area. What I want to do is to use reinforcement learning to find the best arrangement. How do I adjust and do I need to adjust my reward funtion?0816-2-1.fsm

1665642085695.png

And why are my results like this? I did it wrong

FlexSim 22.0.0
reinforcement learning
1665641917869.png (42.5 KiB)
1665642085695.png (141.4 KiB)
0816-2-1.fsm (310.3 KiB)
· 1
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Jeanette F avatar image Jeanette F ♦♦ commented ·

Hi @mark zhen , was Jordan Johnson's answer helpful? If so, please click the "Accept" button at the bottom of their answer. Or if you still have questions, add a comment and we'll continue the conversation.

If we haven't heard back from you within 3 business days we'll auto-accept an answer, but you can always unaccept and comment back to reopen your question.

0 Likes 0 ·

1 Answer

Jordan Johnson avatar image
3 Likes"
Jordan Johnson answered Joerg Vogel commented

Properly tuning the reward in a Reinforcement Learning model is very difficult, and each unique model needs to be individually tunes. The only way to find the correct rewards is to iterate through the following process over and over:

  1. Plan out a set of observations, actions, and a reward system.
  2. Train an AI using that reward system. Be sure to train long enough to be sure that it's working or not working. For small models like this, I'd say observe for 1 million steps to see if the policy learns something. In that time, the AI should learn to improve its reward. If the average reward has not improved in that time, it usually means the AI doesn't have the right set of observations.
  3. Use the model from the above training to see what the AI did learn. How did it maximize reward? It is very common for an AI to maximize reward and perform in an unexpected way. Observe what behaviors the AI uses to maximize its reward.
  4. You may discover that the reward system you created incentivized improper behavior, or that the AI was unable to learn anything at all. You may need major changes, or you may need only minor changes, depending on how your AI performs. Using what you have learned, return to step one.

Also keep in mind this rule about reward: reward should not accumulate between actions. For example, if I want to reward throughput.

It is very helpful to keep a journal or log of what you tried and how it performed. This process is time consuming and difficult, and a log can help you remember what you learned from previous attempts.

If you get "inf" in your reward value, that usually means that somewhere is a divide by zero. I note that the Sink updates its Reward label with this value:

1000/(Model.time-current.LastTime)

In your model, it is possible that two items enter the Sink at exactly the same clock time, so that the denominator is zero.

· 4
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

mark zhen avatar image mark zhen commented ·

Your answer is great, but I might need more help, can you use my model as an example?

0 Likes 0 ·
mark zhen avatar image mark zhen mark zhen commented ·

I understand what you mean, because I may complete the processing of two products at the same time, so the part of the denominator will be zero, then you can provide me with other reference

0 Likes 0 ·
mark zhen avatar image mark zhen mark zhen commented ·

What I want to do is to find the best way to schedule, so my observations and actions will be the same as the teaching guidelines?

Then my goal now is the minimum total completion time

0 Likes 0 ·
Show more comments