Reward of reinforcement learning

Question

question

Scarlett X asked May 26, '22 Phil BoBo commented Jun 7, '22

Reward of reinforcement learning

Hi everyone,what's the intension of those 3 code as follow

a. Model.find("Sink1").Reward = 0;

b. int done = (Model.time > 1000);

c. return [reward, done];

Can I delete the code of row a in order to show the label of reward in dashboard?

If I want to increase the accuracy of training model,that is I need to change the number of " model.learn(total_timesteps=1000)"?

If it's still not enough,can I use punishment or increase reward to achieve this goal?

Thank you!

Software Version:

FlexSim 22.0.0

reinforcement learning

· 1

Kavika F ♦ commented · Jun 02, 2022 at 10:05 PM

Hi @Scarlett X, was one of Phil BoBo's or Jacob W2's answers helpful? If so, please click the "Accept" button at the bottom of the one that best answers your question. Or if you still have questions, add a comment and we'll continue the conversation.

If we haven't heard back from you within 3 business days we'll auto-accept an answer, but you can always unaccept and comment back to reopen your question.

0 ·

Answer 1 · 2022-05-26T15:28:37Z

Phil BoBo answered May 26, '22 Phil BoBo commented Jun 7, '22

a. This line makes it so that the reward returned is only the amount of reward accumulated on the sink since the last action. If you just return the accumulated reward on the label without setting it back to zero, then each action will be returning more and more reward, and the algorithm will learn that actions toward the end of the simulation are better than actions towards the beginning. That's not what you are trying to teach it. You want to return a reward that it can correlate with the action it took based on the state of the observations. You can return whatever reward you want and store it however you want so that you can see it on a dashboard if you want. The import thing is line c needs to return the reward for taking the previous action. (Various reinforcement learning algorithms work with delayed rewards, but in general, they learn faster if the reward for a given action in a given state is as close in time to the decision that caused that reward.)

b. This line makes it so the training episode ends after the model has passed 1000 time units. It is entirely arbitrary. You can stop a training episode whenever you want. It's up to you in how you want to teach the algorithm. Instead of a specific time, you might stop the episode based on the state of the system. For example, if you reach the end of a maze or you finish processing all the orders in a factory.

c. This is the critical line of this function that returns a reward and whether the episode is done.

Many different things can affect the "accuracy" of the training model. If you want to train for more steps, then yes, you can change the total_timesteps that the model learns for in the python script. In general, training longer will allow the algorithm to learn better, but you often have diminishing returns if it has been sufficiently trained.

Yes, you can return negative rewards to punish bad actions.

That model is just an example showing how all the aspects of reinforcement learning work together. It isn't a precise solution to all machine learning problems. Training a reinforcement learning algorithm is like training a dog: you need to think about what you are trying to teach it by rewarding it based on the observed state of the system, the chosen action, and the result of that action. For example, if a dog hears you say sit (observation), and then sits (action), and then receives a treat (reward), then the dog will learn to sit when you tell it to sit. It will get better with consistent repetition of this process, and it will get confused if you are inconsistent in your rewards. For example, if you tell it to sit and it accidentally sits on a spiky object (a negative reward for the same observation/action combination). In that case, you may need to add more state to your observation space (you said sit + is there something spiky under me) so that the dog can know whether something is under it and learn to look before sitting (actions) to avoid sitting on something bad (negative reward).

Reinforcement learning algorithms aren't going to magically learn how to make optimal choices, you have to teach the algorithm by designing your training episodes in a way that it will learn what you want it to learn.

· 4

Scarlett X commented · May 27, 2022 at 05:21 AM

about a.

When it will get reward in offical model?

about b.

1.How can I write the code if I want it end when it finishing processing all the orders in a factory.

2.What if I just set done=1 or done=0?

about this" It will get better with consistent repetition of this process, and it will get confused if you are inconsistent in your rewards."

When model process different type , it will have changeover time ,and this sentences means if I set reward to minus changeover time is inappropriate? (I mean first process item is type 1 and second process item type 2 so I give it reward -10, if second process item is type 3, I give it -20) .Beaause I think use minus changeover time is the best chose to fit"the reward for a given action in a given state is as close in time to the decision that caused that reward." or I misunderstood?

0 ·

Phil BoBo ♦♦ Scarlett X commented · May 27, 2022 at 01:26 PM

Decision Events

When the model is reset, an initial observation and action will be taken. The model will then run. At each of the specified decision events, a reward will be received for the previous actions, and if the episode is not done, another observation and action will be taken. This cycle will continue until the Reward Function returns that the episode is done.

The Reinforcement Learning Tool (flexsim.com)

"How can I write the code if I want it end when it finishing processing all the orders in a factory."

When the last order is finished processing, return 1 for the done value in the Reward Function.

It sounds like you have very little FlexSim model building experience. Perhaps you should contact your local distributor for training in how to use the software before attempting to use FlexSim as an environment for reinforcement learning.

I cannot explain to you all the scenarios of what will happen when you return certain rewards in particular situations. If you want to know what will happen in a particular configuration of training, then configure it a particular way, train it, and see what it learned.

1 ·

Scarlett X Phil BoBo ♦♦ commented · May 30, 2022 at 12:42 AM

I think I do not want you to explain all the scenarios, I just try to understand the information you tell me about It will get better with consistent repetition of this process, and it will get confused if you are inconsistent in your rewards .

Because I made this huge mistake and I want to fix my model.When I I set reward to minus changeover time, my model stuck in training step,and I don't know that is because my encoding error or computing power or my concept about reinforcement learning in FlexSim,that's why I asked back to you.

0 ·

Show more comments

Answer 2 · 2022-05-26T15:23:10Z

Jacob W2 answered May 26, '22

Hi @Scarlett X,

A. Is initializing the value of the reward label on the sink in the model.

B. Tells the reinforcement training to end after 1000 seconds have passed in model.

C. This returns the value of done and reward every time an observation is made during the model. This is how the reinforcement training knows when to stop running.

I would not recommend deleting the code in row A. It will not help you view the value of the reward label in a dashboard. You can view that value by adding a graph to your dashboard from the dashboard library. Once the graph is added you just need to sample the sink and select "reward" as the label you want to use.

To better train the AI from the model you can increase the done function to be larger than 1000, but the value that is more important to change is the training runtime in the training program in python. When training an AI I have generally used 500,000 to 1,000,000 timesteps to train.

Finally, you can modify how the reward function is set on the sink to change how the AI is rewarded for each item to come through.

I hope that this helps.

question