question

Scarlett X avatar image
0 Likes"
Scarlett X asked Phil BoBo commented

Reward of reinforcement learning

Hi everyone,what's the intension of those 3 code as follow

a. Model.find("Sink1").Reward = 0;

b. int done = (Model.time > 1000);

c. return [reward, done];

Can I delete the code of row a in order to show the label of reward in dashboard?

If I want to increase the accuracy of training model,that is I need to change the number of " model.learn(total_timesteps=1000)"?

If it's still not enough,can I use punishment or increase reward to achieve this goal?

Thank you!

FlexSim 22.0.0
reinforcement learning
· 1
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Phil BoBo avatar image
1 Like"
Phil BoBo answered Phil BoBo commented

a. This line makes it so that the reward returned is only the amount of reward accumulated on the sink since the last action. If you just return the accumulated reward on the label without setting it back to zero, then each action will be returning more and more reward, and the algorithm will learn that actions toward the end of the simulation are better than actions towards the beginning. That's not what you are trying to teach it. You want to return a reward that it can correlate with the action it took based on the state of the observations. You can return whatever reward you want and store it however you want so that you can see it on a dashboard if you want. The import thing is line c needs to return the reward for taking the previous action. (Various reinforcement learning algorithms work with delayed rewards, but in general, they learn faster if the reward for a given action in a given state is as close in time to the decision that caused that reward.)

b. This line makes it so the training episode ends after the model has passed 1000 time units. It is entirely arbitrary. You can stop a training episode whenever you want. It's up to you in how you want to teach the algorithm. Instead of a specific time, you might stop the episode based on the state of the system. For example, if you reach the end of a maze or you finish processing all the orders in a factory.

c. This is the critical line of this function that returns a reward and whether the episode is done.

Many different things can affect the "accuracy" of the training model. If you want to train for more steps, then yes, you can change the total_timesteps that the model learns for in the python script. In general, training longer will allow the algorithm to learn better, but you often have diminishing returns if it has been sufficiently trained.

Yes, you can return negative rewards to punish bad actions.

That model is just an example showing how all the aspects of reinforcement learning work together. It isn't a precise solution to all machine learning problems. Training a reinforcement learning algorithm is like training a dog: you need to think about what you are trying to teach it by rewarding it based on the observed state of the system, the chosen action, and the result of that action. For example, if a dog hears you say sit (observation), and then sits (action), and then receives a treat (reward), then the dog will learn to sit when you tell it to sit. It will get better with consistent repetition of this process, and it will get confused if you are inconsistent in your rewards. For example, if you tell it to sit and it accidentally sits on a spiky object (a negative reward for the same observation/action combination). In that case, you may need to add more state to your observation space (you said sit + is there something spiky under me) so that the dog can know whether something is under it and learn to look before sitting (actions) to avoid sitting on something bad (negative reward).

Reinforcement learning algorithms aren't going to magically learn how to make optimal choices, you have to teach the algorithm by designing your training episodes in a way that it will learn what you want it to learn.

· 4
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Jacob W2 avatar image
1 Like"
Jacob W2 answered

Hi @Scarlett X,

A. Is initializing the value of the reward label on the sink in the model.

B. Tells the reinforcement training to end after 1000 seconds have passed in model.

C. This returns the value of done and reward every time an observation is made during the model. This is how the reinforcement training knows when to stop running.

I would not recommend deleting the code in row A. It will not help you view the value of the reward label in a dashboard. You can view that value by adding a graph to your dashboard from the dashboard library. Once the graph is added you just need to sample the sink and select "reward" as the label you want to use.

To better train the AI from the model you can increase the done function to be larger than 1000, but the value that is more important to change is the training runtime in the training program in python. When training an AI I have generally used 500,000 to 1,000,000 timesteps to train.

Finally, you can modify how the reward function is set on the sink to change how the AI is rewarded for each item to come through.

I hope that this helps.

5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.