rltest-2.fsmI want to complete my reinforcement learning model based on this literature. Here is a mention about my action space
There is also a section on reward functions.
The definition of s is as follows!!
rltest-2.fsmI want to complete my reinforcement learning model based on this literature. Here is a mention about my action space
There is also a section on reward functions.
The definition of s is as follows!!
Hi @mark zhen ,
We haven't heard back from you. Were you able to solve your problem? If so, please add and accept an answer to let others know the solution. Or please respond to the previous comment so that we can continue to help you.
If we don't hear back in the next 3 business days, we'll assume you were able to solve your problem and we'll close this case in our tracker. You can always comment back at any time to reopen your question, or you can contact your local FlexSim distributor for phone or email help.
Hi @mark zhen , was Joerg Vogel's answer helpful? If so, please click the "Accept" button at the bottom of their answer. Or if you still have questions, add a comment and we'll continue the conversation.
If we haven't heard back from you within 3 business days we'll auto-accept an answer, but you can always unaccept and comment back to reopen your question.
In a Case when there is not any next operation of “a” you need to define a maximum waiting time to compute a reward for any last processed item in a model run or you must be absolutely sure, when you end a model run the current processed items are not taking part of a reward system. You must count items only if their rewards could be calculated completely.
Edit: if you don’t find any suitable approach you can set maximum waiting time (Edit II: waiting time + process time = reward) to model run time.This prevents any situation of any undefined reward and is also your lower boundary of your system. I would use this method to pre store an reward and I would update this value when the involved processor state changes in model.
Edit II: The above description normalizes your rewards against longest processing time. Then there might occur percentage values greater than 100%. If you set all pre stored rewards to a self defined lower boundary value and you update them later normalized against run time length, then you can allocate all rewards.
lowest boundary reward = process time (a) / model runtime length
you scale all experiments by a factor of run time length of current experiment divided by maximum length of all Experiments.
14 People are following this question.
FlexSim can help you understand and improve any system or process. Transform your existing data into accurate predictions.
FlexSim is a fully 3D simulation software environment. FlexSim can be used to simulate any process in any industry.
FlexSim®, FlexSim Healthcare™, Problem Solved.®, the FlexSim logo, the FlexSim X-mark, and the FlexSim Healthcare logo with stylized Caduceus mark are trademarks of FlexSim Software Products, Inc. All rights reserved.
Privacy | Do not sell or share my personal information | Cookie preferences | Report noncompliance | Terms of use | Legal | © Autodesk Inc. All rights reserved