Understand the definition of reinforcement learning

Question

question

mark zhen asked Apr 10, '23 Jeanette F commented Apr 18, '23

Understand the definition of reinforcement learning

0325.fsm @Felix Möhlmann @Jason Lightfoot @Kavika F @

Hi, thank you very much for your assistance, but I have some questions about the heuristic algorithm. I want to know how you design such a code and why there are three situations of 0, 1, 2, and 0, The effect of the value of 1,2 on the overall model

And I have to clarify the definition and logic of each state action

There are currently three questions

1. The state and action of reinforcement learning cases provided in flexsim

My understanding is that Model.parameters["LastItemType"].value = getvarnum(Model.find("Processor1"), "f_lastlabelval");

The definition of this grammar is that the value of "f_lastlabelval" of the observation machine will be our state

But I would like to ask if there is a better understanding of the instructions.

And the syntax of action

item.Type == Model.parameters["ItemType1"].value Then what is the logic of this action.

2. Odd-job production extended by myself

My grammar is shown in the figure I want to understand the overall logic and description more clearly

Also about the action, I also want to know how the action is generated.

3. There is no set up time topic

Just like the previous question

I want to believe that I can understand the complete problem better after I clarify it

Software Version:

FlexSim 22.0.16

reinforcement learning

1681191342327.jpg (28.4 KiB)

0325.fsm (328.5 KiB)

· 1

5 |100000

Attachments: Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Jeanette F ♦♦ commented · Apr 18, 2023 at 05:14 PM

Hi @mark zhen , was Felix Möhlmann's answer helpful? If so, please click the "Accept" button at the bottom of their answer. Or if you still have questions, add a comment and we'll continue the conversation.

If we haven't heard back from you within 3 business days we'll auto-accept an answer, but you can always unaccept and comment back to reopen your question.

0 ·

Answer 1 · 2023-04-11T07:47:00Z

Felix Möhlmann answered Apr 11, '23 Felix Möhlmann commented Apr 12, '23

There isn't really a general guideline you can follow. Problems are often unique and as such require different approaches or strategies to solve. I am by no means an expert when it comes to Reinforcement Learning. But as I understand it, reducing the size of the state space will speed up the learning. Take the example where the quantity of three different item types is used as the observations. For the algorithm the quantities of [300, 300, 300] and [299, 300, 301] represent two distinct states that will be handled separately. Whereas in reality these are so similar that a human operator would very likely treat them as if they were the same situation.

So in order to reduce the possible number of states I categorize each type into one of only three quantity brackets: none (0), some (1) and many (2). As I mention at the start, this is not the only solution nor is it likely to the best (or even a good one). It was an example to try and show how one can think about RL problems.

In your previous posts you mostly cling to using the same approach that is demonstrated in the example model with the varying setup times. The problem you are trying so solve differs significantly from it, so I tried to provide you with an idea what observations other than the previous item type you could use.

Your questions 1 and 2 seem like very basic questions about the FlexScript language, so I would recommend to have a look at the coding manual for FlexSim.

1. "=" is an assignment: The value on the right is assigned to the variable on the left.

"==" is a comparison: It returns 1/true if the expressions on both sides evaluate to the same value or 0/false otherwise.

2. If you wrote the code, you should know how it works (?)

It first loops through all objects in the "Processors" group in a for-loop, determining which of these has the highest "LastFinishTime" value. It then updates the respective "LastItemType" parameter for that processor with the current value of the "f_lastlabelval" node that is normally used by the code that sets the setup time dependend on whether the item type changed.

3. I am not sure what you mean with this. As I wrote above, if your model does not have a setup time, then using the previous item type to decide which item to work on next doesn't really make sense in my mind.

Lastly, I often times get the feeling that a lot of meaning is lost in translation (in both directions). Have you tried to contact your local distributor to possibly receive more effective support?

https://www.flexsim.com/contact/#asia

· 5

5 |100000

Attachments: Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

mark zhen commented · Apr 11, 2023 at 10:51 AM

For example, this? What kind of actions will these combinations of 0 1 2 prompt,

Or can I explain it like this? Type1Step1 is 2

So it means that the objects that need to execute Type1Step1 in q are higher than the proportion?

That is for the model to know what action to take if it is higher than the ratio.

0 ·

1681210135542.png (13.6 KiB)

mark zhen mark zhen commented · Apr 11, 2023 at 10:58 AM

About action spaces

item.Type == Model.parameters["ItemType"].value

so if i were to interpret my action space its syntax would be this

But the question I accept is where does the value in Itemtype come from.

Then I just want to explain the logic of the observation and action for these three questions,

For example Model.parameters["LastItemType"].value = getvarnum(Model.find("Processor1"), "f_lastlabelval");

If I want to explain this sentence, I would say that the state of this problem is to observe the value of "f_lastlabelval" on the machine, right?

So my action is generated according to random action?

0 ·

Felix Möhlmann mark zhen commented · Apr 11, 2023 at 11:34 AM

The Reinforcement Learning algorithm returns an array of values to FlexSim with as many entries as there are action parameters.. In FlexSim, these values are then assigned to the respective parameters.

The expression "item.Type == Model.parameters["ItemType"].value" is used in the Pull Strategy code to decide which item to pull. The code checks each item in the upstream object. If it can be pulled (Pull Requirement is fulfilled) and the item has a higher "value" based on the above expression than the previous best item, it is marked as the new best item. When all items where checked, the "best" item is pulled. Since the expression above returns either 0 or 1, either the first item of the fitting type is pulled, if there are any. Otherwise the first available item (of any other type) is pulled.

In regards to the question: "What kind of actions will these combinations of 0 1 2 prompt":

It is the job of the RL algoriothm to figure that out. If the answer was obvious you wouldn't need it in the first place but could instead just implement a fixed logic. So the answer would be: An action that leads to as many processors as possible having items available to work on.

For example, processor 4 can only work on items of type 3. The algorithm should learn that if all observation values belonging to type 3 are 0, it should take actions to increase those values. Which would be have other processors pull items of type 3 so they are processed and eventually advance to step4.

0 ·

1681212779034.png (8.0 KiB)

Show more comments

question