Reinforcement Learning with 2 agents

Question

question

russ.123 asked Jan 06 2025 at 5:29 AM Jeanette F commented Jan 23 2025 at 4:30 PM

Reinforcement Learning with 2 agents

Hello!

I am trying to complete a project on using Reinforcement Learning with FlexSim to do the following:

4 sources will be used as inputs for the 4 distinct products into the production line ("Product 1", "Product 2", “Product 3” and “Product 4”). The 4 products are then collected in Queue 1 to be output into Processor 1 which has varying setup time when switching from processing one product to another, and different processing times based on the 4 distinct product types. (same as the FlexSim Reinforcement Learning tutorial) After Processor 1 has completed the procedure on the product, the product will proceed to Queue 2.

From Queue 2, they will be directed to 4 “specialized” processors: Processor 2, Processor 3, Processor 4, Processor 5. Each processor is “specialized” at processing one item type faster than all other processors, for example, Processor 2 processes "Product 1" faster than “Product 2”, while Processor 3 processes "Product 2" faster than “Product 1”. After processing has been completed, the product will then enter Sink 1 where the process is completed.

To optimize the system's efficiency through product scheduling and routing, reinforcement learning (RL) will be implemented at two key points:

1. Processor 1 – At this stage the agent will decide which product to pull into Processor 1 from Queue 1 depending on which sequence has the shortest total elapsed time (setup time and processing time)

2. Queue 2 – RL will be employed to optimize the routing of products to Processor 2 or Processor 3 or Processor 4 or Processor 5. The goal of the RL agent here is to send the product to the “specialized” processor or the next best processor to use if the “specialized” processor is currently processing a product.

From my understanding, the scripts made available by FlexSim in the Reinforcement Learning tutorial (flexsim_env.py and flexsim_training.py) are only for training one RL agent. As such, I have 2 identical models, but one model has RL agent implementation at Processor 1 only, while one model has RL agent implementation at Queue 2 only. The scripts are able to train the agent at Processor 1 but are unable to train the agent at Queue 2. Thus, I would like to check if I have done something wrong here.

Additionally, after validating that both models with one agent implemented in each are able to work, I would like to combine them. Is this possible? flexsim_env.py flexsim_training.py

Changeover - RL1.fsm Changeover - RL2.fsm

Software Version:

FlexSim 23.2.0

reinforcement learning

changeover-rl1.fsm (40.4 KiB)

changeover-rl2.fsm (40.6 KiB)

flexsim-env.py (7.6 KiB)

flexsim-training.py (2.7 KiB)

· 1

Answer 1 · 2025-01-08T09:54:40Z

Felix Möhlmann answered Jan 08 2025 at 9:54 AM Felix Möhlmann commented Jan 17 2025 at 8:52 AM

In the second model the "LastItemType" parameter is updated upon exit of an item from Queue2. So the decision of where to send an item is currently based on the type of the previous item...which doesn't make much sense. The parameter should be updated in the On Entry trigger.

Since multiple items could be worked on, the way the reward is calculated also should probably be changed. If an new item was released from Queue2 shortly before a different item enters the sink, its reward will be very, regardless of its actual process time. I would store the total process time on a label on the item. When it enters the sink, add that value to an array label. The reward would be based on the oldest entry in that array (which is then discarded).

The current setup also does not satisfy your goal of learning to use the "next best" processor if the best one is not available. For that the agent actually needs to know which processors are available through the observations.

I haven't done this myself, but it should be possible to add two RL tools to a model that query actions from different ports. So you would run both agents in parallel.

· 4

russ.123 commented · Jan 13 2025 at 7:32 AM

Hey Felix, thank you for the reply!

Based on your inputs above, these are the improvements I have made:

1. Changed the "LastItemType" parameter to be updated upon entry into Queue 2.

2. Added a "ProcessTime" label to each product once it is created. Upon entering Queue 2, the on entry trigger will set the current time as the item's "ProcessTime" label. Once it is done processing, the "on process finish" trigger in each processor after Queue 2 will record the process time depending on processor using the current time minus the product's "ProcessTime" label.

3. For the last point, I have included an "On entry" trigger in each processor after Queue 2 to record which one of Queue 2's output ports the agent has pushed the product to. Whenever a product leaves Queue 2, the trigger will be activated in the processor that it enters, and this trigger records the output port number of Queue 2 as "PortSentTo". This new parameter is also added into the observation space of the RL agent implemented in "ReinforcementLearning1".

Changeover - RL2.fsm

However, after changing all these parameters, when I run the Python script, I can see from the results that the RL agent is unable to train the agent effectively as the explained variance is 0.

Please do let me know if I have done anything wrong or if there are any further improvements that I can make!

0 ·

changeover-rl2.fsm (40.6 KiB)

Felix Möhlmann russ.123 commented · Jan 13 2025 at 9:58 AM

As for point 3, I had something like this in mind. Where P2-P5 reflect whether the respective processor is currently busy or not.

Each simulation currently also terminates when the model time reaches 1000s. It makes sense that the agent can't learn anything if all it ever does is decide where to send the first two items.

Similarly, only creating each item type once in a fixed order will also severely limit what the agent can learn. I modified the model to run until 500 items have entered the sink. With 100k steps, the agent quickly learned which type to send to what processor. Though it doesn't really use alternatives if the best processor is busy yet. Some further tweaking of the model setup and reward might be needed there.

changeover-rl2-2.fsm

0 ·

1736762087306.png (4.8 KiB)

changeover-rl2-2.fsm (75.1 KiB)

russ.123 Felix Möhlmann commented · Jan 17 2025 at 8:06 AM

Hi Felix, thank you for the insight on each episode of the training ending prematurely.

When you mentioned that the model is learning when using 100k steps on the model you have attached, what metrics are you using to determine this? Because from my understanding, from the output after training the algorithm, the explained variance is supposed to approach 1 to ensure that the agent is making accurate predictions to reduce the overall processing time. However, the explained variance when I ran the model you attached with my script stayed low.

0 ·

Show more comments

question