强化学习中SARSA与Q-learning算法在网格世界的Java实现及其参数优化资源-CSDN文库

版权申诉

117 浏览量 2025-01-14 21:05:35 上传评论收藏 262KB PDF 举报

资源推荐

资源详情

资源评论

CS 457/557: Assignment 4

Due: December 11, 2024, by 11:00 pm (Central)

Overview

In this assignment, you will be implementing the SARSA and Q-learning algorithms for a navigation

task in a simple grid world environment.

Academic Integrity Policy

All work for this assignment must be your own work and it must be completed independently.

Using code from other students and/or online sources (e.g., Github) constitutes academic miscon-

duct, as does sharing your code with others either directly or indirectly (e.g., by posting it online).

Academic misconduct is a violation of the UWL Student Honor Code and is unacceptable. Pla-

giarism or cheating in any form may result in a zero on this assignment, a negative score on

this assignment, failure of the course, and/or additional sanctions. Refer to the course syllabus for

additional details on academic misconduct.

You should be able to complete the assignment using only the course notes and textbook along with

relevant programming language documentation (e.g., the Java API speciﬁcation). Use of additional

resources is discouraged but not prohibited, provided that this is limited to high-level queries and

not assignment-speciﬁc concepts. As a concrete example, searching for “how to use a HashMap in

Java” is ﬁne, but searching for “Q-learning in Java” is not.

Deliverables

You should submit a single compressed archive (either .zip or .tgz format) containing the following

to Canvas:

1. The complete source code for your program. I prefer that you use Java in your implemen-

tation, but if you would like to use another language, check with me before you get started.

Additional source code requirements are listed below:

• Your name must be included in a header comment at the top of each source

code ﬁle.

• Your code should follow proper software engineering principles for the chosen language,

including meaningful comments and appropriate code style.

• Your code must not make use of any non-standard or third-party libraries.

2. A README text ﬁle that provides instructions for how your program can be compiled (as

needed) and run from the command line. If your program is incomplete, then your

README should document what parts of the program are and are not working.

CS 457/557 Assignment 4 Page 2 of 12

Program Requirements

The general outline of the program is given below:

Algorithm 1 Program Flow

Process command-line arguments

Load environment from the speciﬁed ﬁle

Perform learning episodes, with periodic parameter updates and/or evaluation

Perform ﬁnal evaluation of pure greedy policy based on learned Q values

Print ﬁnal learned policy and additional details as appropriate

The program should be runnable from the command line, and it should be able to process command-

line arguments to update various program parameters as needed. The command-line arguments

are listed below:

• -f <FILENAME>: Reads the environment from the ﬁle named <FILENAME> (speciﬁed as a

String); see the File Format section for more details.

• -a <DOUBLE>: Speciﬁes the (initial) learning rate (step size) α ∈ [0, 1]; default is 0.9.

• -e <DOUBLE>: Speciﬁes the (initial) policy randomness value ϵ ∈ [0, 1]; default is 0.9.

• -g <DOUBLE>: Speciﬁes the discount rate γ ∈ [0, 1] to use during learning; default is 0.9.

• -na <INTEGER>: Speciﬁes the value N

which controls the decay of the learning rate (step

size) α; default is 1000.

• -ne <INTEGER>: Speciﬁes the value N

which controls the decay of the random action thresh-

old ϵ; default is 200.

• -p <DOUBLE>: Speciﬁes the action success probability p ∈ [0, 1]; default is 0.8.

• -q: Toggles the use of Q-Learning with oﬀ-policy updates (instead of SARSA with on-policy

updates, which is the default).

• -T <INTEGER>: Speciﬁes the number of learning episodes (trials) to perform; default is 10000.

• -u: Toggles the use of Unicode characters in printing; disabled by default (see the Output

section for details).

• -v <INTEGER>: Speciﬁes a verbosity level, indicating how much output the program should

produce; default is 1 (See the Output section for details)

The -f <FILENAME> option is required; all others are optional. You can assume that your program

will only be run with valid arguments (so you do not need to include error checking, though it may

be helpful for your own testing). Your program must be able to handle command-line arguments

in any order (e.g., do not assume that the ﬁrst argument will be -f). Several example runs of the

program are shown at the end of this document.

Extra credit: Any program that includes support for one or both of the following options will be

eligible for a small amount of extra credit (see page 10 for details):

• -l <DOUBLE>: Speciﬁes the λ parameter for eligibility trace decay; default is 0.0 (meaning

that eligibility traces should not be used by default).

• -w: Speciﬁes that the agent should use a weighted sum of features to estimate Q values for

each state-action pair instead of maintaining these values in a table; disabled by default.

CS 457/557 Assignment 4 Page 3 of 12

Environment Details

The environment in which the agent operates consists of a rectangular grid of cells, with each cell

belonging to one of the following types:

• S indicates the start cell for an agent

• G indicates a goal cell

• indicates an empty cell

• B indicates a block cell

• M indicates an explosive mine cell

• C indicates a cliﬀ cell

In an environment, the agent begins in the start cell (S), and can choose to move in one of four

directions (up, down, left, and right). Goal (G) and mine (M) cells are terminal states, which means

no further action is possible upon reaching those states. Cliﬀ (C) cells are “restart” cells: any action

taken in a cliﬀ cell causes the agent to return to the start cell for the next step. Block (B) cells are

obstacles that the agent must circumvent (i.e., the agent cannot enter a block cell).

In all other types of cells, any attempt at movement either succeeds as intended with some proba-

bility p ∈ [0, 1] or results in movement plus drift with probability 1 − p. The drift is perpendicular

to the intended direction of movement, and each drift direction is equally likely. For example, in

Figure 1 below, the agent is in the cell marked X and is attempting to move up to the cell marked

T. The following outcomes are possible:

• the agent moves up to enter cell T as intended with probability p;

• the agent moves up and drifts left to enter cell L with probability (1 − p)/2;

• the agent moves up and drifts right to enter cell R with probability (1 − p)/2.

Any movement (including movement plus drift) that would cause the agent to leave the bounds of

the grid or enter a block cell (B) will fail and result in the agent staying in its current location. For

example, in Figure 2 below, the agent is at the right edge of the environment and is attempting to

move up to the cell marked T. The following outcomes are possible:

• the agent moves up to enter cell T as intended with probability p;

• the agent moves up and drifts left to enter cell L with probability (1 − p)/2;

• the agent agent remains in place with probability (1 − p)/2 because moving up and drifting

right would take the agent out of bounds.

Figure 3 below shows how block cells inﬂuence outcomes. The agent is again in the cell marked

X and attempting to move up. The only movement that can occur is moving up and drifting left

(with probability (1−p)/2); moving up without drift and moving up with rightward drift both lead

to block cells which the agent cannot enter, so these outcomes result in no movement. (Note that

the possibility of drift allows the agent to move on the diagonal in an indirect way; this can lead

to some interesting “wall-hack” behavior that allows the agent to squeeze through gaps between

two block cells that are adjacent to the agent’s current location.)

TL R

Figure 1

Figure 2

BL B

Figure 3

剩余11页未读，继续阅读

评论收藏

内容反馈

版权申诉

pk_xz123456

粉丝: 2966
资源: 4150

强化学习中SARSA与Q-learning算法在网格世界的Java实现及其参数优化

SARSA-and-Q-learning-on-a-Windy-Grid-World:风电网格世界上的SARSA和Q学习

强化学习 Dyna-H算法代码

Q-Learning.pdf

MATLAB强化学习_QLearning网格迷宫问题_程序包

javacatgame_java_强化学习_

强化学习入门及其实现代码

matlab.rar_DEMO_Reinforcement

p1_navigation_机器学习_强化学习_

Grid-World-Reinforcement-Learning:我的机器学习课程的一项作业。 在许多不同的场景中，使用许多不同的算法来找到从一个点到另一点的最佳可能路线。 力求最大程度地减少风险，同时最大化回报

Machine Learning A-Z Chinese Template Folder_learninga-z使用_pytho

reinforcement-learning-implementation:强化学习示例的实现和解释

国科大-黄庆明-模式识别与机器学习 2016-2017期末考试试卷

吴恩达-机器学习-个人笔记完整版

Machine Learning经典ppt课件.rar

中心差分法的MATLAB代码-Temporal-Difference-Learning:Matlab中的时态差异学习和基本强化学习演示

Reinforcement_Learning

南京大学机器学习课程oj.zip

机器学习学习资料

CW.zip_cliff_matlab cw_walking

《机器学习导论》课后作业+《PRML》模式识别期末复习知识点.zip

71094322_机器学习_教学大纲1

Python库 | griddly-0.0.12-cp37-cp37m-win_amd64.whl

LetsUpgrade_DS-算法

UCI（55个）源码.zip

人工智能实战-从 Python 入门到机器学习.zip

《Python机器学习：原理与实践》代码和数据.rar

研0必会项目（人工智能方向）2

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

最新资源

Grid-World-Reinforcement-Learning:我的机器学习课程的一项作业。在许多不同的场景中，使用许多不同的算法来找到从一个点到另一点的最佳可能路线。力求最大程度地减少风险，同时最大化回报