Authors:
Masayuki Horiguchi,
Masami Kurano,
Masami Yasuda,
Volume: 1, Page 706 Paper number 1901
Abstract:
The optimization problem for a stopped Markov decision process is considered
to be taken over stopping times (tau) constrained so that E (tau)
<=q (alpha) for some fixed (alpha)>0. We introduce the concept of a
randomized stationary stopping time which is a mixed extension of the
entry time of a stopping region and prove the existence of an optimal
constrained pair of stationary policy and stopping time by utilizing
a Lagrange multiplier approach. Also, applying the idea of the one-step
look ahead (OLA) policy the optimal constrained pair is sought concretely.
As an example, constrained Markov deteriorating system is explained.
Authors:
Jerzy A. Filar,
Jacek Gondzio,
Alain B. Haurie,
Francesco Moresino,
Jean-Philippe Vial,
Volume: 1, Page 711 Paper number 1902
Abstract:
This paper deals with a class of ergodic control problems for systems
described by Markov chains with strong and weak interactions. These
systems are composed of a set of m subchains that are weakly coupled.
Using results recently established by Abbad et al. one formulates a
limit control problem the solution of which can be obtained via an
associated nondifferentiable convex programming (NDCP) problem. The
technique used to solve the NDCP problem is the Analytic Center Cutting
Plane Method (ACCPM) which implements a dialogue between, on one hand,
a master program computing the analytical center of a localization
set containing the solution and, on the other hand, an oracle proposing
cutting planes that reduce the size of the localization set at each
main iteration. The interesting aspect of this implementation comes
from two characteristics: (i) the oracle proposes cutting planes by
solving reduced sized Markov Decision Problems (MDP) via a linear program
(LP) or a policy iteration method; (ii) several cutting planes can
be proposed simultaneously through a parallel implementation on m processors.
The paper concentrates on these two aspects and shows, on a large scale
MDP obtained from the numerical approximation a la Kushner-Dupuis of
a singularly perturbed hybrid stochastic control problem, the important
computational speed-up obtained.
Authors:
Alexander S. Poznyak,
Kaddour Najim,
Volume: 1, Page 717 Paper number 1903
Abstract:
A two finite Markov chains repeated zero-sum stochastic game with unknown
transition matrices and payoffs is considered. The control objective
is to obtain the equilibrium point based only on current measurements.
The behavior of each players is modelled by a finite controlled Markov
chain. A novel adaptive policy is developed based on Lagrange multipliers
involved into ''learning through reinforcement'' procedure. A regularized
Lagrange function and a new normalization procedure are introduced.
The saddle-point of this function is shown to be unique. The convergence
properties are proved and the order of almost sure convergence is estimated.
Authors:
Eugene A. Feinberg,
Aleksey B. Piunovskiy,
Volume: 1, Page 723 Paper number 1904
Abstract:
We consider a Markov decision process with an uncountable state space
for which the vector performance functional has the form of expected
total rewards. Under the single condition that initial distribution
and transition probabilities are nonatomic, we prove that the performance
space coincides with that generated by nonrandomized Markov policies.
Authors:
Onésimo Hernández-Lerma,
Rosario Romera,
Volume: 1, Page 729 Paper number 1905
Abstract:
This paper presents two main results on partially observable (PO) stochastic
systems. In the first one, we consider a general PO system x_t+1=F(x_t,
a_t, (xi)_t), y_t=G(x_t, (eta)_t) (t=0,1,...) on Borel spaces,
with possibly unbounded cost-per- stage functions, and give conditions
for the existence of (alpha)-discount optimal control policies (0<(alpha)
<1). In the second result we specialize () to additive-noise systems
x_t+1=F_n(x_t, a_t)+ (xi)_t, y_t=G_n(x_t)+ (eta)_t (t=0,1,...)
in Euclidean spaces, with F_n(x, a) and G_n(x) converging pointwise
to functions F_(infinity)(x, a) and G_(infinity)(x), respectively,
and give conditions for the limiting PO model x_t+1=F_(infinity)(x_t,
a_t)+ (xi)_t, y_t=G_(infinity)(x_t)+ (eta)_t to have an (alpha)-discount
optimal policy.
Authors:
Rachid El Azouzy,
Eitan Altman,
Vladimir Gaitsgory,
Volume: 1, Page 730 Paper number 1906
Abstract:
We consider a class of singularly perturbed zero-sum differential games
with piecewise deterministic dynamics, where the changes from one structure
(for the dynamics) to another are governed by a finite-state Markov
process. Player 1 controls the continuous dynamics, whereas Player
2 controls the rate of transition for the finite-state Markov process;
both have access to the states of both processes. Player 1 wishes to
minimize a given quantity. For player 2, we consider two possible scenarios:
one in which it wishes to minimize the same quantity (team framework),
and one in which it wishes to maximize it (zero sum game). The transition
rates of the Markov process are fast, of the order of 1/(epsilon).
To solve the above problem, we use the dynamic programming approach.
In particular, we study the asymptotic properties of the underlying
system for sufficiently small epsilon. The viscosity solution method
is employed to verify the convergence of the value function, which
allows us to obtain the convergence in a general setting and helps
us to characterize the structure of the limit system. We apply this
to the special case of linear quadratic games with jump parameters,
which allows us to obtain an explicit solution for the limiting problem.
Authors:
Sandjai Bhulai,
Ger Koole,
Volume: 1, Page 736 Paper number 1907
Abstract:
In this paper we investigate the multi-armed bandit problem, where
each arm generates an infinite sequence of Bernoulli distributed rewards.
The parameters of these Bernoulli distributions are unknown and initially
assumed to be Beta-distributed. Every time a bandit is selected its
Beta-distribution is updated to new information in a Bayesian way.
The objective is to maximize the long term discounted rewards. We study
the relationship between the necessity of acquiring additional information
and the reward. This is done by considering two extreme situations
which occur when a bandit has been played N times; the situation where
the decision maker stops learning and the situation where the decision
maker acquires full information about that bandit. We show that the
difference in reward between this lower and upper bound goes to zero
as N grows large.
|