2010年10月9日,星期六

相依奖励启示:第二部分

在我的 以前的帖子 我谈到了价格差异,以及价格差异是如何激发依赖的奖励启示的,这是一种很好的说法,即揭示哪些奖励取决于奖励价值。我必须承认,我对如何利用这些附加信息有些困惑。

我将在这里重复设置:
  1. 世界从$ D $中选择$(x,\ omega,r)$并显示$(x,\ omega)$。
  2. 玩家通过$ p(a | x,\ omega)$在A $中选择$ a \。
  3. 世界通过$ q(\ mathcal {A} | x,\ omega,r,a)$选择$ \ mathcal {A} \ in \ mathcal {P}(A)$,其中$ a \ in \ mathcal {A} $是必需的。
  4. 世界揭示$ \ {r(a)| \ mathcal {A} \} $中的\ in。
通常的``看看你做了什么''的情况是$ q(\ mathcal {A} | x,\ omega,r,a)= 1 _ {\ mathcal {A} = \ {a \}} $差异是\ [
q(\ mathcal {A} | x,\ omega,r,a)=
\ begin {cases}
\{ a^\prime | a^\prime \leq a \} & \mbox{if } r (a) > 0; \\
\ {a ^ \ prime | a ^ \ prime \ geq a \}和\ mbox {if} r(a)= 0。
\ end {cases}
\]因为需要$ a \ in \ mathcal {A} $,所以我总是可以扔掉其他信息,并将任何$ q $转换为$ q = 1 _ {\ mathcal {A} = \ {a \}} $,然后使用 偏移树。这似乎很浪费,但是目前我没有别的选择可以解决价格差异问题。

滤镜偏移样式更新(失败)


考虑一个 滤波偏移树 样式解决方案。修复$(x,\ omega,r)$,并考虑使用输入$ \ lambda \ not \ in \ omega $和$ \ phi \ not \ in \ omega $输入的固定内部节点。 $ \ lambda $的预期重要性权重为\ [
\ begin {aligned}
w _ {\ lambda | r}&= E_ {a \ sim p} \ biggl [E _ {\ mathcal {A} \ sim q | r,a} \ biggl [\ alpha _ {\ lambda,\ neg \ phi} 1_ { \ lambda \ in \ mathcal {A}} 1 _ {\ phi \ not \ in \ mathcal {A}} 1_ {r(\ lambda)\ geq \ frac {1} {2}} \ left(r(\ lambda) -\ frac {1} {2} \ right)\\
&\ quad \ quad \ quad + \ alpha _ {\ neg \ lambda,\ phi} 1 _ {\ lambda \ not \ in \ mathcal {A}} 1 _ {\ phi \ in \ mathcal {A}} 1_ {r(\ phi)\ leq \ frac {1} {2}} \ left(\ frac {1} {2}-r(\ phi)\ right)\\
&\quad\quad\quad + \alpha_{\lambda, \phi} 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} 1_{r (\lambda) > r (\phi)} \left( r (\lambda) - r (\phi) \right) \biggr] \biggr] \biggl/ \\
&E_ {a \ sim p} \ left [E _ {\ mathcal {A} \ sim q | r,a} \ left [1 _ {\ lambda \ in \ mathcal {A}} 1 _ {\ phi \ not \ in \ mathcal {A}} + 1 _ {\ lambda \ not \ in \ mathcal {A}} 1 _ {\ phi \ in \ mathcal {A}} + 1 _ {\ lambda \ in \ mathcal {A}} + 1 _ {\ lambda \ in \ mathcal {A}} 1 _ {\ phi \ in \ mathcal {A}} \ right] \ right]。
\ end {aligned}
\]与滤波器偏移量更新的类比建议了选择
\ begin {aligned}
\ alpha _ {\ lambda,\ neg \ phi}&=(1-\ gamma)\ frac {E_ {a \ sim p} \ left [E _ {\ mathcal {A} \ sim q | r,a} \ left [1 _ {\ lambda \ in \ mathcal {A}} 1 _ {\ phi \ not \ in \ mathcal {A}} + 1 _ {\ lambda \ not \ in \ mathcal {A}} 1_ { \ phi \ in \ mathcal {A}} + 1 _ {\ lambda \ in \ mathcal {A}} 1 _ {\ phi \ in \ mathcal {A}} \ right] \ right]]} {E_ {a \ sim p } \ left [E _ {\ mathcal {A} \ sim q | r,a} \ left [1 _ {\ lambda \ in A} 1 _ {\ phi \ not \ in \ mathcal {A}} \ right] \ right]},\\
\ alpha _ {\ neg \ lambda,\ phi}&=(1-\ gamma)\ frac {E_ {a \ sim p} \ left [E _ {\ mathcal {A} \ sim q | r,a} \ left [1 _ {\ lambda \ in \ mathcal {A}} 1 _ {\ phi \ not \ in \ mathcal {A}} + 1 _ {\ lambda \ not \ in \ mathcal {A}} 1_ { \ phi \ in \ mathcal {A}} + 1 _ {\ lambda \ in \ mathcal {A}} 1 _ {\ phi \ in \ mathcal {A}} \ right] \ right]} {E_ {a \ sim p} \ left [E _ {\ mathcal {A} \ sim q | r,a} \ left [1 _ {\ lambda \ not \ in A} 1 _ {\ phi \ in \ mathcal {A}} \ right] \ right]},\\
\ alpha _ {\ lambda,\ phi}&= \ gamma \ frac {E_ {a \ sim p} \ left [E _ {\ mathcal {A} \ sim q | r,a} \ left [1 _ {\ lambda \ in \ mathcal {A}} 1 _ {\ phi \ not \ in \ mathcal {A}} + 1 _ {\ lambda \ not \ in \ mathcal {A}} 1_ { \ phi \ in \ mathcal {A}} + 1 _ {\ lambda \ in \ mathcal {A}} 1 _ {\ phi \ in \ mathcal {A}} \ right] \ right]} {E_ {a \ sim p} \ left [E _ {\ mathcal {A} \ sim q | r,a} \ left [1 _ {\ lambda \ in A} 1 _ {\ phi \ in \ mathcal {A}} \ right] \ right]},
\ end {aligned}
\] for some $\gamma \in [0, 1]$. Unfortunately in general these quantities cannot be computed since $r$ is only partially revealed per instance. For the price differentiation $q$, for instance, only when $a$ is the largest possible price and $r (a) > 0$, or when $a$ is the smallest possible price and $r (a) = 0$, can these quantities be computed.

我的怀疑是进行此偏移滤镜样式更新的唯一方法是,始终显示$ q $所依赖的一组奖励。所以类似\ [
q(\ mathcal {A} | x,\ omega,r,a)=
\ begin {cases}
\{ \tilde a \} \cup \{ a^\prime | a^\prime \geq a \} & \mbox{if } r (\tilde a) > 0; \\
\ {a,\ tilde a \}& \mbox{if } r (\tilde a) = 0; \\
\ {\ tilde一个\} \ cup \ {a ^ \ prime | a ^ \ prime \ leq a \}和\ mbox {if} r(\ tilde a) < 0, \end{cases} \] would work since $q$ only depends upon $r (\tilde a)$ which is always revealed, so the above expectations can always be computed. With such a cooperative $q$, 其余的偏移滤镜树曲柄 可以转向,加权因子为\ [
\ begin {aligned}
\ alpha _ {\ lambda,\ neg \ phi}&= \ frac {E_ {a \ sim p} \ left [E _ {\ mathcal {A} \ sim q | r,a} \ left [1 _ {\ lambda \ in \ mathcal {A}} 1 _ {\ phi \ not \ in \ mathcal {A}} + 1 _ {\ lambda \ not \ in \ mathcal {A}} 1_ { \ phi \ in \ mathcal {A}} \ right] \ right]} {E_ {a \ sim p} \ left [E _ {\ mathcal {A} \ sim q | r,a} \ left [1 _ {\ lambda \ in A} 1 _ {\ phi \ not \ in \ mathcal {A}} \ right] \ right]},\\
\ alpha _ {\ neg \ lambda,\ phi}&= \ frac {E_ {a \ sim p} \ left [E _ {\ mathcal {A} \ sim q | r,a} \ left [1 _ {\ lambda \ in \ mathcal {A}} 1 _ {\ phi \ not \ in \ mathcal {A}} + 1 _ {\ lambda \ not \ in \ mathcal {A}} 1_ { \ phi \ in \ mathcal {A}} \ right] \ right]} {E_ {a \ sim p} \ left [E _ {\ mathcal {A} \ sim q | r,a} \ left [1 _ {\ lambda \ not \ in A} 1 _ {\ phi \ in \ mathcal {A}} \ right] \ right]},\\
\ alpha _ {\ lambda,\ phi}&= 1,
\ end {aligned}
\]很好,但仍然让我想知道如何利用价格差异问题中可用的其他信息。

没意见:

发表评论