## 2016年1月6日星期三

### 注意：我们可以将其形式化吗？

\ begin {aligned}
\帽子&= \mathrm{sgn\;} \ left（w ^ \ top X z \ right），\\
z_i＆= \ frac {\ exp \ left（v ^ \ top X_i \ right）} {\ sum_k \ exp \ left（v ^ \ top X_k \ right）}，
\ end {aligned}
\]，即$z \ in \ Delta ^ k$是softmax，用于为$X$的每一列选择权重，然后$w$在给定的输入$X z \ in \减少的情况下线性预测标签。 mathbb {R} ^ d$。如果您更需要注意，则可以强制$z$作为单纯形的顶点。

\ begin {aligned}
\帽子&= \mathrm{sgn\;} \ left（u ^ \ top \ mathrm {vec \;}（X）\ right），
\ end {aligned}
\]，即忽略$X$中的列结构，展宽矩阵，然后使用所有功能进行估算。

#### 1条评论：

1. We are aware of an encouraging result for the case of static 在 tention where the "parts" are features 和 there is no competition among them (i.e. the 在 tention vector z does not have to sum to 1). This is the same as learning a sparse model 和 Andrew Ng's analysis of L1 regularization ( http://ai.stanford.edu/~ang/papers/icml04-l1l2.pdf ) shows that it can exponentially reduce sample complexity (from O(number of features) to O(log(number of features)). At the same time, rotationally invariant methods (c.f. the paper above) have to use O(number of features). When I read the paper, long time ago, I did not find the analysis very enlightening, but perhaps the ideas in there are right headed.