Machine learning appendix

February 1, 2017

to be updated sporadically, suggestions are highly appreciated

Glossary of common ML terminologies

Glossary of common terms and their synonyms (or strongly related).

$\text{Pr}\left(x\right)$ is written in place of $\text{Pr}\left(X=x\right)$ for brevity. Read more.

Term	Common notations and more information (if any)
estimation/prediction/inference of a quantity	the quantity with a “hat” eg. $\hat{y}$ , $\hat{\theta}$ , $\text{Pr}\left(y\left\|x,\hat{\theta}\right.\right)$
ước lượng/dự đoán/suy diễn một đại lượng nào đấy
objective function, error function, loss function, cost function	$J\left(\cdot\right)$ ; or $L\left(\cdot\right)$ ; or $l\left(\cdot\right)$
hàm mục tiêu, hàm lỗi, hàm tổn thất, hàm chi phí
learning, training; related: parameter estimation
học mô hình, huấn luyện mô hình, ước lượng tham số
evaluating, forward pass - as in “computing some quantity given estimates of all unknown quantities”	note: not to be confused with `performance evaluation`
tính toán, chiều xuôi
score function; related: (inverse) link function	$s\left(\cdot\right)$ ; or $g\left(\cdot\right)$ if link function
hàm tính điểm, hàm liên kết
neuron/unit; related: dimension (of a vector)	$x_{d}$ where $d$ is a dimesion of vector $x$
nơ-ron, chiều (của một vec-tơ)
feature; related: independent variable, explanatory variable, predictor	$x_{d}$ where $d$ is a dimesion of vector $x$ . note: in practice, most of the time we imply `feature` as `feature vector`, thus use vector $x$ instead
vec-tơ đặc trưng, biến độc lập, biến giải thích, nhân tố giải thích
mid-/high-level feature, hidden/latent variable; related: hidden neurons/units	$z$ ; or $h$ ; or $\theta$ if assumed a random variable
đặc trưng bậc trung hoặc bậc cao, biến ẩn, các nơ-ron ẩn
target, label; related: dependent variable, response variable	$y$ ; or $t$
mục tiêu, nhãn, biến phụ thuộc, biến (?)
observed data/variable(s),	$\mathcal{D}$ ; or $x$ ; or $x$ and $y$ if supervised learning
dữ liệu/biến quan sát được (đã biết)
unobserved data/variable(s)	$\tilde{\mathcal{D}}$ ; or $\tilde{\mathcal{x}}$ ; or $x'$ ; or $x^{\left(\text{new}\right)}$
dữ liệu/biến không quan sát được (chưa biết)
data point	$x$ ; or (if working with more than 1 data point) $x^{\left(n\right)}$ , or $x_{n}$ if $x$ represents a set of data points
điểm dữ liệu
parameter; related: weight, bias (as the “offset” from the origin)	$\theta$ , $W$ , $b$ respectively
tham số, trọng số, độ lệch	note: `bias` is an umbrella term
	`bias` has more common meanings
parameterized function	$f\left(x,y\,\mathbf{;}\,\theta\right)$
hàm có tham số
parametric distribution	$p\left(x\mathbf{;}\,\theta\right)$ ; or $p_{\theta}\left(x\right)$ where $p\left(\cdot\right)$ is density function
phân bố xác suất có tham số

More notation

For brevity, $\text{Pr}\left(x\right)$ is often seen in ML literature instead of $\text{Pr}\left(X=x\right)$ as in statistical texts. I also use an overhead circle e.g. $\overset{\circ}{x}$ to denote random variable, in place of capital letters e.g. X, N, K which are preserved for either sets of data points, matrices, or total number of samples/classes/features/…

Notation	Description
$\overset{\circ}{\mathbf{y}_{n}}$	A random variable. Thus $\text{Pr}\left(\overset{\circ}{y_{n}}\|\cdot\right)$ implies a probability distribution.
$y_{n}$	A realization of random variable $\overset{\circ}{y_{n}}$ . Thus $\text{Pr}\left(y_{n}\|\cdot\right)$ is a value.
$\Omega_{\overset{\circ}{y_{n}}}$	Set of all possible realization $y_{n}$ ’s of random variable $\overset{\circ}{y_{n}}$ i.e. *sample space* of $\overset{\circ}{y_{n}}$ .
$\hat{y}_{n}$	An estimate of random variable $\overset{\circ}{y_{n}}$
$\underset{y_{n}^{'}}{\text{argmax}}\text{Pr}\left(y_{n}^{'}\|\cdot\right)$	short for $\underset{y_{n}^{'}\in\Omega_{\overset{\circ}{y_{n}}}}{\text{argmax}}\text{Pr}\left(\overset{\circ}{y_{n}}=y_{n}^{'}\|\cdot\right)$ . We use notation $y_{n}^{'}$ to not confuse with $y_{n}$ which is reserved for the realization provided by training data $\mathcal{D}=\left\{ x_{n},y_{n}\right\}_{n=1}^{N}$ .

Common machine learning models acronym

Acronym	Description
PGM	Probabilistic Graphical Models i.e. Probabilistic models
GLM	Generalized Linear Models
GMM, PPCA	(Gaussian) Mixture Models, Probabilistic Principle Component Analysis
HMM, LDS	Hidden Markov Models, Linear Dynamical Systems (for modelling sequential data)
Topic Models	Latent Dirichlet Allocation (LDA - not to be confused with Linear Discriminant Analysis) and variants
DNN	Deep Neural Networks
MLP, CNN	Multi-layer Perceptrons, Convolutional NNs i.e. FNN - Feed-forward NN
RNN	Recurrent NNs, also including Recursive NN and Bi-directional RNN (for modelling sequential data)
EBM	Energy-based Models (undirected PGM)
RBM, DBN, DBM	Restricted Boltzmann Machines, Deep Belief Networks (not to be confused with Dynamic Bayesian Networks), Deep Boltzmann Machines
VAE, DRAW, AIR	Variational Auto-encoder, Deep Recurrent Attentive Writer, Attention-Infer-Repeat
GAN	Generative Adversarial Networks
AAE	Adversarial Auto-encoders
SVM	Support Vector Machines
DP, GP	Dirichlet Processes, Gaussian Processes
Tree, RF	Decision Trees, Random Forests
kNN	k-Nearest Neighbours

Back-propogation algorithm

⊕Conditions: (i) J is differentiable every where w.r.t. theta; (ii) J and all thetas form a directed acyclic computational graph. For what: computes gradient $\nabla_{\theta}J=\dfrac{\partial J}{\partial\theta}$ for all $\theta$ ’s of interest.

Use case(s): update optimal parameter $\hat{\theta}$ ’s of a neural network (or a certain class of probabilistic models) by gradient descent algorithms;