Machine learning appendix

February 1, 2017

to be updated sporadically, suggestions are highly appreciated

Glossary of common ML terminologies

Glossary of common terms and their synonyms (or strongly related).

is written in place of for brevity. Read more.

Term Common notations and more information (if any)
estimation/prediction/inference of a quantity the quantity with a “hat” eg. , ,
ước lượng/dự đoán/suy diễn một đại lượng nào đấy  
objective function, error function, loss function, cost function ; or ; or
hàm mục tiêu, hàm lỗi, hàm tổn thất, hàm chi phí  
learning, training; related: parameter estimation  
học mô hình, huấn luyện mô hình, ước lượng tham số  
evaluating, forward pass - as in “computing some quantity given estimates of all unknown quantities” note: not to be confused with performance evaluation
tính toán, chiều xuôi  
score function; related: (inverse) link function ; or if link function
hàm tính điểm, hàm liên kết  
neuron/unit; related: dimension (of a vector) where is a dimesion of vector
nơ-ron, chiều (của một vec-tơ)  
feature; related: independent variable, explanatory variable, predictor where is a dimesion of vector . note: in practice, most of the time we imply feature as feature vector, thus use vector instead
vec-tơ đặc trưng, biến độc lập, biến giải thích, nhân tố giải thích  
mid-/high-level feature, hidden/latent variable; related: hidden neurons/units ; or ; or if assumed a random variable
đặc trưng bậc trung hoặc bậc cao, biến ẩn, các nơ-ron ẩn  
target, label; related: dependent variable, response variable ; or
mục tiêu, nhãn, biến phụ thuộc, biến (?)  
observed data/variable(s), ; or ; or and if supervised learning
dữ liệu/biến quan sát được (đã biết)  
unobserved data/variable(s) ; or ; or ; or
dữ liệu/biến không quan sát được (chưa biết)  
data point ; or (if working with more than 1 data point) , or if represents a set of data points
điểm dữ liệu  
parameter; related: weight, bias (as the “offset” from the origin) , , respectively
tham số, trọng số, độ lệch note: bias is an umbrella term
  bias has more common meanings
parameterized function
hàm có tham số  
parametric distribution ; or where is density function
phân bố xác suất có tham số  

More notation

For brevity, is often seen in ML literature instead of as in statistical texts. I also use an overhead circle e.g. to denote random variable, in place of capital letters e.g. X, N, K which are preserved for either sets of data points, matrices, or total number of samples/classes/features/…

Notation Description
A random variable. Thus implies a probability distribution.
A realization of random variable . Thus is a value.
Set of all possible realization ’s of random variable i.e. sample space of .
An estimate of random variable
short for . We use notation to not confuse with which is reserved for the realization provided by training data .

Common machine learning models acronym

Acronym Description
PGM Probabilistic Graphical Models i.e. Probabilistic models
GLM Generalized Linear Models
GMM, PPCA (Gaussian) Mixture Models, Probabilistic Principle Component Analysis
HMM, LDS Hidden Markov Models, Linear Dynamical Systems (for modelling sequential data)
Topic Models Latent Dirichlet Allocation (LDA - not to be confused with Linear Discriminant Analysis) and variants
DNN Deep Neural Networks
MLP, CNN Multi-layer Perceptrons, Convolutional NNs i.e. FNN - Feed-forward NN
RNN Recurrent NNs, also including Recursive NN and Bi-directional RNN (for modelling sequential data)
EBM Energy-based Models (undirected PGM)
RBM, DBN, DBM Restricted Boltzmann Machines, Deep Belief Networks (not to be confused with Dynamic Bayesian Networks), Deep Boltzmann Machines
VAE, DRAW, AIR Variational Auto-encoder, Deep Recurrent Attentive Writer, Attention-Infer-Repeat
GAN Generative Adversarial Networks
AAE Adversarial Auto-encoders
SVM Support Vector Machines
DP, GP Dirichlet Processes, Gaussian Processes
Tree, RF Decision Trees, Random Forests
kNN k-Nearest Neighbours

Back-propogation algorithm

Conditions: (i) J is differentiable every where w.r.t. theta; (ii) J and all thetas form a directed acyclic computational graph. For what: computes gradient for all ’s of interest.

Use case(s): update optimal parameter ’s of a neural network (or a certain class of probabilistic models) by gradient descent algorithms;

Machine learning appendix - February 1, 2017 - Hoa M. Le