Weekly DL study note: GFlowNet Code Tutorial (completed)

Completed code: https://github.com/gangfang/littlegfn/blob/main/face_generator.ipynb

Pre-requisites

Logarithm provides numerical stability because it makes smaller (extremely small like 10^-300) numbers bigger and bigger (extremely big like 10^300) numbers smaller. So we prefer to train our models in the log space.
LogSoftmax: $\text{LogSoftmax}(x_i) = \log\left(\frac{\exp(x_i)}{\sum_j \exp(x_j)}\right)$ . This is literally $log (softmax)$
Softmax isn’t typically considered a part or layer of a NN. It operates on and intepret the output of a NN
NLLLoss (negative log likelihood loss) is commonly used for classification tasks while MSE loss (mean square error loss) is mainly used for regression tasks.
- NLLLoss is often combined with softmax to become cross-entropy loss, which operates on logits directly.

Flow Networks (Nothing about training yet)

IMPORTANT – MAIN IDEA OF GFN

The main idea behind GFlowNet is to interpret the DAG as a flow network, and to think of each edge as a pipe through which some amount of water, or particles, flows. We then want to find a flow where, (a) flow is preserved, (b) the flow coming into a terminal state (a finished object) is equal to its reward (and in fact every non-terminal state does not have any reward).

A central finding of the first GFlowNet paper is that if we match to each edge a flow, $F(s,s')>0$ , and that this flow respects the following property (i.e., flow consistency):
$\forall s', \sum_{s: (s,s')\in\mathcal{E}} F(s,s') = R(s') + \sum_{s'':(s',s'')\in\mathcal{E}} F(s',s'')$
which ensures flow consistency, then if we define the following policy: (The probability of transitioning to any particular s’ is proportional to the flow F(s, s’) from s to s’, normalized by the total flow out of s to all possible next states $s''$ .)
$P_F(s'|s) = \frac{F(s, s')}{\sum_{s''}F(s, s'')}$
and sample trajectories (paths through the DAG) using this policy, we will sample terminal states (finished objects $x$ ) with probability:
$p(x) \propto R(x)$

The intuition behind the flow consistency equation is that for an intermediate state, in-flow equals to out-flow; for a terminal state, in-flow equals to reward of that terminal state. Given this consistency, we establish the policy so that the step-wise probability distribution equals to the distribution of flow.

The graph above is the resulting graph a GFN can produce, the Markov Decision Process (MDP), instead of the NN itself. In this tutorial, the above abstract graph looks like the below. Note that this is a visualization of all possible valid paths.

One assumption is reward function $R(x)$ is given. $R(x)$ attributes a reward value to every possible object $x$ . In this tutorial, $R(x)$ is defined like so:

Here I’d like to only reward valid faces, therefore if there are missing parts or two overlapping parts, such as both smile and frown the reward is 0. I’d also like to generate smiley faces more often than sad faces, and so I’m giving the former a reward of 2, and the latter a reward of 1.

Key difference between RL and GFN:

What distinguishes GFN from RL methods is that we use the reward to learn a particular 𝑝(𝑥), instead of trying to find reward-maximizing solutions. We are aiming to train a model such that 𝑝(𝑥)∝𝑅(𝑥).

With RL, we will end up with all happy faces as we generate faces using the trained model because R(happy face) is bigger than R(frown face). With GFN, we get a ratio between the number of generated happy face and the number of generated frown face based on R(happy face)/R(frown face)

About the relationship between the distribution to learn and step-wise policy:

What we’re interested in is a policy, something that will tell us at each step what action to take (what patch to add). Since we are interested in learning a distribution, the policy itself will be a distribution over actions (possible patches to add) which is typically written 𝜋(𝑎|𝑠) in RL, but which we’ll call 𝑃𝐹 (𝐹 for forward) here.

Building and training a flow model with flow-matching

We want to predict edge flows. To do so, we train a model F(state) which outputs a vector which is indexed by the action used to reach a child of state, i.e. 𝐹(𝑠,𝑠′) is F(s)[a] if a leads from 𝑠 to 𝑠′ (e.g. a can be, “add a raised eyebrow”).

Now I can see the role of the NN. The flow network itself is a theoretical construct with properties defined in the previous chapter. With such a small flow network example, I can work out the flow of the entire network (i.e., flow amount of an edge for all edges in the network) and hence the probability distribution of all actions from an input state. But instead of hand calculating these probabilities, the NN can learn these. This will be useful in any non-trivial situation where the hand-calculation is intractable/impossible. Also, we get fast amortized inference.

But do we not need to know the flow in order to train the NN? – GUESS: maybe not. What we need is some examples of network execution. SECOND GUESS: we probably don’t need those examples either. AN: example actions are generated using the NN during training. A reward is obtained at the end of an episode given the trajectory, contributed by all NN chosen actions, and loss is calculated using flow consistency. So no, we don’t need to know the flow in order to train the NN.

The simplest loss is the MSE between inflow and outflow of a state. This is based on the flow consistency equation. This is to TRAIN THE NN TO RESPECT flow consistency. But do we still not have the flow, which is necessary to construct $P_F(s'|s) = \frac{F(s, s')}{\sum_{s''}F(s, s'')}$ ? AN: we don’t compute $P_F(s'|s)$ using $P_F(s'|s) = \frac{F(s, s')}{\sum_{s''}F(s, s'')}$ , we learn it through training.

But how is flow consistent NN following policy $P_F(s'|s) = \frac{F(s, s')}{\sum_{s''}F(s, s'')}$ ? – AN: it isn’t when not specified. In the training code, we have a line which constructs the policy:

# The policy is just normalizing, and gives us the probability of each action
policy = edge_flow_prediction / edge_flow_prediction.sum()

The model would never generate a face with duplicate feature (e.g., two “left_eb_down”) because, in F = self.mlp(x).exp() * (1 - x), the flow value of an existing feature is set to 0 and the subsequent sampling would never yield that feature, which has a probability of 0
Intuition: A GFN isn’t a new topology for DL. Rather, it is a framework where a trained stochastic policy generates compositional objects, thru a sequence of sampling, whose distribution is proportional to a reward function .
- That is why we needed the inner loop for t in range(3): in the training loop of the FlowModel. We know it takes 3 steps to make a complete face so we must sample the FlowModel 3 times, to have a chance of getting a complete face, before assigning a reward.
The model isn’t generic. The set up and training both rely on specific structures of the problem. For example, the inner loop for t in range(3) exists because we know it takes 3 steps to make a potentially valid, complete face

Trajectory Balance

Multiple objectives can be used for GFN training. Their efficiency and effectiveness differ:
- Flow matching objective: computing for all parents of a state is required
- Detailed balance objective: constraint to satisfy:
  $F(s)P_F(s'|s) = P_B(s|s')F(s')$
- Trajectory balance objective: similar constraint as the above but at the trajectory-level
Trajectory balance objective uses trajectory from a source state to a terminal state. The constraint it satisfies focuses on . The two sides of the equation, which are two ways to calculate , look at it from the start and from the end.
- From the start, we have $Z \prod_{t} P_F(s_{t+1}|s_t)$ where $Z$ is the total flow of network originating from $s_0$ .
- From the end, we have $R(x)\prod_t P_B(s_t|s_{t+1})$ where $R(x)$ is the reward of the terminal state $x$
- The loss is the mismatch between the two. The optimization operates on minimizing these mismatches so we can achieve:
  $Z \prod_{t} P_F(s_{t+1}|s_t) = R(x)\prod_t P_B(s_t|s_{t+1})$
- A toy example: