# Life optimization: inspiration from mathematics

As humans, we have both an ability to take actions that affect the environment around us and some ability to predict the effects of those actions. This is an opportunity to mold our lives as close as possible to what we want.

We start by rephrasing this initial assumption in the context of reinforcement learning and take further inspiration from optimization theory to provide a framework for considering the process of action selection to improve one’s life. In essence, it aims to provide a way of looking at the question: How does one make their life better?

Sections 1 and 2 of this article can be perceived as mathematical and dry but are crucial for motivating the ideas presented in later sections. Section 3 translates the mathematical framework into a human applicable framework. Simplifications have been made in all areas, for the sake of clarity and readability, but those interested are directed towards the sources provided.

# 1. The reinforcement learning framework

Reinforcement learning is a category of machine learning where agents (which correspond to humans in our example) take actions in a system, modifying that systems, with the aim of maximizing cumulative reward. Equivalently, the agent’s performance is measured by the sum of the rewards it obtains at each time step.

It is particularly well suited for complicated systems, just as our the world around us humans is.

**The reinforcement learning problem** consist of the following components:

- A set of states of the combined environment and agent: $\{{s}_{i}\}\in S$
- A set of actions that the agent can take: $\{{a}_{i}\}\in A$
- The probability ${P}_{a}(s,{s}^{\prime})|Pr({s}_{t+1}={s}^{\prime}|{s}_{t}=s,{a}_{t}=a)$ of transitioning (at time t) from state $s\in S$ to state ${s}^{\prime}\in S$ under action $a\in A$.
- The immediate reward ${R}_{a}(s,{s}^{\prime})$ for transitioning from state $s\in S$ to state ${s}^{\prime}\in S$ under action $a\in A$. We call $R$ the reward function.

The **reinforcement learning loop** being with an initial state ${s}_{0}$, observed by the agent and then (in a process which is the main focus of this article) chooses an action ${a}_{0}$. Given the probability distribution of this state and the chosen action, a new state ${s}_{1}$ results. This state transition leads to a reward ${r}_{0}={R}_{{a}_{0}}({s}_{0},{s}_{1})$. At the next time step this process repeats. A sequence of actions leads to a sequence of states and corresponding rewards.

The reinforcement learning agent’s objective is then to find the optimal (or a near optimal) sequence of actions that maximizes the cumulative (over the time steps) reward.

From the perspective of a machine learning engineer looking to build an RL system the task is then to find a policy under which the agent’s actions maximize the cumulative reward. The reward function is the encoding (usually by the machine learning engineer / interpreter) defining how good each state transition is, and is the principal method of incentivizing the desired behavior in the agent.

The beauty of the reinforcement learning perspective, for the challenge of optimizing one’s life, is that it show a well defined process for something we intuitively do, while also making clear which aspects are under our control and which ones are not. With the addition of a non-deterministic state transition, it also encodes the stochastic nature of our world. It provides a framework for looking at the world, to help make sense of what happened, what is happening, and what can be done in the future.

# 2. The Agent’s perspective

The framework of reinforcement learning is only useful insofar as it helps frame the central question of this article: How does one improve their life? Investigating this leads to further questions such as how to choose our actions, or what it means for an action to be the right one.

One differentiating factor between RL algorithms is whether they do or do not contain a model. Generally, an RL agent does not need to be able to approximate how the environment will react to its actions to choose an action - these are called model-free RL algorithms. The other option, which seems more appropriate in our case (and has been applied in state of the art systems such as AlphaGo, etc) are model-based RL algorithms. These have (or build) an approximation to their environment, and query it many times before choosing an action. We shall focus on the perspective of model-based RL algorithms, congruent with the previous assumption that humans have some ability to predict the effects of their actions.

Model-based RL agents such as Alpahzero have three components:

- The model, which predicts the state of the world given an initial state and a sequence of actions.
- A search algorithm, which the agent uses to explore a different scenarios, using the model to predict the effects.
- A value function that tells the agent how good each state is. This is an internal approximation to the external reward function. Note that the optimal value function does not have to be an exact copy of the reward function.

For a given action $a\in A$, the agent predicts of the environment to be - using the model $M$ - and then assigns that state a score - using the value function $V$. It then applies a searching algorithm (tree based search) to find the actions that maximize its value:

## 2.1 Model-based environment

The agent has an internal model of the external environment. This can be understand as some expectation of how the world changes depending on the initial state and a chosen action. This may be a stochastic model, for example because there are other agents in the model which may be stochastic or whose behavior cannot be estimated.

## 2.2 Value function

To be able to use a model to compare and contrast actions, it must be able to quantify how good each state is. For this, it needs to be able to predict the reward it will receive from each state transition.

## 2.3 Search Algorithm & Gradient descent

With a model and a value function, a crucial element still remains: the search algorithm. For discrete actions (i.e. move a pawn to E5) a decision tree can be used, enumerating at each step the actions available. For continuous actions (i.e. move forward at 5.14 m/s) the process is more complicated. However, in both scenarios, the space of actions is very large, and reducing it to a manageable size is a challenge. Alphazero uses a neural network to propose actions. In certain situations, where the model is continuous, one can take the derivative of the model+value function in the direction of the actions, and the pick the action that increases the value (from the current state) the most.

# 3. The Human’s perspective

This is where we leave the structured mathematics behind to try and apply the Reinforcement Learning framework to a more qualitative, more uncertain, and ultimately human world. This section considers the section of Chapter 2, and attempts to draw conclusions from them into this human case. Note that this chapter presents both general ideas (i.e. planning helps) but also specific techniques to apply those ideas (i.e. multi-term planning). The former is a conclusion drawn from the mathematics, the latter is a suggested implementation and therefore much more open to interpretation and personal perspective.

## 3.1 World Model and Learning

We all have some understanding of how our environment works. In the perspective of model-based RL this is a model of the environment. Consider a scenario in which your are not physically in shape. Our internal model tells us that if we exercise frequently over a long enough time you will become fitter. This is an internal belief we have, and we don’t need to actually do it to find this out. We all have an internal understanding of how the world works. This means we can consider actions and predict what effect they’ll have. Crucial to understand is that:

**Proposed implementation.** To improve our internal world model, learning is required. Ideally, this learning is systematized. This can be learnt from various sources:

- Role models, Mentors or sponsors
- Formal education (university, night classes, etc)
- Independent study

## 3.2 Value functions and planning

The underlying reward function which we all strive for is happiness, in some form or other. Each of us has their own ways of going about: our own value function that guides our behavior. More often than not, this value function is neither precisely nor consciously defined. It strikes me as odd that people don’t realize this more often, it’s like driving at night without headlights or even a map. I theorize that a partial reason is that defining our internal value function is a non-intuitive task to do.

### 3.2.1 Defining the ideal state

Defining a value function is hard, but there is an intuitive path to it: defining your ideal state. For example, think 10 years into the future, what would you like like your life to look like? What does your ideal day look like?

Defining very explicitly your morning routine (or lack thereof), what you do during the bulk of the day, any socializing, events, family time, etc that you want to have should help you rank your priorities.

This ranking and general description, the more explicit the better, will help guide what you are trying to achieve.

An important consideration when doing this is that others will have completely different answers to these questions. Their measure of success will look completely different, leading them to focus on different goals, and achieve different successes. For example, if you decided you want to spend some time catching up every day/week with friends, then you will naturally be less productive that if you threw yourself into work, but that’s okay: you decide where you invest your time and energy. You decide what you care about. It is crucial to remember your own priorities and measure yourself with your own yardstick.

### 3.2.2 Multi-term planning

With the ideal state (=end goal) in mind, we can make plans on multiple timelines to help get there. First comes the long term plan, which list the large tasks required to get you from the present to that ideal state. I usually refer to this as a multi year planning. Next comes the monthly planning, which subdivides the multi-year-planning steps into chunks manageable on a monthly scale. You can repeat that to create a weekly planning (and then do the tasks wherever they fit inside your week as it progresses). This multi-term planning has multiple aims:

- Helps connect every small task to the long term goal, so you can feel that you are making progress. This is especially true if you keep a log of the tasks already done.
- Helps break down unmanageable large tasks into smaller chunks that directly contribute to the long term, goals.
- Checking this planning frequently ensures you don’t loose track of your long term goals.

Just as mentioned in the introduction of Subsection 3.2, this is just one way of going about it. What is invariant is the need to define the value function and choose a planning method. Other planning methods include Monte Carlo and Real Option Analysis. These are well suited for complicated projects and company planning, but elements might very well be useful to your process.

### 3.2.3 Cumulative reward optimization

In important idea of RL is the optimization of cumulative reward, meaning all time steps are considered. For this human version this would correspond to optimizing for maximum happiness over time, not being overly focused on the here and now.

An intuitive way of doing is is considering the second order effects of your actions. For example, partying often will give you pleasure in the moment, but the longer term effect is lost productivity and therefore missed opportunities. Another example is exercise: short term there is pain, but the longer term benefits are numerous. I would classify this article as a third order activity, whereby we plan and consider choosing actions to optimize second order effects.

The notion of optimizing for cumulative reward connects to the book “Solve for happy” where the author mentions not to optimize for short term fun, but deal with the underlying problems and gain a general level of happiness such that there is no need to escape from reality using fun. The author does mention two important uses for fun:

- As a supplement to allow you to relax and recharge your emotional capacity.
- As an emergency tool to distract you from a situation when appropriate.

### 3.2.4 Gradient descent

As the environment is stateful, actions over a period of time are needed to change it. Assuming a nonlinear environment (which I would argue our world certainly is) and an initial state, the standard procedure in mathematical optimization is gradient descent. This involves taking small steps in the direction that seems to be best given your current state, and the reevaluating when you’ve made the step. Similarly to how interest on financial investments accrues over time, steady investments in yourself also accrue.

It says something that for complicated situations (which I would argue our world certainly is) the best mathematics can offer is incremental improvements. So why would we be able to do better in our daily lives? Large improvements can usually be traced back to making small improvements every day.

## 3.3 Interacting with the environment

Reading the state of the environment and taking actions in it are complicated enterprises in human version of RL. The main challenge is the sheer size of the environment, and finding out which information is relevant to you and obtaining that information in an appropriate way.

For example, you might see someone doing something, and read the world at that level. However, you can delve deeper and try to understand why that person did what they did. This second approach will lead to much fewer misunderstanding. Other times you simply might not hear

### 3.3.1 Difficult Conversation

[Section related to the book: “Difficult Conversations”]

- It matters why people do what they do
- Understand and accept that people are different.
- More to be filled after trip to Isle of Wight.

### 3.3.2 Staying up to date

I propose the need for systematizing the flow of information to you. This involves ensure that as much relevant information as possible comes to you, after which you can decide what to do with it. It concerns both digital information and techniques as well as more human and qualitative sources. The general aim is to be up to date about:

- Relevant news (global, inside a company or organization, etc)
- Interesting opportunities
- External inspiration

For this the two main sources are the internet (no surprises there) and humans (networking).

# 4. Conclusion

- Introduced RL from the perspective of mathematics
- Applied the appropriate bits to a framework for RL in humans.
- Connected it with existing business ideas, todo’s. This shows the main value: you can now place globally why certain things matter.

# Disclaimers

- For those hoping for a rigorous mathematical article, this isn’t it. The qualitative nature and general messiness of life means the author wasn’t able to create it. Nevertheless any feedback for closing gaps or reducing mental leaps is greatly appreciated
- This article focuses on individualism, seemingly ignoring all sense of community and other individuals. [Add explanation/justification/solution]
- The article assumes that everyone should optimize for long term happiness, but that is really not a fundamental truth, it is very much up to everyone to choose. The author assumes that if you took the effort to read this article and critically think about taking actions for long term planning, you probably do care.