Defining the camera position

When displaying a 3D scene, we need to think about how such an image is formed. The obvious model is our own vision, although having two eyes is a bit more complicated. So we make it simpler and go for a camera, with a single image sensor.

The first thing we need to answer is, how to place the camera in the 3D world and how to describe that mathematically.

A common way to describe a camera is by specifying three values:

$\mathbf{A}$ : The position of the camera in the world
$\mathbf{C}$ : The point the camera is pointing towards ("looking at")
$\mathbf{u}$ : The "up"-vector, which describes which direction is the "up" direction in our model. Usually, the $y$ -axis is chosen

The basic idea is to compute the coordinates of each object vertex in the coordinate system described by our camera. This involves two steps. The first step is to make our camera the new origin of the coordinates. This is actually pretty easy and we already covered the operation before: A translation by $-\mathbf{A}$ : $\mathbf{T}(-\mathbf{A})$ . This will move $\mathbf{A}$ to the origin and everything else will move relative to that.

Now we need to compute the coordinates of the moved points. This might sound hard, but it is actually pretty intuitive. Let's say you have a coordinage system drawn out on a sheet of paper. You now place a point somewhere on that paper. How do you find the $x$ and $y$ coordinates? You draw a line perpendicular to the axis such that it goes through the point. This line of course also intersects the axis you draw it perpendicular too. To get the coordinate, you measure how far away that intersection is from the origin.

This is exactly the projection of the position vector of the point onto the axis (you "drop the position straight down onto the axis")! How do we calculate this projection? If our axis vector has a length of $1$ , you may recall that the length of the projection is exactly the dot product of the vector and the axis!

So, given that we know the camera axes $\mathbf{x}^w,\mathbf{y}^w,\mathbf{z}^w$ in world space (the camera is placed in the world), we can find the camera coordinates of a point also in the world system $\mathbf{p}^w$ as:

\begin{align*} \mathbf{p}^c_x &= \mathbf{p}^w \cdot \mathbf{x}^w \\ \mathbf{p}^c_y &= \mathbf{p}^w \cdot \mathbf{y}^w \\ \mathbf{p}^c_z &= \mathbf{p}^w \cdot \mathbf{z}^w \end{align*}

Now, if you recall how matrix multiplication works, this can be simply written as:

\begin{align*} \mathbf{p}^c &= \begin{pmatrix}(\mathbf{x}^w)^T \\ (\mathbf{y}^w)^T \\ (\mathbf{z}^w)^T\end{pmatrix} \mathbf{p}^w \\ &= \mathbf{R_c} \mathbf{p}^w \end{align*}

Note: If our axes are normalized and perpendicular and form a right-handed system, the matrix $\mathbf{R}_c$ will be a rotation matrix, satisfying $\mathbf{R}_c \mathbf{R}_c^T = \mathbf{I}$ .

Also recall that we used matrices to transform the coordinates of points before, even if it was in 2D. To compose multiple such transformations we just used matrix multiplication and we will do it this time again, we just need to once again make a $4\times 4$ matrix from $\mathbf{R}_c$ by putting it in the upper left part of a $4\times 4$ identity matrix. With that we define the View matrix $\mathbf{V}$ as:

\mathbf{V} = \mathbf{R}_c \mathbf{T}(-\mathbf{A})

This matrix will transform a point in the world's coordinate system into one in the camera's system!

As this does not change per model, we can for example put that into the uniform data as a constant.

What is missing is how we can compute the $\mathbf{x}^w,\mathbf{y}^w,\mathbf{z}^w$ axes.

For that we think about how our screen coordinates should work. We want the $x$ -axis to point to the right and the $y$ axis to point up, such that we can align it with the screen where the origin is the lower left.

Note: This is the convention used in the OpenGL API, but others, such as Vulkan, are different and might have the $y$ axis point down. This isn't a big issues, you just need to be careful what definition is used.

If we use the "right-up" definition and want to stay with a right handed coordinate system, the $z$ axis has to point "out" of the screen, so the other direction of our viewing direction. And this is where we start.

The camera looks from $\mathbf{A}$ to the point $\mathbf{C}$ , so the normalized view direction is $\frac{\mathbf{C}- \mathbf{A}}{||\mathbf{C}-\mathbf{A}||}$ . As mentioned, the $z$ -axis points in the other direction, thus have:

\mathbf{z} = \frac{\mathbf{A}- \mathbf{C}}{||\mathbf{A}-\mathbf{C}||}

Now we want to compute the "right" axis: $x$ . For that we use the up direction $\mathbf{u}$ . We can think about it as a first guess on where the final $y$ -axis will be. To get $x$ from $y$ and $z$ , you compute $\mathbf{y} \times \mathbf{z}$ . Using our up vector instead and normalizing we get our second axis:

\mathbf{x} = \frac{\mathbf{u} \times \mathbf{z}}{||\mathbf{u} \times \mathbf{z}||}

By the properties of the cross product, $\mathbf{x}$ and $\mathbf{z}$ are perpendicular. We now get the final vector as another cross product:

\mathbf{y} = \mathbf{z} \times \mathbf{x}

Note, that we don't need to normalize, as the length of the cross product will be $||\mathbf{z} || || \mathbf{x} || \sin\alpha = 1 * 1 * 1 = 1$ , since both vectors are normalized and perpendicular.

With that, we have found everything we need to define the position and orientation of a camera using the View matrix $\mathbf{V}$ . We call the coordinate system defined this way the view space.

You can program this yourself or use the following function in code:

/**
 * Computes a 4x4 view matrix for 3D space
 *
 * @param {AbstractMat} eye - The camera center
 * @param {AbstractMat} center - The point to look at
 * @param {AbstractMat} up - The up vector
 * @returns {Mat} The view matrix
 */
jsm.lookAt(eye, center, up);

// Example
// The camera is located at (-0.5,0,0) and
//  looks to the origin (0,0,0) with (0,1,0)
//  being the world's up direction
const V = jsm.lookAt(
  vec3(-0.5,0,0), vec3(0,0,0), vec3(0,1,0)
);

Next up we define how the camera itself works and represents how we see perspective.