Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 17 2020

12:12

TUI Acceptance Testing

Update: I had created this blog post as a draft and forgotten about it. Since this went into a talk at LCA2020 I thought I publish it for completeness.

Since Purebred is closing in on it’s seconds birthday on 18th of July, I wanted to highlight how useful our suite of acceptance tests has become and what work went into it.

The current state of affairs

We use tmux extensively to simulate user input and basically black box test the entire application. Currently all 25 tests run on travis in 1 minute and 25 seconds. Most of it comes down to IO performance. On a modern i5 laptop it’s down to 30seconds.

Screenshot_2020-01-17 Job #1366 1 - purebred-mua purebred - Travis CI

Each test runs performs a setup, starts the application and runs through a series of steps performing user input, waiting for the terminal to repaint and assert a given text to be present. Since everything is text in a terminal – even colours – this makes it really is to design a test suite if you can bridge the gap using tmux.

I think I should explain more precisely what I mean by “waiting for the terminal to repaint”. At this point in time, we poll tmux for our assertion string to be present. That happens in quick succession by checking the hardcopy of the terminal window (basically a screen shot) with an exponential back off.

Problems we encountered

tmux different behaviour between releases

The travis containers run a much older version of GNU/Linux and therefore tmux. We’ve subcommands accepting different arguments or changed escape sequences in the terminal screen shot.

Tests fail randomly because of too generic assertion strings

That took a bit of figuring out, but in hind sight it is really obvious. Since we determine the screen to be repainted once our assertion string shows up, we sometimes used a string which showed up in subsequent screens. Next steps were executed and the test failed at subsequent steps. This is confusing, since you wonder how the test steps have even gotten to this screen.

We solved this not really technically, but simply by fixing the failing problems and documenting this potential pitfall.

Races between new-session and tmux server

Initially each test set up a new session and cleaned the session during a tear down. While this intention was good, it lead to randomly failing tests. The single session being torn down meant that also the server was killed. However, if the next session is created immediately after the old session is being removed we find our self in a race between the tmux server being killed and newly started.

We solved this problem with a keep-alive session which runs as long the entire suite runs and numbering the test sessions.

Asserting against terminal colours

Some tests assert specific how widgets are specifically rendered including their ANSI colour codes. Since we do write the tests on our own computers which support more than 16 colours, the terminals the CI runs is typically less sophisticated. This can lead to randomly failing tests because the colours in CI are different depending on the type of the terminal.

We solved this problem by simply setting a “dumb” terminal only supporting 16 colours.

Line wrapping in a terminal

A terminal comes with a hard character line width. By default it’s 80 characters (and 24 in height). Part of those 80 characters will be eaten up by your PS1 (the command prompt) setting of your shell, the rest can be consumed by command input.

We did run into randomly failing tests if lines wrapped at random points in the input and therefore introduced newlines in the output.

We solved this by invoking tmux with an additional “-J” parameter to join wrapped lines.

Optimisations

Initially, each step was waiting up to 6 seconds for a redraw. With an increasing amount of tests, the wait time for a pass or fail increased as well. Since we faced some flaky tests, we felt fixing those first, before making the tests run faster. The downside of optimising first and fixing flaky tests afterwards is that random test failures become more pronounced eroding the confidence in any automated test suite.

After were were sure we had caught all problems, we introduced an exponential back off patch which would wait cumulatively up to a second for the UI to be repainted. That’s a long time for the UI to change.

October 20 2019

23:30

User permissions checking

Background

I’ve currently been working on restricting user access to executables on a Linux box. I removed all executable rights for others and added them via access control lists for certain groups. So for example for cat it looked like this:
# getfacl /usr/bin/cat
getfacl: Removing leading '/' from absolute path names
# file: usr/bin/cat
# owner: root
# group: root
user::rwx
group::r-x
other::r--

For an executable I want the user access to the permissions looked like this:


# getfacl /usr/bin/ping
getfacl: Removing leading '/' from absolute path names
# file: usr/bin/ping
# owner: root
# group: root
user::rwx
group::r-x
group:staff:--x
mask::r-x
other::r--

The user is in the staff group and can execute ‘ping’, while everyone else get’s a permission denied.

The Test

As an automated test, I thought I go over all commands and produce a whitelist of executables a given user has access to.

The script looks a bit like this for a single executable:

# cat /tmp/foo.py
import os

# the users group id
os.setgid(2000)
# the users ID
os.setuid(2003)

print(os.access(‘/usr/bin/cat’, os.X_OK))
print(os.access(‘/usr/bin/ping’, os.X_OK))

When I run it I expected the test for the first executable to be false and the second to be true:


# python /tmp/foo.py
True
True

Aeh what? Doesn’t the script run as my user who’s in the staff group?

Turns out there is more to the process than just the group and user id. There are also supplementary groups and capabilities. When changing the script to call print(os.getgroups()) it printed the supplementary groups of the user I was running the script as, which was root in this circumstance. Changing the script to also set the supplementary groups to the one of the user:


import os

# the users group id
os.setgid(2000)
os.setgroups([2000, 2003])
# the users ID
os.setuid(2003)

print(os.access(‘/usr/bin/cat’, os.X_OK))
print(os.access(‘/usr/bin/ping’, os.X_OK))

and running it returns the right results:

# python /tmp/foo.py
False
True

Caveats

Restricting permissions with ACLs and testing the way I demonstrated above can lead to false positives for scripts. You can not remove executable permissions from the script interpreter (e.g. /usr/bin/python) while keeping it with an ACL on the actual script. The test above will tell you it’s all fine and dandy, while in reality the user will run into a permission denied.

February 07 2019

04:32

GHC: can’t find a package database

In case you’re using the nix package manager your nix build fails with:


these derivations will be built:
/nix/store/7xk0m6r07x85rwlh01b3wvq8bbzwbw1n-purebred-0.1.0.0.drv
/nix/store/dmj2ax3qsa55jjl6by9fb9sk929k98nl-ghc-8.6.3-with-packages.drv
/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv
building '/nix/store/7xk0m6r07x85rwlh01b3wvq8bbzwbw1n-purebred-0.1.0.0.drv'...
setupCompilerEnvironmentPhase
Build with /nix/store/cclv7n6jr311i5ywwkms1m3iz4lsg37j-ghc-8.6.3.
unpacking sources
unpacking source archive /nix/store/j23vlzlg2rmqy0a706h235j4v9zh4m9s-purebred
source root is purebred
patching sources
compileBuildDriverPhase
setupCompileFlags: -package-db=/build/setup-package.conf.d -j4 -threaded
Loaded package environment from /build/purebred/.ghc.environment.x86_64-linux-8.6.3
ghc: can't find a package database at /home/rjoost/.cabal/store/ghc-8.6.3/package.db
builder for '/nix/store/7xk0m6r07x85rwlh01b3wvq8bbzwbw1n-purebred-0.1.0.0.drv' failed with exit code 1
cannot build derivation '/nix/store/dmj2ax3qsa55jjl6by9fb9sk929k98nl-ghc-8.6.3-with-packages.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv': 1 dependencies couldn't be built
error: build of '/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv' failed

then the solution to it is actually easier then you think. It happens when you run


cabal new-repl

inside a nix shell, because cabal creates a hidden environment file. So look for a


.ghc.environment.--
# for example on Linux with GHC 8.6.3
.ghc.environment.x86_64-linux-8.6.3

Delete it and you should be good to go.

December 14 2018

02:04

Docker volume mount fails

I recently stumbled over this odd error message in one of our gitlab runners:

ERROR: for nginx_proxy Cannot start service load_balancer: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/builds/group/project/nginx/nginx.conf\\\" to rootfs \\\"/data/docker/overlay2/88fb8a0ee201dd14cfc9aa9befe4d7a5eb28e5ec816a2d76726040316853ed11/merged\\\" at \\\"/data/docker/overlay2/88fb8a0ee201dd14cfc9aa9befe4d7a5eb28e5ec816a2d76726040316853ed11/merged/etc/nginx/nginx.conf\\\" caused \\\"not a directory\\\"\"": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type

It is the result of using the docker-compose up myservice command which is defined to use just an image and mounts files like so:

- ./nginx/nginx.conf:/etc/nginx/nginx.conf

I’ve spent a bit of time figuring out what the underlying problem is. In hindsight, the error message already gives it away, but I was unable to reproduce the issue on my host machine. That is because the problem is actually more related to docker than your host.

When I found out, that the runner in gitlab is actually a docker container, it dawned upon me, that the operation we do here is a container-in-a-container operation.  The container typically shares the same docker instance with the host system. The bind mount actually happens on the host machine. It tries to mount the path from a directory/file which doesn’t exist on the host machine.

To verify if we can reproduce the same error on the host, I tried to bind mount a volume with a path which doesn’t exist and voila:

$ sudo docker run --rm -it --volume /build/nginx/nginx.conf:/etc/nginx/nginx.conf nginx --help
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/build/nginx/nginx.conf\\\" to rootfs \\\"/var/lib/docker/devicemapper/mnt/2fab14f3dc592d19b1408618a5ba26e88e334d88fe6b7524dc6c30bb0d26bbfc/rootfs\\\" at \\\"/var/lib/docker/devicemapper/mnt/2fab14f3dc592d19b1408618a5ba26e88e334d88fe6b7524dc6c30bb0d26bbfc/rootfs/etc/nginx/nginx.conf\\\" caused \\\"not a directory\\\"\"": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type.

Be weary of running docker in docker when you need to bind mount volumes. Prefer a bare metal or VM as a runner.

December 03 2018

09:30

“Start request repeated too quickly”

If one of your units is not running any more and you find this in your journal: 


● getmail.service - getmail
Loaded: loaded (/home/rjoost/.config/systemd/user/getmail.service; enabled; vendor preset: enabled)
Active: failed (Result: start-limit-hit) since Thu 2018-11-29 18:42:17 AEST; 3s ago
Process: 20142 ExecStart=/usr/bin/getmail --idle=INBOX (code=exited, status=0/SUCCESS)
Main PID: 20142 (code=exited, status=0/SUCCESS)

Nov 29 18:42:17 bali systemd[3109]: getmail.service: Service hold-off time over, scheduling restart.
Nov 29 18:42:17 bali systemd[3109]: getmail.service: Scheduled restart job, restart counter is at 5.
Nov 29 18:42:17 bali systemd[3109]: Stopped getmail.
Nov 29 18:42:17 bali systemd[3109]: getmail.service: Start request repeated too quickly.
Nov 29 18:42:17 bali systemd[3109]: getmail.service: Failed with result 'start-limit-hit'.
Nov 29 18:42:17 bali systemd[3109]: Failed to start getmail.

it might be because your command really exits immediately and you may want to run the command manually to verify if that’s the case. Also check if you indeed have the unit configured with

Restart: always

.

I you’re sure it really does not restart too quickly, you can reset the counter with:

$ systemctl reset-failed unit

Further information can be found in the man pages of: systemd.unit(5) and systemd.service(5)

January 17 2020

12:12

TUI Acceptance Testing

Update: I had created this blog post as a draft and forgotten about it. Since this went into a talk at LCA2020 I thought I publish it for completeness.

Since Purebred is closing in on it’s seconds birthday on 18th of July, I wanted to highlight how useful our suite of acceptance tests has become and what work went into it.

The current state of affairs

We use tmux extensively to simulate user input and basically black box test the entire application. Currently all 25 tests run on travis in 1 minute and 25 seconds. Most of it comes down to IO performance. On a modern i5 laptop it’s down to 30seconds.

Screenshot_2020-01-17 Job #1366 1 - purebred-mua purebred - Travis CI

Each test runs performs a setup, starts the application and runs through a series of steps performing user input, waiting for the terminal to repaint and assert a given text to be present. Since everything is text in a terminal – even colours – this makes it really is to design a test suite if you can bridge the gap using tmux.

I think I should explain more precisely what I mean by “waiting for the terminal to repaint”. At this point in time, we poll tmux for our assertion string to be present. That happens in quick succession by checking the hardcopy of the terminal window (basically a screen shot) with an exponential back off.

Problems we encountered

tmux different behaviour between releases

The travis containers run a much older version of GNU/Linux and therefore tmux. We’ve subcommands accepting different arguments or changed escape sequences in the terminal screen shot.

Tests fail randomly because of too generic assertion strings

That took a bit of figuring out, but in hind sight it is really obvious. Since we determine the screen to be repainted once our assertion string shows up, we sometimes used a string which showed up in subsequent screens. Next steps were executed and the test failed at subsequent steps. This is confusing, since you wonder how the test steps have even gotten to this screen.

We solved this not really technically, but simply by fixing the failing problems and documenting this potential pitfall.

Races between new-session and tmux server

Initially each test set up a new session and cleaned the session during a tear down. While this intention was good, it lead to randomly failing tests. The single session being torn down meant that also the server was killed. However, if the next session is created immediately after the old session is being removed we find our self in a race between the tmux server being killed and newly started.

We solved this problem with a keep-alive session which runs as long the entire suite runs and numbering the test sessions.

Asserting against terminal colours

Some tests assert specific how widgets are specifically rendered including their ANSI colour codes. Since we do write the tests on our own computers which support more than 16 colours, the terminals the CI runs is typically less sophisticated. This can lead to randomly failing tests because the colours in CI are different depending on the type of the terminal.

We solved this problem by simply setting a “dumb” terminal only supporting 16 colours.

Line wrapping in a terminal

A terminal comes with a hard character line width. By default it’s 80 characters (and 24 in height). Part of those 80 characters will be eaten up by your PS1 (the command prompt) setting of your shell, the rest can be consumed by command input.

We did run into randomly failing tests if lines wrapped at random points in the input and therefore introduced newlines in the output.

We solved this by invoking tmux with an additional “-J” parameter to join wrapped lines.

Optimisations

Initially, each step was waiting up to 6 seconds for a redraw. With an increasing amount of tests, the wait time for a pass or fail increased as well. Since we faced some flaky tests, we felt fixing those first, before making the tests run faster. The downside of optimising first and fixing flaky tests afterwards is that random test failures become more pronounced eroding the confidence in any automated test suite.

After were were sure we had caught all problems, we introduced an exponential back off patch which would wait cumulatively up to a second for the UI to be repainted. That’s a long time for the UI to change.

October 20 2019

23:30

User permissions checking

Background

I’ve currently been working on restricting user access to executables on a Linux box. I removed all executable rights for others and added them via access control lists for certain groups. So for example for cat it looked like this:
# getfacl /usr/bin/cat
getfacl: Removing leading '/' from absolute path names
# file: usr/bin/cat
# owner: root
# group: root
user::rwx
group::r-x
other::r--

For an executable I want the user access to the permissions looked like this:


# getfacl /usr/bin/ping
getfacl: Removing leading '/' from absolute path names
# file: usr/bin/ping
# owner: root
# group: root
user::rwx
group::r-x
group:staff:--x
mask::r-x
other::r--

The user is in the staff group and can execute ‘ping’, while everyone else get’s a permission denied.

The Test

As an automated test, I thought I go over all commands and produce a whitelist of executables a given user has access to.

The script looks a bit like this for a single executable:

# cat /tmp/foo.py
import os

# the users group id
os.setgid(2000)
# the users ID
os.setuid(2003)

print(os.access(‘/usr/bin/cat’, os.X_OK))
print(os.access(‘/usr/bin/ping’, os.X_OK))

When I run it I expected the test for the first executable to be false and the second to be true:


# python /tmp/foo.py
True
True

Aeh what? Doesn’t the script run as my user who’s in the staff group?

Turns out there is more to the process than just the group and user id. There are also supplementary groups and capabilities. When changing the script to call print(os.getgroups()) it printed the supplementary groups of the user I was running the script as, which was root in this circumstance. Changing the script to also set the supplementary groups to the one of the user:


import os

# the users group id
os.setgid(2000)
os.setgroups([2000, 2003])
# the users ID
os.setuid(2003)

print(os.access(‘/usr/bin/cat’, os.X_OK))
print(os.access(‘/usr/bin/ping’, os.X_OK))

and running it returns the right results:

# python /tmp/foo.py
False
True

Caveats

Restricting permissions with ACLs and testing the way I demonstrated above can lead to false positives for scripts. You can not remove executable permissions from the script interpreter (e.g. /usr/bin/python) while keeping it with an ACL on the actual script. The test above will tell you it’s all fine and dandy, while in reality the user will run into a permission denied.

February 07 2019

04:32

GHC: can’t find a package database

In case you’re using the nix package manager your nix build fails with:


these derivations will be built:
/nix/store/7xk0m6r07x85rwlh01b3wvq8bbzwbw1n-purebred-0.1.0.0.drv
/nix/store/dmj2ax3qsa55jjl6by9fb9sk929k98nl-ghc-8.6.3-with-packages.drv
/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv
building '/nix/store/7xk0m6r07x85rwlh01b3wvq8bbzwbw1n-purebred-0.1.0.0.drv'...
setupCompilerEnvironmentPhase
Build with /nix/store/cclv7n6jr311i5ywwkms1m3iz4lsg37j-ghc-8.6.3.
unpacking sources
unpacking source archive /nix/store/j23vlzlg2rmqy0a706h235j4v9zh4m9s-purebred
source root is purebred
patching sources
compileBuildDriverPhase
setupCompileFlags: -package-db=/build/setup-package.conf.d -j4 -threaded
Loaded package environment from /build/purebred/.ghc.environment.x86_64-linux-8.6.3
ghc: can't find a package database at /home/rjoost/.cabal/store/ghc-8.6.3/package.db
builder for '/nix/store/7xk0m6r07x85rwlh01b3wvq8bbzwbw1n-purebred-0.1.0.0.drv' failed with exit code 1
cannot build derivation '/nix/store/dmj2ax3qsa55jjl6by9fb9sk929k98nl-ghc-8.6.3-with-packages.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv': 1 dependencies couldn't be built
error: build of '/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv' failed

then the solution to it is actually easier then you think. It happens when you run


cabal new-repl

inside a nix shell, because cabal creates a hidden environment file. So look for a


.ghc.environment.--
# for example on Linux with GHC 8.6.3
.ghc.environment.x86_64-linux-8.6.3

Delete it and you should be good to go.

December 14 2018

02:04

Docker volume mount fails

I recently stumbled over this odd error message in one of our gitlab runners:

ERROR: for nginx_proxy Cannot start service load_balancer: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/builds/group/project/nginx/nginx.conf\\\" to rootfs \\\"/data/docker/overlay2/88fb8a0ee201dd14cfc9aa9befe4d7a5eb28e5ec816a2d76726040316853ed11/merged\\\" at \\\"/data/docker/overlay2/88fb8a0ee201dd14cfc9aa9befe4d7a5eb28e5ec816a2d76726040316853ed11/merged/etc/nginx/nginx.conf\\\" caused \\\"not a directory\\\"\"": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type

It is the result of using the docker-compose up myservice command which is defined to use just an image and mounts files like so:

- ./nginx/nginx.conf:/etc/nginx/nginx.conf

I’ve spent a bit of time figuring out what the underlying problem is. In hindsight, the error message already gives it away, but I was unable to reproduce the issue on my host machine. That is because the problem is actually more related to docker than your host.

When I found out, that the runner in gitlab is actually a docker container, it dawned upon me, that the operation we do here is a container-in-a-container operation.  The container typically shares the same docker instance with the host system. The bind mount actually happens on the host machine. It tries to mount the path from a directory/file which doesn’t exist on the host machine.

To verify if we can reproduce the same error on the host, I tried to bind mount a volume with a path which doesn’t exist and voila:

$ sudo docker run --rm -it --volume /build/nginx/nginx.conf:/etc/nginx/nginx.conf nginx --help
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/build/nginx/nginx.conf\\\" to rootfs \\\"/var/lib/docker/devicemapper/mnt/2fab14f3dc592d19b1408618a5ba26e88e334d88fe6b7524dc6c30bb0d26bbfc/rootfs\\\" at \\\"/var/lib/docker/devicemapper/mnt/2fab14f3dc592d19b1408618a5ba26e88e334d88fe6b7524dc6c30bb0d26bbfc/rootfs/etc/nginx/nginx.conf\\\" caused \\\"not a directory\\\"\"": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type.

Be weary of running docker in docker when you need to bind mount volumes. Prefer a bare metal or VM as a runner.

January 17 2020

12:12

TUI Acceptance Testing

Update: I had created this blog post as a draft and forgotten about it. Since this went into a talk at LCA2020 I thought I publish it for completeness.

Since Purebred is closing in on it’s seconds birthday on 18th of July, I wanted to highlight how useful our suite of acceptance tests has become and what work went into it.

The current state of affairs

We use tmux extensively to simulate user input and basically black box test the entire application. Currently all 25 tests run on travis in 1 minute and 25 seconds. Most of it comes down to IO performance. On a modern i5 laptop it’s down to 30seconds.

Screenshot_2020-01-17 Job #1366 1 - purebred-mua purebred - Travis CI

Each test runs performs a setup, starts the application and runs through a series of steps performing user input, waiting for the terminal to repaint and assert a given text to be present. Since everything is text in a terminal – even colours – this makes it really is to design a test suite if you can bridge the gap using tmux.

I think I should explain more precisely what I mean by “waiting for the terminal to repaint”. At this point in time, we poll tmux for our assertion string to be present. That happens in quick succession by checking the hardcopy of the terminal window (basically a screen shot) with an exponential back off.

Problems we encountered

tmux different behaviour between releases

The travis containers run a much older version of GNU/Linux and therefore tmux. We’ve subcommands accepting different arguments or changed escape sequences in the terminal screen shot.

Tests fail randomly because of too generic assertion strings

That took a bit of figuring out, but in hind sight it is really obvious. Since we determine the screen to be repainted once our assertion string shows up, we sometimes used a string which showed up in subsequent screens. Next steps were executed and the test failed at subsequent steps. This is confusing, since you wonder how the test steps have even gotten to this screen.

We solved this not really technically, but simply by fixing the failing problems and documenting this potential pitfall.

Races between new-session and tmux server

Initially each test set up a new session and cleaned the session during a tear down. While this intention was good, it lead to randomly failing tests. The single session being torn down meant that also the server was killed. However, if the next session is created immediately after the old session is being removed we find our self in a race between the tmux server being killed and newly started.

We solved this problem with a keep-alive session which runs as long the entire suite runs and numbering the test sessions.

Asserting against terminal colours

Some tests assert specific how widgets are specifically rendered including their ANSI colour codes. Since we do write the tests on our own computers which support more than 16 colours, the terminals the CI runs is typically less sophisticated. This can lead to randomly failing tests because the colours in CI are different depending on the type of the terminal.

We solved this problem by simply setting a “dumb” terminal only supporting 16 colours.

Line wrapping in a terminal

A terminal comes with a hard character line width. By default it’s 80 characters (and 24 in height). Part of those 80 characters will be eaten up by your PS1 (the command prompt) setting of your shell, the rest can be consumed by command input.

We did run into randomly failing tests if lines wrapped at random points in the input and therefore introduced newlines in the output.

We solved this by invoking tmux with an additional “-J” parameter to join wrapped lines.

Optimisations

Initially, each step was waiting up to 6 seconds for a redraw. With an increasing amount of tests, the wait time for a pass or fail increased as well. Since we faced some flaky tests, we felt fixing those first, before making the tests run faster. The downside of optimising first and fixing flaky tests afterwards is that random test failures become more pronounced eroding the confidence in any automated test suite.

After were were sure we had caught all problems, we introduced an exponential back off patch which would wait cumulatively up to a second for the UI to be repainted. That’s a long time for the UI to change.

October 20 2019

23:30

User permissions checking

Background

I’ve currently been working on restricting user access to executables on a Linux box. I removed all executable rights for others and added them via access control lists for certain groups. So for example for cat it looked like this:
# getfacl /usr/bin/cat
getfacl: Removing leading '/' from absolute path names
# file: usr/bin/cat
# owner: root
# group: root
user::rwx
group::r-x
other::r--

For an executable I want the user access to the permissions looked like this:


# getfacl /usr/bin/ping
getfacl: Removing leading '/' from absolute path names
# file: usr/bin/ping
# owner: root
# group: root
user::rwx
group::r-x
group:staff:--x
mask::r-x
other::r--

The user is in the staff group and can execute ‘ping’, while everyone else get’s a permission denied.

The Test

As an automated test, I thought I go over all commands and produce a whitelist of executables a given user has access to.

The script looks a bit like this for a single executable:

# cat /tmp/foo.py
import os

# the users group id
os.setgid(2000)
# the users ID
os.setuid(2003)

print(os.access(‘/usr/bin/cat’, os.X_OK))
print(os.access(‘/usr/bin/ping’, os.X_OK))

When I run it I expected the test for the first executable to be false and the second to be true:


# python /tmp/foo.py
True
True

Aeh what? Doesn’t the script run as my user who’s in the staff group?

Turns out there is more to the process than just the group and user id. There are also supplementary groups and capabilities. When changing the script to call print(os.getgroups()) it printed the supplementary groups of the user I was running the script as, which was root in this circumstance. Changing the script to also set the supplementary groups to the one of the user:


import os

# the users group id
os.setgid(2000)
os.setgroups([2000, 2003])
# the users ID
os.setuid(2003)

print(os.access(‘/usr/bin/cat’, os.X_OK))
print(os.access(‘/usr/bin/ping’, os.X_OK))

and running it returns the right results:

# python /tmp/foo.py
False
True

Caveats

Restricting permissions with ACLs and testing the way I demonstrated above can lead to false positives for scripts. You can not remove executable permissions from the script interpreter (e.g. /usr/bin/python) while keeping it with an ACL on the actual script. The test above will tell you it’s all fine and dandy, while in reality the user will run into a permission denied.

February 07 2019

04:32

GHC: can’t find a package database

In case you’re using the nix package manager your nix build fails with:


these derivations will be built:
/nix/store/7xk0m6r07x85rwlh01b3wvq8bbzwbw1n-purebred-0.1.0.0.drv
/nix/store/dmj2ax3qsa55jjl6by9fb9sk929k98nl-ghc-8.6.3-with-packages.drv
/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv
building '/nix/store/7xk0m6r07x85rwlh01b3wvq8bbzwbw1n-purebred-0.1.0.0.drv'...
setupCompilerEnvironmentPhase
Build with /nix/store/cclv7n6jr311i5ywwkms1m3iz4lsg37j-ghc-8.6.3.
unpacking sources
unpacking source archive /nix/store/j23vlzlg2rmqy0a706h235j4v9zh4m9s-purebred
source root is purebred
patching sources
compileBuildDriverPhase
setupCompileFlags: -package-db=/build/setup-package.conf.d -j4 -threaded
Loaded package environment from /build/purebred/.ghc.environment.x86_64-linux-8.6.3
ghc: can't find a package database at /home/rjoost/.cabal/store/ghc-8.6.3/package.db
builder for '/nix/store/7xk0m6r07x85rwlh01b3wvq8bbzwbw1n-purebred-0.1.0.0.drv' failed with exit code 1
cannot build derivation '/nix/store/dmj2ax3qsa55jjl6by9fb9sk929k98nl-ghc-8.6.3-with-packages.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv': 1 dependencies couldn't be built
error: build of '/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv' failed

then the solution to it is actually easier then you think. It happens when you run


cabal new-repl

inside a nix shell, because cabal creates a hidden environment file. So look for a


.ghc.environment.--
# for example on Linux with GHC 8.6.3
.ghc.environment.x86_64-linux-8.6.3

Delete it and you should be good to go.

January 17 2020

12:12

TUI Acceptance Testing

Update: I had created this blog post as a draft and forgotten about it. Since this went into a talk at LCA2020 I thought I publish it for completeness.

Since Purebred is closing in on it’s seconds birthday on 18th of July, I wanted to highlight how useful our suite of acceptance tests has become and what work went into it.

The current state of affairs

We use tmux extensively to simulate user input and basically black box test the entire application. Currently all 25 tests run on travis in 1 minute and 25 seconds. Most of it comes down to IO performance. On a modern i5 laptop it’s down to 30seconds.

Screenshot_2020-01-17 Job #1366 1 - purebred-mua purebred - Travis CI

Each test runs performs a setup, starts the application and runs through a series of steps performing user input, waiting for the terminal to repaint and assert a given text to be present. Since everything is text in a terminal – even colours – this makes it really is to design a test suite if you can bridge the gap using tmux.

I think I should explain more precisely what I mean by “waiting for the terminal to repaint”. At this point in time, we poll tmux for our assertion string to be present. That happens in quick succession by checking the hardcopy of the terminal window (basically a screen shot) with an exponential back off.

Problems we encountered

tmux different behaviour between releases

The travis containers run a much older version of GNU/Linux and therefore tmux. We’ve subcommands accepting different arguments or changed escape sequences in the terminal screen shot.

Tests fail randomly because of too generic assertion strings

That took a bit of figuring out, but in hind sight it is really obvious. Since we determine the screen to be repainted once our assertion string shows up, we sometimes used a string which showed up in subsequent screens. Next steps were executed and the test failed at subsequent steps. This is confusing, since you wonder how the test steps have even gotten to this screen.

We solved this not really technically, but simply by fixing the failing problems and documenting this potential pitfall.

Races between new-session and tmux server

Initially each test set up a new session and cleaned the session during a tear down. While this intention was good, it lead to randomly failing tests. The single session being torn down meant that also the server was killed. However, if the next session is created immediately after the old session is being removed we find our self in a race between the tmux server being killed and newly started.

We solved this problem with a keep-alive session which runs as long the entire suite runs and numbering the test sessions.

Asserting against terminal colours

Some tests assert specific how widgets are specifically rendered including their ANSI colour codes. Since we do write the tests on our own computers which support more than 16 colours, the terminals the CI runs is typically less sophisticated. This can lead to randomly failing tests because the colours in CI are different depending on the type of the terminal.

We solved this problem by simply setting a “dumb” terminal only supporting 16 colours.

Line wrapping in a terminal

A terminal comes with a hard character line width. By default it’s 80 characters (and 24 in height). Part of those 80 characters will be eaten up by your PS1 (the command prompt) setting of your shell, the rest can be consumed by command input.

We did run into randomly failing tests if lines wrapped at random points in the input and therefore introduced newlines in the output.

We solved this by invoking tmux with an additional “-J” parameter to join wrapped lines.

Optimisations

Initially, each step was waiting up to 6 seconds for a redraw. With an increasing amount of tests, the wait time for a pass or fail increased as well. Since we faced some flaky tests, we felt fixing those first, before making the tests run faster. The downside of optimising first and fixing flaky tests afterwards is that random test failures become more pronounced eroding the confidence in any automated test suite.

After were were sure we had caught all problems, we introduced an exponential back off patch which would wait cumulatively up to a second for the UI to be repainted. That’s a long time for the UI to change.

October 20 2019

23:30

User permissions checking

Background

I’ve currently been working on restricting user access to executables on a Linux box. I removed all executable rights for others and added them via access control lists for certain groups. So for example for cat it looked like this:
# getfacl /usr/bin/cat
getfacl: Removing leading '/' from absolute path names
# file: usr/bin/cat
# owner: root
# group: root
user::rwx
group::r-x
other::r--

For an executable I want the user access to the permissions looked like this:


# getfacl /usr/bin/ping
getfacl: Removing leading '/' from absolute path names
# file: usr/bin/ping
# owner: root
# group: root
user::rwx
group::r-x
group:staff:--x
mask::r-x
other::r--

The user is in the staff group and can execute ‘ping’, while everyone else get’s a permission denied.

The Test

As an automated test, I thought I go over all commands and produce a whitelist of executables a given user has access to.

The script looks a bit like this for a single executable:

# cat /tmp/foo.py
import os

# the users group id
os.setgid(2000)
# the users ID
os.setuid(2003)

print(os.access(‘/usr/bin/cat’, os.X_OK))
print(os.access(‘/usr/bin/ping’, os.X_OK))

When I run it I expected the test for the first executable to be false and the second to be true:


# python /tmp/foo.py
True
True

Aeh what? Doesn’t the script run as my user who’s in the staff group?

Turns out there is more to the process than just the group and user id. There are also supplementary groups and capabilities. When changing the script to call print(os.getgroups()) it printed the supplementary groups of the user I was running the script as, which was root in this circumstance. Changing the script to also set the supplementary groups to the one of the user:


import os

# the users group id
os.setgid(2000)
os.setgroups([2000, 2003])
# the users ID
os.setuid(2003)

print(os.access(‘/usr/bin/cat’, os.X_OK))
print(os.access(‘/usr/bin/ping’, os.X_OK))

and running it returns the right results:

# python /tmp/foo.py
False
True

Caveats

Restricting permissions with ACLs and testing the way I demonstrated above can lead to false positives for scripts. You can not remove executable permissions from the script interpreter (e.g. /usr/bin/python) while keeping it with an ACL on the actual script. The test above will tell you it’s all fine and dandy, while in reality the user will run into a permission denied.

January 17 2020

12:12

TUI Acceptance Testing

Update: I had created this blog post as a draft and forgotten about it. Since this went into a talk at LCA2020 I thought I publish it for completeness.

Since Purebred is closing in on it’s seconds birthday on 18th of July, I wanted to highlight how useful our suite of acceptance tests has become and what work went into it.

The current state of affairs

We use tmux extensively to simulate user input and basically black box test the entire application. Currently all 25 tests run on travis in 1 minute and 25 seconds. Most of it comes down to IO performance. On a modern i5 laptop it’s down to 30seconds.

Screenshot_2020-01-17 Job #1366 1 - purebred-mua purebred - Travis CI

Each test runs performs a setup, starts the application and runs through a series of steps performing user input, waiting for the terminal to repaint and assert a given text to be present. Since everything is text in a terminal – even colours – this makes it really is to design a test suite if you can bridge the gap using tmux.

I think I should explain more precisely what I mean by “waiting for the terminal to repaint”. At this point in time, we poll tmux for our assertion string to be present. That happens in quick succession by checking the hardcopy of the terminal window (basically a screen shot) with an exponential back off.

Problems we encountered

tmux different behaviour between releases

The travis containers run a much older version of GNU/Linux and therefore tmux. We’ve subcommands accepting different arguments or changed escape sequences in the terminal screen shot.

Tests fail randomly because of too generic assertion strings

That took a bit of figuring out, but in hind sight it is really obvious. Since we determine the screen to be repainted once our assertion string shows up, we sometimes used a string which showed up in subsequent screens. Next steps were executed and the test failed at subsequent steps. This is confusing, since you wonder how the test steps have even gotten to this screen.

We solved this not really technically, but simply by fixing the failing problems and documenting this potential pitfall.

Races between new-session and tmux server

Initially each test set up a new session and cleaned the session during a tear down. While this intention was good, it lead to randomly failing tests. The single session being torn down meant that also the server was killed. However, if the next session is created immediately after the old session is being removed we find our self in a race between the tmux server being killed and newly started.

We solved this problem with a keep-alive session which runs as long the entire suite runs and numbering the test sessions.

Asserting against terminal colours

Some tests assert specific how widgets are specifically rendered including their ANSI colour codes. Since we do write the tests on our own computers which support more than 16 colours, the terminals the CI runs is typically less sophisticated. This can lead to randomly failing tests because the colours in CI are different depending on the type of the terminal.

We solved this problem by simply setting a “dumb” terminal only supporting 16 colours.

Line wrapping in a terminal

A terminal comes with a hard character line width. By default it’s 80 characters (and 24 in height). Part of those 80 characters will be eaten up by your PS1 (the command prompt) setting of your shell, the rest can be consumed by command input.

We did run into randomly failing tests if lines wrapped at random points in the input and therefore introduced newlines in the output.

We solved this by invoking tmux with an additional “-J” parameter to join wrapped lines.

Optimisations

Initially, each step was waiting up to 6 seconds for a redraw. With an increasing amount of tests, the wait time for a pass or fail increased as well. Since we faced some flaky tests, we felt fixing those first, before making the tests run faster. The downside of optimising first and fixing flaky tests afterwards is that random test failures become more pronounced eroding the confidence in any automated test suite.

After were were sure we had caught all problems, we introduced an exponential back off patch which would wait cumulatively up to a second for the UI to be repainted. That’s a long time for the UI to change.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl