David Moreau Simard

We’ve had issues opened about benchmarking and improving the performance of the ara callback plugin as well as the API server for a while now and I only recently took a bit of time to tackle the callback.

It paid off because we can already see significant performance benefits in the 1.5.3 release of ara and it builds a foundation for future improvement opportunities.

If you’d like to see the raw unformatted data that was used for this post, check out this gist on GitHub.

A benchmarking playbook

Whenever you want to improve something, it’s important to measure it first so you know how better (or worse!) things become as a result of your changes.

The first step was to create a standardized benchmarking playbook that we could run across a variety of configurations and parameters.

The playbook was designed to run a specified number of tasks against a specified number of hosts.

It’s available in the git repository but it’s simple and small enough to include here:

# Copyright (c) 2020 The ARA Records Ansible authors
# GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt)

- name: Create many hosts
  hosts: localhost
  gather_facts: no
    benchmark_host_count: 25
    - name: Add a host to the inventory
        ansible_connection: local
        hostname: "host-{{ item }}"
        groups: benchmark
      with_sequence: start=1 end={{ benchmark_host_count }}

- name: Run tasks on many hosts
  hosts: benchmark
    benchmark_task_file: "{{ playbook_dir }}/benchmark_tasks.yaml"
    # Run N tasks per host
    benchmark_task_count: 50
    # Off by default to prevent accidental load spike on localhost
    benchmark_gather_facts: no
  gather_facts: "{{ benchmark_gather_facts }}"
    - name: Include a task file
      include_tasks: "{{ benchmark_task_file }}"
      with_sequence: start=1 end={{ benchmark_task_count }}

and then the benchmark_tasks.yaml file:

# Copyright (c) 2020 The ARA Records Ansible authors
# GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt)

# These are tasks meant to be imported by benchmark.yaml

- name: Run a task
    msg: "{{ inventory_hostname }} running task {{ item }}/{{ benchmark_task_count }}"


All tests ran under Ansible 2.10.2 and ANSIBLE_FORKS=50 using the default sqlite database backend for ara.

The benchmark playbook was run three times:

# 25 hosts and 50 tasks: 1276 results
ansible-playbook -i 'localhost,' -c local tests/integration/benchmark.yaml

# 100 hosts and 50 tasks: 5101 results
ansible-playbook -i 'localhost,' -c local tests/integration/benchmark.yaml \
    -e benchmark_host_count=100

# 200 hosts and 200 tasks: 40201 results
ansible-playbook -i 'localhost,' -c local tests/integration/benchmark.yaml \
    -e benchmark_host_count=200 \
    -e benchmark_task_count=200

Note: localhost and two bootstrap tasks are included in the results below

ansible without ara

tasks hosts results duration
52 26 1276 0m 11s
52 101 5101 0m 41s
202 201 40201 6m 01s

Our control: basically how much time these playbooks take to run without ara enabled so we can calculate the overhead and performance when enabling the ara callback.

ansible with ara 1.5.1

api client api server tasks hosts results duration
offline django 52 26 1276 0m 56s
http django 52 26 1276 0m 32s
http gunicorn 52 26 1276 0m 36s
offline django 52 101 5101 3m 31s
http django 52 101 5101 1m 39s
http gunicorn 52 101 5101 2m 19s
offline django 202 201 40201 30m22s
http django 202 201 40201 17m28s
http gunicorn 202 201 40201 21m38s

1.5.1 is the latest version that didn’t implement threading inside the callback plugin.

It’s curious that the django built-in webserver outperformed running with gunicorn when using the http client. I was not able to reproduce this result in 1.5.3.

I was aware that there was an overhead when enabling the callback but never realized the performance hit was this much until now, taking the time to accurately measure it:

results without ara 1.5.1 overhead
1276 11s 32s 21s
5101 41s 1m39s 58s
40201 6m01s 17m28s ~11.4m

ansible with ara 1.5.3

api client api server tasks hosts results duration
offline django 52 26 1276 0m 52s
http django 52 26 1276 0m 30s
http gunicorn 52 26 1276 0m 20s
offline django 52 101 5101 3m 22s
http django 52 101 5101 1m 37s
http gunicorn 52 101 5101 1m 09s
offline django 202 201 40201 29m25s
http django 202 201 40201 17m24s
http gunicorn 202 201 40201 13m47s

1.5.2 introduced threading in the callback and then 1.5.3 was subsequently released to workaround an issue when using the offline client, forcing it to use a single thread for now.

From the table above, we can tell:

  • Running a single thread with the offline client is just about the same performance as 1.5.1 without threading, if only a little bit faster.
  • There is a significant improvement in performance due to the multi-threading when using the http client
  • Running the API server with gunicorn outperforms using the built-in django development server

1.5.3 reduced the overhead of the callback plugin quite a bit when comparing to 1.5.1:

results without ara 1.5.3 overhead
1276 11s 20s 9s
5101 41s 1m 09s 28s
40201 6m 01s 13m47s ~7.8m

For science: ara 0.16.8

tasks hosts results duration
52 26 1276 0m 28s
52 101 5101 1m 56s
202 201 40201 19m05s

Although ara 0.x is no longer supported, it turns out it still works if you use the stable/0.x git branch.. even with Ansible 2.10!

It was interesting to run the same benchmark against 0.x because it runs a completely different backend. It uses flask instead of django and doesn’t provide an API: the callback talks directly to the database through flask-sqlalchemy.

Putting it all together

Tallying up the numbers, we can see that we’re on the right track and performance is improving:

tasks hosts results without ara 0.16.8 1.5.1 1.5.3
52 26 1276 11s 28s 32s 20s
52 101 5101 41s 1m 56s 1m 39s 1m 09s
202 201 40201 6m 01s 19m05s 17m28s 13m47s

There is definitely more work to do and more opportunities to improve performance to find. There will unfortunately always be an overhead but it needs to be low enough that it’s worth it without sacrificing simplicity.

In the future, it could be interesting the measure the impact of other parameters on performance like:

  • Ansible forks – what difference does having 25, 100 or 200 forks make ?
  • Callback threads – is there a benefit when running more threads in the threadpool ?
  • Version of Python – is there any difference between python 3.5 and 3.9 ?
  • Version of Ansible – was there any performance improvements or regressions between 2.8 and 2.10 ?
  • Database backend – is sqlite faster than mysql ? what about postgresql ?
  • Application backend – is gunicorn faster than uwsgi ? what about apache mod_wsgi ?
  • Latency: what’s the impact on performance of adding a jump box ? what about 50ms ? 250ms ?

If you’d like to help, have a look at the issues on GitHub or come chat with us on Slack or IRC !

See you around o/