We’ve had issues opened about benchmarking and improving the performance of the ara callback plugin as well as the API server for a while now and I only recently took a bit of time to tackle the callback.

It paid off because we can already see significant performance benefits in the 1.5.3 release of ara and it builds a foundation for future improvement opportunities.

If you’d like to see the raw unformatted data that was used for this post, check out this gist on GitHub.

A benchmarking playbook

Whenever you want to improve something, it’s important to measure it first so you know how better (or worse!) things become as a result of your changes.

The first step was to create a standardized benchmarking playbook that we could run across a variety of configurations and parameters.

The playbook was designed to run a specified number of tasks against a specified number of hosts.

It’s available in the git repository but it’s simple and small enough to include here:

# Copyright (c) 2020 The ARA Records Ansible authors
# GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt)

- name: Create many hosts
  hosts: localhost
  gather_facts: no
  vars:
    benchmark_host_count: 25
  tasks:
    - name: Add a host to the inventory
      add_host:
        ansible_connection: local
        hostname: "host-{{ item }}"
        groups: benchmark
      with_sequence: start=1 end={{ benchmark_host_count }}

- name: Run tasks on many hosts
  hosts: benchmark
  vars:
    benchmark_task_file: "{{ playbook_dir }}/benchmark_tasks.yaml"
    # Run N tasks per host
    benchmark_task_count: 50
    # Off by default to prevent accidental load spike on localhost
    benchmark_gather_facts: no
  gather_facts: "{{ benchmark_gather_facts }}"
  tasks:
    - name: Include a task file
      include_tasks: "{{ benchmark_task_file }}"
      with_sequence: start=1 end={{ benchmark_task_count }}

and then the benchmark_tasks.yaml file:

# Copyright (c) 2020 The ARA Records Ansible authors
# GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt)

# These are tasks meant to be imported by benchmark.yaml

- name: Run a task
  debug:
    msg: "{{ inventory_hostname }} running task {{ item }}/{{ benchmark_task_count }}"

Methodology

All tests ran under Ansible 2.10.2 and ANSIBLE_FORKS=50 using the default sqlite database backend for ara.

The benchmark playbook was run three times:

# 25 hosts and 50 tasks: 1276 results
ansible-playbook -i 'localhost,' -c local tests/integration/benchmark.yaml

# 100 hosts and 50 tasks: 5101 results
ansible-playbook -i 'localhost,' -c local tests/integration/benchmark.yaml \
    -e benchmark_host_count=100

# 200 hosts and 200 tasks: 40201 results
ansible-playbook -i 'localhost,' -c local tests/integration/benchmark.yaml \
    -e benchmark_host_count=200 \
    -e benchmark_task_count=200

Note: localhost and two bootstrap tasks are included in the results below

ansible without ara

taskshostsresultsduration
522612760m 11s
5210151010m 41s
202201402016m 01s

Our control: basically how much time these playbooks take to run without ara enabled so we can calculate the overhead and performance when enabling the ara callback.

ansible with ara 1.5.1

api clientapi servertaskshostsresultsduration
offlinedjango522612760m 56s
httpdjango522612760m 32s
httpgunicorn522612760m 36s
offlinedjango5210151013m 31s
httpdjango5210151011m 39s
httpgunicorn5210151012m 19s
offlinedjango2022014020130m22s
httpdjango2022014020117m28s
httpgunicorn2022014020121m38s

1.5.1 is the latest version that didn’t implement threading inside the callback plugin.

It’s curious that the django built-in webserver outperformed running with gunicorn when using the http client. I was not able to reproduce this result in 1.5.3.

I was aware that there was an overhead when enabling the callback but never realized the performance hit was this much until now, taking the time to accurately measure it:

resultswithout ara1.5.1overhead
127611s32s21s
510141s1m39s58s
402016m01s17m28s~11.4m

ansible with ara 1.5.3

api clientapi servertaskshostsresultsduration
offlinedjango522612760m 52s
httpdjango522612760m 30s
httpgunicorn522612760m 20s
offlinedjango5210151013m 22s
httpdjango5210151011m 37s
httpgunicorn5210151011m 09s
offlinedjango2022014020129m25s
httpdjango2022014020117m24s
httpgunicorn2022014020113m47s

1.5.2 introduced threading in the callback and then 1.5.3 was subsequently released to workaround an issue when using the offline client, forcing it to use a single thread for now.

From the table above, we can tell:

  • Running a single thread with the offline client is just about the same performance as 1.5.1 without threading, if only a little bit faster.
  • There is a significant improvement in performance due to the multi-threading when using the http client
  • Running the API server with gunicorn outperforms using the built-in django development server

1.5.3 reduced the overhead of the callback plugin quite a bit when comparing to 1.5.1:

resultswithout ara1.5.3overhead
127611s20s9s
510141s1m 09s28s
402016m 01s13m47s~7.8m

For science: ara 0.16.8

taskshostsresultsduration
522612760m 28s
5210151011m 56s
2022014020119m05s

Although ara 0.x is no longer supported, it turns out it still works if you use the stable/0.x git branch.. even with Ansible 2.10!

It was interesting to run the same benchmark against 0.x because it runs a completely different backend. It uses flask instead of django and doesn’t provide an API: the callback talks directly to the database through flask-sqlalchemy.

Putting it all together

Tallying up the numbers, we can see that we’re on the right track and performance is improving:

taskshostsresultswithout ara0.16.81.5.11.5.3
5226127611s28s32s20s
52101510141s1m 56s1m 39s1m 09s
202201402016m 01s19m05s17m28s13m47s

There is definitely more work to do and more opportunities to improve performance to find. There will unfortunately always be an overhead but it needs to be low enough that it’s worth it without sacrificing simplicity.

In the future, it could be interesting the measure the impact of other parameters on performance like:

  • Ansible forks – what difference does having 25, 100 or 200 forks make ?
  • Callback threads – is there a benefit when running more threads in the threadpool ?
  • Version of Python – is there any difference between python 3.5 and 3.9 ?
  • Version of Ansible – was there any performance improvements or regressions between 2.8 and 2.10 ?
  • Database backend – is sqlite faster than mysql ? what about postgresql ?
  • Application backend – is gunicorn faster than uwsgi ? what about apache mod_wsgi ?
  • Latency: what’s the impact on performance of adding a jump box ? what about 50ms ? 250ms ?

If you’d like to help, have a look at the issues on GitHub or come chat with us on Slack or IRC !

See you around o/