Cheney on the M.T.A. with native threads.

Using the C Profiler and Debugger with Cyclone Scheme

June 5th, 2017

While going through a new run of the R7RS benchmarks from Larceny, I noticed Cyclone performed significantly worse than other schemes on the tail benchmark. Certainly when testing locally the results are disappointing:

[justin@justin-pc r7rs-benchmarks]$ ./bench cyclone tail

Testing tail under Cyclone
Including postlude /home/justin/Documents/r7rs-benchmarks/src/Cyclone-postlude.scm
Running tail:25
Elapsed time: 32.261715 seconds (32) for tail:25

real    0m32.379s
user    0m31.783s
sys     0m0.513s

One of the easiest things to do is run a profiler on the code to figure out what is going on. This lets us see if there is something in the runtime or compiled code that is dominating the execution time, and possibly slowing things down. This isn’t a catch-all - for example, it can’t show us if a compiler optimization is needed. But it helps paint a picture of what is going on.

A compiled Cyclone program is just a regular C program so we can use the standard GNU tools for profiling and debugging.

To get started we change Makefile.config in cyclone-bootstrap to enable profiling. The -O2 option in the lines below are replaced with -g -pg:

CFLAGS       ?= -g -pg -fPIC -rdynamic -Wall -Iinclude -L.
COMP_CFLAGS  ?= -g -pg -fPIC -rdynamic -Wall -I$(PREFIX)/include -L$(PREFIX)/lib

Then Cyclone must be rebuilt:

[justin@justin-pc cyclone-bootstrap]$ sudo make clean ; ./

Once this is done a gmon.out file will be generated each time Cyclone or a compiled Cyclone program is executed. This can be used to create a detailed analysis of what the program is doing at runtime.

Now we perform set up for running the tail benchmark directly:

[justin@justin-pc r7rs-benchmarks]$ cd /tmp/larcenous/Cyclone/
[justin@justin-pc Cyclone]$ cp -r ~/Documents/r7rs-benchmarks/inputs/ .
[justin@justin-pc Cyclone]$ mkdir outputs

And recompile tail.scm to get a version with profiling. We then run it to generate a gmon.out file:

[justin@justin-pc Cyclone]$ cyclone  tail.scm
[justin@justin-pc Cyclone]$ ./tail < inputs/tail.input

Then we run gprof to create a report:

[justin@justin-pc Cyclone]$ ls
gmon.out  inputs  outputs  tail  tail.c  tail.o  tail.scm
[justin@justin-pc Cyclone]$ gprof ./tail gmon.out > report.txt

Let’s examine the start of report.txt to see the functions that account for most of the program’s runtime:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 99.19     61.09    61.09  2331598     0.00     0.00  Cyc_length
  0.10     61.15     0.06    11221     0.00     0.00  gc_minor
  0.08     61.20     0.05   777200     0.00     0.00  __lambda_3
  0.08     61.25     0.05   777097     0.00     0.00  __lambda_300
  0.03     61.27     0.02  4664280     0.00     0.00  Cyc_st_add
  0.03     61.29     0.02  1562923     0.00     0.00  gc_thr_add_to_move_buffer
  0.03     61.31     0.02   777572     0.00     0.00  __lambda_278
  0.03     61.33     0.02   777178     0.00     0.00  dispatch
  0.03     61.35     0.02   777147     0.00     0.00  __lambda_297
  0.03     61.37     0.02   777135     0.00     0.00  __lambda_368
  0.03     61.39     0.02       59     0.00     0.00  gc_empty_collector_stack
  0.02     61.41     0.02  5441402     0.00     0.00  Cyc_is_pair
  0.02     61.42     0.02                             ck_pr_cas_char
  0.02     61.43     0.01  3109193     0.00     0.00  Cyc_is_null
  0.02     61.44     0.01  3109170     0.00     0.00  Cyc_car

Well that’s interesting, tail is spending all of its time computing Cyc_length.

Again, Cyclone compiles programs to C, so we can use gdb to debug them and figure out how Cyc_length is being called.

First we need to know what inputs to use:

[justin@justin-pc Cyclone]$ cat inputs/tail.input

To run gdb:

[justin@justin-pc Cyclone]$ gdb ./tail
GNU gdb (GDB) 7.12.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
Find the GDB manual and other documentation resources online at:
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./tail...done.
(gdb) run
Starting program: /tmp/larcenous/Cyclone/tail
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/".
[New Thread 0x7ffff6031700 (LWP 1850)]
Thread 1 "tail" received signal SIGINT, Interrupt.
0x00007ffff6fd44ed in read () from /usr/lib/
(gdb) break Cyc_length
Breakpoint 1 at 0x53b145: file runtime.c, line 1713.
(gdb) c

After continuing a few times, the code breaks here:

#0  Cyc_length (data=0x7cd4e0, l=0x7ffffffbc660) at runtime.c:1713
#1  0x00000000004af96b in __lambda_368 (data=0x7cd4e0, argc=3, _=0x7ffff6635340, k_734906=0x7ffff63331c0, f_731928=0x7ffffffb2dd0, lis1_731927=0x7ffffffbc660,
    lists_731926_raw=0x7ca420 <Cyc_void_symbol>) at scheme/base.c:22680
#2  0x00000000004afde1 in __lambda_367 (data=0x7cd4e0, argc=1, self_736727=0x7ffffff9d4c0, r_734921=0x7ca420 <Cyc_void_symbol>) at scheme/base.c:22703
#3  0x00000000004b7aa7 in __lambda_299 (data=0x7cd4e0, argc=3, _=0x7ffff66350e0, k_735061=0x7ffffff9d4c0, char_731994=0x2a, port_731993_raw=0x7ffff6333200) at scheme/base.c:23706
#4  0x000000000055cf5e in do_dispatch (data=0x7cd4e0, argc=3, func=0x4b7523 <__lambda_299>, clo=0x7ffff66350e0, b=0x7ffffff9ccf0) at dispatch.c:6

Opening the source code for scheme/base.c you can see the code breaks in the for-each function:

static void __lambda_368(void *data, int argc, closure _,object k_734906, object f_731928, object lis1_731927, object lists_731926_raw, ...) {
load_varargs(lists_731926, lists_731926_raw, argc - 3);
Cyc_st_add(data, "scheme/base.sld:for-each");
if( (boolean_f != Cyc_is_null(lis1_731927)) ){

And in scheme/base.sld you can see where length is being called:

(define (for-each f lis1 . lists)
  (if (not (null? lis1))
    (if (pair? lists)
      (let recur ((lists (cons lis1 lists)))
        (receive (cars cdrs) (%cars+cdrs lists)
          (if (pair? cars)
                (apply f cars)
                (recur cdrs)))))
      ;; Fast path.
      (if (eq? 1 (length lis1))
        (f (car lis1))
        (begin (f (car lis1))
               (for-each f (cdr lis1)))))))

The code can be simplified to make it more obvious what is going on:

(define (for-each f lis1 . lists)
  (if (not (null? lis1))
      (if (eq? 1 (length lis1))
        (f (car lis1))
        (begin (f (car lis1))
               (for-each f (cdr lis1))))))

Basically on every iteration of for-each the code is calling length to see if f can be called directly. Well, that’s not good - the main for-each loop itself has a time complexity of O(n). That is, its execution time depends directly on the length of lis1. But each time length is called it must examine the entire contents of lis1, which is another O(n) operation. Combined with the outer loop this raises the overall time complexity to O(n^2) - which can really add up for large values of n.

This reminds me of an old article from Joel Spolsky that talks about the same issue with respect to strings:

Is their function fast? Look at how many times they call strlen. I’ve seen O(n^2) algorithms for strrev when it should be O(n), because they are calling strlen again and again in a loop.

The solution is to check directly for null, instead of scanning the whole string:

(if (null? (cdr lis1))
  (f (car lis1))
  (begin (f (car lis1))

After rebuilding with this fix we can re-run the tail benchmark:

[justin@justin-pc r7rs-benchmarks]$ ./bench cyclone tail

Testing tail under Cyclone
Including postlude /home/justin/Documents/r7rs-benchmarks/src/Cyclone-postlude.scm
Running tail:25
Elapsed time: 0.72314 seconds (1) for tail:25

real    0m0.729s
user    0m0.540s
sys     0m0.187s 

Whoa! Remember how the older code took over 32 seconds to finish? Now it finishes in less than a second. Not bad.