[hatari-devel] Profiling Hatari code with Valgrind

Nicolas Pomarède npomarede at corp.free.fr
Fri Jan 7 22:53:37 CET 2011


Le 07/01/2011 22:28, Eero Tamminen a écrit :
> Hi,
>
> On perjantai 07 tammikuu 2011, Laurent Sallafranque wrote:
>> I don't agree with you here.
>>
>> Update_e_u_n_z is called for nearly every other instruction in the DSP
>> (each instructions that are decoded in the else of the main DSP
>> instruction decoder).
>>
>> This mean mac, mpy, add, sub, test, cmp, ...
>> Nearly all DSP programs use these instructions a lot (and use them
>> millions of time).
>>
>> I've done a quick Vallgrind tonight to see the difference before this
>> update and after it.
>> The difference is not negligeable (compared to the png you sent last
>> time).
>>
>> I'm running a vallgrind of hatari without this optimization. I'll send
>> you the 2 pngs tonight.
>
> In general, while profiler output can indicate performance changes, it's
> better to have something that actually measures it[1], like I suggested in
> previous mail (memory snapshot, --run-vbls&  --frame-skips etc).
> Especially if the change isn't localized within single function, but it
> affects also how functions are called.
>
> [1] Performance measuring and profiling/analysing are two different things
> (I've done that kind of stuff at work for years), you typically need to use
> different means for each.
>
>
> Looking at the diff for your changes, they seem to be localized though, so
> in this case profiler (especially one like Valgrind) could be reliable.
>
> Note though that while the boxed view gives nice overall picture,
> if you want to know total individual percentage of given function, look
> into callgraph or inclusive % column in the table at left.
>
>

I didn't try the new falcon code, but from the graphs it seems to be 
faster. Nevertheless, I also agree with Eero that another precise way to 
measure improvment is to run a snapshot for a number of VBL and see if 
it takes more or less time.

In the case of "simpler" functions, another could indicator could be to 
just build a small test program that would do 1000000 calls of 
Update_e_u_n_z old/new versions and see which one is faster (this way 
you can get a real percentage of the speedup).

It's not always possible, but removing "if" is sometimes possible by 
doing binary and/or or things like that. Sometimes 2 instructions that 
are always executed can be faster that 1 conditional "if".
Code where you handle bits is often a good candidate for this kind of 
optimisation (I remember greatly optimizing some popular depacking 
routines on 68000 this way years ago).

But there's no general rule, it's really up to your imagination.

>
> It might be possible that my i3 CPU scales its speed according to
> the load[2] though and that's why I don't see the differences.  I hadn't
> thought of that earlier. If it does scaling, then "top" isn't valid way
> to measure anything.

It certainly does ; check "cat /proc/cpuinfo" to see if the "cpu MHZ" is 
changing depending on the overall load. In that case, using top is not a 
good option.

The best option is a profiler that would completly emulate an i5x6 cpu 
with cycle precise value for each instruction. The program will usually 
run very slowly, but in the end you get an exact cycle count of what 
happened. If I recall correctly, there're such profiler under linux, but 
I don't remember their names.



Nicolas



More information about the hatari-devel mailing list