[hatari-devel] Profiling Hatari code with Valgrind
Nicolas Pomarède
npomarede at corp.free.fr
Fri Jan 7 22:53:37 CET 2011
Le 07/01/2011 22:28, Eero Tamminen a écrit :
> Hi,
>
> On perjantai 07 tammikuu 2011, Laurent Sallafranque wrote:
>> I don't agree with you here.
>>
>> Update_e_u_n_z is called for nearly every other instruction in the DSP
>> (each instructions that are decoded in the else of the main DSP
>> instruction decoder).
>>
>> This mean mac, mpy, add, sub, test, cmp, ...
>> Nearly all DSP programs use these instructions a lot (and use them
>> millions of time).
>>
>> I've done a quick Vallgrind tonight to see the difference before this
>> update and after it.
>> The difference is not negligeable (compared to the png you sent last
>> time).
>>
>> I'm running a vallgrind of hatari without this optimization. I'll send
>> you the 2 pngs tonight.
>
> In general, while profiler output can indicate performance changes, it's
> better to have something that actually measures it[1], like I suggested in
> previous mail (memory snapshot, --run-vbls& --frame-skips etc).
> Especially if the change isn't localized within single function, but it
> affects also how functions are called.
>
> [1] Performance measuring and profiling/analysing are two different things
> (I've done that kind of stuff at work for years), you typically need to use
> different means for each.
>
>
> Looking at the diff for your changes, they seem to be localized though, so
> in this case profiler (especially one like Valgrind) could be reliable.
>
> Note though that while the boxed view gives nice overall picture,
> if you want to know total individual percentage of given function, look
> into callgraph or inclusive % column in the table at left.
>
>
I didn't try the new falcon code, but from the graphs it seems to be
faster. Nevertheless, I also agree with Eero that another precise way to
measure improvment is to run a snapshot for a number of VBL and see if
it takes more or less time.
In the case of "simpler" functions, another could indicator could be to
just build a small test program that would do 1000000 calls of
Update_e_u_n_z old/new versions and see which one is faster (this way
you can get a real percentage of the speedup).
It's not always possible, but removing "if" is sometimes possible by
doing binary and/or or things like that. Sometimes 2 instructions that
are always executed can be faster that 1 conditional "if".
Code where you handle bits is often a good candidate for this kind of
optimisation (I remember greatly optimizing some popular depacking
routines on 68000 this way years ago).
But there's no general rule, it's really up to your imagination.
>
> It might be possible that my i3 CPU scales its speed according to
> the load[2] though and that's why I don't see the differences. I hadn't
> thought of that earlier. If it does scaling, then "top" isn't valid way
> to measure anything.
It certainly does ; check "cat /proc/cpuinfo" to see if the "cpu MHZ" is
changing depending on the overall load. In that case, using top is not a
good option.
The best option is a profiler that would completly emulate an i5x6 cpu
with cycle precise value for each instruction. The program will usually
run very slowly, but in the end you get an exact cycle count of what
happened. If I recall correctly, there're such profiler under linux, but
I don't remember their names.
Nicolas
More information about the hatari-devel
mailing list