[hatari-devel] new version of video.c
Eero Tamminen
eerot at users.berlios.de
Fri Jul 3 22:54:08 CEST 2009
Hi,
On Monday 22 June 2009, Kåre Andersen wrote:
> So, skimming the code, the first thing that strikes me, is that there
> is no 32-bit conversion routine for High res - only 8 bit. This is
> quite likely to kill speed on OS X due to earlier discussed
> compositing display manager (Quartz).
Only problem I've heard related to OSX compositing was that SDL does/did
VSync on all screen updates, not just on flip. An extra conversion step
doesn't cause extra VSyncs.
What frameskip you use? Do you have statusbar enabled & does switching
it off change anything?
> I dont know how this will register with gprof, and i still have very
> little experience in using it, but the theory is sound enough :)
You could use Shark, AFAIK it's free (needs registration):
http://developer.apple.com/tools/shark_optimize.html
http://developer.apple.com/tools/performance/optimizingwithsystemtrace.html
(I don't have Mac myself, but at work there was this one guy who
raved about Shark and few features that it has which are lacking
in Valgrind/KCachegrind. Apple does nice GUIs also for their developer
tools.)
> (8 bit converted to 32-bit while
> beeing shipped back and forth to the video hardware due to compositing
> is slower than 32-bit not beeing converted while beeing shipped back
> and forth...)
The graphics conversions are one-way, there's no need to read the written
data back (which can be slow if it would be done from the gfx card memory).
Nowadays performance bottlenecks, especially in things like this, mostly
come for memory accesses, not CPU instructions.
As to conversion operations, current code causes conversion operations for:
1-bit -> 8-bit -> 32-bit
which for fullscreen updates should mean 32kB reads + 250kB writes + 250kB
reads + 1MB writes. In total = 1.5MB data.
(Bit twiddling and how that's done + how that interacts with CPU caches can
vary the effect of the data amount. The current 1-bit -> 8-bit conversion
code handles 4 pixels at the time.)
Monochrome screen is (without frameskip) refreshed at 72Hz. This means
72*1.5 = 108MB/s memory bus load from the conversions.
Depending on graphics system and how SDL works, this might be written
directly to the screen (like on framebuffer), but mostly not.
OSX uses compositing. This means that every time the window contents
change, they're re-composited to screen i.e. read from the window back
buffer and written (possibly composited with other data & transformed) to
the screen. For the 32-bit monochrome i.e. 640x400 windows this means at
least additional 1MB reads + 1MB writes for each frame. In total 3.5MB data
and 252MB/s memory bus load.
You're suggesting doing conversion directly:
1-bit -> 32-bit
which means skipping the 8-bit reads & writes i.e. 1/2MB of the total data.
So, with compositing you should be expecting about 3.5/0.5MB ~= 14% CPU
usage improvement for the CPU usage on screen conversion. Screen conversion
can be be a noticeable part of Hatari CPU usage if you aren't using
frameskip, but I think it's even then less than half of the whole Hatari
CPU usage.
Note: My machine is 1.4Ghz AMD Athlon XP with old (obsolete) Matrox G550 gfx
card. I'm running the display at 16-bits, no compositing. On this system,
monochome screen causes <3% CPU usage by Hatari itself and ~10% by X server.
I think X server load comes just from pushing the converted data to
the display (+ occasional updates from other apps). I guess on 32-bit
display Hatari with the current conversion routines would use 4% CPU.
From above one can deduce that the performance problems when compositing is
used are more likely related to screen update frequency than conversions.
> I guess there is no way around it now - i will implement my conversion
> routines for all ST modes (currently only for ST Low) and send them to
> whoever wants to review them. Hopefully i can get this done some time
> tonight...
Does it support partial screen updates or does it always do full screen
updates? For many use-cases (games and applications that don't do
constantly whole screen updates) partial screen updates improve
the performance a lot which is important on older & embedded machines.
> From there on, the path to pure OpenGL rendering is much simpler...
Btw. Apple has also an OpenGL profiler.
- Eero
More information about the hatari-devel
mailing list