“There’s that end of sense that many technical folk will explore or feel need to you explore one thing and you know what it need to be, and you perceive it need to be crucial, and you explore corporations factual doing it nasty — there’s that worthy have to repair it and enact it factual and list them end of how it’s purported to be performed”
Ever since its originate in 2015, Batman: Arkham Knight has had a unpleasant performance profile. The initial originate had absolutely evil performance that used to be finest covered by a Total Biscuit video in which he described seeing “hitching and frame-price drops, especially when riding or flying like a flash spherical town”, and that wasn’t an isolated abilities — the relate bought so substandard that the PC model used to be removed from sale.
The sport later returned to retail with an update which Digital Foundry covered in a 2015 video and came upon to be neither preferrred-enough nor atmosphere suited: hitches and stuttering have been restful spherical, and the GPU used to be removed from being entirely utilized. Digital Foundry revisited Arkham Knight over again in a 2018 video, with the video title calling it “One of PC’s Worst ports” and the intro say-over referring to it as “The dreaded PC model” — and on the opposite hand in a 2019 video holding the re-originate on the Epic Video games Retailer which comes sans DRM. Richard Leadbetter even calls Arkham Knight’s performance “a puzzle [he’s] been making an try to crack for years now”. I most current the game loads yet suffered from its performance even on top-tier hardware — and technical puzzles, especially ones connected to sport performance, captivate me. So I made up my thoughts I’d rob a stab at this one. FWIW, all my work has been with the GameWorks effects grew to change into off.
The truth that the GPU isn’t entirely utilized pointed a finger in opposition to the CPU-side code, and the stutters being connected with traversing the field by gliding or the Batmobile made the game’s streaming intention the principle perpetrator, so I made up my thoughts to investigate extra. This grew to change into out to be extra sophisticated than strange — I without a doubt have the DRM-encumbered Steam model, and it seemed as if it might perchance resist any frame preserve tools I throw at it like Renderdoc et al. I restful wished to perceive how the CPU is spending its time, so I made up my thoughts to procure a bit artistic.
I grabbed the offer code for ReShade, which implements a layer that intercepts Direct3D 11 API calls, and integrated the Tracy Profiler into it, so I will procure an thought of what API calls the game’s making and the blueprint considerable time it spends in each of them. That worked beautifully, and while I wasn’t ready to hook the DXGI Swapchain’s Contemporary() call and procure factual per-frame timings, I used to be ready to ogle the D3D11 calls and I in an instant saw a large crimson warning flag:
There have been about a
CreateTexture2D() calls taking up a millisecond each:
When a 60 FPS framerate allocates an ~= 16 milliseconds funds for the total frame that is rather a recount. Tracy gathers aggregate statistics, and inspecting them printed that
CreateTexture2D calls taking an extended times are rather frequent. One other smoking gun potentially pointing to the streaming intention, nonetheless I restful wished to examine one extra data point.
I added code to wrap the 2D texture objects and discover their lifetime, to boot to to tracking the preference of texture advent calls, and the outcomes have been attention-grabbing:
The sport hovered spherical 11,000 stay textures at any one time, nonetheless the preference of
CreateTexture2D() calls kept mountain climbing — a ticket that the streaming intention is quite without a doubt by no manner recycling texture objects, which is a big no-no. Whereas an atmosphere suited streaming intention would retract a pool of textures to reuse and update as the participant moves by blueprint of the originate world, this methodology factual kept calling into the graphics driver repeatedly asking for fresh textures and virtually without a doubt inflicting a ton of fragmentation within the route of.
I wished to ogle how considerable this lack of texture reuse is hurting performance, and since I used to be already wrapping D3D intention and texture objects it used to be pretty straightforward to implement a texture pool. I made up my thoughts to skip the precise updating of texture contents until I might perchance also explore what extra or much less performance affect this approach would enact, so all I added used to be a transient-and-dirty pooling approach that suits texture descriptions exactly and ignores the
pInitialData argument of ID3D11Software::CreateTexture2D.
The performance enchancment used to be dramatic — even with a without a doubt naive implementation of a texture pool, transferring spherical town used to be loads smoother than what I used to be venerable to, so as that route seemed promising. Skipping the initial data add lead, unsurprisingly, to a pair image corruption, and the following step used to be to repair that and explore whether or now no longer the performance enchancment stays. On the unprecedented side, about a of the corruption used to be fun to glance:
I ended up copying the initial data equipped at the time of
CreateTexture2D() call, then flushing it to the precise texture by a call to ID3D11DeviceContext::UpdateSubresource ahead of executing any scheme calls. My first implementation venerable a
deque and allocated memory once for every
CreateTexture2D call, and even with this amount of overhead many of the command used to be gone; as expected this mounted the visual corruption considerations. I later moved to the utilize of a preallocated relate of intention memory with a linear allocator — that manner no calls to
malloc() might perchance be occurring within the route of asset streaming: this seemed as if it might perchance repair the last stutters connected to texture uploads.
Buffers exhibited the same behavior to textures, nonetheless the affect on performance wasn’t as substandard. Easy, a pooling intention (as inefficient as the one I ended up the utilize of is) mounted extra stutters, and at that time I used to be ready to play the game at a near-locked 60 FPS as long as my GPU can without a doubt push the pixels on computer screen — which is rather the dramatic swap from how things have been since, smartly, perpetually.
There’s restful some low-hanging fruit to be picked: my pooling code suits on precise texture description, so an RGBA8_UNORM texture can not be venerable to meet an RGBA8_SRGB allocation — converting to typeless DXGI formats ahead of pooling might perchance be a straightforward enchancment to affect, and in step with the data I already have it might perchance toughen the preference of pool hits. The buffer pooling code is even worse, matching on the precise buffer dimension, where a seriously better approach might perchance be to pad buffers to the nearest extra than certainly one of N kilobytes, then on the total have extra than one buckets with each bucket match to meet all allocation requests within a particular dimension differ. However at this point my incentive isn’t as high since I used to be ready to procure to the purpose of being GPU-travel merely by throwing greater than enough CPU, RAM, and VRAM at the difficulty given the newer hardware I without a doubt have now.
I used to be blown away, and pretty loads disappointed, by the reality that such a straightforward optimization (implemented in under every week) used to be ready to procure such a dramatic performance enchancment in a sport this notorious for substandard performance. I ponder how the folk to blame for this implementation came upon it acceptable, and whether or now no longer they place how considerable cash used to be thrown away by one thing that might perchance also’ve been mounted in a single man-week of labor. I argue loads for craftsmanship-for-craftsmanship’s-sake, and I place how that in total doesn’t affect financial sense, nonetheless in this case a cramped bit attention and care might perchance even have saved the writer hundreds of thousands of bucks in precise prices, and finest god-is aware of-how-considerable in reputation and goodwill. Sloppy engineering has a fee, preferrred craftsmanship has precise place, and the following time somebody is allowed to ship one thing this unpleasant we would now no longer be lucky enough to be ready to fabricate such a repair.
The repair is equipped in this plunge-in replacement DXGI dll — to put it to use factual replica the
dxgi.dll file to the identical directory as
BatmanAK.exe, the game’s executable, and guarantee to disable all GameWorks effects and any overlays plenty of than the Steam overlay as some overlays are known to relate off crashes. The provision code for my adjustments is equipped within the
batman division in this git repo.
Change: In the event you high-tail into crashes with this, it might perchance also restful generate some
.dmp files within the identical directory as the DLL. Half the dump files with me and I’ll enact my finest to investigate the recount.