Fixing Blue Screen and crashes with a brand new RTX card by checking your RAM

So you just got a brand new NVidia RTX 3000 or 4000 series GPU, you already have a powerful power supply (750W+) and you PC shows one of the following symptoms:

  • Blue Screens of Death;
  • Artifacting in games;
  • Random games and app crashing;
  • Computer freezes;
  • Black screen crashes.

Well it’s most likely not your CPU, motherboard or power supply dying.

What is XMP?

XMP stands for eXtreme Memory Profile, it’s an Intel thing, other names for it can be DOCP (on some Asus AM4 motherboards) or RAMP (Ryzen Accelerated Memory Profile). To make things simple I’ll call this XMP regardless of Intel or AMD since it stands for the same thing.

When you build your computer (or have it built) you will most certainly benefit by enabling XMP in your UEFI if your RAM comes with it. In my case I have some Corsair Vengeance RAM so I’m using the profile to benefit from the advertised 3600MHz on the kit I bought.

You have enabled XMP on your RAM

The crashes are most likely caused by having the XMP profile enabled, one quick way to check is to run a memtest86 (it’s free), set it up on a flash drive and boot from it. Run a long test and you should start seeing errors pile up.

Depending on the amount of RAM you have you will see errors appear within an hour or even witness a crash.

Once you see errors feel free to cancel the test, reboot into the UEFI and disable the XMP profile (refer to your motherboard manual for instructions). Once this is done you will need to run another test but this time let it run for a couple of hours longer than previously.

Testing your system’s stability is very important and only memtest86 can do it effectively. DO NOT TRUST WINDOWS’ RAM CHECKING TOOL.

If everything goes well and no other error is detected than it means we have fixed the crashes.

Fix #1: No XMP profile

The cold, hard and sad truth is that you can no longer run the XMP profile. Just disable it to fix the problem and be done with it.

This might not be the answer you were looking for but this is the easiest way.

Memtest86 doesn’t lie and when is couldn’t find any errors without XMP enabled over several hours you can consider the problem fixed and everything will be going smoothly from there.

But… If you are willing to take risks there’s a second fix…

Fix #2: Increase RAM voltage

Disclaimer: You are responsible for whatever you do to your PC, everything is on you no matter what. Any damage caused by following these instructions are on you.
Overclocking and overvolting is not covered by your warranty and if something breaks it’s on you.

To get started you’ll have to enable your XMP profile once more, then you will have an input field for the RAM voltage. In my case the default setting was showing 1.35V.

I decided to increase the voltage to 1.37V and run memtest86. I found no errors within the hours which was good but not a promise. I let the test run for 16 hours total and I still didn’t have any errors.

In my case 1.37V is good enough. Your case is different so you’ll have to try 1.36V first, then increase it until it’s stable or your hardware can’t take anymore.
Be careful because you can damage your hardware and this kind of damage isn’t covered by any warranty.

What caused the crashes actually?

I can only speculate but I think it’s related to the transient power spikes being caused by high powered GPUs such as the RTX 3000 and 4000 series.
I’m speculating that more power is being sucked into the PCIE connector on the motherboard too, potentially causing a voltage drop in some other places where it matters such as the RAM.

I’m not sure if this is a bad design coming from the GPU, the motherboard or even the RAM but it does cause some sort of instability that then causes artifacting in some games, random blue screens (BSoD) with unrelated error codes from each other sometimes (see my screenshot below) or even other issues.

Gamers Nexus did produce a video to explain the transients in more detail but I feel that it might only be scratching the surface for the issues that some of us might encounter.

My hardware

At the time of writing I’m rocking my BEEF PC that is build around the following components:

  • AMD Ryzen 9 5900X;
  • NVidia RTX 3080Ti (Gigabyte);
  • 64GB Corsair Vengeance 3600MHz RAM;
  • Asus TUF GAMING X570-PRO motherboard;
  • Corsair 850W PSU.

This was tested and has been stable ever since I’ve increased my RAM voltage.
I have tested this on another computer with the following specs:

  • AMD Ryzen 7 5800X;
  • NVidia RTX 3090 (FE);
  • 32GB Corsair Vengeance 3200MHz RAM (voltage bumped to 1.37V);
  • MSI Tomahawk B450;
  • Corsair 1000W PSU.

Conclusion

Those new RTX 3000 and 4000 series are pretty powerful and pull all the power they can get and some more!

I understand that most of the reasoning I went through was speculation and it feels like I’m making up a reason. I’m honestly not even convinced by what I think is the problem but increasing the voltage did fix it for me so I’m not going to argue that I’m running a stable system.

 

No ShadowPlay overlay in games, no recording after install a Stream Deck

I’ve recently purchased a Stream Deck MK2 by Elgato (Corsair) after playing around with the mobile app for a while and planning out some profiles for stuff like video editing and streaming.

It’s honestly a nice piece of equipment but I’ve had it for less than 24 hours and I’ve already found some problems.

Symptoms

After running the Stream Deck software to configure the keys and browse the plugins that are available I noticed that my ShadowPlay overlay isn’t showing in the critically acclaimed Final Fantasy XIV, which is unusual.

I closed the game, relaunched it, even rebooted and nothing. Time to try another game like Monster Hunter World, same. I confirmed that the main menu would show up but I had no overlay showing the replay buffer running.

Weirder was when I attempted to record it would ask me if I wanted to enable desktop recording. That’s when I started thinking that some software was grabbing ShadowPlay and keeping it to itself without ever releasing it.

I was looking for the icons on the bottom right

How to blacklist apps from ShadowPlay

The only way to fix this is to find a way to forbid ShadowPlay from hooking into it and of course Nvidia doesn’t give us tools to deal with that.
But on the bright side we have third party tools and I used the NVIDIA Profile Inspector, download a release here.

Download it, run it and the in the box on top search for Discord:

We will now add the StreamDeck.exe utility to that list by clicking on the add application to current profile button, browse to your Stream Deck installation folder and add StreamDeck.exe to it.

You might need to relaunch the Stream Deck utility, relaunch ShadowPlay through GeForce Experience or even reboot.
After that the utility will no longer trigger the overlay notification you get when you run a game and you should be able to record gameplay without enabling desktop recording.

Closing notes

I’ll make a proper post about how I use it at a later date (maybe) but for now I’m pretty much pleased with how easy it is to configure and I have started experimenting with some plugins such as Discord (mute/deafen toggle), iCUE profile changer (for my Corsair iCUE H150i RGB PRO XT) and a plugin to show CPU and GPU temps with HWiNFO.

I’ll get started with streaming a bit more often I guess and will improve my general workflow and battlestation setup.

Remap Ansel hotkey

Ansel is Nvidia’s tool that is used for making high resolution, 360° and stereoscopic screenshots. You can move around freely in a paused scene to capture your character’s best side.

Sadly during action ALT+F2 is quite hard to hit. Let’s remap the key.

Launching the configuration utility

To launch the utility hit Windows+R and paste this path:

%PROGRAMFILES%\NVIDIA Corporation\Ansel\Tools\NvCameraConfiguration.exe

You should be greeted with this window:

The input hotkey is used to remap, hit the key or the combination you wish to use then confirm and save.

I have tested this on Windows 10 with the Witcher 3.

Source: https://forums.geforce.com/default/topic/957949/ansel/any-way-to-rebind-keys-/

Asus ROG G750JM: Defective by design

Republic of Gamers

Update 18/07/2017: BSoD reported with nVidia drivers WHQL (dated 29/06/2017) to me (in French) and fixed by applying Asus’ drivers. See conclusion.

If we ever have to speak about the looks, the form factor and the cooling system, I’d have to say that I love the Asus ROG (Republic of Gamers) laptop G750JM.

But it has issues. It actually has a number of issues you can’t overlook.

Technical specification

Its specs are not the best, but good enough to be able to run multiple VMs, the Witcher 3 and enjoying desktop like performance on just about any tasks. Not high-end desktop like performance, but not far from it.

  • CPU: Intel i7 4700HQ 2.4GHz;
  • RAM: 16GB;
  • GPU: NVIDIA GTX 860M (2GB);
  • Storage: 1To Hitachi Travelstar 7K1000;
  • Windows 8.1 Pro 64bits.

For the operating system I have installed my own license of Windows 8.1, just to make sure I don’t have to live with the bloat it comes with. No overclocking was involved at any moment, it’s still a laptop so I’d like to maximize its lifespan.

Blue Screen of Death: DRIVER_POWER_STATE_FAILURE 0x9F

Ever since I have owned this laptop, I remember having systematic Blue Screens of Death (DRIVER_POWER_STATE_FAILURE) when Geforce Experience was installed and Shadow Play was enabled at startup. I have since then removed Geforce Experience to not experience any BSoDs at startup.
I never go to understand the issue. I have since then never installed Shadow Play ever again, and I actually would like to be able to record gaming footage…

But recently I have been experiencing those BSoDs again. Strange since I didn’t update the NVidia driver in a long time (version 353.06). I could take a look at a crash dump.

The good thing about running into issues is that it’s the perfect time to learn stuff such as using WinDbg. With just a few commands I can find out what went wrong… Or get to close to know.

This is the debugging process:

*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 9F, {3, ffffe000e61bd060, ffffd0011789c960, ffffe000e546a010}

Probably caused by : pci.sys

Followup: MachineOwner
---------

2: kd> k
Child-SP          RetAddr           Call Site
ffffd001`1789c928 fffff802`a4e92782 nt!KeBugCheckEx
ffffd001`1789c930 fffff802`a4e926a2 nt!PopIrpWatchdogBugcheck+0xde
ffffd001`1789c990 fffff802`a4d302d8 nt!PopIrpWatchdog+0x32
ffffd001`1789c9e0 fffff802`a4dde7ea nt!KiRetireDpcList+0x4f8
ffffd001`1789cc60 00000000`00000000 nt!KiIdleLoop+0x5a
2: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DRIVER_POWER_STATE_FAILURE (9f)
A driver has failed to complete a power IRP within a specific time.
Arguments:
Arg1: 0000000000000003, A device object has been blocking an Irp for too long a time
Arg2: ffffe000e61bd060, Physical Device Object of the stack
Arg3: ffffd0011789c960, nt!TRIAGE_9F_POWER on Win7 and higher, otherwise the Functional Device Object of the stack
Arg4: ffffe000e546a010, The blocked IRP

Debugging Details:
------------------


DRVPOWERSTATE_SUBCODE:  3

IMAGE_NAME:  pci.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  53d0f1d4

MODULE_NAME: pci

FAULTING_MODULE: fffff80107fae000 pci

DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

BUGCHECK_STR:  0x9F

PROCESS_NAME:  System

CURRENT_IRQL:  2

ANALYSIS_VERSION: 6.3.9600.17237 (debuggers(dbg).140716-0327) amd64fre

DPC_STACK_BASE:  FFFFD001178A3FB0

STACK_TEXT:  
ffffd001`1789c928 fffff802`a4e92782 : 00000000`0000009f 00000000`00000003 ffffe000`e61bd060 ffffd001`1789c960 : nt!KeBugCheckEx
ffffd001`1789c930 fffff802`a4e926a2 : ffffe000`e546af08 00000000`00000001 ffffe000`ea7417e0 fffff802`a4d2e391 : nt!PopIrpWatchdogBugcheck+0xde
ffffd001`1789c990 fffff802`a4d302d8 : 00000000`00000000 ffffd001`1789cae0 00000000`00000001 00000000`00000001 : nt!PopIrpWatchdog+0x32
ffffd001`1789c9e0 fffff802`a4dde7ea : ffffd001`17872180 ffffd001`17872180 ffffd001`1787e3c0 ffffe000`ed531080 : nt!KiRetireDpcList+0x4f8
ffffd001`1789cc60 00000000`00000000 : ffffd001`1789d000 ffffd001`17897000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x5a


STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

IMAGE_VERSION:  6.3.9600.17238

FAILURE_BUCKET_ID:  0x9F_3_ACPI_IMAGE_pci.sys

BUCKET_ID:  0x9F_3_ACPI_IMAGE_pci.sys

ANALYSIS_SOURCE:  KM

FAILURE_ID_HASH_STRING:  km:0x9f_3_acpi_image_pci.sys

FAILURE_ID_HASH:  {20ddeb92-07eb-ebdc-bd08-44da71ffbd68}

Followup: MachineOwner
---------

2: kd> !irp ffffe000e546a010
Irp is active with 6 stacks 4 is current (= 0xffffe000e546a1b8)
 No Mdl: No System Buffer: Thread 00000000:  Irp stack trace.  
     cmd  flg cl Device   File     Completion-Context
 [  0, 0]   0  0 00000000 00000000 00000000-00000000    

      Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000    

      Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000    

      Args: 00000000 00000000 00000000 00000000
>[ 16, 2]   0 e1 ffffe000e61b1240 00000000 fffff80107fb3b30-ffffe000e61ff730 Success Error Cancel pending
         \Driver\ACPI	pci!PciBusPowerUpCompletion
      Args: 00000000 00000001 00000001 00000000
 [ 16, 2]   0 e1 ffffe000e61ff5e0 00000000 fffff802a4da91cc-ffffe000e546af08 Success Error Cancel pending
         \Driver\pci	nt!PopRequestCompletion
      Args: 00000000 00000001 00000001 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-ffffe000e546af08    

      Args: 00000000 00000000 00000000 00000000

I get to understand a few things from it:

  • Line 11: pci.sys might have caused the BSoD;
  • Line 33 & 34: A physical device object blocked an IRP for too long;
  • Line 36: The block IRP.

What’s actually interesting is the line 36: ffffe000e546a010. At line 91 I run a command with that IRP number:

2: kd> !irp ffffe000e546a010
Irp is active with 6 stacks 4 is current (= 0xffffe000e546a1b8)
 No Mdl: No System Buffer: Thread 00000000:  Irp stack trace.  
     cmd  flg cl Device   File     Completion-Context
 [  0, 0]   0  0 00000000 00000000 00000000-00000000    

      Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000    

      Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000    

      Args: 00000000 00000000 00000000 00000000
>[ 16, 2]   0 e1 ffffe000e61b1240 00000000 fffff80107fb3b30-ffffe000e61ff730 Success Error Cancel pending
         \Driver\ACPI	pci!PciBusPowerUpCompletion
      Args: 00000000 00000001 00000001 00000000
 [ 16, 2]   0 e1 ffffe000e61ff5e0 00000000 fffff802a4da91cc-ffffe000e546af08 Success Error Cancel pending
         \Driver\pci	nt!PopRequestCompletion
      Args: 00000000 00000001 00000001 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-ffffe000e546af08    

      Args: 00000000 00000000 00000000 00000000

The entry that is marked is the one on the line 14. It mentions on line 51 that IRP’s driver is \Driver\ACPI. ACPI is what puts some components to sleep and does status monitoring. What could possibly be the issue and how is it related to Nvidia ?

I don’t know. Reports on the internet have pointed at some feature called Turbomaster being the culprit. Disabling it in the UEFI should fix the issue according to reports from people online. It did not for me.

Let’s check the most recent BSoD:

*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 9F, {3, ffffe000842c7410, ffffd0002609c960, ffffe00084f9dca0}

Probably caused by : pci.sys

Followup: MachineOwner
---------

2: kd> !irp ffffe00084f9dca0
Irp is active with 6 stacks 4 is current (= 0xffffe00084f9de48)
 No Mdl: No System Buffer: Thread 00000000:  Irp stack trace.  
     cmd  flg cl Device   File     Completion-Context
 [  0, 0]   0  0 00000000 00000000 00000000-00000000    

      Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000    

      Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000    

      Args: 00000000 00000000 00000000 00000000
>[ 16, 2]   0 e1 ffffe000842c7410 00000000 fffff8015a4a837c-ffffd000205ea8c0 Success Error Cancel pending
         \Driver\pci	dxgkrnl!DpiFdoPowerCompletionRoutine
      Args: 00000000 00000001 00000001 00000000
 [ 16, 2]   0 e1 ffffe000847a31f0 00000000 fffff802715201cc-ffffe0008709fea8 Success Error Cancel pending
        *** ERROR: Module load completed but symbols could not be loaded for nvlddmkm.sys
 \Driver\nvlddmkm	nt!PopRequestCompletion
      Args: 00000000 00000001 00000001 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-ffffe0008709fea8    

      Args: 00000000 00000000 00000000 00000000
2: kd> !sysinfo machineid
Machine ID Information [From Smbios 2.7, DMIVersion 39, Size=1687]
BiosMajorRelease = 4
BiosMinorRelease = 6
BiosVendor = American Megatrends Inc.
BiosVersion = G750JM.207
BiosReleaseDate = 07/17/2014
SystemManufacturer = ASUSTeK COMPUTER INC.
SystemProductName = G750JM
SystemFamily = G
SystemVersion = 1.0       
SystemSKU = ASUS-NotebookSKU
BaseBoardManufacturer = ASUSTeK COMPUTER INC.
BaseBoardProduct = G750JM
BaseBoardVersion = 1.0

Why is \Driver\pci the issue ?
Well it’s time to rollback, like suggested here, to the OEM drivers to be sure I’m not going to have anymore issues. Version 347.52 from the 2nd of April 2014!

Damaging SATA controller

What’s great about the G750JM is that is has two 2.5″ bays, one is already populated with a 1TB hard drive (Hitachi Travelstar 7K1000). That’s a great opportunity to put another 1TB drive in the other bay so I can double my storage !

Except no. The SATA controller seems to damage hard drives put in the second bay. I got no response from Asus’ support team regarding that issue. When I RMA’d the laptop they changed the motherboard (on wich the SATA controller sits ?), I ran through 4 hard drives of the same brand as the one that came with the laptop in the second bay.
They all ended up dying. And more often than not, not even reporting S.M.A.R.T. issues. Something is wrong, and it’s not the hard drives causing issues.

So I started thinking… Maybe I’m the only one experiencing this issue. Well I’m not really the only one actually. That might not mean much.

Asus’ support

Let’s rant a bit. I have been contacting the support since 2014 I believe. I have RMA’d this computer twice and it still damages the hard drives in the second bay, it still needs to be disassembled to rewire the screen correctly sometimes. It still Blue Screens with Nvidia’s drivers.

Is this Asus quality? How come I’ve had a flawless experience on a desktop computer with Asus but on a laptop I must pull out my hair?!

I have literally emailed Asus’ support team begging them to get the laptop repaired once and for all, begging them for the quality of service I believe every single customer deserves. Even when my warranty ended.

I have tried my best to diagnose issues, I have sent videos and logs. I have asked for help and nothing was ever done to help me. Instead it was always the kind of response where I would have to send the computer back and let them evaluate if the warranty covers the damages or not before doing anything.
Am I an idiot for trying to figure out what wrong and getting a fully functional laptop?

You tell me Asus.

Conclusion

The Asus ROG G750JM is a defective by design laptop. My motherboard, GPU, screen and “SATA cables” were changed during the first RMA. I then got a screen that have some of the worst burn in you could possibly ever have and it wasn’t even securely wired, so when the cable came loose the picture would just glitch.

While I think that customers don’t have to be treated as kings, I believe that quality should be a concern. Produce the best product and service possible to your customers.

It seems that the G552VW model has kind of the same problem for the nVidia driver, I have done some support (in French) about a BSoD happening. The BSoD happened in Phantasy Star Online 2.
Installing the old drivers from the Asus’ support site resolved the instant BSoD that he would get a few seconds after loading the game. I see a pattern.