Recently, me and Vadim decided to tag-team a 500 point challenge on pwnable called malware:

I have no respect for writing malware
but I do have respect for writing cool malware
lets find a way to beat ANUBIS (https://anubis.iseclab.org)

download : http://pwnable.kr/bin/malware.py

Running at : nc pwnable.kr 9017

The Challenge

We pulled down the supplied python code, and we were immediately greeted with this header:

Shall we play a game?

Hey man, I'm writing a malware to steal some bank account :)
anyway, I want to put some timing based emulator detection
logic into my malware. but the malware emulators are becoming
fast and smart in these days :( so simple timing check to see
if 'RDTSC count is big' stuffs won't work accurately.

I want to make code A and B such that A is slower than B in
case of emulator but B is slower than A in case of real CPU

can you make me such kind of x64 asm code?
also, I want the detection code to be very small...
once your code bypasses my stupid filter, I'll evaluate the code.
get me some magic code here :)

Taking this into account and reading through the code, I came to the conclusion that the script worked like this:

  1. Ask for x64 assembly code A.
  2. Ask for x64 assembly code B.
  3. Compile A and B into a C stub which runs each 10,000 times and taks note of the execution times. The stub returns emulator if B takes more than 120% the time of A, realcpu if A takes more than 120% the time of B, or unsure otherwise.
  4. Loop 100 times, calling the compiled stub either natively or in QEMU by random decision.
  5. Check to see if the stub guessed correctly whether it was emulated or run natively.
  6. If at least 97 out of 100 runs are correct, print the flag.

Additionally, the script imposes some restrictions on what the assembly code can contain:

black_list  = ['jne', 'je', 'jmp', 'ret', 'and', 'mov', 'push', 'pop', 'xor', 'loop', 'sh']
black_list += ['"', '/', '#', '(', ')', '*', '&', '\'', '\n', '\t', '\x00', '@', '^', '+']
black_list += ['sub', 'add', 'jz', 'jnz', 'byte', 'int', '80', 'leave', 'syscall', 'xchg', 'nop']

def check( code ):
	for e in black_list:
		if code.lower().find(e) != -1: return False
	if len(code) > 40: return False

That's right, we were banned from using JMP, MOV, AND, POP, PUSH, and XOR, among many others. We also had to fit each payload into 40 bytes or less. It's worth mentioning that that's 40 bytes of mnemonics, not 40 bytes of bytecode.

We brainstormed a bit but ultimately realized that there's not much we could do. After sitting with ourselves for a while, we came up with some tricks, but nothing that panned out. Thats when I thought about the C code:

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <seccomp.h>
#include <sys/prctl.h>

void sandbox(){
	scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
	if (ctx == NULL) {
		printf("seccomp error\n");
		exit(0);
	}

	seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(connect), 0);
	seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(dup), 0);
	seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(dup2), 0);
	seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(dup3), 0);
	seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(open), 0);
	seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(openat), 0);
	seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(execve), 0);

	if (seccomp_load(ctx) < 0) {
		seccomp_release(ctx);
		printf("seccomp error\\n");
		exit(0);
	}
	seccomp_release(ctx);
}

unsigned long long rdtsc(){
	unsigned long lo, hi;
        asm( "rdtsc" : "=a" (lo), "=d" (hi) ); 
        return( lo | (hi << 32) );
}

void A(){
	asm("{0}");
}

void B(){
	asm("{1}");
}

unsigned long long measure_time(void (*f)(), int cnt){
	register int i = cnt;
	unsigned long long t2, t1;
	t1 = rdtsc();
	prctl(PR_SET_TSC, PR_TSC_SIGSEGV);
	while(i--) f();
	prctl(PR_SET_TSC, PR_TSC_ENABLE);
	t2 = rdtsc();
	return t2-t1;
}

int main(int argc, char* argv[]){

	unsigned long long t_A=0, t_B=0;
	void* p = mmap((void*)0x66666000, 0x1000, PROT_WRITE|PROT_READ|PROT_EXEC, MAP_ANONYMOUS|MAP_SHARED, -1, 0);

	sandbox();	

	measure_time(A, 10);		// warm up cache
	t_A = measure_time(A, 10000);

	measure_time(B, 10);		// warm up cache
	t_B = measure_time(B, 10000);

	if( t_A > t_B*1.2 ){
		printf("emulator! (A:%llu, B:%llu)\n", t_A, t_B);
	}
	else if( t_B > t_A*1.2 ){
		printf("realcpu! (A:%llu, B:%llu)\n", t_A, t_B);
	}
	else{
		printf("im not sure... (A:%llu, B:%llu)\n", t_A, t_B);
	}

	return 0;
}

"Why the hell are they using mmap() to allocate a buffer they never use?", I thought. Well, then it seemed obvious: they were either trolling us, or we needed the buffer to solve the challenge. Given how stuck we were, we assumed it was the latter.

The Payload

We took a break to grab some food and head to our monthly meetup with our local DEFCON chapter, and I had a thought: "What would happen if we put a RET in the buffer and made a CALL to it? Could QEMU possibly have performance issues executing code from memory allocated with mmap?"

As soon as we arrived at the meetup, Vadim whipped out his laptop and we attempted to accomplish this in 40 bytes. I realized we could use the .long macro to emit 4-byte chunks of bytecode at a time, while Vadim found that the most succinct way to get 0xC3 (RET) into the buffer was using STOSB. Our first try looked something like this:

MOV AL, 0xC3
MOV RDI, 0x66666000
STOSB
DEC RDI
CALL [RDI]

Compiled to bytecode, this is 15 bytes. Using .long to emit the bytes meant 46 characters. No good. I then realized that we could ditch 2 bytes if we got the value 0x66666000 from the stack of main(), rather than defining it as an immediate value. We had to compile the code locally and use gdb to find the offset, and it turned out to be 0x78:

MOV AL, 0xC3
MOV RDI, [RSP+0x78]
STOSB
DEC RDI
CALL [RDI]

We were closer, but we still needed 42 characters to represent all 13 bytes. We needed to shave one more byte. It's at this point Vadim realized that CALL [RSP+0x78] is 4 bytes, as opposed to the 5 taken by DEC RDI; CALL [RDI]. And, there we had it, a succinct assembly snippet:

MOV AL, 0xC3
MOV RDI, [RSP+0x78]
STOSB
CALL [RSP+0x78]

Which could be given in 38 characters:

.long 0x8b48c3b0,0xaa78247c,0x782454ff

When I executed this payload as both A and B, I noticed something magnificent:

running your code with real CPU...
your code says : im not sure... (A:2552805, B:2750728)
running your code inside qemu-x86_64...
your code says : im not sure... (A:278716331, B:283544101)

When comparing the execution times between a real CPU and an emulator, this code had two orders of magnitude in performance variation. To ensure that this wasn't simply the expected consequence of emulation, I tested both A and B with 4 NOP instructions:

running your code with real CPU...
your code says : im not sure... (A:68037, B:68961)
running your code inside qemu-x86_64...
your code says : im not sure... (A:2373822, B:2292258)

This showed a large variation, but emulation was now only 33x slower; not 104x slower. This pretty much confirmed to us that, indeed, something about mmap() really slowed down the emulator. Neither of us knew what, and even though I did guess the solution right on my first try, I know I wouldn't have if they hadn't left us a monumental hint with that lonely mmap() call. Whatever the reason, though, we crafted B as a block of 16384 NOP instructions using the .rept macro:

rept 0x1000; .long 0x90909090; .endr

Then we submitted our solution:

x64 asm code for function A: .long 0x8b48c3b0,0xaa78247c,0x782454ff
x64 asm code for function B: .rept 0x1000; .long 0x90909090; .endr
got your code. let me see if it works...
...
correct!(100)
dude...! your code is really awesome!
thanks for the help. here is reward :)
{flag omitted}

And we celebrated with milkshakes.

The Aftershock

We were quite pleased with our success and, even though we didn't understand exactly what happened, we discussed some theories as I drove Vadim home. However, after a few days passed, I decided to revisit the problem.

First, I wanted to see if I could cut the payloads down in size. After some digging, I realized I could use .octa to emit 16 bytes at a time. Ultimately, this would save 6 bytes in A, as it would remove 2 occurrences of ,0x from the payload. However, because the payload is only encoding 12 bytes, it would be followed by 4 implicit 0x00 bytes, potentially leading to a crash. I realized I could JMP over these bytes at little cost. A short JMP is 2 bytes, meaning I'd have 2 remaining 0x00 bytes to jump over. The bytecode for this jump is 0xEB 0x02, or 0x02EB when endian-flipped. Of course, the leading 0 could be omitted because it is implicit, meaning the new payload was 38 - 6 + 3 = 35 characters:

.octa 0x2eb782454ffaa78247c8b48c3b0

Disassembled, this looks like:

MOV AL, 0xC3
MOV RDI, [RSP+0x78]
STOSB
CALL [RSP+0x78]
JMP 2  ----------.
ADD [RAX], AL    |
...    <---------'

Additionally, I realized I could cut B down from 37 to 16 characters using the .fill macro:

.fill 9999,1,144

This didn't really accomplish anything, I was just really curious to see how much smaller I could get it. For the life of me, I can't get A below 35 bytes. I haven't tried too hard to shorten B anymore, though.

If you manage to shorten A any further, I'd love to see how!

The Attack

Even though I had solved the challenge, and even managed to do decently well at payload golf, I still didn't understand exactly why it worked. When making the guess, I had a few theories:

  1. Because the allocated buffer is in shared memory, the emulator could be failing to optimize or JIT compile the code (because shared memory can possibly be modified without it's knowledge).
  2. The emulator builds some kind instruction cache, but does not include mmap() pages in that cache, causing execution from them to be quite slow.

These were close, but ultimately they were wrong. Though I'm not 100% certain of this answer, after reading a lot of QEMU code and documentation, I think it is the correct one:

  • QEMU does runtime translation of emulated code into native code, similar to what you might find in a JIT compiler. This is stored in what is called the translation cache.
  • QEMU strips PROT_WRITE from any pages which have been translated, meaning any writes to the pages trigger a segmentation fault exception.
  • When a segmentation fault is triggered, QEMU invalidates the translation cache for the target page, restores PROT_WRITE to it, sets the trap flag, and resumes execution.
  • The this allows the write to happen, but the trap flag instructs the CPU to throw a type-1 interrupt, which is caught by QEMU so that it can once again strip PROT_WRITE from the page.
  • Finally, when the page is executed, the translation cache is rebuilt because it is marked as invalidated.

So, in fact, the reason had little to do with mmap(), and much to do with the overhead of translation cache invalidation due to code overwrite as well as translation cache rebuild due to execution of invalidated code. To confirm this theory, I kept the same payload for A, but used this for B:

CALL [RSP+0x78]

Because A always runs before B, we can be sure that the 0xC3 would be in the right spot when B is executed, meaning it would still be hitting the mmap() buffer and returning immediately. When comparing the times, I found that, during emulation, A executed in 275855292 units, where B executed in 4586202. This is a difference of 60x. This shows that the CALL into and RET out of the mmap() memory is extremely cheap when no writes occur. This doesn't mean that the write carries all of the overhead, but, rather, the side-effect of the write (cache invalidation, with some overhead of it's own) leads to extra overhead when the CALL is executed due to forced translation cache rebuild.

Closing Thoughts

This was quite a brilliant challenge. Even with the flag in hand and a good understanding of what might be happening, I didn't manage to find any documentation on this method of emulator detection. This goes to show how much insight the creator must have had when writing this challenge.