Dennis Yurichev <dennis@conus.info>
FPGA is a construction set. Its chip mostly consists of typical blocks (cells), each of them can be programmed using information in flash-memory after powering. This allows building almost any digital circuit on top of it.
In Altera Quartus development environment one can draw logical primitives and connect them by wires using mouse, but this is used mostly for demonstration purposes. There are HDL (hardware description languages), which exist for a long time and which are able to describe logical blocks of any abstraction level and connections between them. The most popular ones are VHDL and Verilog. But even using them for programming, it is hard to create something practical, because of extremely low level of abstractions. Procedural programming paradigm is almost unusable here. Fortunately, many symmetrical cryptographical algorithms (like DES) traditionally can be very well described by logical primitives.
Often, reusable blocks of electronic design are called IP-core or "intellectual property core". Building a FPGA-based DES-encrypting IP-core is not very hard. It will contain two inputs (key and unencrypted data) and one output (encrypted data).
While implementing algorithms on FPGA, it is possible to concentrate on task entirely and not to do unnecessary actions. All our IP-core will do is only encrypting input stream and nothing more. Comparing with conventional microprocessor, we do not have necessity to shift results from register to register, from register to stack, from stack to cache-memory, from cache-memory to memory, etc. This is a reason of the speed burst.
FPGA chips can be reconfigured as many times as one would like, we can test implementations for different algorithms, etc, but for this feature chips contain interconnect logic, programmed during powering, describing which primitives and blocks to connect to which. When a signal is sent from one place to another, it will come up to 80% of time through interconnect logic and only 20% of time remain for the effective work. The situation is different in ASIC (Application-specific integrated circuit) where interconnect logic is absent, and contains only what is necessary. On the other hand, ASICs are not reconfigurable as FPGA. For example, EFF DES cracker [2] was made on ASIC chips.
Password hashing algorithm [3] in Oracle RDBMS (versions before 11) is based on well-known DES encryption algorithm [4]. Later, in version 11, SHA1 hashing method was implemented too, but old scheme is still used, partly because, ironically, new SHA1 algorithm can be brute-forced much faster [5]. So, how old scheme works: string containing username is concatenated with password string. Resulting string is encrypted by DES in CBC mode with key 0x0123456789ABCDEF. Result of last CBC iteration is the new key. String containing username and password is encrypted again, with new key. Result of last CBC iteration is the hash we need.
So, we need to create IP-core which can take username and password as input and give us resulting 64-bit hash. We will brute-force all possible passwords (in reasonable limits) and compare them with hash we are interested in. If we find a password where resulting hash is the same as we are looking for, this means that we have found the password with very high confidence (we are not taking cryptographical collisions into account).
Let's start with DES IP-core. Since the time it was introduced in 1977, no faster practical attack than brute-force is known today. In a nutshell, it contains 3 functions: preparing input data (initial permutation), encryption function repeating 16 times (called "rounds") and function of preparing output data (final permutation).
There are also so called S-boxes (substitution tables). DES uses 8 tables, each table indexed by 6-bit value and output is 4-bit. It is possible to reduce these tables to logical functions for optimization purposes (for example: [6]), and this solution will not need memory at all. This is fine for modern microprocessors which are able to execute several instructions simultaneously without accessing cache or memory. And it can be helpful there. But it's different on FPGA. It is known that Altera Quartus puts logical functions into small tables whenever it is possible, and FPGA cells contain lookup tables (LUT/ALUT) for it. More on this: [7].
Next question is creating IP-core capable of encrypting data in Cipher Block Chaining (CBC) [8] mode. Nothing special here: we add one more counter; also we add circuit XOR-ing inputs with outputs for DES module or not XOR-ing anything if the input block is the first.
Another question is Oracle password hashing function, which will take text string on input and give us result on output. There will be just two phases: the first one is string encrypting with the key 0x0123456789ABCDEF and the second one is encrypting the same string but with the key we got at the first phase. We do this by adding more registers and also register for indicating a phase (first or second).
So, we have such IP-core. For the username+password length not longer than 12 symbols it will work with such timing: 16*4*2=128. This means: 16 DES rounds * 4 CBC iteration for 12 symbol encryption * 2 phases.
Now we have much more complex task. In Altera Stratix II EP2S60 chip we are able to implement 60 of such IP-cores, so we need to make all them work in parallel. A set of all passwords we would like to check is what we call "keyspace", and we need to cut this keyspace into approximately 60 equal parts, and do this on-the-fly while execution. More of that, 60 is not a constant. We may have any different FPGA chip where 10 IP-cores may fit in, or 100, or any other number.
By Oracle's password standard, first password symbol is always Latin character (one of 26) and other symbols can be Latin characters, digits and "#", "$", "_" symbols (39 in total). If we could have only 26 or 39 brute-forcing blocks, our task would be simpler: we could just divide password on "base" and "last character". "Base" can be shared by all brute-forcing blocks and last character is different for all.
Here we are shifting to another part of the project, which is more embedded-programming like. We need a microprocessor, C language compiler and something more. Altera has soft-microprocessor Nios II [9] (their competitor Xilinx has MicroBlaze). This is typical RISC-processor, but it is different in the sense that it is does not exist as a chip, it is written in HDL language and it is configurable and it is IP-core. It is possible to turn on or off some features, change cache-memory size, define boot address, add debug features. All this also defines how much FPGA cells it will occupy and how fast it will work. There is Nios II back-end for well-known GCC compiler written by Altera. We also need a bus which processor, DDR-memory and other IP-cores will be connected with. Such bus in Alera environment is called Avalon [10]. We will also use Altera SOPC Builder [11] program for connecting all these modules, defining all address for them, interrupts etc.
Also we connect to Avalon bus everything we've prepared before: all those 60 brute-force blocks.
So, now we are able to write in C and access blocks via ports and read results from there.
Let's get back to paralleling problem. To be precise, 2 problems exist: 1) we need to place separate password variation for each block; 2) we need to read result from each block and compare it with the hash we're looking for, and if it is what we need, stop and tell user about it.
All device's working time can be divided into brute-forcing blocks' working time and what is going on on high-level in Nios II processor. As I said before, for calculating of at least one hash (if username and password are short enough), we need at least 128 cycles. Which comes to mind first is to write a separate password into each block (there 60 of them), wait 128 cycles and read result. First and last tasks will take much more than 128 cycles and it will be extremely inefficient. We need to minimize overhead. Last task (read result and compare it with what we're looking for) can be solves easily: add comparator to each block and also add register containing reference value; all these blocks will do comparing on their side and in case of success fire some bit: which will be read by Nios II via Avalon bus.
And still we have to solve the first task. So we place some memory in each block for input data. From a C program we can write to either each block's memory or to all of them simultaneously (setting special bit in address) via ports. For example, we are brute-forcing all 3-symbol passwords. At the start we write username + "AA" to first block's memory, username + "AB" to second block's memory, "AC" - to the third, and so for each. This operation is relatively slow. After that, we write "A" at position of password's third symbol to all blocks simultaneously and start brute-forcing search. After 128 cycles, if nothing is found, we write "B" to all blocks at the same position and start searching again. At the same time, all we wrote at first and second password's positions are still the same as we wrote at start. Then we write "C", and so on. Now, major part of our device's working time is: write one byte at some specific address via Avalon bus + wait for 128 cycles of brute-force search + simultaneously read all bits for all blocks and look if any block has found anything. Now it is fast enough.
One more thing. Let's get back to Oracle password hashing algorithm. If we need to get hash for user "SYS" and password "ABCDEFHJ", we need to form string "SYSABCDEFGHJ", encrypt it in CBC mode (3 calls to DES) and do this one more time (for the second algorithm phase). As we can see, at the first phase, "SYSA" is at the input at the first CBC iteration, "BCDE" on the second and "FGHJ" on the third. We know that major part of time, only last symbol in string "FGHJ" will be changing, this mean, results of first and second CBC iterations will be the same. So, we can cache it and save a lot of time: on long usernames it is tend to (but not equal to) 50%.
On user side we make web-interface. It is possible to use web-server example by Altera [12]. We also use well-known in embedded-programming NicheStack TCP/IP Stack [13]. Web-page is created on-the-fly (we do not even have such abstraction as files) and handles HTTP GET-requests. We have to work with brute-forcing circuits and, at the same time, handle user web-requests, so we need to use some multi-tasking operation system, like MicroC/OS-II [14]. In general, its features differ from conventional personal computer OS features, because it is intended for embedded devices, and we only use multitasking.
Board used: Nios II Development Kit, Stratix II Edition [15]. What we use on the board: EP2S60 chip, DDR-memory, Ethernet-controller chip and connector, RS-232 port, used for logging and debugging purproses. Brute-forcing circut is clocked at 99MHz.
Board connected to Internet on 24h basis and accessible at: http://ops.conus.info
References: