Tuesday, July 26, 2016

A Neural Network Scoring Engine in PL/SQL

Topic: In this post, you will find an example of how to build and deploy a basic artificial neural network scoring engine using PL/SQL for recognizing handwritten digits. This post is intended for learning purposes, in particular for Oracle practitioners who want a hands-on introduction to neural networks.


Introduction

Machine learning and neural networks in particular, are currently hot topics in data processing. Many tools and platform are now easily available to work and experiment with neural networks and deep learning (see also the links at the end of this post)Recognizing hand-written digits, in particular using the MNIST database by Yann LeCun et al., is currently the "hello world" example for neural networks.
In this post, you will see how to build and deploy a simple neural network scoring engine to recognize handwritten digits using Oracle and PL/SQL. The final result is a short PL/SQL package with an accuracy of about 98%. The neural network is built and trained using TensorFlow and then transferred to Oracle for serving it.

One of the ideas that this post wants to illustrate is that scoring neural networks is much easier than training them: the operations required for serving a trained network can be implemented relatively easily on many computing languages/environments. Discussions on these topics normally are centered around platforms for "Big Data" (see for example Spark and MLlib). I find interesting to note that neural networks can also be successfully applied to the RDBMS world. This can be useful as large quantities of valuable data are currently stored in relational databases. In the case of Oracle, the implementation of a scoring engine is also made easier by the availability of a mature the PL/SQL environment with a package for linear algebra: UTL_NLA.


Let's start from the end: how to deploy the PL/SQL package MNIST and recognize handwritten digits using Oracle

One short PL/SQL package and two tables is all you need to replay the following example (you can find the details of the code on Github). The tables are:
  • TENSORS_ARRAY: this table contains the numerical values for the vectors and matrices (tensors) that constitute the neural network. There is a total of 79510 floating point numbers encoded into four tensors using the data type UTL_NLA_ARRAY_FLT.
  • TESTDATA_ARRAY: this table contains the test images. There are 10K images, each composed of 28x28 = 784 pixels. Image data is also encoded using the data type UTL_NLA_ARRAY_FLT.
The engine for scoring the example neural network is in a package called MNIST. It has a procedure called INIT that loads the components of the neural network from the table tensors_array into PL/SQL variables and a function called SCORE that takes an image as input and return a number, the predicted value of the digit. 
Here is an example of its usage, where the first image in the table testdata_array is examined and correctly predicted to represent the number 7 (the image label agrees with the prediction by MNIST.SCORE):

SQL> exec mnist.init

PL/SQL procedure successfully completed.

SQL> select mnist.score(image_array), label from testdata_array where rownum=1;

MNIST.SCORE(IMAGE_ARRAY)      LABEL

------------------------ ----------
                       7          7

Figure 1: This is a bitmap display of the test image used in the example. This confirms that the prediction of MNIST.SCORE is correct and indeed the image is a representation of the number 7 handwritten and encoded in a grid of 28x28 gray-scale pixels.


Processing all the test images is also a matter of a simple SQL command. In the example of Figure 2 it takes 2 minutes to process 10000 test images, that is about 12 ms per image on average. The accuracy of the scoring function is about 98%. It is calculated as follows: out of 10000 images, 9787 are scored correctly according to the data labels. Note also that the set of test images is disjoint from the images used to train the neural network. Therefore we can expect that the MNIST package has an accuracy of about 98% for recognizing digits also when used on generic input (additional evaluations of the quality of the MNIST package as a classifier are beyond the scope of this post).


The full PL/SQL code and the datapump dump file with the relevant tables can be found on Github. In the following paragraphs, you can read how to build and train the neural network.


Figure 2: The accuracy of the PL/SQL scoring function MNIST.SCORE on the test set of 10K images is about 98%. Processing takes about 12 ms per image.


The neural network

The neural network used in this post is composed of three layers (see also Figure 3): one input layer, one hidden layer and one output layer. If this topic is new to you, I recommend to do some additional reading (see references) and in particular to read Michael Nielsen's "Neural Networks and Deep Learning" which provides an excellent introduction to the topic and a series of step-by-step examples on the problem of recognizing handwritten digits.


Figure 3: The artificial neural network used in this post is composed of three layers. The input layer has 784 neurons, one per pixel of the input image. A hidden layer of 100 neurons is added to improve the accuracy. The output layer has 10 neurons, one per each possible output value (that is digits from 0 to 9).


Get the training and test data, build and train the neural network

Another important step for deploying neural networks is training. For this you need data, lots of it if possible. You also need an engine to do the necessary computation. Luckily there are many platforms available for working with neural networks, that that are free and relatively use to deploy (see references). In this post, you will see how to use Google's TensorFlow and the Python environment. TensorFlow comes with a tutorial for recognizing handwritten digits in the MNIST database. Included in the tutorial are training and test data with labels and also example code.

You can find the code I used for training the neural network on Github. Some highlights and code snippets are discussed in the following.

Importing the data: The example dataset that comes with TensorFlow provides 55000 images for training and 10000 images for testing. These originally come from the work of Yann LeCun and coworkers. Having large amounts of high-quality data is very important to the success of the process. Moreover, the images come with labels: the labels tell which number each image is intended to depict and provide a very important piece of information as the exercise is to do supervised learning.

Defining the neural network: there are four tensors (vector and matrices in this case) in the network: W0, W1, b0 and b1. They are defined in the following snippet of code. To better understand their role and the key role that the cross entropy and the gradient descent optimizer play in training the network see the references, in particular "Neural Networks and Deep Learning" and TensorFlow tutorial.



Training the neural network: training proceeds with multiple steps of optimization. Training is performed using 55000 images with labels. It runs over 30000 iterations using "mini-batch" size of 100 images. At each step, the gradient descent algorithm computes an update of the weights and biases (W0, W1, bo and b1) with the goal of minimizing the loss function (cross_entropy). The relevant snippet of the code is:


Result: as a result, the trained network has the accuracy of about 98% in recognizing the images in the test set. Note that the test set is composed of 10000 images and is disjoint from the set of images used for training (the training set contains 55000 images).
It is possible to get higher accuracy with more advanced neural network configurations (see references for details), but that is beyond the scope of this post.


Manually scoring the neural network, a Python example

The main result of the training operations is that the tensors (matrices and vectors in this case) that make the neural network are now populated with useful values. I believe that a good way to understand how all this works is to "run the network manually", that is run as an example of how to go from an image of a handwritten digit to the prediction of its value by the trained neural network. As a first step we extract the values of the trained tensors in our model into numpy arrays for later processing:


An example of "manually" operating the network in Python is as follows:



W0_matrix, b0_array, W1_matrix and b1_array are the tensors that constitute the neural network after training, "testimage" is the input, sigmoid() is used as activation function, "hidden_layer" represents the hidden layer of the network, "predicted" is the output layer and softmax() is a function used to normalize the output as a probability distribution. At the end of the calculation, the array predicted[n] contains the prediction that the input image represents the digit "n". The function argmax() finds the value of "n" where predicted[n] is maximized.
The code shown above predicts the value 7 for a test image. The prediction is confirmed as correct by the value of the label and can also be visually confirmed by the bitmap display of the test image (see Figure 1).


Move test data and network to Oracle

The example in the previous paragraph on how to manually run a the scoring engine illustrates that serving a neural network can be straightforward, in some cases it is just a matter of performing some basic computations with matrices. This contrasts with the complexity of training neural network models, where often one needs a specialized engine, large quality of training data and in the more complex cases also specialized hardware, such as GPU cards.
The discussion of the previous paragraph has also prepared the terrain for the following development: that is moving the neural network tensors and test data to Oracle and implement a serving engine there.
There are many ways to export Python's numpy arrays. One way is to save them in a text format. Here you will see instead a method targeted to exporting directly into Oracle using cx_Oracle, the Python library to interact with Oracle. See also the notebook "Oracle and Python with cx_Oracle" for additional examples and references on how to use cx_Oracle.

You can find the code on Github, here are some relevant snippets:

- Create the tables to host the tensor definition and test data:

SQL> create table tensors (name varchar2(20), val_id number, val binary_float, primary key(name, val_id));

SQL> create table testdata (image_id number, label number, val_id number, val binary_float, primary key(image_id, val_id));

- From Python, open a connection to Oracle:

import cx_Oracle
ora_conn = cx_Oracle.connect('mnist/mnist@ORCL')
cursor = ora_conn.cursor()

- Example of how to transfer the matrix W0 into the Oracle table "tensors"

i=0
sql="insert into tensors values ('W0', :val_id, :val)"
for column in W0_matrix:
    array_values = []
    for element in column:
        array_values.append((i, float(element)))
        i += 1
    cursor.executemany(sql, array_values)
ora_conn.commit()


Oracle's optimizations for linear algebra

From Oracle documentation: "The UTL_NLA package exposes a subset of the BLAS and LAPACK (Version 3.0) operations on vectors and matrices represented as VARRAYS". This is very useful for implementing the calculations needed to serve the neural network of this post.
A snippet of the MNIST code to get the gist of this works in practice is reported below. The code performs the calculation v_Y0 = v_Y0 + g_W0_matrix * p_testimage_array, there g_W0_matrix is a 784x100 matrix, p_testimage_array is a vector of 784 elements (encoding the 28x28 images) and v_Y0 is a vector of 100 elements.



utl_nla.blas_gemv(
                      trans => 'N',
                      m => 100,
                      n => 784,
                      alpha => 1.0,
                      a => g_W0_matrix,
                      lda => 100,
                      x => p_testimage_array,
                      incx => 1,
                      beta => 1.0,
                      y => v_Y0,
                      incy => 1,
                      pack => 'C'
        );


In order to use UTL_NLA the tensors that make the neural network and the test images need to be stored in varrays of binary_float, or rather be declared of data type UTL_NLA_ARRAY.
For this reason it is also convenient to post-process the tables "tensors" and "testdata" as follows:

SQL> create table testdata_array as
select a.image_id, a.label, 
cast(multiset(select val from testdata where image_id=a.image_id order by val_id) as utl_nla_array_flt) image_array 
from (select distinct image_id, label from testdata) a order by image_id;

SQL> create table tensors_array as
select a.name, cast(multiset(select val from tensors where name=a.name order by val_id) as utl_nla_array_flt) tensor_vals 
from (select distinct name from tensors) a;

Finally, you can export the tables for later use. In the Github repository you can find a dump file obtained with the following command (run as Oracle):

$ expdp mnist/mnist tables=testdata,tensors directory=DATA_PUMP_DIR dumpfile=MNIST_tables.dmp

The final step, which brings you back to the discussion in the paragraph "let's start from the end: how to test the PL/SQL package MNIST",  is to create the PL/SQL package MNIST that loads the tensors and performs the operations needed to score the neural network, See the details of the code on Github.


Conclusions and comments

This post describes an example of how to implement a scoring engine for an artificial neural network using the Oracle RDBMS and PL/SQL. The discussion is about a simple implementation of the "hello world" example of neural networks: recognizing handwritten digits of the MNIST database. The network is trained using TensorFlow and later exported into Oracle. The final result is a short PL/SQL package which provides digit recognition with an accuracy of about 98%.

We can expect in the near future to find increasing deployments of neural networks close to data sources and data stores. The example in this post of how to implement a neural network serving engine on an Oracle database shows that this is not only possible but also easy to implement.
Serving neural networks is much simpler than training them. While training requires specialized software/platforms and domain knowledge and large amounts of training data, trained networks can be imported into target systems and executed there, in many cases requiring low usage of computing resources.
This post is intended as learning material: a simple feed forward neural network has been used instead of the more performing convolutional network (see references). Moreover, data movement from TensorFlow to Oracle and the implementation of the serving engine in PL/SQL is a sort of a hack in the present state and it is not intended for production usage.

The code accompanying this post is available on Github.


Notes on how to build the test environment

The main components and tools for testing the scripts in this post are:
the Python environment (on Linux with Centos 7) installed using Anaconda 4.1: Python 2.7, Jupyter Ipython notebook.
TensorFlow, version 0.9 (the latest as I write this), installed following the instructions at https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html
Oracle RDBMS running on Linux. The Oracle scripts have been tested on Oracle 11.2.0.4 and 12.1.0.2


References and acknowledgments

An excellent introduction to neural networks and an inspiration for this blog post is Michael Nielsen's book "Neural Networks and Deep Learning".
The code for neural network training used in this post is an extension of Google's TensorFlow MNIST tutorial.
See also: tutorial on TensorFlow by Nartin Gorner
Basic techniques for TensorFlow by Aaron Schumacher
Visualizing MNIST by Christopher Olah
Python Machine Learning by Sebastian Raschka
Other popular frameworks for working with neural networks and deep learning besides TensorFlow include Theano and Torch among many others, see also this page on Wikipedia.


Monday, June 13, 2016

IPython/Jupyter Notebooks for Oracle

Topic: In this short post you can find examples of how to use IPython/Jupyter notebooks for running SQL on Oracle.

IPython/Jupyter notebooks are one of the leading free platforms for data analysis, with many advantages, notably the interactive web-based interface and a large ecosystem of readily available packages for data analysis and visualization. Moreover IPython/Jupyter notebooks are a very handy format for sharing code and data as you will see in the examples.
See also this blog post with examples on how to use Jupyter for querying Apache Impala.

It is of interest to integrate many data sources into Jupyter notebooks to make the platform versatile and to fulfill many different use cases. In this short post you can find examples of how to query data from Oracle using Jupyter notebooks and simple integration with pandas and matplotlib.


Notebook Short description
Oracle_IPython_sqlplus Examples of how to use sqlplus inside Jupyter notebooks. It is based on the use of %%bash cell magic and here documents to wrap up sqlplus inside Jupyter cells.
Oracle_IPython_cx_Oracle_pandas Examples of how to query Oracle from Python using cx_Oracle and how to integrate with pandas and visualization with matplotlib.
Oracle_IPython_SQL_magic Examples of how to query Oracle using %sql line magic (or %%sql cell magic) and of the integration with cx_Oracle and pandas.


Dependencies and pointers to build a test environment:

  • Install IPython and Jupyter. The following assumes Anaconda from Continuum Analytics)
  • Install the Oracle client
    • Download the software from OTN: http://www.oracle.com/technetwork/topics/linuxx86-64soft-092277.html
    • On that same link you can find the installation instructions (scroll down by the end of the page)
    • Example of the actions: perform ln -s libclntsh.so.12.1 libclntsh.so (12c version) and export LD_LIBRARY_PATH={oracle client home}
    • I have tested this with Oracle client versions 12.1.0.2 and 11.2.0.4 on Linux
  • Post client installation:
    • set environment: export ORACLE_HOME={path to the Oracile client installation}
    • If not already installed, install libaio (yum install libaio)
  • Check that the Oracle client works and all dependencies are set by running sqlplus from the Oracle client home, example:
    • check client connectivity with: sqlplus username/password@dbserver:port/service_name
  • Install cx_Oracle, for example with pip install cx_Oracle
  • Install ipython-sql
Reference: a web page by Julian Dyke with examples on how to use cx_Oracle


Tuesday, May 31, 2016

Linux BPF/bcc for Oracle Tracing

Topic: In this post you will find a short discussion and pointers to the code of a few sample scripts that I have written using Linux BPF/bcc and uprobes for Oracle tracing.

Previous work and motivations

Tools for dynamic tracing are very useful for troubleshooting and internals investigations of Oracle workloads. Dynamic tracing probes on the OS/kernel, can be used to measure the details for I/O latency for example. Moreover probes on the Oracle userspace functions can be be used to complement Oracle instrumentation and probe deeper into the internals of the Oracle engine, when needed. For example in a work done in collaboration with Frits Hoogland we have investigated how to link Oracle wait event interface with tools able to probe the userspace, such as gdb and dynamic tracing tools as DTrace, SystemTap and Perf. More details and examples on this topic can be found in previous work: Modern Linux Tools for Oracle Troubleshooting (presentation at UKOUG Tech14), Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTapLinux Perf Probes for Oracle TracingFlame Graphs for Oracle.


What's new with BPF/bcc

BPF together with his frontend bcc are new and exciting technologies in the Linux dynamic tracing ecosystem. If you are not familiar with them, you can get up to speed using the excellent material in Brendan Gregg's blog. While the tools are still in development as I write this, a large amount of important features are already available starting from kernel 4.5, notably the support for uprobes that I have used in the scripts discussed later in this post.

I have started experimenting with porting a few probes for the Oracle userspace that I had written using SystemTap and discussed in this blog. While the language/syntax is completely different between SystemTap and BPF/bcc, porting the probes has turned out to be relatively straightforward. The work has been made substantially easier by the availability of a large selection of well-written and documented scripts in the tools directory of bcc. At present writing BPF/bcc scripts is a mixture of coding in Python and C, which feels to me both familiar and at the same time strangely low-level compared to the more mature environments for example of DTrace and SystemTap.

The example scripts that I have written using BPF/bcc are on Github. Here is a list with comments:

Script name Short description
ora_sqlparse_trace.py Tracing of Oracle SQL parsing. This script traces SQL hard parsing on Oracle binaries hooking on the Oracle function "opiprs" and reads from function arguments (CPU registers) and from process memory.
ora_wait_trace.py Tracing of Oracle wait events. This script traces Oracle sessions by hooking on the functions "kskthewt" and "kews_update_wait_time" and reads from function arguments (CPU registers).
ora_logicalIO_histogram.py Logical IO latency histograms. This script measures the latency between call and return time for the Oracle function "kcbgtcr", which is an important part of the logical IO processing for consistent reads.
ora_wait_histogram.py Wait event latency histograms. This script traces Oracle sessions by hooking on the functions "kskthewt" and "kews_update_wait_time" and reads from function arguments (CPU registers). BPF computes the latency histogram for the wait events and the script prints the values on stdout.

An example of the usage of ora_wait_histogram.py to measure and display wait event latency:

# stdbuf -oL ./ora_wait_histogram.py 10 10|sed -e 's/event# = /event#=/g' -f eventsname.sed
Start tracing oracle wait events... Hit Ctrl-C to end.

event=db file sequential read
     wait time, microsec : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 25       |                                        |
       128 -> 255        : 24521    |********************                    |
       256 -> 511        : 46788    |****************************************|
       512 -> 1023       : 12169    |**********                              |
      1024 -> 2047       : 1132     |                                        |
      2048 -> 4095       : 660      |                                        |
      4096 -> 8191       : 248      |                                        |
      8192 -> 16383      : 29       |                                        |


Latency heat maps with PyLatencyMap

PyLatencyMap is a command-line tool for visualizing latency histograms using heat maps on terminal graphics, using ANSI escape codes. PyLatencyMap can be used to investigate I/O performance for random I/O, especially suited for the cases of multiple modes of response time from the storage (SSD cache, HDD, latency outliers). The original idea for PyLatencyMap comes from the work of Brendan Gregg on latency heat maps. I have added to the PyLatencyMap repository a modified version of the biolatency.py script to measure I/O latency histograms: this is the link to the script pylatencymap_biolatency.py and an example of heat maps generated with PyLatencyMap with the script Example11_BPF-bcc_blockIO_latency.sh:



The test workload has been generated using Kevin Closson's SLOB. Additional references with a more detailed discussion of the topic of testing Oracle I/O with SLOB and measuring latency heat maps are:
Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap and OraLatencyMap v1.1 and Testing I/O with SLOB 2.


Notes on the test environment

The scripts discussed in this post have been developed on Fedora 24 (alpha) running Linux kernel version 4.6 (using the rawhide kernel) and have been tested on workloads generated using Oracle version 11.2.0.4. This is not an Oracle-supported configuration and  the scripts are intended mainly as a demonstration of the technology and for learning purposes.
Here are some pointers on the steps I used to setup a lab environment for testing:

Note in particular the step for configuring the rawhide kernel, probably a good choice when testing BPF, as new features are being added on a regular basis as I write this:
# sudo dnf config-manager --add-repo=http://alt.fedoraproject.org/pub/alt/rawhide-kernel-nodebug/fedora-rawhide-kernel-nodebug.repo

# sudo dnf update

Additional pointers and recipes on how to install Oracle 11.2 on Fedora can be found on Tim Hall's website at: https://oracle-base.com/articles/11g/articles-11g


Conclusions

BPF with its bcc frontend are new and powerful tools for dynamic tracing for Linux. A discussion of the merits and shortfalls of BPF/bcc vs other existing solutions in the Linux dynamic tracing ecosystem is beyond the scope of this post. What you can find in this post are a few example scripts that I have written for tracing Oracle using BPF/bcc and uprobes and an additional script for integrating BPF/bcc with PyLatencyMap, which provides visualization as heat maps of the histograms generated using BPF/bcc.
Although the BPF/bcc environment is currently under evolution, it appears already a very useful addition to the toolbox for troubleshooting and performance investigations of Linux workloads. A set of example scripts/tools that come with the bcc repository are of great help for getting started both with using BPF/bcc and with writing new scripts. BPF/bcc can only run on relatively new kernels (as I write this, I have tested the scripts discussed in this post on kernel version 4.6) and this is an obstacle for its adoption in many environments, at least in the short term.


Credits and acknowledgements

Many of the original ideas and tools discussed here are inspired or directly derived from the awesome published work of Brendan Gregg.
Many thanks also to the development teams of BPF and bcc for providing and supporting this new powerful tools. In particular thanks to Brenden Blanco for his work on uprobes for bcc and for his support on issue #478.
The work of investigating Oracle userspace with dynamic tracing tools has been done in collaboration with Frits Hoogland.

Wednesday, April 27, 2016

IPython/Jupyter Notebooks for Querying Apache Impala

Topic: in this post you can find examples of how to get started with using IPython/Jupyter notebooks for querying Apache Impala.

Apache Impala is an open source massively parallel processing (MPP) SQL Query Engine for Apache Hadoop. This post explores the use of IPython for querying Impala and generates from the notes of a few tests I ran recently on our systems. For completeness please that that several additional options exist to query Impala, some of the most popular are: the command line impala-shell and the web-based interface Hue.
IPython/Jupyter notebooks are one of the leading free platforms for data analysis, with many advantages, notably the interactive web-based interface and a large ecosystem of readily available packages for data analysis and visualization. Moreover IPython/Jupyter notebooks are a very handy format for sharing code and data as you will see in the examples.


Installation notes for a test environment:
  • To test the techniques described here you will need a target Hadoop system with a running instance of Impala.
  • You will need an installation of IPython, for example on your laptop.
    • I have used Anaconda from Continuum Analitycs. Additional details from my test environment: Anaconda version 4.0.0, Linux version, python 2.7 for x86_64 
  • Cloudera impyla, Python client for Impala and Hive
  • ipython-sql, this is a key ingredient of this post. It implements "SQL magic" for IPython by Catherine Devlin
    • Install with pip install git+https://github.com/LucaCanali/ipython-sql.git or download from https://github.com/LucaCanali/ipython-sql
    • as I write this you will need the version I forked and to which I added a couple of changes to integrate it with Impala/impyla and Kerberos authentication

Example notebooks:

The example notebooks described in this post are available on Github.

The first example is about using the impyla client to query Apache Impala from IPython notebooks and basically comes straight out of impyla documentation. The next example adds "SQL magic" for an improved user interface. To run this you need to install ipython-sql by Catherine Devlin. As I write this you will need the version I forked and extended with minor changes to integrate it with Impala/impyla and with Kerberos authentication.
In practice ipython-sql adds the "IPython Magic functions" %%SQL and %SQL which can be used to run SQL directly in the notebooks.

Example 1: Impala_Basic.ipynb

Example 2:  Impala_SQL_Magic.ipynb


Kerberos, if you need it:

Kerberos authentication is used in our production environments. This requires additional steps in the configuration and some modifications to the scripts. For a start, support for Kerberos requires installing additional components. In my test environment I had to run:

# yum install cyrus-sasl-devel
# yum install gcc-c++
$ pip install thrift_sasl

You also need to have a valid Kerberos ticket (use kinit to get the ticket if your Kerberos client is configured). Here are the example notebooks described in the previous paragraph, now extended with the steps needed for integrating with Kerberos authentication:

Example 3: Impala_Basic_Kerberos.ipynb

Example 4: Impala_SQL_Magic_Kerberos.ipynb


Conclusions

IPython/Jupyter notebooks can be used to build an interactive environment for data analysis with SQL on Apache Impala. This combines the advantages of using IPython, a well established platform for data analysis, with the ease of use of SQL and the performance of Apache Impala.
This post provides examples of how to integrate Impala and IPython using two python packages: the impyla client and ipython-sql. An additional contribution of this work is a minor change to the ipython-sql package to make it compatible with Impala and with Kerberos authentication. The examples described in this post are available on Github.


Acknowledgements and references

Catherine Devlin for ipython-sql.
Cloudera: the Apache Impala project, Wes McKinney for Pandas and Impyla.
The Jupyter project.
Many thanks to the CERN Hadoop service team.
Links to related technology: Hue (web-based SQL interface) and the Ibis project.



Tuesday, March 29, 2016

SystemTap Guru Mode and Oracle SQL Parsing

Introduction and motivations

SystemTap and dynamic tracing tools in general give administrators great control on their systems with the relatively little additional effort to learn the new tools. In this post you will see how SystemTap that can be used to modify data on the fly at runtime. The outcome is a form of "live patching". Examples are provided on how to apply these ideas to Oracle SQL parsing functionality. This type of "guru mode" use of SystemTap is a corner case, but I believe it is important to know that such techniques exist and how they can be deployed, also because they can be implemented with just a few lines of code.

SystemTap has been successfully used for emergency security band aid of Linux systems for many years, see this presentation by Frank Ch. Eigler for full details. See also an example of how these techniques have been used in practice, described in the CERN openlab 2013 summer student lecture "SystemTap: Patching the Linux kernel on the fly".

This post is about applying the techniques and ideas of "live patching on the fly by data modification" to closed source application, when debuginfo is not available and in particular to Oracle. The post is structured around three examples of increasing complexity on how to hook and change the behavior of Oracle SQL hard parsing. Some of the topics that you will see addressed in the examples are:
  • how to find the relevant function(s) to hook SystemTap to
  • how to write into userspace memory with SystemTap probes
  • how to modify CPU registers with SystemTap probes

Disclaimer: The tools and techniques presented in this post are intended for learning/reference only and are best used on a sandbox as they are unsupported and can potentially put at risk systems stability and integrity. Administrator privileges are needed to run SystemTap probes.


Programmable SQL filter

In this example you will see a method for selectively blocking execution of SQL based on programmable filter rules implemented with SystemTap.

As a first step you need to identify a relevant function for SQL parsing in the Oracle binary. Functions in Oracle binary are not documented, but luckily the function opiprs has been discussed previously in this blog and turns out to be a good choice to use with SystemTap probes. There are also other options but the details are outside the scope of this post. What you need to know about opiprs for this blog post is summarized in the table below:

Function namePurposeSelected parameters
opiprsOracle Program Interface - Parse

This function is called when Oracle performs hard parsing 
(i.e. when a SQL statement that is not in the library cache needs to be parsed).
Notable function arguments:

register
rdx -> sql statement length        
register
rsi -> pointer to the SQL text string                                                           


Note two important arguments of opiprs that are passed using CPU registers rdx and rsi: respectively containing the length of the sql statement and the pointer to the SQL statement text. The SQL text string is stored in memory, more precisely in the stack of the Oracle process, this can be confirmed by comparing the SQL text address with the process memory map from /proc/<pid>/maps.

A simple mechanism to implement the original goal of selectively blocking SQL execution is the following: write a SystemTap probe on the Oracle function opiprs that examines the SQL and if it matches some programmable rules block further parsing by forcing the SQL parsing to exit with an error.

SQL parsing can be forced to exit with an error by writing a 0 (end of line) in the first memory location of the buffer that contains the SQL text, effectively signaling a zero-length string. The effect of such change is that Oracle will throw the error: ORA-900, invalid SQL statement.

SystemTap probes can write into userspace memory using embedded C functions. This requires running SystemTap in "guru mode" and requires some additional syntax as detailed in the SystemTap documentation.

The code to implement the ideas described so far is summarized in the example script filterSQL_opiprs.stp. It consists of two main parts: one is a probe on the Oracle function opiprs, the other is an auxiliary C function called block_parse, that performs the task of writing into memory and specifically to the memory location that contains the SQL test string. A copy of the main text of the script for convenience:

function block_parse(pointersql:long) %{
    char *sqltext;
    sqltext = (char *) STAP_ARG_pointersql;
     /* Modify the SQL text with an end of line: this will throw ORA-00900: invalid SQL statement */
sqltext[0] = 0;
%}

probe process("oracle").function("opiprs") {
   sqltext = user_string2(register("rsi"),"error")
   # debug code
   # sqllength = register("rdx")
   # printf("opiParse: arg2=%s, arg3=%d\n",sqltext,sqllength)
   if (isinstr(sqltext, "UNWANTED SQL")) {
      printf("FOUND!\n")
      block_parse(register("rsi"))
   }
}

Test the example:

1. Run the SystemTap script as root (note the oracle executable needs to be in the path) with:
# stap -g -v filterSQL_opiprs.stp

2. On a different session using SQL*Plus:
SQL> select 'Hello world' from dual; -- this runs normally

'HELLOWORLD
-----------
Hello world

SQL> select /* UNWANTED SQL */ 'Hello world' from dual;
select /* UNWANTED SQL */ 'Hello world' from dual
 *
ERROR at line 1:
ORA-00900: invalid SQL statement


This illustrates how filterSQL_opiprs.stp blocks any SQL that contains the string "UNWANTED SQL". The example can be generalized to filter generic SQL statements based on keywords or other complex rules.


Modify SQL on the fly

The code below shows an example of altering the SQL statement on the fly. It is an artificial example for demo purposes. The C function "replace_SQL" (see code) is used to write into Oracle userspace the new SQL text, effectively modifying the statement that is being parsedThe SystemTap script livepatch_basic_opiprs.stp is available at this link. Here is a copy of the main text:

%{
/* SQL that will replace TARGET_SQL */
#define REPLACEMENT_SQL "select power(count(*),3) from dba_objects"
%}

global TARGET_SQL = "select count(*) from dba_objects, dba_objects, dba_objects"

function replace_SQL(pointersql:long) %{
    char *sqltext;

    sqltext = (char *) STAP_ARG_pointersql;
    /* This changes in memory (stack) the SQL text that will be parsed */
    strcpy(sqltext, "select power(count(*),3) from dba_objects");
%}

probe process("oracle").function("opiprs") {
    sqltext = user_string2(register("rsi"),"error")  
    # debug code
    # sqllength = register("rdx")
    # printf("opiParse: arg2=%s, arg3=%d\n",sqltext,sqllength)
    if (sqltext == TARGET_SQL) {
        printf("FOUND!\n") # debug code
        replace_SQL(register("rsi"))
    }
}

Test the example:

1. Consider this SQL. It may take days of CPU time to execute, as it has been built on purpose with cartesian joins:

SQL> select count(*) from dba_objects, dba_objects, dba_objects;


2. Run the SystemTap script livepatch_basic_opiprs.stp as root and execute the SQL again:

# stap -g -v livepatch_basic_opiprs.stp

2. Run the SQL again (flushing the shared pool is used to cause hard parsing).

SQL> set timing on
SQL> alter system flush shared_pool;
SQL> select count(*) from dba_objects, dba_objects, dba_objects;

POWER(COUNT(*),3)
-----------------
       7.5931E+25

Elapsed: 00:00:19.26

The SQL now runs in a few seconds because the statement with cartesian joins has been replaced "on the fly" by the SystemTap probe with an equivalent statement that executes much faster and without joins.
There is an important limitation to this implementation: the length of the "new" SQL statement must not exceed the length of the original SQL. A more general case is addressed in the next example.


Modify SQL on the fly, a more complex experiment

This example addresses the case of replacing SQL statements on the fly removing the limitation of the example above on the length of the SQL statement.The main point is that you also have to update the register rdx with the length of the new SQL statement. If the new statement is shorter than the original one this step can be omitted (as it was the case of the previous example).
How to modify the content of a CPU register with SystemTap? This is done by updating the CPU register value in CONTEXT->uregs. SystemTap takes care of restoring the register values when returning to Oracle userspace execution.

Another important point is about where to write the new SQL text, as we need a longer buffer than with the original SQL. Where to allocate the extra memory?
The example code referenced below writes the new SQL text in the process stack using the value of the %rsp pointer and subtracting 0x2000 to it. This is an educated guess that the target memory location is allocated to the process (in the memory chunk allocated for the stack), however enough "down in the stack" that it is free and will not be used by subsequent branches or leaf functions called by opiprs. From a few basic tests this approach seems to work, however please note also that the use of this script is intended mainly for reference and education purposes and can be potentially dangerous for system stability.

The SystemTap script livepatch_opiprs.stp is available at this link.

The proposed example script livepatch_opiprs.stp replaces the SQL "select sysdate from dual" with "select sysdate -1 from dual". This is inspired by a hypothetical situation where you want to replay a workload with time-dependent SQL. Another example of SQL replacement you may want to test is adding SQL hints. Customize the SQL replacement as you wish by editing REPLACEMENT_SQL and TARGET_SQL in the script.

Test the example:

SQL> alter session set nls_date_format='YYYY-MM-DD HH24:MI';
SQL> select sysdate from dual; -- all normal up to this point

SYSDATE
--------------
2016-02-22 12:00

Run the SystemTap script as root
# stap -g -v livepatch_opiprs.stp

Now the same SQL will return a different result (that is sysdate -1 instead of sysdate):

SQL> alter system flush shared_pool;
SQL> select sysdate from dual;

SYSDATE
--------------
2016-02-21 12:00


Cleaning up after  testing

Once a SQL statement is hard parsed, all subsequent executions will also run with the modified text. If you want to revert to normal Oracle behavior you need to flush the statement out the library cache an re-parse (after having stopped the SystemTap script). For flushing statements out of the shared pool you can use "alter system flush shared pool" or dbms_shared_pool.purge (see also this post on dbms_shared_pool).


Pointers to reproduce the results using gdb

In the examples discussed in this post SystemTap has been used almost as an automated debugger. Notably with the addition that SystemTap has a low-overhead compared to many debuggers and provides a powerful programmable interface for defining the SQL text search and replacement actions. Another advantage of SystemTap is that it can attach to all running processes (of the Oracle executable in this case) if desired.
For completeness and as a reference, this is a short list of relevant gdb commands that can be used to reproduce some of the actions described in the examples of this post:

gdb -p <pid> start gdb against an existing Oracle session
break opiprs -> define a breakpoint on opiprs
continue -> continue program execution till breakpoint
info reg -> show registers
bt -> backtrace
x/1s $rsi -> visualize the SQL statement string using $rsi as pointer

write the new SQL length into the CPU register rdx:
  set $rdx=24

copy a SQL string into memory and update $rsi (examples):
  set $rsi=$rsp-0x2000
  p strcpy($rsi, "select sysdate from dual")
  set $rdx=25


Conclusions

This post illustrates SystemTap techniques that can be used to modify userspace data on the fly at runtime. Examples are provided on how to apply these techniques to Oracle SQL parsing. The techniques discussed here can be generalized and used on other functions of the Oracle kernel as well as be extended to "live-patch" data at runtime for other applications in the Linux environment. In particular the provided example SystemTap probes show techniques for writing into userspace memory and into CPU registers, useful to address cases where debuginfo is not available.

Disclaimer: The tools and techniques presented in this post are intended for learning/reference only and are best used on a sandbox as they are unsupported and can potentially put at risk systems stability and integrity. Administrator privileges are needed to run SystemTap probes.


Acknowledgements and references

A shout-out to Frank Ch. Eigler, for his work on SystemTap and in particular for his presentation "Applying band-aids over security wounds with systemtap" and related blog post and also for the tip on how to write into CPU registers with SystemTap.
Many thanks to Frits Hoogland for comments and improvements to this post and for collaboration on the investigation of Oracle internals and the use of Linux dynamic tracing tools for Oracle troubleshooting.


Link to the example code in Github.
Links to previous work on this blog on using SystemTap for Oracle tracing: SystemTap into Oracle for Fun and Profit and Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTapDiagnose High-Latency I/O Operations Using SystemTap.


Thursday, February 25, 2016

PerfSheet.js: Oracle AWR Data Visualization in the Browser with JavaScript Pivot Charts

PerfSheet.js is a tool aimed at DBAs and Oracle performance analysts. It provides a simplified interface to extract and visualize AWR time series data in the browser using javascript.

The reason for a tool like PerfSheet.js is to make the analysis of AWR data easier by providing a graphical interactive interface and by automating several repetitive steps of data extraction and chart preparation. Pivot charts provide a flexible and easy to use way to navigate around the many metrics and dimensions of AWR data. PerfSheet4 is a tool that I have been using for a few years for AWR data analysis (following previous original work by Tanel Poder).
PerfSheet.js spins off from the work on PerfSheet4. While PerfSheet4 runs in MS Excel. PerfSheet.js is written in javascript and runs in the browser. This provides extra flexibility and compatibility across platforms.
The main components of PerfSheet.js are:
  • Visualization with interactive pivot charts. This is implemented using  PivotTable.js by Nicolas Kruchten.
  • Additional customization provide pre-configured starter charts and a basic interface to load data from CSV files.
  • A set of SQL scripts to extract performance metrics of interest from Oracle AWR tables into CSV files.
  • A few example data files come with the tool, to help first-time users.

You can download PerfSheet.js from Github or from this webpage.

Here is a screenshot of PerfSheet.js:



Here is an introductory video:



References and acknowledgements

PerfSheet.js is based on PivotTable.js by Nicolas Kruchten. Additional dependencies include: C3.js, D3.js, jQuery, jQuery-ui, PapaParse. PerfSheet.js is based on previous work on PerfSheet4 which in turn is based on original work by Tanel Poder.


Monday, January 25, 2016

Linux Perf Probes for Oracle Tracing

Topic: this post is about Linux perf and uprobes for tracing and profiling Oracle workloads for advanced troubleshooting.

Context

The recent progress and maturity of some of the Linux dynamic tracing tools has raised interest in applying these techniques to Oracle troubleshooting and performance investigations. See Brendan Gregg's web pages for summary and future developments on dynamic traces for Linux. Some recent work on applying these tools and techniques to Oracle can be found in: Tanel Poder's posts on "RAM is the new disk", Frits Hoogland's post on "Stapflame" and in this blog at "Extended stack profiling".

Dynamic tracers are proven tools for instrumenting OS calls: an example of interest for OLTP tuning, are probes for measuring I/O latency at microsecond precision. Another topic, more complex but also more exciting, is tracing Oracle userspace. This topic is of growing interest with modern servers hosting large amounts of RAM and workloads that are often CPU-bound.

DTrace is still a reference when talking about dynamic tracing. For Oracle userspace tracing, however, this is mostly relevant for the Solaris environment. Basic examples of DTrace probes for Oracle on Linux can be found in this blog at this link.

SystemTap is a tool that offers a mature interface to tracing Linux kernel and the userspace. Notably SystemTap offers the possibility to write probes with advanced aggregations similarly to DTrace. Tracing Oracle userspace and wait events with SystemTap has been covered before in this blog with work done in collaboration with Frits Hoogland: see this link and this link and the scripts at this Github repository. An advantage of SystemTap can trace userspace for RHEL/OL 6.x using the utrace.

Linux Perf is another major player in this area. It is a powerful tool, integrated in the kernel and continuously being improved with new kernel releases. Perf is an excellent tool for data gathering for on-CPU flame graphs (see this link and references below). Perf has gained a reputation to be stable and relatively low overhead, which makes it useful for troubleshooting production workloads (mileage may vary).

In this post you will find examples, of how Perf  can be used to trace the Oracle userspace: this includes tracing calls to userspace functions and also reading from memory, in particular from the Oracle SGA. This creates a link between Perf and Oracle instrumentation and can be used for tracing Oracle and for building extended profiling tools (i.e. tools that pull together OS and Oracle instrumentation) and for flame graph visualization.


Perf probes for Oracle

If you are not familiar with Linux Perf, you can profit from reading/testing a few examples from http://www.brendangregg.com/perf.html. The focus of this post is with perf probes for Oracle userspace functions using uprobes. This functionality is available from kernel version 3.5, this means that to run the examples shown here you will need a relatively recent Linux version: RHEL 7.x (with 3.10.x kernel) or Oracle Linux with UEK3 (3.8.x kernel) or UEK4 (4.1.x kernel). The examples in this post have been tested for Oracle 11.2.0.4 running on OL7.2 with UEK4.

Perf meets Oracle wait events: probes for the wait event interface are of particular interest as they provide a link between what Perf can measure and the Oracle instrumentation. Some key Oracle userspace functions in this area have been described previously in this blog (see also the example scripts in github). A short description of three functions of interest before that will be used later in the example scripts:


Function namePurposeSelected parameters
kskthbwtKernel service Kompile thread begin wait.
This function is called at the start of an Oracle wait event.
kslwtbctx is its parent function call and marks the start of a wait event. 
The suffix "bwt" most likely stands for "begin wait".
Parameters and registers:
11.2.0.4 - register r14  -> pointer into X$KSUSE (V$SESSION) array in the SGA with a fixed offset (version-dependent)
Note for 12.1.0.2 - register r13  -> pointer into X$KSUSE 

register rsi -> timestamp of the beginning of the wait (in microseconds)
register rdx -> wait event number
kskthewtKernel service Kompile thread end wait.
This function is called at the end of an Oracle wait event.
kslwtectx is its parent function call marking the end of a wait event.
The suffix "ewt" most likely stands for "end wait".
Parameters and registers:
register r13 -> pointer into X$KSUSE (V$SESSION) array in the SGA  with a fixed offset (version-dependent)
register rdi -> timestamp of the beginning of the wait (in microseconds)
register rsi -> wait event number
kews_update_wait_timeKernel event wait statistics update wait time.
This function is called at the end of an Oracle wait event.
kslwtectx is its parent function call marking the end of a wait event.
Parameters and registers:

register rsi -> wait time

register r13 (relevant for 11.2.0.4, it does not work on 12.1.0.2) -> event number
register r15 -> pointer into X$KSUSE (V$SESSION) array in the SGA  with a fixed offset (version-dependent)


You can get started with Perf tracing of Oracle userspace using the following:

# export ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/rdbms
# perf probe -x $ORACLE_HOME/bin/oracle kskthewt timestamp=%di event=%si

If all worked well  you will have created a new probe for Oracle userspace tracing and will see the following output.

Added new event:
  probe_oracle:kskthewt (on kskthewt in /u01/app/oracle/product/11.2.0.4/rdbms/bin/oracle with timestamp=%di event=%si)

You can now use it in all perf tools, such as:

        perf record -e probe_oracle:kskthewt -aR sleep 1

Note in case of errors, check the OS and Oracle version compatibility discussed elsewhere in this post. Go ahead and create two more probes:

# perf probe -x $ORACLE_HOME/bin/oracle kskthbwt timestamp=%si event=%dx
# perf probe -x $ORACLE_HOME/bin/oracle kews_update_wait_time wait_time=%si event=%r13

You should have now three probes on Oracle, you verify this by listing them with:  "perf probe -l". If needed, you can delete probes with this syntax: "perf probe --del <probe_name>". Now you can run perf record to gather data for the events defined by the probes. One last step is to identify an Oracle process to test (it can be a sqlplus session for example). Find the OS pid of the Oracle process to probe and run the following:

# perf record -e probe_oracle:kews_update_wait_time -e probe_oracle:kskthbwt -e probe_oracle:kskthewt -p <pid>

Hit CTRL-C when done tracing. Display the trace output with the perf script command. Here an  example from "select count(*) from <testtable>" tun on a test databas:

# perf script

oracle 17270 [003] 4004734.472372: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x6fce81d4 event=0x162
oracle 17270 [003] 4004734.472408: probe_oracle:kskthewt: (94862ec) timestamp=0x521b78690de07 event=0x162
oracle 17270 [003] 4004734.472953: probe_oracle:kskthbwt: (9486678) timestamp=0x521b78690e05c event=0x15e
oracle 17270 [003] 4004734.472977: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x1e event=0x15e
oracle 17270 [003] 4004734.472982: probe_oracle:kskthewt: (94862ec) timestamp=0x521b78690e07a event=0x15e
oracle 17270 [003] 4004734.474083: probe_oracle:kskthbwt: (9486678) timestamp=0x521b78690e4c6 event=0xc6
oracle 17270 [003] 4004734.474398: probe_oracle:kews_update_wait_time: (95af198) wait_time=0x13b event=0xc6
...

The fields displayed in the example output are: process name, pid, CPU number, timestamp since machine startup, probe name and address, probe parameters. Some additional processing is beneficial to make the output more easily readable. In the next example perf script output is filtered by piping it to a series of commands: (1) to reduce the number of fields displayed, (2) convert hexadecimal numbers into decimal, (3) convert wait event numbers in wait event names. For this we use sed and eventsname.sed, a file generated with a sqlplus script.

# perf script | \
  awk '{printf "pid=%s TS=%s ",$2,$4} {print $5,$7,$8}' | \
  perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' | \
  sed s/event=/event#=/g |sed -f eventsname.sed

pid=17270 TS=4004734.472372: probe_oracle:kews_update_wait_time: wait_time=1875804628 event=SQL*Net message from client
pid=17270 TS=4004734.472408: probe_oracle:kskthewt: timestamp=1444447003926023 event=SQL*Net message from client
pid=17270 TS=4004734.472953: probe_oracle:kskthbwt: timestamp=1444447003926620 event=SQL*Net message to client
pid=17270 TS=4004734.472977: probe_oracle:kews_update_wait_time: wait_time=30 event=SQL*Net message to client
pid=17270 TS=4004734.472982: probe_oracle:kskthewt: timestamp=1444447003926650 event=SQL*Net message to client
pid=17270 TS=4004734.474083: probe_oracle:kskthbwt: timestamp=1444447003927750 event=direct path read
pid=17270 TS=4004734.474398: probe_oracle:kews_update_wait_time: wait_time=315 event=direct path read
...

What the examples show is that trace data that you would normally find in a 10046 trace file (or  trace for EXEC DBMS_MONITOR.session_trace_enable(waits=>TRUE) if you prefer) can be collected using OS commands alone. Perf with utrace can be used to trace Oracle as an alternative to setting Oracle diagnostic events. What you cannot find in the Perf examples above are information on the parameters of the wait event (usually called p1, p2, p3). This is the subject of the next paragraph.


It only gets better from here: perf with uprobes can read Oracle SGA

Uprobes has received a major improvement with kernel 3.14 to include the possibility of fetching arguments from the memory of the traced process (see also http://lxr.free-electrons.com/source/Documentation/trace/uprobetracer.txt). As I write this Oracle has released UEK4 with kernel 4.1.x, which can be used to test this feature. Fedora is another option for a test environment with recent kernels.

Many details and metrics of the Oracle instrumentation coming from V$ views and their underlying X$ tables can be directly accessed by reading memory structures in the SGA (this is not the case for all X$, as many of them are  rather implemented as internal functions). X$KSUSE exposes many fields of V$SESSION. The memory location of the fields of interest can be identified using SQL as in this example.
The link between one of the main wait event userspace functions, kskthewt, and X$KSUSE is found by experimentation. This has been investigated together with Frits. We find that the value of the CPU register r13 points into the X$KSUSE record with an offset that is version depended. This work has already been done for SystemTap probes and a working example can be found at this link for 11.2.0,4 and this link for 12.1.0.2.

This is an example of a probe on the Oracle wait event interface that exposes selected fields from X$KSUSE memory structures:

# perf probe --del probe_oracle:kskthewt
# perf probe -x $ORACLE_HOME/bin/oracle kskthewt timestamp=%di event=%si p1='-2080(%r13)' p2='-2072(%r13)' p3='-2064(%r13)' sql_hash='-1828(%r13)'


# perf record -e probe_oracle:kskthewt -p <pid>

# perf script | \
  awk '{printf "pid=%s TS=%s ",$2,$4} {print $8,$9,$10,$11,$12}' | \
  perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' | \
  sed s/event=/event#=/g |sed -f eventsname.sed

pid=17270 TS=4097944.225385: timestamp=1444540213679034 event=SQL*Net message from client p1=1413697536 p2=1 p3=0 sql_hash=11373786902
pid=17270 TS=4097944.225786: timestamp=1444540213679433 event=SQL*Net message to client p1=1413697536 p2=1 p3=0 sql_hash=11373786902
pid=17270 TS=4097944.227361: timestamp=1444540213681023 event=direct path read p1=1 p2=193 p3=7 sql_hash=11373786902
pid=17270 TS=4097944.229206: timestamp=1444540213682870 event=direct path read p1=1 p2=3896 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.230532: timestamp=1444540213684197 event=direct path read p1=1 p2=12032 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.232161: timestamp=1444540213685827 event=direct path read p1=1 p2=29464 p3=8 sql_hash=11373786902
pid=17270 TS=4097944.244675: timestamp=1444540213698344 event=direct path read p1=1 p2=177 p3=5 sql_hash=11373786902
...

More fields of interest from X$KSUSE or other areas of the SGA can be extracted as arguments of perf by extending the example above.

Example of tracing Oracle parse functions:

Function namePurposeSelected parameters
opiprsOracle Program Interface - Parse

Use to trace hard parsing.

This function is called when doing hard parsing
(i.e. when a SQL statement that is not in the library cache needs to be parsed).


Useful information when probing the function:

register rdx -> sql statement length        
register rsi -> pointer to the SQL text string                                                           
kksParseCursorUse to trace soft parsing.
This is one of the functions in the call stack when Oracle parses a cursors already in the library cache
Useful information when probing the function: 

11.2.0.4:
   register rsi -> sql statement lenght

   register rbx -> pointer to the SQL text string

12.1.0.2:
   
   register r9 ->sql statement lenght
   register r8 -> pointer to the SQL text string 



This is an example of a perf probe to trace hard parsing (tip: flush the shared pool before testing):

perf probe -x $ORACLE_HOME/bin/oracle opiprs length=%dx sql='+0(%si)':"string"

# perf record -e probe_oracle:opiprs -p <pid> 
# perf script

oracle 17270 [003] 4100899.401507: probe_oracle:opiprs: (27fd2f6) length=0x19 sql="select sysdate from dual"

Similarly for tracing soft parsing (11.2.0.4 version) you can use the following probe:

perf probe -x $ORACLE_HOME/bin/oracle kksParseCursor length=%si sql='+0(%bx)':"string"

Many more Oracle functions can be traced:

From the OS point of view, the Oracle executable is a big C program with a large number of functions. You can run "nm -D $ORACLE_HOME/bin/oracle" to get an idea of the scale: it is of the order of 10^5 text (code) sections. Obviously too many details and also undocumented. Some pointers to understand the function names can be found in MOS (see in especially the bug-related documents), from sqlplus running oradebug doc and from various blogs and presentations (see references).
Here below a short table of functions of interest for tracing logical and physical I/O taken from this blog post. The ideas and examples discussed above can be applied to these functions too, if there is a need to trace logical and physical I/O for example. Note that tracing logical I/O on a busy system can have an important overhead because of the large number of events per second.


Function namePurposeSelected parameters
kcbgtcrKernel Cache Buffers Get Consistent Read

Note: key function for consistent reads
tbs#=user_int32(%rdi) 
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF                                                      
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbgcurKernel Cache Buffers Current

Note: key function for current reads
tbs#=user_int32(%rdi)
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbzibkcbZIB should stand for: Kernel Cache Buffers, Z
(kcbz.o is a module for physical IO helper functions), IB: Input Buffer


Note: this function is part of the physical I/O stack:
it indicates that Oracle will perform physical read(s) into the buffer cache

kcbzgbThe suffix GB in kcbZGB should stand for:
Get (space for) Buffer


Oracle allocates space in the buffer cache for a given block (typically before I/O operations). 

tbs n#=%rsi,
rel file n#=%rdx >> 22 & 0x003FFFFF
block#=%rdx & 0x003FFFFF
data_object_id#=%r8
object_id#=%r9

Note: for bigfile tablespaces:
block#=%rdx
kcbzvbInvoked after Oracle has read a given block

Note: this function is used both for reads in the buffer cache and for direct reads

tbs n#=%rsi, 
rel file n#=%rdx >> 22 & 0x003FFFFF,
block#=%rdx & 0x003FFFFF
data_object_id=user_int32(%rdi+24)

Note: for bigfile tablespaces:
block#=%rdx



Combining Perf probes and Oracle traces

What you have seen so far can also be described as a way to extract information normally available with Oracle diagnostic events, using perf and uprobes instead. Additional value to these methods comes when you combine them with perf probes for tracing kernel and hardware activity. There are many Perf events available. See also the output of perf list and the references in this post.

Study of Oracle I/O and wait events:
The example below combines perf events on tracepoints for system calls to the I/O interface in Linux and probes on the Oracle wait event interface. The example below records a trace for the Oracle userspace functions marking the beginning and end of a wait and for system calls to the I/O interface:

# perf record -e probe_oracle:kskthewt -e probe_oracle:kskthbwt  -e syscalls:sys_enter_io_submit -e syscalls:sys_exit_io_submit -e syscalls:sys_enter_io_getevents -e syscalls:sys_exit_io_getevents  -e syscalls:sys_enter_pread64 -e syscalls:sys_exit_pread64 -p <pid>

The output is processed for improved readability:

# perf script |  \
  awk '{printf "pid=%s TS=%s ", $2,$4} {print $5,$8,$9,$10,$11,$12}' |  \
  perl -p -e 's/0[xX][0-9a-fA-F]+\b/sprintf "%s", hex $&/ge' |  \
  sed s/event=/event#=/g |sed -f eventsname.sed

pid=25314 TS=4173676.363095: probe_oracle:kskthbwt: event=db file sequential read
pid=25314 TS=4173676.363109: syscalls:sys_enter_pread64: buf: 2644443136, count: 8192, pos:
pid=25314 TS=4173676.363667: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.363689: probe_oracle:kskthewt: event=db file sequential read p1=4 p2=147586 p3=1 sql_hash=2246415721
pid=25314 TS=4173676.364030: probe_oracle:kskthbwt: event=db file scattered read
pid=25314 TS=4173676.364043: syscalls:sys_enter_pread64: buf: 5727326208, count: 40960, pos:
pid=25314 TS=4173676.364576: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.364597: probe_oracle:kskthewt: event=db file scattered read p1=4 p2=147587 p3=5 sql_hash=2246415721
pid=25314 TS=4173676.364790: probe_oracle:kskthbwt: event=db file scattered read
pid=25314 TS=4173676.364801: syscalls:sys_enter_pread64: buf: 5727326208, count: 65536, pos:
pid=25314 TS=4173676.365344: syscalls:sys_exit_pread64:
pid=25314 TS=4173676.365365: probe_oracle:kskthewt: event=db file scattered read p1=4 p2=147592 p3=8 sql_hash=2246415721

What this example achieves is a way to port to Perf the work done with SystemTap for tracing Oracle I/O. See also http://externaltable.blogspot.com/2014/11/life-of-oracle-io-tracing-logical-and.html for additional details.

Oracle wait events and perf scheduler events

Perf scheduler events (see: perf list sched:*) are another source of very useful information. For example the event sched:sched_stat_runtime traces the amount of time spent executing on CPU. Note however that not all the scheduler probes are available on RHEL7 and UEK kernels, as CONFIG_SCHEDSTATS is not set on those platforms (see: grep CONFIG_SCHED /boot/config-`uname -r`). In particular this prevents the use of the technique described by Brendan Gregg for off-CPU flame graphs with perf.

This is an example of tracing the CPU scheduler together with I/O related events with Oracle wait events:

# perf record -e sched:* -e probe_oracle:kskthewt -e probe_oracle:kskthbwt  -e syscalls:sys_enter_io_submit -e syscalls:sys_exit_io_submit -e syscalls:sys_enter_io_getevents -e syscalls:sys_exit_io_getevents  -e syscalls:sys_enter_pread64 -e syscalls:sys_exit_pread64 -p <pid>

After capturing the workload you can extract atrace file running "perf script". In the table below you can see a snippet of the trace filtered with a simple scripts to improve the readability. You can see that perf provides valuable information by measuring CPU time, system calls and Oracle wait events.

EVENT NAME                      BEGIN TS        END TS          ELAPSED (MicroSec)
event=SQL*Net message to client 4413050521226   4413050521261   35
sched_stat_runtime              4413050521179   4413050521291   112
sched_stat_runtime              4413050521293   4413050521530   237
syscall:pread64                 4413050521392   4413050535684   14292
event=db file sequential read   4413050521375   4413050535704   14329
syscall:io_submit               4413050536073   4413050536249   176
sched_stat_runtime              4413050535593   4413050536551   958
syscall:io_submit               4413050536343   4413050536623   280
sched_stat_runtime              4413050536548   4413050536686   138
syscall:io_getevents            4413050536678   4413050537650   972
event=direct path read          4413050536652   4413050537677   1025
syscall:io_submit               4413050537807   4413050537924   117
sched_stat_runtime              4413050537563   4413050537963   400
syscall:io_getevents            4413050537956   4413050539289   1333
event=direct path read          4413050537943   4413050539315   1372
sched_stat_runtime              4413050539134   4413050539553   419
..


Ideas for future work

Stack profiling and flame graphs: An additional step to get more value out of perf traces is to visualize their output. Flame graphs and extended stack profiling have proven to be useful in this area. See also previous work on flame graphs for Oracle at link1 and link2 on extended stack profiling with stapflame by Frits at this link and with stack profilers at this link.
This is still work in progress. For a sneak preview follow this link for a proof-of-concept graph.

Perf hardware events: Combining Oracle traces traces with hardware and PMU counters is another area of interest for future investigations. This could be useful for example for studying logical reads and their dependencies on events at HW level on memory and/or processor cache level. 
Add and documenting more probes for Oracle userspace function tracing: For example it would be useful to develop a set of perf probes for replicating the 10046 trace event.

Linux and Oracle version compatibility:

The definition of probes on Oracle executable needs uprobes, available from kernel version 3.5. When testing the examples discussed in post on RHEL/OL, you can use Kernel UEK4 or UEK3 or use RHEL 7.x.  Notably, examples discussed in this post will not run on RHEL 6.x kernels.
Issues with uprobes and Oracle versions: uprobes works OK for tracing Oracle 11.2. However, for Oracle 12.1 I find that uprobes works OK on  RHEL7.0 (kernel 3.10.0-123) and UEK (kernel 3.8.x), but does not work for kernels that ship with RHEL 7.1,7,2 and anything higher (including UEK4). When testing the easiest is to use Oracle 11.2 or if you want to test Oracle 12.1 use UEK3 or RHEL 7.0 kernel. More investigations are needed on this topic.


Conclusions

Perf events and uprobes are very powerful tools for dynamic tracing of Oracle. They allow to complement and extend Oracle and OS instrumentation for advanced troubleshooting and performance investigations. In particular Perf with uprobes can be used:
  • As an alternative to standard instrumentation for Oracle tracing with diagnostic events: see the examples of using perf probes for tracing wait event, sql parsing, physical and logical I/O.
  • As a tool for combining OS-related probes (tracepoints for I/O, network, scheduler or hardware events) and Oracle diagnostic information, for advanced troubleshooting.
  • As a data collection tool for extended profiling of Oracle workloads, which combines stack traces, scheduler information, wait events or other Oracle diagnostic data for flame graph visualization.

Credits and references

Kudos to Brendan Gregg for his work on flame graphs and for the articles and presentations he has published on Linux tracers. Many thanks to Frits Hoogland for his investigations of Oracle internals, for collaboration on Oracle internals investigations and tracing tools and for proof reading this post. Many thanks to Tanel Poder for his work on Oracle internals and performance.
Additional links on topics related to this post:

http://www.brendangregg.com/perf.html
https://fritshoogland.wordpress.com/2016/01/12/introducing-stapflame-extended-stack-profiling-using-systemtap-perf-and-flame-graphs/
http://externaltable.blogspot.ch/2015/12/extended-stack-profiling-ideas-tools.html
http://blog.tanelpoder.com/2015/09/21/ram-is-the-new-disk-and-how-to-measure-its-performance-part-2-tools/
http://externaltable.blogspot.ch/2014/11/life-of-oracle-io-tracing-logical-and.html
https://fritshoogland.wordpress.com/2014/02/27/investigating-the-wait-interface-via-gdb/
http://enkitec.tv/2013/03/14/tanel-poders-hacking-session-how-oracle-sql-plans-are-really-executed-part-2/
https://alexanderanokhin.wordpress.com/tools/digger/
http://lxr.free-electrons.com/source/tools/perf/Documentation/perf-probe.txt
http://linux.die.net/man/1/perf-probe
http://lxr.free-electrons.com/source/Documentation/trace/uprobetracer.txt