Source code: https://github.com/SvenvDam/HTypes

At my current job, we make heavy use of Apache HBase. HBase is a distributed database built on top of Hadoop and HDFS. It is extremely performant when you need consistent and random access to data which needs to be distributed across multiple servers. The aim of this post is not to provide an in-depth explanation of how HBase works though, but to describe a small extension to its (Java) API I wrote which makes it more pleasant to work with in Scala.

I mainly wrote this extension to get more familiar with typeclasses in Scala (more on this later) and using typeclasses to map objects to a database representation seemed like a good use-case. Besides, the developer friendliness of HBase leaves some things to be wished for and I figured adding some extensions for boilerplate work would not hurt. To be fair though, I’ve found the Java API to be quite pleasant to work with. Especially the construction of queries as values which can be passed around and composed before executed is great. Hence, I wanted to provide functionalities which do not get in the way of the existing API but are just an extension to what is already there. What I tried not to do was being clever with new ways to do things that were already possible, but then in a more ‘scala-esque’ way.

HTypes aims to provide three types of improvements:

Conversion of results to Scala collections
Asynchronous execution of queries
Easy conversion between objects and database representations

It relies heavily on implicit classes to add extra syntax to objects provided by the HBase client API. This makes that you as a user do not have to think about whether you should use methods provided by the core API or HTypes since both will be available at the same place. This also means that they can easily be combined to reach you desired goal with minimal work!

In the rest of this post, I will first give a short example of how to get up and running with HTypes. Then I will go into the types of improvements one-by-one with code examples. For the final improvement, I will also give an explanation of what typeclasses are and how they work in Scala since this is important to understand how it is implemented.

Getting started

First things first: install HTypes:

libraryDependencies += "com.svenvandam" %% "htypes" % "0.1"

HTypes is available for Scala 2.13, 2.12 and 2.11

To have all goodness of HTypes available in one go, do the following:

import com.svenvandam.htypes.Implicits._

That’s it! You now have all extensions to the client API available to you. Lets see what these are.

Conversion to Scala collection

The first improvement is a very simple one. When doing a scan on HBase, the client will hand you a ResultScanner. This can more or less be seen as a list of Result’s, but we cannot directly treat it as a collection. HTypes provides an implicit conversion to an Iterable[Result]. This has the advantage that you now map, filter and fold away on the retrieved results.

import org.apache.hadoop.hbase.client._
import com.svenvandam.htypes.converters.ScalaConverter._ // for ResultScanner syntax


val table = conn.getTable() // assuming you already have your connection set up

val scan = new Scan().addColumn("family".getBytes, "qualifier".getBytes)

val results: ResultScanner = table.getScanner(scan)
results.map { result => // this would normally not be possible

  result.getRow()
}

This improvement gives you a lot of flexibility with the ResultScanner out of the box. Note that you can achieve this without HTypes by importing scala/java converters provided by the scala stdlib and calling .asScala on your ResultScanner.

Asynchronous queries

An issue I encountered with the client was that the execution of queries is blocking by default. Having your code wait for the completion of a query on possibly petabyes of data (that is the scale HBase is designed for) can result in unnecessary slow programs. HTypes provides a couple of asynchronous methods to Table objects which will wrap the execution of your query in a Future. This makes that your program can continue to run while the query is running in the background and handle the results when they come in. The available methods are:

getScannerAsync(scan: Scan)(implicit ec: ExecutionContext): Future[ResultScanner]
getAsync(get: Get)(implicit ec: ExecutionContext): Future[Result]
putAsync(put: Put)(implicit ec: ExecutionContext): Future[Unit]
deleteAsync(delete: Delete)(implicit ec: ExecutionContext): Future[Unit]

The meaning of these methods is straightforward. An example of how this can be used:

import concurrent.ExecutionContext.Implicits.global
import org.apache.hadoop.hbase.client._
import com.svenvandam.htypes.async.TableSyntax._ // for additional table syntax


val table: Table = conn.getTable()
val get = new Get("rowKey".getBytes)

val f: Future[Result] = table.getAsync(get)

// continue with the rest of your code

Conversion between objects and database representations

Arguably the biggest addition HTypes provides is the ability to convert between objects in your program and a representation which can be stored in HBase. HTypes introduces a couple of classes for representations that correspond to how something should be stored:

case class Column(family: String, qualifier: String)
case class CellValue(value: Array[Byte], timestamp: Option[Long] = None)

case class Row(key: String, values: Map[Column, CellValue])

Row is the most important class here. HTypes needs the user to supply how an object can be converted to/constructed from a Row. Part of the information of your object can be stored in the key, the rest should be stored in columns. HBase stores everything as bytearrays. In order to not make assumptions on how your data can be stored most efficiently, HTypes requires you to specify yourself how these bytearrays should be created/read.

So how does this work then? If you know how to relate your objects to Row’s, how can we make HTypes work with this? Enter typeclasses.

Typeclasses

Typeclasses are a concept originating from category theory. According to wikipedia:

Category theory is a mathematical theory that deals in an abstract way with mathematical structures and relationships between them.

Cool, pretty vague sentence. In slightly more concrete (and probably slightly less correct) words: category theory is about how things can be grouped together based on their properties and how these groups (categories) relate to each other.

Typeclasses are different from classes in the object-oriented sense. In OO, whether something is of a certain class is determined by inheritance. Typeclasses are defined by means of their structure, not inheritance. This means that a typeclass describes certain functionalities, and everything that has these functionalities is part of the typeclass.

For now, think of them as traits (as you will see later on, traits are a key to the implementation of typeclasses in scala). As you will see, typeclasses are not a first-class citizen in scala the way they are in languages like haskell. The examples that follow show the conventional way to implement them in scala. For the examples that follow, we will use an actual typeclass of HTypes, the HBaseEncoder. Consider the following trait:

trait HBaseEncoder[T] {
  def encode(t: T): Row
}

This trait has one method, encode, which converts some object of type T to a Row. You can probably already see the use of this: say we have a function to store things in HBase:

def store[T](t: T): Unit

HTypes provides logic to store all the information contained in a Row. That means that as long as we have a type T which we can convert to a Row, we know it can be stored in HBase!

By now you might be thinking: why not do this?

trait HBaseEncoder {
  def encode: Row
}

def store(obj: HBaseEncoder): Unit

This way, we can make our classes we want to store implement the HBaseEncoder trait and call it a day. What is up with this category theory-talking guy and his difficult solutions? Stay with me for a bit.

What if we have an object of a class from a third-party library containing data we want to store in HBase? We cannot make this class implement our HBaseEncoder trait since we have no control over the library. In this case we need to make classes part of a type after they have been defined. This has a name: ad-hoc polymorphism. This is exactly what typeclasses facilitate.

So how is this done in scala? The conventional way is to provide an implementation of a certain class for your typeclass implicitly:

def store[T](t: T)(implicit encoder: HBaseEncoder[T]): Unit

Here, you can see the encoder as a sort of ‘evidence’ that T can be encoded. This way, it does not matter if T originates from your own code, a third-party library or the stdlib. As long as you provide an implementation of how it can be encoded, we can pass it to encode(). An example:

case class User(id: String, name: String, age: Int)

implicit val userEncoder = new hBaseEncoder[User] {
  def encode(user: User): Row = 
    Row(
      User.id, 
      Map(
        Column("profile", "name") -> Cellvalue(user.name.getBytes),
        Column("profile", "age") -> Cellvalue(user.age.toString.getBytes),
      )
    )
}
    
store(User("123", "Alice", 30)) // this now works!

A common pattern you’ll often see is that functionalities are added as syntax to classes which have an implementation for thatt typeclass. This can be done using implicit classes in scala:

object HBaseSyntax { // implicit classes can not be defined at the top level!

  implicit class HBaseEncoderOps[T](t: T) {
    def encode(implicit encoder: HBaseEncoder[T]): Row = encoder.encode(t)
  }
}

We can then call encode as if it was a method from User all along:

import HBaseSyntax._

val user = User("123", "Alice", 30)

// this works as long as we have an implicit typeclass instance available

val row: Row = user.encode

This syntax enrichment makes that typeclasses in scala are actually fairly convenient to work with.

If you encounter typeclasses in scala without being familiar with them, it might all seem very confusing at first. The are tons of implicits passed around which are hard to trace. After reading this however, I hope you can see the pattern of how they are used and more important: why they are useful.

Typeclasses in HTypes

Back to the topic of this post though. Now that we are familiar with how typeclasses work, let me introduce the one that HTypes provides:

HBaseEncoder[T]
HBaseDecoder[T]
HBaseCodec[T]
HBaseClassEncoder[T]

The first one, HBaseEncoder, is already discussed above. It describes the transformation of a T to a Row. HBaseDecoder does the inverse, it has a method decode(row: Row): T which constructs an object from the data out of HBase. The third typeclass, HBaseCodec, is simply the two above fused together. An instance of HBaseCodec[T] means we can convert a T to and from a Row. The final typeclass, HBaseClassEncoder, is slightly different. It is used to describe which columns should be fetched to construct a class without the need for an object of that class. It looks as follows:

trait HBaseClassEncoder[T] {
  def getColumns: Set[Column]
}

As you can see, getColumns takes no argument so we don’t need to pass it anything. It just returns a set of Column’s which are needed to construct a T.

So how are these typeclasses used then? HTypes uses these typeclasses to extend the syntax provided by the regular HBase API. There are two groups of syntax enrichment:

QuerySyntax, used to add extra methods to Get, Scan, Put and Delete
ResultSyntax, used to add extra methods to Result and ResultScanner

The first is all about encoding. It adds a method from to the four kinds of queries HTypes works with. This method allows you to construct a query by passing an object or its type. The resulting query will contain all data needed to perform the operation you want.

// requires implicit HBaseClassEncoder[User]

val userScan = new Scan().from[User]

// requires implicit HBaseClassEncoder[User]

val userGet = new Get("123").from[User]

val alice = User("123", "Alice", 30)

// requires implicit HBaseEncoder[User]

val alicePut = new Put("123").from(alice)
// or

val alicePut = PutUtils.createFrom(alice)

// requires implicit HBaseEncoder[User]

val aliceDelete = new Delete("123").from(alice)
// or

val aliceDelete = DeleteUtils.createFrom(alice)

All these constructed queries can then be executed or further refined by adding more attributes. Remember that the resulting objects are still part of the java API and that all original methods are still available.

ResultSyntax is all about decoding. It adds an as method to Result and ResultScanner. This method will convert a Result to a Iterable[(T, Long)]. Remember that HBase does not overwrite values but stores multiple updates of a value. A Result can contain multiple of these updates. This also means that we can possibly construct multiple T’s with values at different times. HTypes does the following: say we have our user Alice of age 30 stored in HBase at timestamp 0. At timestamp 1 her age gets updated and she now is 31. If all this info gets retrieved in a single Result, HTypes does the following:

val result = table.get(aliceGet)

// this requires an implicit HBaseDecoder[User]

val aliceHistory = result.as[User]
// Iterable((User("123, "Alice", 31), 1), (User("123", "Alice", 30), 0))

When you are are doing a scan rather than a get, you will get a ResultScanner. You can see this as a sequence of Result’s. Hence, HTypes converts a ResultScanner to a Iterable[Iterable[(T, Long)]]. It’s simply a sequence of what it converts a single Result to.

Composing encoders and decoders

If you know how to decode a Row to an A and you also know how to transform an A to a B, you actually know how to decode a Row to a B. You simply chain the decoding and transformation. This is also known as mapping and probably not unknown to anyone using scala. A HBaseDecoder has a map function:

trait HBaseDecoder[A] { self =>
  def decode(row: Row): Option[A]

  def map[B](f: A => B): HBaseDecoder[B] = new HBaseDecoder[B] {
    def decode(row: Row): Option[B] = self.decode(row).map(f)
  }
}

As you can see, once you have a decoder for some type, you can construct a decoder for another type as long as you can define a function which transforms one into the other!

In the case of encoding, things get flipped a bit. If you know how to encode and store an A and you want to do the same for a B, a function f: A => B is not going to be of much help. In the case of decoding, we followed the following path: Row => A => B. Encoding is the opposite and thus we want to follow this path: B => A => Row. That means that if we have a f: B => A, and an encoder for A, we can construct an encoder for B! The way we do this is similar to mapping, with one important difference. In the case of mapping, we ‘append’ the function f to get to a B. Here, we would first apply the B => A transformation before encoding and we are in a sense ‘prepending’ it to the encoding step. This is also known as contramapping. A HBaseEncoder ans a contramap function which does this:

trait HBaseEncoder[A] { self =>
  def encode(t: A): Row

  def contramap[B](f: B => A): HBaseEncoder[B] = new HBaseEncoder[B] {
    def encode(b: B): Row = self.encode(f(b))
  }
}

In the case of a HBaseCodec if we can supply two transforming functions for a bidirectoinal mapping between two types, we can compose a new codec using its imap method:

trait HBaseCodec[A] extends HBaseEncoder[A] with HBaseDecoder[A] { self =>
  def imap[B](f: A => B, g: B => A): HBaseCodec[B] = new HBaseCodec[B] {
    def decode(row: Row): Option[B] = self.map(f).decode(row)
    def encode(b: B): Row = self.contramap(g).encode(b)
  }
}

The methods described above allow you to easily compose encoders and decoders based on relations in your domain without rewriting lots of boilerplate code.

Conclusion

HTypes provides some nice additions to the regular HBase API which make it more pleasant to work with in scala. It relies heavily on the typeclass pattern and implicits in general to purely extend the existing API without getting in its way. You can use it to save yourself from some boilerplate code or if you are looking for a simple resource to get familiar with typeclasses.