DEV Community

Andreas Jim
Andreas Jim

Posted on • Originally published at Medium on

CSV parsing with Scala and shapeless

The shapeless library serves as an excellent foundation for building generic, reusable components. We demonstrate using the types HList and Generic to parse strings into case classes.

Introduction

This post complements the upcoming article Real-time log processing with Akka streams, which involves processing log entries from a web server log. For convenience and clarity, we want to parse the log file line strings into case class instances, allowing us to process them in a semantic fashion.

There are countless libraries available for parsing strings; we will use scala-csv and shapeless to demonstrate a generic and extensible approach. Some of the code we use is based on the CSV example in the shapeless codebase.

Get the source

The source code of the example project is available on github.

Log format

Our example uses a web server log format with the following components:

  1. Remote IP address
  2. Time in milliseconds (Unix time stamp)
  3. Request path
  4. User agent

This roughly corresponds to the following nginx log configuration:

log\_format my\_log\_format
  '"$remote\_addr","$msec","$request","$http\_user\_agent"';
Enter fullscreen mode Exit fullscreen mode

A log entry looks like this:

"1.2.3.4","1466585706027","/foo","Chrome"
Enter fullscreen mode Exit fullscreen mode

We will use the following case class to represent log events:

Parsing string values into objects

A parser type for individual CSV record elements is defined by the Parser trait. The parse result is either an object of type T or an error message:

For convenience, we provide a parser which handles exceptions and returns an error message. This covers the typical scenario of using parser methods from Java libraries, e.g. Integer.parseInt, which throws NumberFormatException, or DateFormat.parse, which throws ParseException:

The Parsers object provides a trivial parser for strings. The client application can provide additional custom parsers. Our application provides a parser for parsing millisecond expressions into Joda Time Instant objects and a parser for IP addresses:

Parsing CSV records

The LineParser[T] class parses a CSV record, which is represented as a List[String], into an instance of case class T. We use the HList (heterogenous list) type from the shapeless library to express the types of the list elements. The string list is first converted to a HList instance, which in turn is converted to a case class instance.

We will need some types from the shapeless library:

First we define a LineParser[Out] trait which defines the capability of parsing a list of strings into an object of type Out. In the case of an error, we expect that the parsing errors of individual list elements will be aggregated into a list represented by the Left incarnation of the Either result type.

Now we define a companion object for our LineParser trait which provides methods for parsing List[String] instances into HList instances. The methods are declared as implicit, making them available in the LineParsertrait. The object provides a method for each of the List incarnations Nil and Cons (::, although we are using +: due to a name collision with the shapeless :: type).

The hnilParser method is expected to emit an empty list and returns an error if it encounters a list containing one or multiple elements:

The hconsParser method parses a concatenation of a head element (type H) and a tail list (type T) while combining the errors which may occur during the parsing steps accordingly. Note the Parser type tag for the H type which ensures that the list element type H is a member of the Parser type class, i.e. that a parser for this type is provided. The parser is later on obtained using the expression implicitly[Parser[H]].

The following implicit method converts the HList R into the case class A, using the shapeless Generic type. We define an implicit parameter gen, whose type is an instance of the Generic trait with a representation of type R. The method call gen.from converts the HList representation into the desired instance of case class R.

The apply[A] method is parameterised with the type of the expected case class. It uses the implicitly provided parser to parse the list elements. The implicitly available method caseClassParser allows us to use a case class as the type parameter for the apply method.

Putting it all together

The CsvReader[T] class converts comma-separated strings into objects of type T.

The read method parses the lines of an Akka stream source, which emits elements of type String, into objects of type T. Note that the type parameter T has a context bound of type LineParser, which ensures that a parser is available for this type, as explained in the previous section.

The class uses the CSV parser class from the scala-csv library to split lines into lists of strings. In case the line was successfully split into a list of strings, a LineParser[T] is created and its apply method is invoked on the string list.

The CsvReader[T].read method can now be used to transform a source of strings into a source of elements of type T:

Further reading

This article was originally published on the BeCompany blog.


Oldest comments (0)