Blazing Fast Concurrent Text I/O in Erlang

Why Is io:get_line/2 So Slow?

It is clear that io:get_line/2 is a terribly slow way to read files. From what I have read it was intended for interactive I/O and not machine/disk I/O so performance was not an issue. However, lacking a decent line-reading mechanism appears to be a serious embarrassment for Erlang. Java, Python, Ruby, and many others can work with line-based data quite efficiently.

What Naïve Solution Did I Try?

I have seen a few alternative solutions to line-based file-reading in Erlang. My initial response to these was that they unnecessarily complicated things by introducing either native code or expansive case expressions using binary I/O. What follows is my attempt to outdo these solutions by using cat through an Erlang port.

In my experiment I used the common utility program cat through an Erlang port to make it appear as if the files push data to my Erlang process. The results were terrible and the performance gap was extreme. It takes on average twice as long as a bfile-based solution to read a file and roughly thirty times as long as a Python line reader.

What Did I Learn?

A single Erlang process is not good at reading large line-based files. Klacke's bfile is the de-facto solution for this and even it is quite slow. In principle bfile should be as fast as Python. However, File I/O and Erlang have contrary needs. Sequential file I/O performance benefits from blocking and consuming as much CPU as possible in a single thread. Erlang's VM has a preemptive scheduler that thwarts this to avoid denial of service and non-determinism.

Wait! Isn't Erlang Concurrent?

Doing a huge chunk of sequential work done in a single process is a fundamentally bad for scalability (check with Amdahl if don't believe me). If you want good I/O performance on a large file in Erlang you pretty much need to read the file different segments of the file with different Erlang processes. Per Gustaffsson developed the line_server module that tears through line-based files at a blistering pace. It single-handedly proved was one of many compelling alternatives that proved Erlang's worthiness in the Wide Finder debate where it was getting roundly panned due to io:get_line/2.

IMHO, line_server is a much better solution than bfile because it encourages you to write concurrent I/O which will scale far better than fast sequential I/O. If you want lines from a file you might as well have a waiter serve them to you rather than fetching them yourself.

Erlang generally encourages concurrent solutions to problems and will disappoint you if you fight that. Concurrent I/O with Erlang delivers excellent performance and scalability. We need to change the way we think about many problems if we are going to get over this multi-core chasm that is opening in front of us. Concurrent I/O is a great start and Per Gustafsson demonstrates just how elegant and easy it is in Erlang.