I need to implement a small command line tool to convert a CSV in one format to a CSV in another format. The task itself is rather simple, but there are a few issues with the input CSV format which make it a bit more interesting.

As an experiment, I want to write the tool in several different languages and compare my experience. Somewhat randomly, I chose Ruby, Haskell, Java and Elixir.

Read on to see what happened.

Problem Definition

The input file is a bank statement from my bank’s online banking system. The output is a file in a format understandable by my personal finance management system.

Example input:

<U+FEFF>"ЗАО «МТБанк»"
"улица Толстого, 10, 220007, г.Минск."
"http://www.mtbank.by"
"тел.: +375 17 229-98-89"
"Выписка по счету  12322332233"
"Клиент:","Цой Виктор Робертович","15.01.2017 г. 19:58:04"
"Название:","VC USD 2Y (Сберегательная карта)"
"Номер счета:","12322332233"
"Номер и дата договора:","38555568 от 23.06.2015"
"Период выписки:","с 14.01.2017 г. по 16.01.2017 г."
"Входящий остаток:","1,000.00","USD"
"поступило:","0.00","USD"
"списано:","17.65","USD"
"Исходящий остаток:","1,000.81","USD"
Транзакции по счету с 14.01.2017 г. по 16.01.2017 г.
"Тип","Дата операции","Дата обработки","Место операции","Oписание операции","Карта","Валюта","Сумма в валюте операции","Сумма в валюте счета","Остаток счета"
"A","15.01.2017 12:19:46","..","SHOP "BRIOCHE PARIS" / MINSK / BY","Оплата товаров и услуг","123123******2222","BYN","-25.10","-12.88",""
"A","14.01.2017 17:10:38","..","0253379 / MINSK / BY","Выдача наличных","123123******2222","BYN","-100.00","-51.31",""
"T","12.01.2017 16:52:34","16.01.2017","SHOP"KALI LASKA" BPSB / MINSK / BY","Оплата товаров и услуг","123123******2222","BYN","-34.39","-17.65","1,000.81"
"  Типы операций"
"T - сумма обработана"
"A - сумма блокирована (ожидает обработки)"
"E - ошибка выполнения операции"
"Исходящий остаток:","1,000.81","USD"

Example output:

15.01.2017;ХЗ;Без категории;-12.88;ХЗ;"Место: SHOP ""BRIOCHE PARIS"" / MINSK / BY, Описание: Оплата товаров и услуг, Сумма: -25.10 BYN";
14.01.2017;ХЗ;Без категории;-51.31;ХЗ;Место: 0253379 / MINSK / BY, Описание: Выдача наличных, Сумма: -100.00 BYN;
12.01.2017;ХЗ;Без категории;-17.65;ХЗ;"Место: SHOP""KALI LASKA"" BPSB / MINSK / BY, Описание: Оплата товаров и услуг, Сумма: -34.39 BYN";

A few notable things:

  1. Input data is encoded in UTF-8 and has Cyrillic characters.
  2. Several first and last lines need to be removed from the output.
  3. The input CSV is not valid: double quotes in the middle of fields are not escaped.
  4. Some fields require additional transformations besides renaming: e.g. "12.01.2017 16:52:34" -> "12.01.2017".
  5. The input is always small and can fit into memory. Performance is not a concern.

The experiment

In order not to waste much time, I decided to make something working really fast. No autotests, no code refactoring after the tool works, nothing like that. As soon as it works - ship it and move on to the next language.

Here is a pretty arbitrary scoring system I used to estimate the development process.

  1. Correctness - max 3 points. If the script doesn’t work, it’s pointless.
  2. Speed of development - 2 points.
  3. Ease of packaging up the tool into a standalone binary or script - 1 point.
  4. Number of external dependencies - 1 point.
  5. Lines of code - 1 point.
  6. Code readability - 1 point. I’m not going to change this script very often, and it’s also small, so it’s not as important for me in this case.

Results

Ruby

Ruby was my first choice since it’s the language I’m most comfortable with at the moment. This is typical throwaway code. It’s about 100 lines. It doesn’t have external dependencies except Ruby itself (CSV parser library is built in to Ruby). It took me a couple of hours to build and debug it. One painful moment was connected with “fixing” badly escaped double quotes in the middle of the fields.

Verdict:

  1. Correctness - 2 points. There are a few bugs that I discovered later, but they are very easy to fix.
  2. Speed of development - 2 points. It took very roughly 2-3 hours to build the script. The experience was smooth, more or less. There were a few hiccups when I realized that the input CSV is invalid.
  3. Packaging standalone script - 1 point. It’s just one file which just works anywhere where Ruby 2 is installed.
  4. Number of external dependencies - 0.5 points. No external libraries are required, but Ruby is still needed to run the script.
  5. Lines of code - 1 point. ~100 LOC. Not that bad.
  6. Code readability - 1 point.

Total - 7.5.

Haskell

Having solved the problem in Ruby, I decided to give Haskell a shot. Luckily, Stack solves most of the pain with external dependencies. The development experience was not that good, however. There are like 5 different types for String(-ish) data: String, Data.Text.Strict, Data.Text.Lazy, Data.ByteString.Lazy, and so on and so forth. Some of these types come as external packages. There are also a bunch of packages for regular expressions. I ended up using the package that also requires installing external library (icu4c). There are also a few packages to parse CSV which I used with different level of success. I ended up using cassava.

So, I spent a couple of evenings using different combinations of packages. In the end, I had 7 (seven) dependencies in my cabal file: temporary, text-regex-replace, bytestring, cassava, text, text-icu, vector. My 100-line app even compiled successfully! But it didn’t do any job yet, because I made it parse only one column of a CSV.

But even that was enough to show that the app can’t work with Cyrillic characters in CSV data. I gave up.

Verdict:

  1. Correctness - 0 points.
  2. Speed of development - 0 points.
  3. Ease of packaging standalone script or executable - 0.5 points. It would be 1 point if I didn’t have to install or upgrade icu4c. Other than that, Stack does incredible job managing dependencies for me.
  4. Number of external dependencies - 0 points. Too much stuff I need to pull in in order to get anything working.
  5. Lines of code - 1 point. It’s less than a hundred lines of code (OTOH, the app doesn’t work at all, lol).
  6. Code readability - 1 point. The code is concise and clear (to me). Type annotations serve as very good documentation generally.

Total - 2.5.

Java

After failing the Haskell experiment, I decided to switch to something simple. Something, where IDE does most of the job for me. Java!

I have to say that I’m not really aware of the most modern ways of working with Java collections. Every time I needed a map, I wrote it myself using for-loop. Sad story.

I also failed to build standalone executable uberjar. I pasted various snippets for my pom.xml, but then got bored and gave up.

Verdict:

  1. Correctness - 2 points. I made a few bugs which I found later.
  2. Speed of development - 2 points. I spend 2-3 hours, just like with Ruby version.
  3. Ease of packaging standalone uberjar - 0 points. I failed.
  4. Number of external deps - 0.5 points. Java is needed to run the jar.
  5. Lines of code - 0.5 points. 200+ LoC - a bit too much. And if we add pom.xml then it’s gonna be almost 300 lines.
  6. Code readability - 0.5 points. On the one hand, the code is really stupid simple, but on the hand, in some cases stupidness of the code leads to its bloatedness. Manual for loops, jiggling with boolean flags, all this jazz from the world of Golang and similar languages. Not my thing.

Total - 5.5.

Elixir

With 3 implementations behind my back, implementing the script in Elixir was really a breeze. The pipeline operator is great. Everything worked just fine almost from the first attempt. The source code is really simple and elegant. Albeit this was not a goal, the code works in a “streaming” manner - it uses constant memory no matter how big the input is. I enjoyed it so much.

Verdict:

  1. Correctness - 3 points. Not a surprise, taking into account that it’s not the first attempt to write this script.
  2. Speed of development - 2 points. Same 2-3 hours as with Ruby and Java.
  3. Ease of packaging standalone script - 1 point. Just one simple command.
  4. Number of external deps - 0.5 points. Erlang runtime is needed to run the script.
  5. Lines of code - 1 point. ~100 LoC, same size as Ruby and (non-working) Haskell.
  6. Code readability - 1 point.

Total - 8.5.

Conclusion

I wish I could give Haskell more points, but my experience with it was too painful, unfortunately. Sad, but true.

With other 3 languages my productivity was about the same, but Elixir (and Ruby, to some extent) are more fun to work with, as for me.

The source code is available here. Bear with me, it can be really shitty in some places. I spent 0 time polishing it.