Berlin Buzzwords feels more like a festival than a normal conference. You get these wristbands that you should wear for a few days and being held at a brewery there are of course plenty of beer. The organizers have adjusted the schedule accordingly which means that the keynote starts at 10.30 and the first session around 12. Still, the afternoon was packed with interesting sessions of various lengths and the schedule was followed accurately with German precision. The conference attracts an interesting mix of developers, operations people and data scientists from all over Europe. Videos from last year are available and they were hoping to publish the videos from this year as soon as possible.
Ariel Waldman started the conference with a really inspiring keynote titled “The Hackers Guide to the Galaxy” which turned out not to be a just a clever title, but instead literally talking about our galaxy, black holes, dark matter and lot of other interesting stuff. Her main point was that big achievements in science such as putting a man on the moon was achieved by people who at first didn’t have a clue what they were doing. Even “rocket scientists” are at some point beginners. By experimenting and doing lots of mistakes (and of course learning from those!) they finally succeeded. She encourages everyone to participate in science and has created Science Hack Day to show that it can actually be done. Even crazy things like a beard detector can turn out to be useful for detecting cosmic rays. She has also created spacehack.org which lists projects that you can participate in, such as Planet Hunters.
Nitay Joffe described Apache Giraph which is a processing model for graph data. If you can state your problem in graph terms you could probably use Giraph. It runs on top of Hadoop and uses ZooKeeper for coordination. He hinted at the size of the data sets that Facebook processes using which are on the order of 2 billion vertexes and 200 billion edges. To optimize Giraph they have rewritten it to use Netty for all network communication and fastutil for data structures containing primitive types. They also used byte arrays to avoid too many objects and utilized sun.misc.Unsafe for serialization. The impression was that Giraph had really been tuned and hardened in production and probably deserves a closer look.
Clément Stenac gave a talk about Dataiku Flow which is can be used to describe dependencies between larger data processing jobs. It uses a declarative approach to describe the data sets, partitions and dependencies. As a sidenote he mentioned Dataiku Cloud Transport Client which provides a unified way of working from the command line with different data sources, for example copying files between s3, ftp, ssh, http, hdfs.
Rashid Khan and Shay Bannon gave a talk about open source logging tools, focusing on Logstash, ElasticSearch and Kibana. Definitely good stuff! After a question about how much data you can ingest in ElasticSearch Shay mentioned a client indexing 10 TB every day.
The talk “Geo-spatial Event Detection in the Twitter Stream” by Michael Kaisser was unlike many other talks not about Hadoop or big clusters. Instead it was about how they were able to process large amounts of tweets to find geographical Twitter hot spots. This can for example be used to find places of accidents or areas with traffic problems. After filtering out tweets that have geo-spatial data with a location from the requested region, areas with high tweeting frequencies are found. Tweets from the same area are then investigated to find similarities. Here they use Weka which is a product for machine learning.
Sylvaine Lebresne and Eric Evans had two separate talks about new features of Cassandra, but unfortunately with some overlap. Eric also gave a talk at the barcamp held on Sunday evening. All Cassandra talks start with the ring and key distribution, and these were no different. 1.2 added virtual nodes for faster rebuilds and simpler load balancing. However, something to watch out for is since all data is distributed across all nodes this means that given X replicas and any X nodes dies this means that you will loose data. However, it turns out that since node rebuilds are faster you were more likely to loose data previously just because the replicas died due to heavier load (not exactly sure about the reasoning, this came up after a few beers…). CQL3 has both INSERT and UPDATE, but there is no difference. In version 2.0 there will probably be support for UPDATE WHERE EXISTS. Other features scheduled for 2.0 (planned this summer) are triggers and compare-and-set support. A good learning resource for how to model Cassandra data is Twissandra and the corresponding java port. Some of the features include de-normalization using atomic batches and timeuuid.
Mikio Braun described a technique he called stream mining that use smart math to process large amounts of events on smaller hardware. Many people left during the session, but it was probably because of all the equations and the fact that it was the last session of the conference. In many cases a good approximation is good enough, for example to find the most frequent terms we don’t have to compute the count of all terms, instead we just keep track of the k top terms whenever a term is not in the list we replace the least frequent term with the new term. Similarly when computing a moving average we can just use exponential decay. This will not be accurate, but we can get some worst case guarantees.