Developing Apache Kafka Producers and Consumers

All Things Hadoop

I gave a presentation recently on Real-time streaming and data pipelines with Apache Kafka.

A correction in the talk (~ 22 minutes in) : I said that you have to have all your topic data fit on one server.  That is not true, you can’t span logs so you have to have all of your data for a partition fit on one server.  Kafka will spread your partitions around for you within topics.

For that presentation I put together sample code for producing and consuming with an Apache Kafka broker using Scala.

To get up and running, use vagrant.

1) Install Vagrant http://www.vagrantup.com/
2) Install Virtual Box https://www.virtualbox.org/

Your entry point is the test file

On the producer side I have started to look more into using Akka. The prototype for this implementation is in the test case above…

View original post 44 more words

Advertisements

The last difference between OpenJDK and Oracle JDK

Technology and fun

Recently I’ve spent a lot of time investigating font rasterization (a great topic which deserves a separate post). Most applications use font engine which is built into graphics library or widget toolkit. Only few cross-platform applications which badly need to provide consistent text layout (Acrobat Reader, for example) are using their own font engines (like Adobe CoolType). Java platform is one of such applications, since it has its own graphics library. If you are curious take a look at this article comparing font engines, including one from Java platform. From publically available information I understood that OpenJDK uses FreeType library. I thought: “That’s great, I have JDK 1.7 installed so this library must be there, let’s take a look”. But I could not find any traces of freetype.dll in JDK. I was puzzled and tried to find some answers in sources of OpenJDK. Imagine my surprize then I’ve found…

View original post 257 more words

My experience of learning R – from basic graphs to performance tuning

Mani's fun & useful blogs

Background

R as some of you may know is a statistical and graphics programming language (see Wikipedia [1]) used by academia and recently by IT professionals of our ever growing software industry. There is a sudden demand for Data Scientists, Data Analysts and Statisticians with a background in R among other things data and development related subjects.

I have been fortunate to work with such a programming language, even though I haven’t had any prior experience working with such a programming language and moreover with Data Scientists. My interest in Mathematics and affinity for numbers drew me to learning it, and with further help of Herve Schnegg our in-house Senior Data Scientist, I was able to pick a fair bit of the subject.

 

R is a mix of a object-oriented programming, Clojure-like functional programming, Javascript-like style of writing code and a Smalltalk-like programming interface. And…

View original post 2,942 more words

Go vs D vs Erlang vs C in real life: MQTT broker implementation shootout.

Átila on Code

At work we recently started using the MQTT protocol, which uses a publish / subscribe model. It’s simple in the good way and well thought out. We went with an open source implementation named Mosquitto. A few weeks ago on the way back from lunch break my colleague Jeff told me he was writing an MQTT broker in Go, his new favourite language. We’re using MQTT at work, I guess he was looking for a new project to write in Go and voilà. It should be a good fit, after all, this is the type of application that Go was made for. But hubris caught up to him when he uttered “And, of course, it’ll be super fast. It won’t even be fair to other languages”. I’m paraphrasing, but that’s how I remember it. You can read Jeff’s account here.

I’m not a fan of Go at…

View original post 1,180 more words

Large Java Heap with the G1 Collector – Part 1

Matt Pouttu-Clarke's Blog

Demonstration of efficient garbage collection on JVM heap sizes in excess of 128 gigabytes.  Garbage collection behavior analyzed via a mini-max optimization strategy.  Study measures maximum throughput versus minimum garbage collection pause to find optimization “sweet spot” across experimental phases.  Replicable results via well-defined open source experimental methods executable on Amazon EC2 hardware. 

Experimental Method

Goals

  • Demonstrate maximum feasible JVM size on current cloud hardware (specific to Amazon for now) using the G1 Garbage Collector (link).
  • Vary the JVM heap size (-Xmx) exponentially to find performance profile and breaking points.
  • Vary the ratio of new versus old generation objects exponentially.
  • Using in-memory workload to stress the JVM (avoids network or disk waits).
  • Produce replicable results on commodity hardware, open source operating systems, and open source tools.
  • Provide gap-free data for analysis, in spite of garbage collection pauses.

Not (Yet) in Scope

In followup to this study, subsequent efforts may…

View original post 1,671 more words

Sorting Algorithms

Coping With Computers

Hope everyone had a wonderful Thanksgiving holiday! Right before the break, I had the opportunity to go up to MIT for a program called Splash. In this, students can spend their Saturday and Sunday taking classes taught by MIT students. I had many interesting classes up there, and topics may find their way into posts I write. The basic idea behind this one came from my Interactive Algorithms class, where we acted as elements in a list and moved around, instead of simply writing down pseudocode. Here, we’ll be taking a less interactive approach to sorting algorithms.

Introduction

One of the first types of algorithms students are taught in a computer science class (after learning some basics and information about Big O notation) are sorting algorithms. Sorting algorithms are methods used to organize a group of objects in a specific order based on some set  of characteristics. When we first…

View original post 1,426 more words