Friday, November 19, 2010

Notes on Devoxx 4th day

This blog might be my last daily note on Devoxx university and conference. Everything has an end, and devoxx must end tomorrow (Friday, November 19). I'm not sure to write notes on tomorrow sessions, and if I do, it will be for Saturday at the earliest.


After having difficulties in public transport yesterday, this morning I left a little bit later with the risk of missing the first couple of minutes of keynote speaker. 


So, here is my summary of the day. 


Future Roadmap of JEE (Keynote), Jerome Dochez, Linda de Michiel, Paul Sandoz


JEE on Cloud (Jerome Dochez, JD)
When I arrived at the conference, Jerome Dochez was presenting JEE on cloud environment. He mentioned that the cloud support should not be revolutional, but should be an evolution. Programmers should not be asked to change a lot of things from what they have known so far.

At least two things he mentioned particularly: State Management and Better packaging.

He finished the JEE on Cloud presentation part by running a small and successful demo on GlassFish.

Modularity (JD)
There are important efforts on going on making JEE more modular, especially to leverage the development on Java Modularity in general (Jigsaw). Unfortunately, the dependencies to Jigsaw means that modularity on JEE would also be late. 

Some points on modularity that I noted:
  • Applications are made of modules (modules in term of Jigsaw)
  • Dependencies are made of explicit instead of by convention of configuration.
  • Versionings are built in.
JSF (JD)
He mentioned two kinds of modifications: short and long term. Some of the short term ones:
  • Transient state saving
  • XML view cleanup
  • Facelets cache API
  • XML free  (oops, it was written fee in his slide, but yes, it is to removed XML taxes)
He mentioned also about support of HTML 5. I'm not sure whether it is shorter or longer term.

JMS  (JD)
There has been almost no important modifications on JMS, and this time it will change, there will be important modifications on JMS. The modifications include ambiguities resolution, standardize couple of vendor extensions, integration with other specs and also to non-Java languages (which ones ?).

Web Tier (JD)
WebSocket support, Standard JSON API, and NIO2-based web container (I'm not sure to understand the relation between NIO2 and web container, but anyway ...). He mentioned Grizzly library in his presentation.

JPA (Linda De Michiels)
There are couple of interesting things in her talks. I noted some interesting things only here:

Mapping:
  • Support on custom mappings.
  • Dynamic fetching plan. This is contrast to JPA today that requires fetching definition upfront using annotation (EAGER, LAZY).
  • Better support for immutable attributes (read-only entities)
  • More flexible XML descriptors. Yeah, XML, why don't make it Java based ??
API:
  • Additional event listeners and callbacks Really ?? Are they really used in production ?)
  • Support for dynamic persistence unit. This one is cool, no XML, right ?
  • Inspection on persistence unit. Cool as well
Query:
  • Stored procedure support. 
  • Interoperate JPQL and criteria queries. For example, create criteria query from JPQL.
For me, all programmatic supports like dynamic fetching and persistence units are the most interesting improvements.


JAX ??? (Paul Sandoz)
I did not really follow the last part of the presentation, no notes I could share here.


The Essence of Caching by Greg Luck


A side story -- this is the only presentation so far where the company I'm working for, Amadeus, is mentioned. Yes !! :-)

Why caching ? 
Because of performance problem




Amdahl's Law
What important to keep in mind on performance optimization is the Amdahl's Law: 
Speedup = 1 / ( ( 1 - f) + ( f / s) )  where f is the proportion of program being sped up.  Illustration: Making 10% of portion of the system 20 times faster makes 1.105 overall improvement.

The law is important in deciding the part of the system to be sped up. For example, if the problem of downloading a page is on the downloading of its content, there is no point of improving the server side code. Maybe CDN is needed in this case.

Performance Problem Sources  

  • Rendering
  • Program
  • Marshalling/unmarshalling data.
  • Database

Caching solved the problem mainly by offloading some data to a cache, e.g. to memory-based cache.

Cache Efficiency 
Cache efficiency = Cache Hits/Total Hits. 
Needs to take into account pareto principle.

Cache Coherency
To handle cache coherency, one of the simplest solution is by applying TTL + LRU. But there are other strategy: Eternal item + invalidation strategy, write through pattern.

Cache in Clustered Environment
A problem called N* problem is inherent in cache in clustered enviroment. The problem is by also clustering the caching. This introduces another problem: bootstrap problem and cache coherency again.
CAP Theorem
CAP = Consistency, Availability, Partition Tolerance => there must be trade off of the three in clustered environment.

The session was very interesting and informative, all that only in one hour. The number of audience in this session was quite important.


Akka by Viktor Klang

I saw couple of presentation of Akka online, and this morning I saw it live. Even with the pretty distracting Devoxx template, Akka presentation was still excellent. Great job by Viktor and Akka team. 

Akka is designed taking into account that it is hard to make concurrent program right. Akka comes with two solutions: Actor and STM. Akka has Scala and Java implementation. In general Scala implementation is better, but Akka has succeeded in removing a lot of boiler plate codes almost inherent in Java.

Actor
Actor is a higher level abstraction on thread. It has an important property of shares nothing. So, one actor does not share anything with any other actor, so actors work in isolation (unlike Clint Eastwood, George Clooney, ..., that cannot work in isolation, although they might share nothing too). The communication between actors are through message passing. Each actor has mailbox where the message is queued.

Three different types of message sending: sending fully one way, one way but with implicit future, and one way with explicit future. In Scala, they are represented by !, !!, !!! methods.

For fault tolerance, Akka uses "let it crash semantics" stolen from Erlang. It also has a notion of supervisor hierarchy.

Actor can be remote, and there are two types of remote actor: client-managed remote actor and server-managed remote actor. Client-managed actor is handy, but it of course cannot be deployed in untrusted enviroment. 

Remote actor implementation is based by Netty and uses ProtoBuf.

Software Transactional Memory (STM)
Akka supports STM. It provides couple of transactional data structure like transactional maps, transactional lists, and so on.

Java implementation uses a library called Multiverse.

The combination of Actor and STM called Transactor.

Miscellaneous
Akka has a couple of interesting add-ons like Spring add-on, Camel , MongoDB, CouchDB, and couple of other things.

Viktor's presentation was awesome. Akka is awesome.


Data Management at Twitter Scale System by Dimitri Ryaboy

Dimitri presentation was not only about Hadoop how and all its ecosystems are used in Twitter. Hadoop is appropriate for offline processing, but the presentation is not only about offline processing. It's also about online.  The presentation was dense and presented very quickly that I had sometimes difficulties to follow. But, here are some notes:
Twitter is 95 million tweets, with 3000 TPS = Tweets per seconds. 

First problem, UUID Generator. Twitter uses Snowflake: https://github.com/twitter/snowflake. The issue is how to make the generator scalable. The UUID is not necessarily sorted, it should only be roughly sorted (k-sorted).

Second problem, Sharding. Twitter uses Gizzard https://github.com/twitter/gizzard , a Scala-based framework for sharding. Sharding is, by the way, storing data accross multiple nodes. 
Gizzard maps a range of tweet to a particular shard. A shard is mapped to a replication tree (hmm..., not really sure to understand this, but, I write it here anyway). Shard can be physical when it refers to a particular backend. It can also be logical, when it refers to other shards.

Third problem, Fault tolerance. I lost a little bit in this section. Not much I have in my note.  The only thing I have in my note is the system must be tolerant of eventual consistency and stay CALM (Consistency as Logical Monotonicity). Hmm... this is pretty puzzle for now. But anyway, that was about fault tolerance.

Fourth problem, is about timeline treatment (message vector cache). To display timelines means that billions of tweets must be filtered to only show messages from the people one follows The solution: Haplochairus.
Because it is a cache: cache effeciency.

FlockDB is the solution used for social graph store. It is basically a customized distributed index database. It is used to handle, e.g. intersection operations on @ . https://github.com/twitter/flockdb .

Cassandra is used for geo database like nearby search and realtime data analysis.
Gaglia for monitoring.

For offline processing, Hadoop is used. Hadoop is appropriate for some analysis that cannot be  achieved using SQL.  
Elephant-Bird is used to work with data in Hadoop, and finally HBase is used to address mutability and random access in Hadoop.

Excellent presentation from Twitter engineer.  Love it very much.

Hadoop, HBase, and Hive in Production at Facebook by Jonathan Gray

The previous presentation came from Twitter and this one is from Facebook.  First, Jonathan explained why HDFS / Hadoop. Basically, the choice came from the fact that traditional database processing was slower than the debit of incoming data: need 24 hours to process one day data.  

The use of Hadoop introduced some other problems: difficult to write mapreduce jobs. Solution: Hive. So Hive is the datawarehouse solution for Facebook. But Hive is itself still not user friendly: fear of command line. HiPAL is introduced for querying using web UI.

Current limitation: name node is still a single point of failure. The high availability solution today is not enough because it takes hours for backup name node  to fail over. Facebook is now working on something called AvatarNode. Jonathan claimed 10s of failover using AvatarNode.

Other limitation is a non-optimized map-reduce. Facebook is working on Hive optimization. Other problem is about better scheduling, called fairshares scheduler, that controls the task by its priority/nature.  Jonathan claimed that queries at Facebook to be less than 10 minutes now.

HBase is used because it's linearly scalable, fast indexed units, and integration with Hadoop.  It is also suitable for realtime analysis because it has optimized increment. 

Why Cassandra was not selected for Facebook Messaging ? There is a probblem of consistency  at Cassandra that does not suit messaging requirements. HBase, is good. 

Modularity in Java by Mark Reinhold

Mark presentation this afternoon was really interesting. Unfortunately, I was so tired to write notes that I missed many points in his presentation. Instead of writing something completely wrong. Although my notes have many erros given the times used to write ones, but it should be OK for a blog. I'm afraid that if I write something here, it would completely be wrong, even for a blog. 

At the end of his presentation, Mark took some times explaining "Why not OSGi" that should answer many questions on the subject. 

At the end of his presentation, Mark Reinhold gave us one URL to follow: http://openjdk.java.net/projects/jigsaw/ .  He said that the project was not that active because of JDK 7 delivery deadline, but it will come back soon.

Java Puzzler by William Pugh and Josh Bloch
Before starting hist presentation, Mark started with a joke "The reason I'm here is to make sure to have the best place for Java Puzzle session". That should describe how popular this session is. Indeed, it was the most popular session so far at devoxx. 

The two speakers came with 6 brand new puzzles around couple of subjects. Well, I will not write them here because it will not be fun and even if it is fun, it will take some times. The subjects they took are around generics, collections, raw type ,big decimal, and couple of other things. 
One thing that I keep in mind is: "Do not ignore warning of typing" issued by the compiler. 

I answered 3 of 6 puzzle correctly, and considered one of them as cheating puzzle :-) , so 3 of 5, not bad, heh ? Yeah, not bad. But, to be honest, only one of them that I answered correctly with the correct explanation too.

--

OK. Java Puzzler completed my day. It was just teriffic day. The best day at devoxx. Quite sad that it will end tomorrow... Back to Nice in the afternoon.

    No comments: