This last section is all about examples of real world problems, and what solutions we can use.
But, before diving deeper, we need to answer the question of, what are the main aspects we should look for when we get the requirements.
While, again, there is not bullet proof, straight-forward solution, and trade-offs are always there. But, as with choosing the database, there are some points to consider when making a decision.
First. Start from what the end result needs to be, from what’s needed, not from what tools are available.
Figure out what common queries; Analytical or many small transactional queries or both. Availability?, Consistency?. Determine the demand for each.
Second. How quickly data gets processed.
Every day, minute? Scheduled jobs using Oozie with Hive, Pig, Spark, etc. Real-time? Storm, Spark Streaming, or Flink with Kafka or Flume.
Third. Keep it simple. Use existing solutions if possible.
Given that there are any users. It must be fast. Consistency is not that important. It has to be hourly updated. Not really real time.
The solutions.
· Database. Cassandra, or Amazon DynamoDB. Will store the top links.
· Kafka or Flume. Streaming the data as it comes in.
· Spark Streaming. Analyze the data over a window interval of 1 hour.
The workflow:
1. A reddit service. Whenever a user submits a new link, it sends that transaction to Flume.
2. Flume will stream the links to Spark Stream, which will …
3. Compute the top links for a window of 1 hour, and send it to Cassandra.
4. The result will be stored in Cassandra, which can be exposed to the web application.
Trade-offs.
Instead of streaming the data. Run jobs using Spark, every say 1 hour. Load
HDFS with the data to run analytical queries, and store the result back to
Cassandra. Use Sqoop to export data to Spark (HDFS) from Cassandra.
Given that availability is a concern, and consistency less important.
The recommendations should be in real-time. So, If a user liked a movie, immediately the user should see the movies related.
The solutions.
It’s overwhelming to compute recommendations every time user gives a rating. So, we should compute the the recommendations based on existing users behavior once, periodically. The result from this algorithm, let’s call it ‘Boltzman’, is not likely to change frequently.
And based on our current understanding of the user behavior, we can predict the recommendations for all the movies, for a given user, then sort them, and filter the ones user have already seen.
· Database. Cassandra, or any NoSQL database like HBase. Will store users ratings (coming from stream) and Boltzman (generated periodically).
· Spark ML (or Flink) and Oozie to run every some hours (in frequently), and re-compute the user movie recommendations.
· Flume and Spark Streaming. For streaming and processing real time user behavior.
The workflow:
1. The web server sends user ratings to Flume, which passes it to Spark Streaming, and then saves result into HBase.
2. A recommendation service returns the user recommendation, given user ratings and Boltzman from HBase.
3. Every, say 1 day, Oozie will kick a Spark ML job to re-compute the movies recommendations (Boltzman) and save it to HBase.
We need to know how many users, their geo locations, number of total requests, etc. Only run once a day based on previous day activity. It’s used only internally. Not exposed to users.
The workflow.
1. The web server sends the user logs to Kafka, and then to Spark Streaming, which will …
2. Determine and answers all the queries across 1 hour interval.
3. Then stores the data in Hive (which uses HDFS).
4. Every day, Oozie will kick a query to run on Hive.
If we are storing logs to existing database, maybe we can export the data using Sqoop, or even no need to use Hive at all.