DEV Community

Cover image for 80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 2
Tariq Abughofa
Tariq Abughofa

Posted on • Originally published at rabbitoncode.com

3

80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 2

This is a continuation of the resources I listed in part 1

This part includes the following four categories:

Machine Learning and Algorithms in Big Data

Recommending items to more than a billion people: An article about collaborative filtering at Facebook.

Machine Learning with Sparkling Water: Using H2O the machine learning framework with Apache Spark.

MLlib: Scalable Machine Learning library on Apache Spark from Stanford/Databricks.

TensorFlow: the famous large-scale machine learning library.

Large-scale parallel collaborative filtering for the Netflix prize: an algorithm that for large scale recommendations of Netflix movies.

Data Processing Systems

Airflow: a workflow management system by AirBnB.

Oozie: a workflow management system for Hadoop by Yahoo!.

BlinkDb: analytics on large scale data from Berkeley.

FlumeJava: a library for developing parallel data pipelines from Google.

MapReduce: the google framework behind Hadoop.

Pig: an engine that supports PigLatin a procedural dataflow language for Hadoop from Yahoo.

Hive (resource#2): A data warehouse on top of Hadoop.

The Dataflow Model: the model behind Google Cloud Dataflow which provides simplified stream and batch processing.

MillWheel: stream processing engine from Google.

Photon: A tool to join data streams at Google.

Kinesis: stream processing engine from Amazon.

Apache Flink (resource#2): stream and batch processing engine from TU Berlin.

Trill: incremental data analytics engine from Microsoft.

Kafka: the famous distributed messaging system from LinkedIn.

Apache Spark: the famous stream and batch processing engine. It uses distributed memory abstractions: RDDs, Dataframes, and Datasets. Since Spark 2 was released, it moved to structured streaming (resource#2) (3) (4) and the SparkSQL library was introduced to allow SQL queries over Spark Dataframes. The whole Databricks blog is a great resource for the project.

SparkR: a Spark library to write processing application in R.

GraphX (resource#2): distributed graph processing with Spark's RDDs.

GraphFrames: distributed graph processing with Spark's Dataframes.

SnappyData (resource#2): a transaction datastore on top of Spark.

Real-time Processing

Samza (resource#2) (3) (4): Stream processing engine from LinkedIn.

Storm: real-time data processing engine from Twitter.

Heron: the new Storm from Twitter.

Real-time data processing at facebook.

Pulsar: real-time data processing engine from eBay.

Graph Processing

WTF: the who to follow service at Twitter.

GraphJet: real-time recommendation graph engine at Twitter.

Pregel: large-scale graph processing engine at Google.

Giraph: open source implementation of Pregel by Facebook.

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay