Large-scale data processing in Cloud

The lure of cloud computing is its elasticity: You can add as much capacity as you need to process and analyze your data. The data might be processed on clusters of computers. This means that the analysis is occurring across machines.

Companies are considering this approach to help them manage their supply chains and inventory control. Or, consider the case of a company processing product data, from across the country, to determine when to change a price or introduce a promotion. This data might come from the point-of-sale (POS) systems across multiple stores in multiple states. POS systems generate a lot of data, and the company might need to add computing capacity to meet demand.

This model is large-scale, distributed computing and a number of frameworks are emerging to support this model, including

✓ MapReduce, a software framework introduced by Google to support distributed computing on large sets of data. It is designed to take advantage of cloud resources. This computing is done across large numbers of computers, called clusters. Each cluster is referred to as a node. MapReduce can deal with both structured and unstructured data. Users specify a map function that processes a key/value pair to generate a set of intermediate pairs and a reduction function that merges these pairs.

✓ Apache Hadoop, an open-source distributed computing platform written in Java and inspired by MapReduce. It creates a computer pool, each with a Hadoop file system. It then uses a hash algorithm to cluster data elements that are similar. Hadoop can create a map function of organized key/value pairs that can be output to a table, to memory, or to a temporary file to be analyzed. Three copies of the data exist so that nothing gets lost.

Source of Information : cloud computing for dummies 2010 retail ebook distribution


Subscribe to Developer Techno ?
Enter your email address:

Delivered by FeedBurner