No mod points, so I'll just post instead: You seem to be blissfully ignorant of what you're talking about.
Big Data isn't just gathering tons of data, then running it through the same old techniques on a big beefy cluster hoping that answers will magically fall out. Rather, it's a philosophy that's used throughout the architecture to process a more complete view of the relevant metrics that can lead to a more complete answer to the problem. If I'd only mentioned "empowering" and "synergy", that would be a sales pitch, so I'm just going to give an example from an old boss of mine.
A typical approach to a problem, such as determining the most popular cable TV show, might be to have each cable provider record every time they send a show to a subscriber. This is pretty simple to do, and generates only a few million total events each hour. That can easily be processed by a beefy server, and within a day or two the latest viewer counts for each show can be released. Now, it doesn't measure how many viewers turned off the show halfway through, or switched to another show on the commercials, or who watched the same channel for twelve hours because they left the cable box turned on. Those are just assumed to be innate errors that cannot be avoided.
Now, though, with the cheap availability of NoSQL data stores, widespread access to high-speed Internet access, and new "privacy-invading" TV sets, much more data can be gathered and processed, at a larger scale than ever before. Now, a suitably-equipped TV can send an event upstream for not just every show, but every minute watched, every commercial seen, every volume adjustment, and possibly even a guess of how many people are in the room. The sheer volume of data to be processed is about a thousand times greater, and coming in about a thousand times as fast, to boot.
The Big Data approach to the problem is to first absorb as much data as possible, then process it into a clear "big picture" view. That means dumping it all into a write-centric database like HBase or Cassandra, then running MapReduce jobs on the data to process it in chunks down to an intermediate level - such as groupings of statistics for each show. Those intermediate results can answer some direct questions about viewer counts or specific demographics, but not anything too much more complicated. Those results, though, are probably only a few hundred details for each show, which can easily be loaded into a traditional RDBMS and queried as needed.
In effect, the massively-parallel processing in the cluster can take the majority of work off of the RDBMS, so the RDBMS has just the answers, rather than the raw data. Those answers can then be retrieved faster than if the RDBMS has to process all of the raw data for every query.
Rather than dismissing errors of reality as unavoidable, a Big Data design relies on gathering more granular data, then distilling accurate answers out of that. For situations where there is enormous amounts of raw data available, this is often beneficial, because the improved accuracy means that some old impossible questions can now be answered. If enough data can't be easily collected (as in the case of so many small websites (almost anybody short of Facebook and Google)), Big Data is probably not the right approach.
Source: http://rss.slashdot.org/~r/Slashdot/slashdotScience/~3/yQAjtxjQOpA/how-big-data-became-so-big
independence day BET Awards 2012 declaration of independence 4th Of July 2012 Zach Parise Spain Vs Italy Euro 2012 nascar
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.