Integrating open source tools into a data warehouse has its advantages.
Although next-gen big data tools such as Hadoop, Spark, and MongoDB are finding more and more uses, most organizations need to maintain data in traditional relational stores as well. Deriving the benefits of both key/value stores and relational databases takes a lot of juggling.
Three basic strategies are currently in use.
-Double up on your data storage. Log everything in your fast key/value repository and duplicate part of it (or perform some reductions and store the results) in your relational data warehouse.
-Store data primarily in a relational data warehouse, and use extract, transform, and load (ETL) tools to make it available for analytics. These tools run a fine-toothed comb through data to perform string manipulation, remove outlier values, etc. and produce a data set in the format required by data processing tools.
-Put each type of data into the repository best suited to it––relational, Hadoop, etc.––but run queries between the repositories and return results from one repository to another for post-processing. » […]
Read more
Andy Oram, Open source technologies and software engineering Specialist at O’Reilly Media, Inc
Source: radar.oreilly.com