This has the potential of really disrupting the enterprise data warehouse sector. All the MPP vendors today (HP Vertica, EMC Greenplum, Teradata) have exhorbitant pricing and ridiculous licensing. With their pricing - 1000 $ per TB per year, I would be really worried if I were Teradata (Not so much if I were IBM).
A lot of large enterprises won't be comfortable hosting their data outside of their own data centers. The killer application is making a portable, on premises version of this functionality without the high price.
I've heard this argument time and time again in the context of various solutions/technologies. I still today feel that for 95% of the companies that "feel this way", it's simply the result of foolish paranoia among older, upper management. The type of thing that separates a "Fast Company" from stodgy and likely to be disrupted companies.
Yes, the foolish neckbeards who aren't agile and dynamic, don't use git and aren't iterative but care deeply obeying regulations and not sending data to the /dev/null that mongodb on US-EAST is.
This mindset screams "I AM IN THE VALLEY AND EVERYONE WHO ISNT UNDER 30 AND USES APPLE PRODUCTS DOESNT GET WEB2.0" (also caps lock is cruise control for cool).
Apologies for the negativity, I think I get it, I want my data to be in the cloud, and easily accessible and all that jazz, but I want to keep it secrete and safe and most importantly I want to be mine.
Says the guy who just signed up for the iCloud today... ;-)
I think you're projecting a bit. FWIW I'm pushing 30, have worked on the East coast for finance as well as "in the valley", and generally fall on Yegge's "conservative" side of the spectrum.
A lot of other commenters immediately jumped to the medical records argument, but all I was saying is that for a LOT of companies that make the "we have to have everything on site" argument...it's just not true.
Projecting? Unsure in which direction you mean, but fwiw I'm pushing 30 my self (26) and use apple products.
But I agree with you, the medical records argument is kind of boring. But, not everything needs to be outsourced; There is value in keeping things on site, if not for anything besides job creation!
My pet-peeve in this is that it has been now for a while (and is trending upwards, fast) that we don't see any problems at all, long or short term with simply "shipping it to the cloud", where it is everything from medical records, to phone contact lists to personal communications with our other significant other.
We as a community are quickly eroding any expectation of privacy and security all in the name of being agile. I guess it just rubs me the wrong way.
I should tweet about it, on my iphone and then copy it to a file for prosperity and upload it to my google drive...
The problem falls into two categories, on one hand you have non technical end users,
and it takes a non trivial amount of time to train them to roll their own crypto if
you will, and it's also hard to convince them it's worth it (This is a fair point, as
security is a cost/benefit between ease of use and not getting caught with your ass in
the wind).
On the other hand, you have companies using outsourced services, and with SaaS/PaaS/aa
becoming all the rage, it's very important in my opinion that those service providers
shoulder some of the responsibility to not let their users, serve their users etc in
a manner that's not conducive to security/privacy etc...
Punting this problem up the stack, with it most often ending on the end users desks,
is IMNHO a bad idea, since then, as it is now all those good things crypto promises
are the exception, rather then the norm.
This is obviously much much much more complicated in practice, but I at least see this
problem reflected in the "to the cloud!" mentality.
Why does nontechnical users' inability to use crypto impact a business's decision on whether or not they should use externally hosted backend services?
Rewrote my previous comment, as muddling up both cases as a single obtuse analogy was a mistake on my part.
But can't we say that we have both a moral and an ethical obligation to protect our nontechnical users or our fellow developers from mistakes, lack of training or in the worst case maleficence ?
The business decision of using an externally hosted backend services, what ever they may be
must take into account what data goes into it, out of it and how it's computed on by both you and the provider w.r.t. who the real end user is and how the data is going to live on.
And here I think is the crux of the problem, those questions and their solutions are generally
very hard when put into practice (I dont have a silver bullet, or even a something vaguely resembling a mold for it) so it's not very conducive to being a "Fast" company.
For example, being European, it scares me a great deal that companies, schools and the public sector are increasingly punting the business decision of "how to handle email" to "let's use gmail".
That in no way takes into account my concerns (and often I do not have a choice in the matter of
using these services) since my mail, and by extension a large part of my life is being handed to
a for profit US corporation who "does no evil".
I use gmail privately though, since I did this particular cost/benefit and decided that i dont really care if google reads my mailing list traffic...
What is a solution? AWS is a general use compute resource. There is no reason that they should enforce crypto anywhere other than SSH/etc. That is obviously in the domain of the dependent service to decide and implement. Encryption has a cost/benefit ratio that is different for every client, there's no reason everyone should have to pay and use encryption resources if they don't need them.
I find your observation that this problem is reflected especially in service oriented architectures questionable. By centralizing all resources (including documentation: http://aws.amazon.com/security/) It makes it easier to enforce best practices and standard interfaces. But just because they can doesn't mean it's always a good idea to do that.
Some of those industries have legal and or regulatory reasons for not hosting all their data on AWS. In addition, AWS isn't always the economic boon it's made out to be. Those fees can really add up once you start moving enough data around.
I'm a big fan of AWS. But, like any other tool, it's not meant for every job.
Best I can figure, this will cost you around $180K for 44TB for three years. I think that's actually a very low estimate. The pricing is confusing. :-)
A Dell MD1220 with 24 Crucial M4 512GB SSDs will run you $12,600. That's 12TB. Multiply by 4, enable compression, etc etc.
You could buy two of those setups, pay for power, cabinet space, and bandwidth, have a ridiculous amount of IO available, with single-digit millisecond latency, pushing 6Gbps and still have money to burn. And it'll take you (much) less time to unbox and setup than it will to push that much data up to AWS.
Granted, this new service is probably only 50% more expensive than hosted your own, and if you have zero IT staff, it might make sense in some scenarios, but it's definitely not a no-brainer.
It doesn't need to compete with Terradata. It needs to compete with Dell, and in that field, it's still the more expensive option by a significant price margin, as well as being (odds are good) at least a couple orders of magnitude slower.
I work for a company that's in direct competition with Amazon (in one of the many areas Amazon operates in).
Handing our customer lists, source code, finance and sales data to Amazon in plaintext form seems naive to me. There's lots of people at Amazon, and it only takes one ambitious middle manager who wants to get noticed by cleverly anticipating the competition. Most likely there's no audit trail, and no chance of getting caught.
Yes. The actual ssh keys to AWS servers will be very heavily guarded by people in AWS whose jobs are not to let anyone see.
They will not have any skin in the "middle managers" personal game and so his only other resort is straightforward hacking which he could do in your data center anyway.
Nah. The cloud is as safe as your data center - with the exception of bad apples at Amazon (same diff at your data center). Its servers, in data centers, virtualised.
At this level, I suspect you will not even multi tenant with others above a certain price point.
Amazon's external-facing security seems robust, but most places I've worked have given a lot of trust to people on the inside. I've worked at places where all developers get read access to all databases - and managers who are former developers usually retain that access.
Amazon might have good internal security procedures - but this stuff can't be audited effectively, we can only take amazon's word for it. Taking their word for it, with the security of all your customers' data, is a big ask.
That's not how it works. EC2 can't log in to your virts. S3 can't (trivially) read your unencrypted bits, and they certainly can't get to your ec2 dom0.
Everything is pretty well firewalled. There's no back door access. If service A uses service B they hit the same public API as every other customer. Beyond that AMZN is really three companies; Amazon.com (retail), AWS, & Amazon Digital (kindle/vod/etc).
That all said, you own your availability (and risk assessment etc).
Edit: " this stuff can't be audited effectively, we can only take amazon's word for it". Or a trusted third party. Go ask your aws sales rep about pci, fisma, etc.
Good luck talking a military customer into hosting their data in some "cloud" somewhere. Pretty sure banks and other industries will also feel that way. The market there is to deliver and support a small cloud infrastructure they can host and use themselves on their closed network.
So instead of building 1 cloud storage service, you need to effectively build a cloud storage factory, so you can deploy N cloud storage services on demand.
At that point you also potentially deliver a product (a rack of hardware) not just a pure service.
Totally agree, and they'll just be left in the dust while paying bucket loads for services that we can now acquire for greater deals less. At the end of the day, Amazon should have no problems finding clients interested in this service.
What's the difference? We trust such data to be in all sorts of insecure places. Do you really think one of the low payed secretaries at a medical office isn't easily bribable? Or that every IT system your data eventually touches has Bruce Schneier doing their security?
All the people I wouldn't want to have access to my medical history (governments, insurance companies, and doctors) already do. Ditto for shopping data. The legal/security front for that sort of data is entirely pointless, as the bad guys are authorized parties, so you may as well save some money by putting it in the cloud.
There were always only two defensible privacy fronts: keeping the data off electronic records or filling the records with shit.
what are the regulations on things like health records, personal information, etc... stuff that has tight restrictions on how the data is handled. Can these types of data be stored on Amazon or similar services and still be in compliance of data protection laws?
For health records the regulations are a bit of a confusing mess when it comes to cloud storage. Basically, it boils down to "whatever your organization's legal team says". In theory, if data is encrypted in transit, encrypted at rest, and access is limited/logged, then it should meet US HIPAA requirements. However, that may not be enough to satisfy a particularly conservative legal department. There are also nuances about who holds the encryption keys, how are they managed, etc... Notably, Amazon won't actually stick their neck out and certify AWS as HIPAA compliant through a business associates agreement (interestingly Microsoft will for Azure: http://www.windowsazure.com/en-us/support/trust-center/compl...). I've been told by consultants that Amazon has so much business it's just not worth their time to bother with the headache of setting up such agreements.
Depends on a variety of factors including which regulations are governing the data. Some privacy laws require such records can't leave the country in which they're obtained. Other records have strict rules about disposition or "destruction" of the record. It's a complex field and wide open with questions.
From courts to records managers/custodians everyone is still trying to understand those questions. In my experience, when in doubt, big business decides the safest legal answer is "probably not".
Another barrier will be the integration with their current "business intelligence" solutions. Microstrategy and Jaspersoft support is a good start, but what about Microsoft, Oracle and SAP offerings?
Switching to Amazon would involve rewriting your ETL process, and retooling your reporting software, and converting all your existing, currently-used reports.
A huge expense in data warehousing projects isn't the hardware - it's the consultants, the time, and the people to support the thing. I'm sure this is a great solution for companies looking to start a data warehouse, or maybe companies looking to revamp their reporting environment completely... but other than that it'd be a hard sell...
Cloudera is doing just that with the recent announcement / open sourcing of Impala. Based on Amazon's description of their hosted product, the technology is very similar. Impala is still in beta, and columnar storage (trevni/avro) is right around the corner...with that, you can do petabyte scale queries for a very low cost.
Platfora is doing some interesting work with interactive, in-memory BI for Hadoop. They essentially do away with the traditional DW/ETL model and create ephemeral in-memory 'lenses' for querying and visualization.
Impala is married to Hadoop. What if your data infrastructure isn't built on hbase and too complex/large to integrate it easily? Would impala still serve that purpose?
Impala doesn't require HBase to operate, it can use raw HDFS. Simple example, if you had a few terabytes of TSV files, you could easily copy the raw data into HDFS and then create a simple schema around it. All queries on this data would be in parallel across all the nodes in the cluster, this is partly due to the distributed nature of HDFS.
If your data is too difficult to integrate into HDFS (doesn't have to be HBase) using existing Hadoop tools, I suspect you're going to have to do some work to use that data on any platform.
Most enterprises already outsource a lot of their IT, data warehousing especially whether its to a big player like IBM or boutique consultancies.
For most applications does it really matter whether the data sits in your data center or Amazons? Nope... cause the organisation your company contracted to manage already has full access to all your secrets.
So really Amazon is just another IT outsourcer except you don't need a long drawn out sales process.
Have to say that this is pretty amazing. The price is so low that it's a no-brainer to just give it a try. For the same 2TB capability, a Vertica license would run between $20-40K, with high annual subscription fees.
The bigger question for me is why Amazon has been able to figure out the technical details necessary to run this kind of service for this price. It's just ridiculous. Talk about taking the oxygen out of the market...
Does anyone have insight into how painful it is for non-technical people to query their data warehouses?
I'm building a tool that allows business people and non-technical analysts to query their data warehouses using natural language. (Currently, you must ask a technical person to write ad-hoc queries for you, or build you a dashboard. This bogs down your data people.)
Does anyone have insight into the demand for such a product?
[edit: I'd love to chat with anyone with insight into this topic. Reach me at Joseph at metaoptimize dot com]
Most of the time its usually easier to have people learn a touch of SQL and ask developers for harder queries. We used Tableau and after a couple of weeks, they had every query they every wanted saved.
They call me up, ask me to do a "quick report across the inventory db with the project cost data." I send it off to them. If they like it, we push a report (maybe with a couple of parameters) into production.
My gut is that we aren't lacking for good technical options in analytics and data warehousing. To be honest, the lion's share of my work in data warehousing is helping the users know what questions to ask.
But there is lots of room and probably several excellent lifestyle to 8-digit businesses for good BI.
Does anyone have insight into how painful it is for non-technical people to query their data warehouses?
Depends. Back when I did DW stuff my general workflow was to speak with the analysts about what they were trying to accomplish. From there I would create the cubes and additional metrics. I would also set up all the processing schedules at this time. The analysts would then use an Excel plugin that provided a pivot table interface to any cubes for which they had access. It worked pretty well.
For straight data access I would teach the them basic sql and/or build sql templates for them that they could extend.
My goal was always teach a man to fish and get out of the way.
The closest things i've seen are exploratory data visualization products such as tableau (which is pretty awesome). The downside (or partial downside) is it can end up writing some nasty non-performant queries in certain aspects.
Yes, it can be crazy painful to the point that Non-technical people just don't do the query unless it is business burning critical. In large part because the technical people often are tasked on projects from IT, and to get them to do a query required middle management department to department deal making which is slow and painful.
At ExxonMobil, a place I worked, you're going to have VP's asking eachother and IT is going to hedge with, yea if we do this then project X will be late (it's going to be late anyway but they've kept quite about it and no one knows).
My personal solution when I needed a query was to bring a six pack of beer down to IT friday afternoon, mostly because I wouldn't be given access to write queries because we had BI software.
I would suggest reading some books on the topic of Dimensional Modeling [1] such as "The Data Warehouse Toolkit" [2]. The critical thing you need to expose to your users is the ability to ask for things which make sense in their world that are actually really difficult for even an engineer to code. Things like: "Show me average 9am-12pm sales on Mondays, Wednesday and Fridays for 1st quarter, 2012"
Speaking as someone who does his fair share of dimensional modeling, I would just point out that the example you cite could only involve two tables in a well designed dimensional model (sales fact and time/date dimension, I reckon). The challenge is in getting to that point.
To speak to OPs point about difficulty in querying data warehouses, most business intelligence tools that I'm aware of provide semantic layer[1]-type capabilities, whereby the user interface of the tool is presented in the language of the business domain. Nevertheless, I still agree that this is still difficult work, unfortunately. That it is getting more complicated in some respects, such as through unstructured data, doesn't help either.
I guess I wasn't clear enough if I came across like my example was complex. It's one easily solved via DM and one that's extremely hard to execute in most non-dimensionally-modeled setups. That's exactly why I'm a huge advocate of DM instead of just throwing a ton of servers, hadoop & MR at everything.
> Does anyone have insight into the demand for such a product?
Enormous, and there are dozens of such tools available.
Most of them work best if you build an actual data warehouse -- dimensionally structured, not normalised. This is because they can easily build query forms using the DW dimensions in a language that makes sense to end users.
You might want to look into rjmetrics or chart.io and see what they offer. I've been integrating with both, and it seems one of their goals is to (after the connection and datasources are set-up - that still requires technical knowledge) allow non-technical people access to analyze the data.
I'm curious what technology they are using to power it. According to the website, the technology described seems very similar to what Cloudera recently open sourced (Impala), which sits along side Hadoop allowing ad-hoc MPP style querying on petabytes of data.
I'm guessing it is quite a bit different from that. It is a relational data warehouse. It supports a Postgres protocol and API, which sounds more like what Netezza has built. In fact, I would expect Netezza to be one of the most likely companies to partner with Amazon at this kind of price-point.
Should be interesting if this will be a viable competitor to column oriented sql engines like Vertica or other OLAP solutions like SAP HANA. It would be nice if there was a simple SQL based olap solution that I can spin up for offline reporting that can scale terrabytes of data
The answer is in the term "data warehousing" -- http://en.wikipedia.org/wiki/Data_warehouse -- which has implication that you're going to be doing data mining on vast amounts of data, often historical data like logs or transaction histories.
Google has systems like this for analyzing its request logs. Think of how many HTTP requests hit Google's front-end servers per second or hour or day. Each one has a few dozen pieces of data associated with it -- URL, client IP, headers, etc. Suppose I want to make a bar chart of how many requests came from France containing a certain header, each day for the last year. The system can do this query quickly if the requests are already bucketed by time interval, organized by column, compressed, and stored so that exactly the information needed can be brought into RAM quickly.
It is a little funny, when you step back, that "storing," "archiving," and "warehousing" are different things and Amazon has services for each. Try explaining the difference between S3, RDS, EBS, Glacier, and Redshift to a layperson.
Thanks for the response. Would it make sense to say that this is more likely to be used for metadata (i.e. analytics, logs, etc.) while a normal RDB (or NoSQL DB) would be used for application data (i.e. users, settings, etc.)?
This can scale up much more than a single RDS database since it spreads the data across multiple machines, but it's not exactly a replacement for MySQL database. It's also possible that this doesn't make use of EBS, which could make it perform more predictably and protect it from failure when EBS fails.
> It's also possible that this doesn't make use of EBS
This quote from the product page seems to indicate that EBS is not used for primary data storage: "it runs on hardware that is optimized for data warehousing, with local attached storage and 10GigE network connections between nodes."
wow.. I just finished reading the sci-fi book a few weeks ago - "Redshift Rendezvous" by John E Stith. I wonder if this is where the name comes from? In the book Redshift is the name of the space ship that runs cargo mission through folded space, the obvious problem that since you are traveling within just a few m/s of the speed of light just walking on the ship while underway causes color shift - thus redshift.
I read that Stith has a physic degree and worked as an Engineer for NORAD Cheyenne mountain. That made me really interested in what novel he would come up with.
http://www.neverend.com/short-bio-john-e-stith
Redshift is a real physical phenomena describing the way light wavelengths get "shifted" (stretched, to visualize) towards the red as they are seen coming from something moving away from the observer:
Very cool that this will support regular sql queries and queries can be sent using postgresql drivers. Postgresql drivers are super stable and supported everywhere. Driver support is usually overlooked with 'Enterprise' Data Warehousing solutions. I recall that it was really hard to get the Vertica drivers installed and stable under Linux.
I took a few screenshots from the keynote and included one showing the mention of Postgresql and ODBC/JDBC support. Included here if you want to see for yourself: http://wp.me/p2sRpx-1e
I cannot find information on whether Redshift supports queries in MDX. Lots of DWs today are run on Microsoft SQL Server Analysis Services and its MDX spec is now supported by several DW vendors. MDX support would mean it would be easy to switch the DW engine and leave your visualisation suite (or Excel, what the hell) and make it for an easy switch to the cloud - you'd just pick a different data source in your tool.
Looks impressive and very interesting, signed up to review and compare with Teradata/Netezza.
Can we run more complex in-database processes implemented as stored procedures on this platform or is it going to be limited to pure SQL querying/analytics?
And does anyone have an idea how to upload 1 TB of data to this service using Internet connection from your in-house company server? ;)
No. Spanner is a globally distributed database which supports transactions. It is meant for applications which need to make frequent updates to a database, but the storage for the database may be distributed around the world.
Redshift is a different usage model. You upload your data once, then ask questions of it - but you don't update it. Google does have something similar to Redshift: BigQuery (https://cloud.google.com/products/big-query).