There are a number of new capabilities that have been introduced in SQL Server 2016. One of the features, that is not often discussed, is a tool called Polybase. I’ve personally been familiar with the general concept of Polybase for more than 3 years since it was first introduced with the PDW Appliance (MPP Architecture – now APS), but I’ve never been very well-versed on its functionality. It appears that Polybase has come a long way and I feel that integrating it with SQL Server 2016 was an excellent move on Microsoft’s part.
Why do I say that? Let’s answer that question by going over the purpose of Polybase. Assume you have data in structured (database like SQL Server) and unstructured (Hadoop or Blobs) formats. To make your databases system as efficient as possible, you need to pull data from both these worlds to create a single useable view – in your current database environment, this is quite a daunting task. Many customers solve this problem by using something like a SCOOP connector to Hadoop with an ETL package to bring the data from the Hadoop system into SQL Server or their data warehouse. From there, they build the needed reports.
Even though the above process works, Polybase can make it simpler by removing the need to have an ETL middle tier to transfer the data. Polybase gives you a query interface from SQL Server via T-SQL that can query both structured and unstructured data from SQL Server directly. Now you can join data sitting inside Hadoop and SQL Server in one single query and get the output. The ability to transfer data that is stored on an Azure Blob like this is impressive and incredibly useful for DBAs looking for more efficient solutions.
Now that we understand the functionality and benefits of Polybase, let’s review the TSQL Script I used to do the setup.
CREATEDATABASE Polybase GO USE Polybase GO EXECsp_configure'hadoop connectivity',7 RECONFIGURE
You’ll notice that I have used the option ‘7’. This is because we are going to use the Azure Blob as the source. Next let’s look at the security script.
-- Creating master key and credential CREATEMASTERKEYENCRYPTIONBYPASSWORD='My$uperC00LP@ssw0rd'; CREATEDATABASESCOPEDCREDENTIAL AzureBlobSecret WITHIDENTITY='sqlauthpolybase1', Secret='your Secret key from blob';
Next is the meat of the script which I will use to configure a pointer to the blob account. Then we will define the schema so it will recognize when SQL Server does a query. Finally, we will map it to a definition of an external table that marries the blob location of the file to the file format we used.
-- Creating external data source (Azure Blob Storage) w/ credential CREATEEXTERNALDATASOURCE AzureStore_Polybase with ( TYPE=HADOOP, LOCATION='wasbs://polybaseDemo@sqlauthpolybase1.blob.core.windows.net', CREDENTIAL= AzureBlobSecret ) -- define the file format being used CREATEEXTERNALFILEFORMAT Poly_FileFormat WITH ( FORMAT_TYPE=DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR='|', USE_TYPE_DEFAULT= TRUE)) -- Define the table's schema when we will access CREATEEXTERNALTABLE EmployeeExtTable( eid int, ename varchar(50) ) WITH ( LOCATION='/SQLAuth.txt', DATA_SOURCE= AzureStore_Polybase, FILE_FORMAT= Poly_FileFormat ); GO
Once these are configured, we can access the data sitting inside a delimited file like any other table:
SELECT*FROM EmployeeExtTable GO
In this example, we used a blob as the source for storing files. However, they can also reside on a Hadoop filesystem and a similar code can be written to access them.
You might have more interesting use cases in your journey to building such Hybrid scenarios with Azure; I would love to hear about them. What are you doing within your organization to leverage the benefits of Polybase? Do you find it useful? Have you found any additional capabilities we may not have covered? Please leave us your thoughts in the “Comments” section.
EPM applications help measure the business performance. This post will help you choose the best EPM solutions for your organization’s needs and objectives.
Which RAID should you use with SQL Server? Learn the differences between RAID 0, RAID 1, RAID 5, and RAID 10, along with best practices.