Merging the Hybrid World of Blobs and SQL Server 2016 with Polybase

Author: Eric Russo | 4 min read | July 26, 2016

There are a number of new capabilities that have been introduced in SQL Server 2016. One of the features, that is not often discussed, is a tool called Polybase. I’ve personally been familiar with the general concept of Polybase for more than 3 years since it was first introduced with the PDW Appliance (MPP Architecture – now APS), but I’ve never been very well-versed on its functionality. It appears that Polybase has come a long way and I feel that integrating it with SQL Server 2016 was an excellent move on Microsoft’s part.

Why do I say that? Let’s answer that question by going over the purpose of Polybase. Assume you have data in structured (database like SQL Server) and unstructured (Hadoop or Blobs) formats. To make your databases system as efficient as possible, you need to pull data from both these worlds to create a single useable view – in your current database environment, this is quite a daunting task. Many customers solve this problem by using something like a SCOOP connector to Hadoop with an ETL package to bring the data from the Hadoop system into SQL Server or their data warehouse. From there, they build the needed reports.

Even though the above process works, Polybase can make it simpler by removing the need to have an ETL middle tier to transfer the data. Polybase gives you a query interface from SQL Server via T-SQL that can query both structured and unstructured data from SQL Server directly. Now you can join data sitting inside Hadoop and SQL Server in one single query and get the output. The ability to transfer data that is stored on an Azure Blob like this is impressive and incredibly useful for DBAs looking for more efficient solutions.

Now that we understand the functionality and benefits of Polybase, let’s review the TSQL Script I used to do the setup.

CREATEDATABASE Polybase

GO

USE Polybase

GO

EXECsp_configure'hadoop connectivity',7

RECONFIGURE

You’ll notice that I have used the option ‘7’. This is because we are going to use the Azure Blob as the source. Next let’s look at the security script.

-- Creating master key and credential

CREATEMASTERKEYENCRYPTIONBYPASSWORD='My$uperC00LP@ssw0rd';

CREATEDATABASESCOPEDCREDENTIAL AzureBlobSecret WITHIDENTITY='sqlauthpolybase1',

Secret='your Secret key from blob';

Next is the meat of the script which I will use to configure a pointer to the blob account. Then we will define the schema so it will recognize when SQL Server does a query. Finally, we will map it to a definition of an external table that marries the blob location of the file to the file format we used.

-- Creating external data source (Azure Blob Storage) w/ credential

CREATEEXTERNALDATASOURCE AzureStore_Polybase with (

TYPE=HADOOP,

LOCATION='wasbs://[email protected]',

CREDENTIAL= AzureBlobSecret

)

-- define the file format being used

CREATEEXTERNALFILEFORMAT Poly_FileFormat WITH (

FORMAT_TYPE=DELIMITEDTEXT,

FORMAT_OPTIONS (FIELD_TERMINATOR='|',

USE_TYPE_DEFAULT= TRUE))

-- Define the table's schema when we will access

CREATEEXTERNALTABLE EmployeeExtTable(

eid int,

ename varchar(50)

)

WITH (

LOCATION='/SQLAuth.txt',

DATA_SOURCE= AzureStore_Polybase,

FILE_FORMAT= Poly_FileFormat

);

GO

Once these are configured, we can access the data sitting inside a delimited file like any other table:

SELECT*FROM EmployeeExtTable

GO

In this example, we used a blob as the source for storing files. However, they can also reside on a Hadoop filesystem and a similar code can be written to access them.

You might have more interesting use cases in your journey to building such Hybrid scenarios with Azure; I would love to hear about them. What are you doing within your organization to leverage the benefits of Polybase? Do you find it useful? Have you found any additional capabilities we may not have covered? Please leave us your thoughts in the “Comments” section.

Blog Author

Eric Russo

Senior Vice President of Database Services

Eric Russo is Senior Vice President of Database Services overseeing all of Datavail’s database practices including project and managed services for SQL Server, Oracle, MySQL, MongoDB, Db2, PostgreSQL, Cassandra and AWS. He is also the product owner for Datavail TechBoost™ (formerly Datavail Delta), a cloud-based automation platform. He has more than 20 years’ experience in IT with a majority of those years in database management. His management success and style has attracted top DBAs from around the world to create one of the most talented and largest SQL Server team. He has been with Datavail since 2008. Previous to that his work experiences include DBA Manager at StrataVia, Senior Web Developer at Manifest Information Systems and SQL Server DBA at Clark County, Nevada.

Merging the Hybrid World of Blobs and SQL Server 2016 with Polybase

Blog Author

Subscribe to Our Blog