如何在SQLServer2016中使⽤R导⼊导出CSV⽂件
介绍 (Introduction)
Importing and exporting CSV files is a common task to DBAs from time to time.
导⼊和导出CSV⽂件是DBA经常执⾏的⼀项常见任务。
For import, we can use the following methods
对于导⼊,我们可以使⽤以下⽅法
with the Bulk option
Writing a CLR stored procedure or using PowerShell
编写CLR存储过程或使⽤PowerShell
For export, we can use the following methods
对于导出,我们可以使⽤以下⽅法
Writing a CLR stored procedure or using PowerShell
编写CLR存储过程或使⽤PowerShell
But to do both import and export inside T-SQL, currently, the only way is via a custom CLR stored procedure.
但是⽬前,要同时在T-SQL中进⾏导⼊和导出,唯⼀的⽅法是通过⾃定义CLR存储过程。
This dilemma is changed since the release of SQL Server 2016, which has R integrated. In this article, we will demonstrate how to use R embedded inside T-SQL to do import / export work.
⾃从集成了RSQL Server 2016版本以来,这⼀难题就得到了改变。 在本⽂中,我们将演⽰如何使⽤嵌⼊在T-SQL中的R进⾏导⼊/导出⼯作。
SQL Server 2016中的R集成 (R Integration in SQL Server 2016)
To use R inside SQL Server 2016, we should first install the R Service in-Database. For detailed installation, please see
若要在SQL Server 2016中使⽤R,我们应⾸先在数据库中安装R Service。 有关详细的安装,请参阅
T-SQL integrates R via a new stored procedure: .
T-SQL通过新的存储过程集成了R。
The main purpose of R language is for data analysis, especially, in statistics way. However, since any data analysis work naturally needs to deal with external data sources, among which is CSV file, we can use this capability to our advantage.
R语⾔的主要⽬的是⽤于数据分析,尤其是以统计的⽅式。 但是,由于任何数据分析⼯作⾃然都需要处理外部数据源,其中包括CSV⽂件,因此我们可以利⽤此功能来发挥我们的优势。
What is more interesting here is SQL Server R service is installed with an enhanced and tailored for SQL Server 2016 R package package, which contains some handy .
此处更有趣的是,SQL Server R服务已安装并经过增强并针对SQL Server 2016 R程序包程序包进⾏了 ,其中包含⼀些⽅便的 。
环境准备 (Environment Preparation)
Let’s first prepare some real-world CSV files, I recommend to download CSV files from .
⾸先,让我们准备⼀些实际的CSV⽂件,我建议从下载CSV⽂件。
We will download the first two dataset CSV files, “College Scorecard” and “Demographic Statistics By Zip Code”, just click the arrow-pointed two links as shown below, and two CSV files will be downloaded.
我们将下载前两个数据集CSV⽂件,即“⼤学记分卡”和“按分类的⼈⼝统计学”,只需单击箭头所⽰的两个链接,如下所⽰,将下载两个CSV⽂件。
After downloading the two files, we can move the “Consumer_complain.csv” and “Most-Recent-Cohorts-Scorecard-Elements.csv” to a designated folder. In my case, I created a folder C:\RData and
put them there as shown below
下载两个⽂件后,我们可以将“ Consumer_complain.csv”和“ Most-Recent-Cohorts-Scorecard-Elements.csv”移动到指定⽂件夹。 就我⽽⾔,我创建了⼀个⽂件夹C:\ RData并将其放置在下⾯,如下所⽰
These two files are pretty typical in feature, the Demographic_Statistics_By_Zip_Code.csv are all pure numeric values, and another file has big number of columns, 122 columns to be exact.
这两个⽂件在功能上⾮常典型,Demographic_Statistics_By_Zip_Code.csv都是纯数字值,另⼀个⽂件的列数很多,准确的说是122列。
I will load these two files to my local SQL Server 2016 instance, i.e. [localhost\sql2016] in [TestDB] database.
我将这两个⽂件加载到本地SQL Server 2016实例,即[TestDB]数据库中的[localhost \ sql2016]。
数据导⼊/导出要求 (Data Import / Export Requirements)
We will do the following for this import / export requirements:
我们将针对此导⼊/导出要求执⾏以下操作:
1. Import the two csv files into staging tables in [TestDB]. Input parameter is a csv file name
将两个csv⽂件导⼊[TestDB]中的登台表。 输⼊参数是⼀个csv⽂件名
2. Export the staging tables back to a csv file. Input parameters are staging table name and the csv file name
将登台表导出回csv⽂件。 输⼊参数是登台表名称和csv⽂件名称
3. Import / Export should be done inside T-SQL
导⼊/导出应在T-SQL内部完成
实施导⼊ (Implementation of Import)
In most of the data loading work, we will first create staging tables and then start to load. However, with some amazing functions in RevoScaleR package, this staging table creation step can be omitted as the R function will auto create the staging table, it is such a relief when we have to handle a CSV file with 100+ columns.
在⼤多数数据加载⼯作中,我们将⾸先创建临时表,然后开始加载。 但是,由于RevoScaleR软件包中有⼀些了不起的功能,因此可以省略此登台表创建步骤,因为R函数将⾃动创建登台表,当我们必须处理包含100列以上的CSV⽂件时,这是⼀种缓解。
The implementation is straight-forward
实现简单明了
1. Read csv file with read.csv R function into variable c, which will be the source (line 7)
使⽤read.csv R函数将csv⽂件读⼊变量c,这将成为源(第7⾏)
2. From the csv file full path, we extract the file name (without directory and suffix), we will use this file name as the
staging table name (line 8, 9)
从csv⽂件的完整路径中,我们提取⽂件名(不带⽬录和后缀),我们将使⽤此⽂件名作为登台表名(第8、9⾏)
3. Create a sql server connection string
创建⼀个SQL Server连接字符串
4. Create a destination SQL Server data source using RxSQLServerData function (line 12)
使⽤RxSQLServerData函数创建⽬标SQL Server数据源(第12⾏)
5. Using RxDataStep function to import the source into the destination (line 13)
使⽤RxDataStep函数将源导⼊到⽬标(第13⾏)
6. If we want to import a different csv file, we just need to change the first line to assign the proper value to @filepath
如果要导⼊其他的csv⽂件,则只需更改第⼀⾏即可为@filepath分配适当的值
One special notd here, line 11 defines a connection string, at this moment, it seems we need a User ID
(UID) and Password (PWD) to avoid problems. If we use Trusted_Connection = True, there can be problems. So in this case, I created a login XYZ and assign it as a db_owner user in [TestDB].
此处需要特别注意的是,第11⾏定义了⼀个连接字符串,此刻,我们似乎需要⼀个⽤户ID(UID)和密码(PWD)以避免问题。 如果我们使⽤Trusted_Connection = True,则可能会出现问题。 因此,在这种情况下,我创建了登录XYZ并将其分配为[TestDB]中的db_owner ⽤户。
After this done, we can check what the new staging table looks like
完成此操作后,我们可以检查新登台表的外观
We notice that all columns are created using the original names in the source csv file with the proper data type.
我们注意到,所有列都是使⽤具有正确数据类型的源csv⽂件中的原始名称创建的。
After assigning @filepath = ‘c:/rdata/Most-Recent-Cohorts-Scorecard-Elements.csv’ , and re-running the script, we can check to see a new table [Most-Recent-Cohorts-Scorecard-Elements] is created with 122 columns as shown below.
分配@filepath = 'c:/rdata/Most-Recent-Cohorts-Scorecard-Elements.csv'并重新运⾏脚本后,我们可以检查以查看新表[Most-Recent-Cohorts-Scorecard-Elements]创建有122列,如下所⽰。
However, there is a problem for this csv file import because some csv columns are treated as integers, for example, when for [OPEID] and [OPEID6], they should be treated as a string instead because treating them as integers will drop the leading 0s.
但是,此csv⽂件导⼊存在问题,因为某些csv列被视为整数,例如,对于[OPEID]和[OPEID6],应将它们视为字符串,因为将它们视为整数会删除前导0s。
When we see what is inside the table, we will notice that in such scenario, we cannot rely on the table auto creation.
当我们看到表中的内容时,我们会注意到在这种情况下,我们不能依赖表的⾃动创建。
To correct this, we have to give the instruction to R read.csv function by specifying the data type for the two columns as shown below
为了解决这个问题,我们必须通过指定两列的数据类型来向R read.csv函数提供指令,如下所⽰
We can now see the correct values for [OPEID] and [OPEID6] columns
现在,我们可以看到[OPEID]和[OPEID6]列的正确值
实施出⼝ (Implementation of Export)
If we want to dump the data out of a table to csv file. We need to define two input parameters, one is the destination csv file path and another is a query to select the table.
如果我们想将数据从表中转储到csv⽂件中。 我们需要定义两个输⼊参数,⼀个是⽬标csv⽂件路径,另⼀个是选择表的查询。
The beautify of sp_execute_external_script is it can perform a query against table inside SQL Server via its @input_data_1 parameter, and then transfer the result to the R script as a named variable via its @input_data_1_name.
美化sp_execute_external_script是因为它可以通过其@ input_data_1参数对SQL Server中的表执⾏查询,然后通过其@
input_data_1_name将结果作为命名变量传输到R脚本。
server 2016So here is the details:
因此,这是详细信息:
Define the csv file full path (line 3), this information will be consumed by the embedded R script via an input parameter definition (line 11 & 12 and consumed in line 8)
定义csv⽂件的完整路径(第3⾏),该信息将由嵌⼊式R脚本通过输⼊参数定义使⽤(第11和12⾏,在第8⾏中使⽤)
Define a query to retrieve data inside table (line 4 and line 9)
定义查询以检索表内部的数据(第4⾏和第9⾏)
SrcTable, and it Is consumed in the embedded R script (line 8) SrcTable,并在嵌⼊式R脚本中使⽤(第8⾏)
to generate the csv file. ⽣成csv⽂件。
We can modify @query to export whatever we want, such as a query with where clause, or just select some columns instead of all columns.
我们可以修改@query以导出所需的任何内容,例如带有where⼦句的查询,或者只选择某些列⽽不是所有列。
The complete T-SQL script is shown here:
完整的T-SQL脚本如下所⽰:
-- import data 1: import from csv file by using default configurations
-- the only input parameter needed is the full path of the source csv file
declare @filepath varchar(100) = 'c:/rdata/Demographic_Statistics_By_Zip_Code.csv' -- using / to replace \
declare @tblname varchar(100);
declare @svr varchar(100) = @@servername;
exec sp_execute_external_script @language = N'R'
, @script = N'
c <- read.csv(filepath, sep = ",", header = T)
filename <- basename(filepath)
filename <- paste("dbo.[", substr(filename,1, nchar(filename)-4), "]", sep="") #remove .csv suffix
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论