A problem I’ve been struggling with is this:
How can I export/import a parquet file without Spark?
There are various efforts in the form of Apache Arrow, and a Python Parquet package which allows you to read, but how might one export the data?
Enter Apache Drill
This isn’t exactly a new approach, but rather I thought I’ll document what I did to get it working locally as there aren’t any really clear instructions online!
Firstly start Apache drill as shown in the tutorial.
Next the connection settings to the Apache drill JDBC interface is show below. Make sure that the jdbc
url is what is shown below; not jdbc:drill:zk=local
as it will not work.
library(RJDBC)
drv <- JDBC("org.apache.drill.jdbc.Driver",
"path/to/apache-drill-1.7.0/jars/jdbc-driver/drill-jdbc-all-1.7.0.jar", "'")
conn <- dbConnect(drv, "jdbc:drill:drillbit=localhost")
My dfs
configuration is shown below. The key highlights are:
- connection field should be
file:///
- make use of your workspaces and the location to navigate effectively.
This can be accessed under http://localhost:8047/storage
after drill is running (assuming defaults are all used!)
{
"type": "file",
"enabled": true,
"connection": "file:///",
"config": null,
"workspaces": {
"root": {
"location": "C:/Users/chapm/Documents/apache-drill-1.7.0",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "C:/tmp",
"writable": true,
"defaultInputFormat": null
},
"output": {
"location": "C:/Users/chapm/Documents/apache-drill-1.7.0/output",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json",
"extensions": [
"json"
]
},
"avro": {
"type": "avro"
},
"sequencefile": {
"type": "sequencefile",
"extensions": [
"seq"
]
},
"csvh": {
"type": "text",
"extensions": [
"csvh"
],
"extractHeader": true,
"delimiter": ","
}
}
}
Writing to Parquet
To write to a parquet file, drill will need to take in another file and output the new table:
dbGetQuery(conn, paste0("create table dfs.output.iris as ",
"select * from dfs.root.`iris.csv`", collapse=" "))
n.b. if the command above doesn’t work and you are following this, make sure that the references to “output” in your configuration above actually exists, i.e. you may need to create a folder.
Also notice that by default, the csv parser does not skip the header row in drill! You may need to alter the configuration in order to get the behaviour you are probably expecting!
Notice that I had to create a csv
file first! Drill can not even be used to insert, update, or delete data, but only able to create table as
.
This naturally leads to how can I potentially import an R object? Approaches could be:
- Write to an intermediary file format, like
csv
- Write to a sqlite database - this should be better as it will better preserve data types of the R data frame object.
Setting up SQLite
To set up SQLite, we will need to download the appropriate jar, and make sure you place it in jars/3rdparty
folder of the drill installation.
Then we can connect in the normal way:
sqlitedrv <- JDBC("org.sqlite.JDBC",
"path/to/apache-drill-1.7.0/jars/3rdparty/sqlite-jdbc-3.8.11.2.jar", "'")
sqliteconn <- dbConnect(sqlitedrv, "jdbc:sqlite:C:/Users/chapm/Documents/apache-drill-1.7.0/sqlite/test.sqlite")
We can write to this table using the normal RJDBC
commands:
dbWriteTable(sqliteconn, "iris", iris)
Adding SQLite config to drill
Adding the configuration is quite simple (I called this configuration sqlite
):
{
"type": "jdbc",
"driver": "org.sqlite.JDBC",
"url": "jdbc:sqlite:path/to/apache-drill-1.7.0/sqlite/test.sqlite",
"username": null,
"password": null,
"enabled": true
}
We can then verify that it is added correctly by querying the sqlite database:
dbGetQuery(conn, "select * from sqlite.iris")
If this works correctly, we can then write to parquet this way:
dbGetQuery(conn, paste0("create table dfs.output.iris2 as ",
"select * from sqlite.iris", collapse=" "))
And that is it! You should be able to export (and import) parquet files using R and drill.
Python
Given that everything is set up as above, python is much easier:
pip install pydrill
Then just use the library:
from pydrill.client import PyDrill
import pandas as pd
drill = PyDrill(host='localhost', port=8047)
df = drill.query("select * from sqlite.iris").to_dataframe()
import sqlite3
import numpy as np
sqliteconn = sqlite3.connect('path/to/apache-drill-1.7.0/sqlite/test.sqlite')
# some dummy data
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df1 = pd.DataFrame(d)
df1.to_sql('dummy', sqliteconn)
# drill can see the table
drill.query("select * from sqlite.dummy").to_dataframe()
# write to parquet:
drill.query("""create table dfs.output.dummy as
select * from sqlite.dummy""")
This could be a simple approach to convert data to Parquet format!