using-get_table.Rmd
The ideadata
package utilizes the simple concept of
get_*
functions, which is named after the table that you
want to connect to in the warehouse in one line. Functions like
[get_students] are useful because they lower your cognitive load (“What
DB and schema do I pull students from and what’s the SQL look like”) and
they ensure that anyone using this package (read: R&A) are always
pulling the same data when they start an analysis, save any processing
logic.
One big downside: they need to be created one at a time and so if a
new set of tables is added to the data warehouse, there can a lag
between when those tables are added and when there is a corresponding
get_*
function in this package.
get_table()
To mitigate against the time Chris has to develop new
get_*
functions and the ever evolving nature of data
warehouse, ideadata
now has a new workhourse function:
get_table()
.
This function does exactly what it says: it get’s tables from the
warehouse by looking up table location in the MetaData
table in the Documentation
schema of R&A’s SQL Server
instance. That table updates frequently and contains a comprehensive
view of the data warehouse at any time.
When you load ideadata
(i.e., when you run
library(ideadat)
the package makes a connection to the
MetaData
table and saves it as a dataframe names
warehouse_meta_data
that you can access by running
data(warehouse_meta_data)
.
Not that you might need to detach and reload ideadata
if
you know the warehouse has a new table but get_table()
can’t find it. This will force the warehouse_meta_data
table to update.
If the table is uniquely named in the data warehouse then the usage is very straightforward. You need to simply the table name to function:
library('ideadata')
#> ℹ Loading ideadata
#> ── Gathering warehouse metadata ─────────────────────────────────────────────── ideadata 3.0.1 ──
#> ℹ Checking credentials ...
#> christopher.haid's Kereberos TGT is current
#> ✓ Success: Warehouse metadata gathered!
#> ℹ The warehouse currently houses:
#> 2,226 tables
#> in 134 databases
#> on 17 servers
#> WHEE! ideadata and you are healthily neat!
#library(ideadata)
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.2
#> ── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.3.1 ──
#> ✓ ggplot2 3.3.6 ✓ purrr 0.3.4
#> ✓ tibble 3.1.6 ✓ stringr 1.4.0
#> ✓ tidyr 1.2.0 ✓ forcats 0.5.1
#> ✓ readr 2.1.2
#> Warning: package 'tibble' was built under R version 4.0.2
#> Warning: package 'tidyr' was built under R version 4.0.5
#> Warning: package 'readr' was built under R version 4.0.5
#> Warning: package 'forcats' was built under R version 4.0.2
#> ── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> x lubridate::as.difftime() masks base::as.difftime()
#> x lubridate::date() masks base::date()
#> x dplyr::filter() masks stats::filter()
#> x lubridate::intersect() masks base::intersect()
#> x dplyr::lag() masks stats::lag()
#> x lubridate::setdiff() masks base::setdiff()
#> x lubridate::union() masks base::union()
oe_responses <- get_table("OEResponses")
#> christopher.haid's Kereberos TGT is current
head(oe_responses)
#> # Source: lazy query [?? x 7]
#> # Database: Microsoft SQL Server 15.00.4198[dbo@RGVPDRA-DASQL/GPTW]
#> ID AcademicYear OEQuestionKey OEResponseKey GroupType Group Response
#> <int64> <chr> <int> <int> <chr> <chr> <chr>
#> 1 1 2020-2021 5 1 Region Austin "I would not change a thing;…
#> 2 2 2020-2021 5 2 Region Austin "As a new teacher I had hear…
#> 3 3 2020-2021 5 3 Region Austin "Choose locations more caref…
#> 4 4 2020-2021 5 4 Region Austin "At my previous campus, I wo…
#> 5 5 2020-2021 5 5 Region Austin "To treat it's employees lik…
#> 6 6 2020-2021 5 6 Region Austin "More check-ins with manager"
Not too bad.
This function will fail if the tables is not uniquely named in the warehouse.
Why?
Because there is no way to know which table you really mean. One
solution, if you know the location, is to provide the table name,
database name, and schema to the .table_name
,
database_name
, .schema
, and
.server_name
arguments of get_table()
. That
will locate the right server and make the connection and avoid
failure.
But here is something cool: What if you don’t know
that info? When get_table
fails in this scenario it fails
informatively. get_table
will show you all the tables it’s
found in the warehouse and give you code you can copy and paste to
get the table you want.
Let’s say you ask for the Schools
table
schools <- get_table("Schools")
#> ! There are 22 tables with that name in our warehouse
#> ℹ You'll need to specify the database and schema name with db target.
#> ✓ Any of these should work:
#> get_table(.table_name = "Schools", .database_name = "CSIInstruction", .schema = "dbo", .server_name = "1064618-SQLDI")
#> get_table(.table_name = "Schools", .database_name = "CSIInstructionStage", .schema = "dbo", .server_name = "1064618-SQLDI")
#> get_table(.table_name = "Schools", .database_name = "IDEAInstruction", .schema = "dbo", .server_name = "1064618-SQLDI")
#> get_table(.table_name = "Schools", .database_name = "PROD1_2021ss", .schema = "Schools", .server_name = "1065574-SQLPRD1")
#> get_table(.table_name = "Schools", .database_name = "PROD1_2020ss", .schema = "Schools", .server_name = "1065574-SQLPRD1")
#> get_table(.table_name = "Schools", .database_name = "PROD1", .schema = "Schools", .server_name = "RGVPDSD-DWPRD1")
#> get_table(.table_name = "Schools", .database_name = "SRC_Florida_Schools", .schema = "Powerschool", .server_name = "RGVPDSD-DWSRC1")
#> get_table(.table_name = "Schools", .database_name = "SRC_Louisiana_Schools", .schema = "Powerschool", .server_name = "RGVPDSD-DWSRC1")
#> get_table(.table_name = "Schools", .database_name = "SRC_Texas_Travis_Schools", .schema = "Skyward", .server_name = "RGVPDSD-DWSRC1")
#> get_table(.table_name = "Schools", .database_name = "TravisSnapshotData", .schema = "MidLand.PM", .server_name = "RGVPDSD-DWSRC1")
#> get_table(.table_name = "Schools", .database_name = "SRC_EA", .schema = "dbo", .server_name = "RGVPDSD-DWSRC2")
#> get_table(.table_name = "Schools", .database_name = "SRC_Ohio_Schools", .schema = "Powerschool", .server_name = "RGVPDSD-DWSRC1")
#> get_table(.table_name = "Schools", .database_name = "SRC_Florida_Schools_Focus_Dev", .schema = "Focus", .server_name = "RGVPDSD-DWSRC1")
#> get_table(.table_name = "Schools", .database_name = "SRC_Ohio_Assessments", .schema = "Edcite", .server_name = "RGVPDSD-DWSRC2")
#> get_table(.table_name = "Schools", .database_name = "SRC_StateAssessments", .schema = "dbo", .server_name = "RGVPDSD-DWSRC2")
#> get_table(.table_name = "Schools", .database_name = "CSIInstructionDebug", .schema = "dbo", .server_name = "1064618-SQLDI")
#> get_table(.table_name = "Schools", .database_name = "IDEAInstructionDebug", .schema = "dbo", .server_name = "1064618-SQLDI")
#> get_table(.table_name = "Schools", .database_name = "IDEAInstructionStage", .schema = "dbo", .server_name = "1064618-SQLDI")
#> get_table(.table_name = "Schools", .database_name = "PROD1", .schema = "Schools", .server_name = "1065574-SQLPRD1")
#> get_table(.table_name = "Schools", .database_name = "Dashboard", .schema = "dbo", .server_name = "791150-HQVRA")
#> get_table(.table_name = "Schools", .database_name = "Dashboard", .schema = "dbo", .server_name = "RGVPDRA-DASQL")
#> get_table(.table_name = "Schools", .database_name = "SRC_Texas_Schools", .schema = "Powerschool", .server_name = "RGVPDSD-DWSRC1")
Oh no! There are 21 tables named Schools
in the
warehouse on 17 databases. Yikes.
But look at that error: It’s got what you need!
You can copy and paste any one of those lines to get the unique table
you are seeking (in this case the Schools
table in
Schools
on Prod1
).
schools <- get_table(.table_name = "Schools", .database_name = "PROD1", .schema = "Schools")
#> ! There are 2 tables with that name in our warehouse
#> ℹ You'll need to specify the database and schema name with db target.
#> ✓ Any of these should work:
#> get_table(.table_name = "Schools", .database_name = "PROD1", .schema = "Schools", .server_name = "RGVPDSD-DWPRD1")
#> get_table(.table_name = "Schools", .database_name = "PROD1", .schema = "Schools", .server_name = "1065574-SQLPRD1")
head(schools)
#> NULL