Using odns


odns provides a base for exploring and obtaining data available through the Scottish Health and Social Care Open Data platform. The package provides a wrapper for the underlying CKAN) API and simplifies the process of accessing the available data with R, allowing users to quickly explore the available data and start using it without having to write complex queries.


Language of CKAN

CKAN and by extension this package refers to packages and resources.

The term package refers to a dataset, a collection of resources. A resource, contains the data itself.

Example of structure;

.
├── package_1
│   ├── resource_1
│   ├── resource_2
│   └── resource_3
|
└── package_2
    ├── resource_1
    └── resource_2


Exploring the available data

To view all the packages available we can use all_packages().

#' view all available packages

> all_packages()
  
# package_name                                      package_id
# covid-19-vaccination-in-scotland                  6dbdd466-…
# enhanced-surveillance-of-covid-19-in-scotland     3c5231ee-…
# hospital-onset-covid-19-cases-in-scotland         d67b13ef-…
# weekly-covid-19-statistical-data-in-scotland      524b42b4-…
# covid-19-in-scotland                              b318bddf-…
# … with 85 more rows


If we wish to search for packages by name, or a part of a name, we can use the contains argument of all_packages(). For example if we want to find packages relating to populations;

> all_packages(contains = "population")

# package_name                           package_id                          
# population-estimates                   7f010430-6ce1-4813-b25c-f7f335bdc4dc
# standard-populations                   4dd86111-7326-48c4-8763-8cc4aa190c3e
# population-projections                 9e00b589-817e-45e6-b615-46c935bbace0
# gp-practice-populations                e3300e98-cdd2-4f4e-a24e-06ee14fcc66c
# scottish-index-of-multiple-deprivation 78d41fa9-1a62-4f7b-9edb-3e8522a93378


If we prefer to see more detail, including the names of resources within packages, we can use all_resources() with the package_contains argument.

#' view all resources within packages whose names contain "population"

> all_resources(package_contains = "population")

# resource_name         resource_id package_name package_id url   last_modified
# Data Zone (2011) Pop… c505f490-c… population-… 7f010430-… http… 2021-10-11T1…
# Intermediate Zone (2… 93df4c88-f… population-… 7f010430-… http… 2021-10-11T1…
# Council Area (2019) … 09ebfefb-3… population-… 7f010430-… http… 2021-07-06T0…
# Health and Social Ca… c3a393ce-2… population-… 7f010430-… http… 2021-07-06T0…
# Health Board (2019) … 27a72cc8-d… population-… 7f010430-… http… 2021-07-06T0…
# … with 53 more rows


The resource_contains argument of all_resources() can also be used to further narrow the results of a query.

#' view all resources whose names contain "population"

> all_resources(resource_contains = "european")

# resource_name         resource_id package_name package_id url   last_modified
# Population mortality… ec2af2be-8… hospital-st… c88a5231-… http… 2022-05-10T0…
# GP Practice Populati… 2c701f90-c… gp-practice… e3300e98-… http… 2022-05-10T0…
# GP Practice Populati… d07debcf-7… gp-practice… e3300e98-… http… 2022-02-07T1…
# GP Practice Populati… 4a3c438b-2… gp-practice… e3300e98-… http… 2021-11-02T1…
# GP Practice Populati… 0779e100-1… gp-practice… e3300e98-… http… 2022-02-17T1…
# … with 45 more rows


When passing search strings they are case insensitive. The example above of all_resources(resource_contains = "european") would return resources contained ‘EUROPEAN’, ‘European’, or european.

Further information and metadata

Lets say that we are interested in ‘standard-populations’ resource ‘European Standard Population’. We can view the metadata for the package and the resource.

#' view metadata for a package using a valid package name

> package_metadata(package = "standard-populations")

# $nhs_language
# [1] "English"
# 
# $license_title
# [1] "UK Open Government Licence (OGL)"
# 
# $maintainer
# [1] ""
# 
# $version
# [1] ""
#
#...

#' view metadata for a resource using a valid resource id

> resource_metadata(resource="edee9731-daf7-4e0d-b525-e4c1469b8f69")

#   id                          type   
#   _id                         int    
#   AgeGroup                    text   
#   EuropeanStandardPopulation  numeric


The package metadata contains useful information such as the time it was last modified, the publisher, a description, and notes.

The resource metadata provides details of the available fields and their types. This information is particularly useful when putting together a query where we want to return only a subset of data.

Importing data to R

We can import all of the resources within a package using get_resource(). This is often the quickest and simplest way to import data where all resources within one or more packages are required.



#' get all resources in a package

> get_resource(package = "4dd86111-7326-48c4-8763-8cc4aa190c3e")

#' get the first 10 rows of each resource in a package

> get_resource(package = "4dd86111-7326-48c4-8763-8cc4aa190c3e", limit = 10L)

#' both package IDs and names can be used

> get_resource(package = "standard-populations", limit = 10L)

#' multiple packages can be specified returning all resources under each

> get_resource(package = c("standard-populations", "population-projections")


We can also use the resource argument of get_resource() to import a specific resource within a package. This is often the simplest way to get a single resource in its entirety.

#' get specific resources

> get_resource(
    resource = c("European Standard Population",
    "9e00b589-817e-45e6-b615-46c935bbace0"),
    limit = 5L
    )
    
#' get a specific resource, if it exists within a specified package

> get_resource(
    package = "standard-populations",
    resource = "European Standard Population"
    )
 


get_resource() always returns a list, even when only one resource is being imported. Where multiple resources have been imported, each resource is its own list element.

Where more granular control is desired over the data imported, the get_data() function can be used. get_data() allows us to import selected fields and to filter the data. If we want to import the fields ‘AgeGroup’ and ‘EuropeanStandardPopulation’ from the ‘European Standard Population’ resource we can achieve this with get_data().

# import specified fields from a resource
>  get_data(
     resource = "edee9731-daf7-4e0d-b525-e4c1469b8f69",
     fields = c("AgeGroup", "EuropeanStandardPopulation")
   )

# AgeGroup     EuropeanStandardPopulati…
# 0-4 years                         5000
# 5-9 years                         5500
# 10-14 years                       5500
# 15-19 years                       5500
# 20-24 years                       6000
# … with 14 more rows


The where argument of get_data() can be used to exact further control over the data we import. If we only want to retrieve rows from ‘European Standard Population’ where the age group is 45-49 years, we can write a SQL style where query to achieve this.

#' import specified fields from a data set utilising a SQL style where query

> get_data(
    resource = "edee9731-daf7-4e0d-b525-e4c1469b8f69",
    fields = c("AgeGroup", "EuropeanStandardPopulation"),
    where = "\"AgeGroup\" = \'45-49 years\'"
  )
  
# AgeGroup    EuropeanStandardPopulation
# 45-49 years                       7000


The where argument of get_data() requires specific formatting to allow for compatibility with the CKAN API. Field names must be double quoted ", non-numeric values must be single quoted ', and both single and double quotes must be delimited, for example; where = "\"AgeGroup\" = \'45-49 years\\'".