Uptake’s Data Science Team Releases Elasticsearch Client For R

Open source is at the heart of what we do at Uptake so naturally, we’re thrilled to announce that this month, a few of our data scientists contributed back to the community.

Open source is at the heart of what we do at Uptake so naturally, we’re thrilled to announce that this month, a few of our data scientists contributed back to the community by releasing an R package called uptasticsearch an Elasticsearch client specifically tailored to data science workflows.

Elasticsearch is an open source, distributed, highly-available document store built on top of a world-class open source search engine called Apache Lucene. This is a powerful technology for digital applications, as it’s flexible to changing data and offers an expressive query language for asking complicated questions of your data. It’s a popular member of the NoSQL family that is explicitly designed for fast search and aggregation over semi-structured text.

See how Uptake's data science is redefining public transit for Smart Cities of the future

uptasticsearch provides an interface for Elasticsearch that is explicitly designed to make data science workflows easy and fun. Our goal was to reduce as much friction as possible between data scientists and their data. Other R Elasticsearch clients expose an overwhelming number of the database’s features yet pay only passing attention to the common data science workflow of unpacking many pages of potentially nested JSON documents into a format that is friendly for statistical analysis. We took a different approach: we sought to create a simple interface that lets you focus on your data, rather than complexities like pagination and nested-result parsing. This package is truly by data scientists, for data scientists.

uptasticsearch supports retrieval and parsing of two types of requests, “search” and “aggregations”. Below, we show some examples of interacting with an Elasticsearch index that holds daily logs of user activity on a gaming site.

Sample Data:

data.json

[
  {
    "_source": {
      "dateTime": "2017-01-01",
      "userName": "Gauss",
      "details": {
        "interactions": 400,
        "userType": "active",
        "appData": [
          {"appName": "farmville","minutes": 500},
          {"appName": "candy_crush","minutes": 350},
          {"appName": "angry_birds","minutes": 422}
        ]
      }
    }
  },
  {
    "_source": {
      "dateTime": "2017-02-02",
      "userName": "Will Hunting",
      "details": {
        "interactions": 5,
        "userType": "very_active",
        "appData": [
          {"appName": "minesweeper","minutes": 28},
          {"appName": "pokemon_go","minutes": 190},
          {"appName": "pokemon_stay","minutes": 1},
          {"appName": "block_dude","minutes": 796}
        ]
      }
    }
  }
]

Example 1: Search

This type of request returns a batch of raw records matching the user’s query. The query below says “give me all the records since January 1, 2017”.

query1.json

{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "dateTime": {
              "gte": "2017-01-01"
            }
          }
        }
      ]
    }
  }
}

get_user_data.R

# Load dependencies
library(uptasticsearch)
library(data.table)
 
# Read in your query (could be specified as an R string instead)
SEARCH_QUERY <- paste0(readLines(“query1.json”), collapse = “”)
 
# Execute with uptasticsearch
resultDT <- uptasticsearch::es_search(es_host = “http://mydb.mycompany.com:9200”
                                      , es_index = “gameplay”
                                      , query_body = SEARCH_QUERY)
# Unpack arrays
resultDT <- uptasticsearch::unpack_nested_data(resultDT, “details.appData”)
 
#         appName minutes   dateTime     userName details.interactions details.userType
# 1:    farmville     500 2017-01-01        Gauss                  400           active
# 2:  candy_crush     350 2017-01-01        Gauss                  400           active
# 3:  angry_birds     422 2017-01-01        Gauss                  400           active
# 4:  minesweeper      28 2017-02-02 Will Hunting                    5      very_active
# 5:   pokemon_go     190 2017-02-02 Will Hunting                    5      very_active
# 6: pokemon_stay       1 2017-02-02 Will Hunting                    5      very_active
# 7:   block_dude     796 2017-02-02 Will Hunting          

Example 2: Aggregation

In this type of request, Elasticsearch returns a summary of the underlying data. The query below says “compute some summary statistics on the number of weekly interactions with our platform, broken out by user type”.

query2.json

{
  "aggs": {
    "report_week": {
      "date_histogram": {
        "field": "dateTime",
        "interval": "week"
      },
      "aggs": {
        "interaction_stats": {
          "stats": {
            "field": "details.interactions"
          }
        }
      }
    }
  }
}

bin_customers.R

# Load dependencies
library(uptasticsearch)
library(data.table)
 
# Read in your query
# (could be specified as an R string instead)
SEARCH_QUERY <- paste0(readLines(“query2.json”), collapse = “”)
 
# Execute with uptasticsearch
statsDT <- uptasticsearch::es_search(es_host = “http://mydb.mycompany.com:9200”
                                     , es_index = “gameplay”
                                     , query_body = SEARCH_QUERY)
 
# Results
#    report_week intstats.count intstats.min intstats.max intstats.avg intstats.sum doc_count
# 1:  2017-02-27         201674            0            3     1.572528       317138    201674
# 2:  2017-03-06         295596            0            7     1.565011       462611    295596
# 3:  2017-03-13         277618            0            7     1.555738       431901    277618
# 4:  2017-03-20         259233            0            7     1.548264       401361    259233
# 5:  2017-03-27         265538            0            7     1.543233       409787    265538
# 6:  2017-04-03         299502            0            7     1.539489       461080    299502
# 7:  2017-04-10         303826            0            7     1.539927       467870    303826
# 8:  2017-04-17         305400            0            3     1.534974       468781    305400
# 9:  2017-04-24         325883            0            3     1.506403       490911    325883
# 10:  2017-05-01         92953            0            3     1.538143       142975     92953

All major versions of Elasticsearch are supported, and the package has been tested on various Linux, Mac and Windows operating systems. You can download the most recent stable release of uptasticsearch directly from CRAN. Installation instructions for the development version are available on GitHub. Come on over to our repo on GitHub to submit issues and pull requests!

Happy Coding,
Uptake Data Science Team