January, 2017

Quick Recap!

Apache Spark

What is the most used keyword or library in the web?

  • 1 trillion pages
  • Sampling of 0.01%
  • 100 machines with 10GB RAM

Apache Spark is a fast and general engine for large-scale data processing, with support for in-memory datasets.

sparklyr - R interface for Spark

spark_install()                             # Install Apache Spark
sc <- spark_connect(master="local")         # Connect to local instance
library(dplyr)                              # Data Manipulation Grammar
mtcars_tbl <- copy_to(sc, mtcars)           # Copy mtcars into Spark
mtcars_tbl %>% summarize(n = n())           # Count records
mtcars_tbl %>% ml_linear_regression(        # Perform linear regression
  response = "mpg",                         # Response vector
  features = c("wt", "cyl"))                # Features for the model fit
library(DBI)                                # R Database Interface
dbGetQuery(sc, "SELECT * FROM mtcars")      # Run SQL query in Spark
invoke(spark_context(sc), "version")        # Run sc.version in Scala
compile_package_jars()                      # Compile Scala code

sparklyr - 0.5 on CRAN!

  • New functions: sdf_quantile(), ft_tokenizer() and ft_regex_tokenizer().
  • Improved compatibility: na.action, dim(), nrow() and ncol().
  • Extended dplyr: do() and n_distinct().
  • Experimental Livy support.
  • Improved connections.
  • Certified with Cloudera.

Queries with sparklyr, at scale!

Connecting to Spark

devtools::install_github("javierluraschi/sparkwarc") # Install sparkwarc from github

library(sparkwarc)                                   # sparklyr extension to read warcs
library(sparklyr)                                    # load sparklyr
library(dplyr)                                       # load dplyr
library(DBI)                                         # load DBI

config <- spark_config()                             # Create a config to tune memory
config[["sparklyr.shell.driver-memory"]] <- "10G"    # Set driver memory to 10GB

sc <- spark_connect(                                 # Connecto to spark
  master = "local",                                  # using local cluster
  version = "2.0.1",                                 # and Spark 2.0.1
  config = config)                                   # setting custom configs

Importing from CommonCrawl

warc_small <- system.file("samples/sample.warc",    # Find a sample.warc file 
                          package = "sparkwarc")    # from the sparkwarch package

warc_big <- "cc.warc.gz"                            # Name a 5GB warc file
if (!file.exists(warc_big))                         # If the file does not exist
  download.file(                                    # download by
    gsub("s3n://commoncrawl/",                      # mapping the S3 bucket url
         "https://commoncrawl.s3.amazonaws.com/",   # into a adownloadable url
         sparkwarc::cc_warc(1)), warc_big)          # from the first archive file

spark_read_warc(                                    # Read the warc file
  sc,                                               # into the sc Spark connection
  "warc",                                           # save into 'warc' table
  warc_big,                                         # load warc_url or warc_small
  repartition = 8,                                  # maximize cores
  parse = TRUE                                      # load tags as table
)

Counting HTML Tags and Attributes

tbl(sc, "warc") %>%                                 # From the CommonCrawl table
  summarize(count = n())                            # count html attributes and tags

Total Pages

tbl(sc, "warc") %>%                                 # From the CommonCrawl table
  filter(tag == "WARC-Target-URI") %>%              # keep only the URLs wark tag
  summarize(total = n())                            # get the count

Top Sites

top_sites <- tbl(sc, "warc") %>%                    # From the CommonCrawl table
  filter(tag == "WARC-Target-URI") %>%              # over the URLs in the warc file
  transmute(site = regexp_extract(                  # use a regular expression
    value,                                          # over the value of the URL
    ".*://(www\\.)?([^/]+).*",                      # to extract only the domain name
    2)) %>%                                         # which is found in the 2nd (),
  group_by(site) %>%                                # then group by site
  summarize(count = n()) %>%                        # and get the count of sites
  arrange(desc(count))                              # arranged top to bottom

Top Sites

top_sites

What is the most used library?

top_libraries <- tbl(sc, "warc") %>%                # From the CommonCrawl table
  filter(tag == "script",                           # over the <script> tags
         attribute == "src") %>%                    # that have a src attribute,
  transmute(library = regexp_extract(               # use a regular expression
    value,                                          # over the value of the src
    "[^/]+[.]js", 0)) %>%                           # to extract library.js,
  group_by(library) %>%                             # then group by library
  summarize(count = n()) %>%                        # and get the total count
  arrange(desc(count)) %>%                          # arranged top to bottom
  filter(library != "",                             # excluding empty entries
         length(library) < 30)                      # or entries that are long

What is the most used library?

top_libraries

What is the most used keyword?

top_keywords <- tbl(sc, "warc") %>%                 # From the CommonCrawl table
  filter(tag == "meta",                             # over the <meta> tags
         attribute == "content",                    # with a content attribute
         original %like% "%keywords%") %>%          # and a keywords attribute,
  transmute(keyword = explode(                      # make each element a row
    split(value, ",")) ) %>%                        # for words between commas,
  transmute(keyword = trim(keyword)) %>%            # removing whitespaces and
  filter(keyword != "") %>%                         # excluding empty entries,
  group_by(keyword) %>%                             # then group by library
  summarize(count = n()) %>%                        # and get the total count
  arrange(desc(count)) %>%                          # arranged top to bottom
  filter(!keyword %like% "%�%",                    # or any unknown characters
         length(keyword) < 30)                      # or entries that are too long

What is the most used keyword?

top_keywords

Running at scale in EMR