I’m a big fan of Google Cloud Platform. It can help your work in many ways from data storage (Google Big Query, Storage Bucket) to automate tasks (Cloud Scheduler) and a lot of other tasks.
Believe me, it is a rabbit hole. GCP has a lot of tools you can use in your daily tasks.
In this post, I go through how you can connect to automate your R tasks and save the result to Google Cloud Stora.
The Background
We need a few things. First of all, a simple task we would like to run on schedule. And we would like to make a cron job and finally save the output to GCS.
Task is to save the current weather data every hour. We use this API: https://openweathermap.org/current– I know it is not the best way to get weather data, but keep in mind it is just a demo, you can subsitute the task 🙂
We would like to run the API call every hour and save the output to Google Cloud Storage as a csv file to use it later.
Weather API call with R
First, sign up and get your API key. This service is free, until you reach the limit. Under 1 million calls/month the service is free, see more: https://openweathermap.org/price
If you are eager as I was right after the registration, you face this error message on the first api call:
{"cod":401, "message": "Invalid API key. Please see http://openweathermap.org/faq#error401 for more info."}
Don’t worry, it just take time to activate your api key:
Your API key is not activated yet. Within the next couple of hours, it will be activated and ready to use.
https://openweathermap.org/faq#error401
After the activation, you can make your first api call. I made it for Budapest:
http://api.openweathermap.org/data/2.5/weather?q=Budapest&appid={my_app_id}
The result is:
{"coord":{"lon":19.0399,"lat":47.498},"weather":[{"id":800,"main":"Clear","description":"clear sky","icon":"01n"}],"base":"stations","main":{"temp":274.79,"feels_like":269.86,"temp_min":273.15,"temp_max":275.93,"pressure":1011,"humidity":86},"visibility":10000,"wind":{"speed":4.12,"deg":300},"clouds":{"all":0},"dt":1615953695,"sys":{"type":1,"id":6663,"country":"HU","sunrise":1615956774,"sunset":1615999891},"timezone":3600,"id":3054643,"name":"Budapest","cod":200}
It contains a lot of data in JSON, but we are able to get the data in XML and HTML, too.
For this example only the temp data is necessary. But it is Kelvin by default. Only one modification need to make it Celsius (or Fahrenheit). Change the unit with this parameter:
&units=metric
If you would like to use Fahrenheit, jut replace the metric to imperial.
The whole (and very complex 🙂 ) api call looks like this:
http://api.openweathermap.org/data/2.5/weather?q=Budapest&appid={my_app_id}&units=metric
And the result is:
{"coord":{"lon":19.0399,"lat":47.498},"weather":[{"id":800,"main":"Clear","description":"clear sky","icon":"01n"}],"base":"stations","main":{"temp":1.48,"feels_like":-3.48,"temp_min":-1,"temp_max":2.78,"pressure":1011,"humidity":86},"visibility":10000,"wind":{"speed":4.12,"deg":310},"clouds":{"all":0},"dt":1615954501,"sys":{"type":1,"id":6663,"country":"HU","sunrise":1615956774,"sunset":1615999891},"timezone":3600,"id":3054643,"name":"Budapest","cod":200}
We know how the api works, just make the api call in R
The R Code
For the weather api call we only have to use 1 library, the ‘httr’
library(httr)
It makes the http requests, let’s how.
weatherURL <- "http://api.openweathermap.org/data/2.5/weather"
weatherAppID <- "123myappid456"
weatherData <- GET(weatherURL,
query = list(
q = "Budapest",
appid = weatherAppID,
units = "metric",
mode="xml"
))
It is just a simple GET request, similar what we use in the AdForm API tutorial (but it is a POST request)
The first line is the URL, it stored in a variable (weatherURL).
The second parameter in the GET request is the query. This is the part of the whole URL after the ‘?’: q=Budapest&appid={my_app_id}&units=metric
The query contains everything we need to get the data:
- q – the name of the city
- appid – the app ID, I stored in a variable
- units – royal (Fahrenheit) or metric (Celsius), if you don’t add it the default value is Kelvin
- mode – the output of the GET request, I choose XML, but if you don’t use it the output is JSON by default. For me XML output is suitable and the whole tutorial based on XML output
If everything works fine you get something like this:
Response [http://api.openweathermap.org/data/2.5/weather?q=Budapest&appid=123myappid456&units=metric&mode=xml]
Date: 2021-03-20 07:51
Status: 200
Content-Type: application/xml; charset=utf-8
Size: 860 B
<BINARY BODY>
So far it is not so informative, let’s encrypt it 🙂
Read the XML output
Since the output is XML we should call a new library XML (what a big surprise)
library(XML)
You can read the description of the package here: https://cran.r-project.org/web/packages/XML/index.html
First of all, for better understanding and cleaner code go back to the GET request and save the whole request in a variable, like this:
weatherData <- GET(weatherURL,
query = list(
q = "Budapest",
appid = weatherAppID,
units = "metric",
mode="xml"
))
The ‘weatherData’ stores the XML output, now we can encrypt it:
xmlParse(weatherData)
This simple line of code make the XML output readable in R. The output looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<current>
<city id="3054643" name="Budapest">
<coord lon="19.0399" lat="47.498"/>
<country>HU</country>
<timezone>3600</timezone>
<sun rise="2021-03-20T04:46:48" set="2021-03-20T16:55:46"/>
</city>
<temperature value="2.36" min="1.67" max="2.78" unit="celsius"/>
<feels_like value="-2.88" unit="celsius"/>
<humidity value="69" unit="%"/>
<pressure value="1021" unit="hPa"/>
<wind>
<speed value="4.12" unit="m/s" name="Gentle Breeze"/>
<gusts/>
<direction value="100" code="E" name="East"/>
</wind>
<clouds value="75" name="broken clouds"/>
<visibility value="10000"/>
<precipitation mode="no"/>
<weather number="803" value="broken clouds" icon="04d"/>
<lastupdate value="2021-03-20T07:51:22"/>
</current>
The next step is to convert it as a list, it is easy with this line of code:
xmlToList(xmlParse(weatherData))
Just add ‘xmlToList’ and a xml changes to a simple list:
$city $city$coord lon lat "19.0399" "47.498" $city$country [1] "HU" $city$timezone [1] "3600" $city$sun rise set "2021-03-20T04:46:48" "2021-03-20T16:55:46" $city$.attrs id name "3054643" "Budapest" $temperature value min max unit "2.36" "1.67" "2.78" "celsius" $feels_like value unit "-2.88" "celsius" $humidity value unit "69" "%" $pressure value unit "1021" "hPa" $wind $wind$speed value unit name "4.12" "m/s" "Gentle Breeze" $wind$gusts NULL $wind$direction value code name "100" "E" "East" $clouds value name "75" "broken clouds" $visibility value "10000" $precipitation mode "no" $weather number value icon "803" "broken clouds" "04d" $lastupdate value "2021-03-20T07:51:22"
This is a big nested list, contains a lot of useful data, but we only need the temperature:
xmlToList(xmlParse(weatherData))["temperature"]
The output is a list, since the whole list is a nested list.
$temperature value min max unit "0.3" "-3" "2.78" "celsius"
We have two options. The first one is to use this output and retrive the data we would like to use. This is tha ‘value’ parameter, so get it with this line:
xmlToList(xmlParse(weatherData))["temperature"][[1]][[1]]
The output is the value itself:
[1] "0.3"
If you would like to use other data than the value, maybe you should convert the whole ‘temperature’ list as a dataFrame. For me it is a bit easier:
as.data.frame(xmlToList(xmlParse(weatherData))["temperature"]) temperature value 2.36 min 1.67 max 2.78 unit celsius
But now we only would like to use the ‘value’ which is the actual temperature on the given city.
Create a new data frame with the actual date and the value. It’s pretty straighforward, just get the items (date and the value). But first we have to create a date variable.
dateForDf <- Sys.Date()
data.frame(date = dateForDf,
value = actualTemp)
The result is a simple data frame with the actual date and value:
date value 1 2021-03-26 0.3
Convert it as a csv
We have the data, we should convert it as a csv file we can upload to Google Cloud Storage
fileName <- paste("weather_data_",dateForDf, ".csv", sep = "")
write.csv(weatherDataDF, file = fileName, row.names=FALSE)
The first line generates the filename. It just add the weather_data_ prefix before the actual date and the .csv. It ends up a filename like this: weather_data_20210229.csv.
The script runs every day, so every day it generates a file with the actual date on the name and of course the data is also up to date 🙂
The second line converts the weatherDataDF (it contains the actual data), to a csv file with the filename we generated on the first line.
The ‘row.names=FALSE’ parameter drop the row names. In this example the row names are just row indexes which is useless for now, so just drop it.
If you run this code, you’ll get the csv in the working directory.
Next Steps
We have the data for the automation. In the next post we’ll go through how to upload it to Google Cloud Storage, schedule it to run every morning and load the csv to Google Big Query.
The scheduled automation of the R script and the BigQuery setup coming on the next posts.
Is there going to be a part 2 to this post? Looking to do a similar thing – thanks
Hi Charles, yes, it is on the roadmap 🙂
If you have specific questions, ideas, just let me know and I’ll add it