Error parse error trailing garbage

I'm using a package of learnr tutorials for teaching that I wrote (https://github.com/profandyfield/discovr ). Some students intermitently get the following error when the click 'start tuto...

I’m using a package of learnr tutorials for teaching that I wrote (https://github.com/profandyfield/discovr ).

Some students intermitently get the following error when the click ‘start tutorial’.

Error: parse error: trailing garbage

          ":{},"value":["0.10.1"]}]}]} {"type":"list","attributes":{},

                     (right here) ------^

Execution halted

I had similar issues last year as well which remained unresolved. I can’t post a reporducible example because we can’t reporduce it: We can’t see any consistent predictors of when it happens or for whom. It feels impossible to debug but I also don’t really understand the process of what happens when someone clicks the ‘start tutorial’ button (other than it reneders my RMD into a tutortial … or how it uses a stored html file when a tutorial starts or how a stored html might become corrupt.

We have students who, for example, load a tutorial successfully using the RStudio tutorials pane, but then next time they load it they get this error. Based on my explorations last year I assume it’s an error with the html renedering. Consistent with this assumption if we (a) manually delete the html of the problematic tutorial from the package library on their system, or (b) force a reinstall (which would also clear the html files) the problem goes away. However, it can return (for the same student).

I don’t know if this is a bug, and I appreciate (really I do having tried to workiit out for over a year) how impossible it is to have a solution when we can’t reliably reproduce the issue, but I would really appreciate some pointers about why this might be happening — like, some options to explore or some better understanding of the rendering process and why it might fail when it has succeeded before. Could this be down to user input in a previous session using the tutorial? Could it be to do with strange text in my hints? Is there something I can do within the tutorial to prevent it (I currently don’t error catch any code that students type into exercises). Any useful pointers?

I have a log file that look like this.
I'm trying to parse the JSON in the Message column by:
library(readr)
library(jsonlite)
df <- read_csv("log_file_from_above.csv")
fromJSON(as.character(df$Message))
But, I'm hitting the following error:
Error: parse error: trailing garbage
"isEmailConfirmed": false } { "id": -1, "firstName":
(right here) ------^
How can I get rid of the "trailing garbage"?

fromJSON() isn't "apply"ing against the character vector, it's trying to convert it all to a data frame. You can try
purrr::map(df$Message, jsonlite::fromJSON)
what #Abdou provided or
jsonlite::stream_in(textConnection(gsub("\n", "", df$Message)))
The latter two will create data frames. The first will create a list you can add as a column.
You can use the last method with dplyr::bind_cols to make a new data frame with all the data:
dplyr::bind_cols(df[,1:3],
jsonlite::stream_in(textConnection(gsub("\n", "", df$Message))))
Also suggested by #Abdou is an almost pure base R solution:
cbind(df, do.call(plyr::rbind.fill, lapply(paste0("[",df$Message,"]"), function(x) jsonlite::fromJSON(x))))
Full, working, workflow:
library(dplyr)
library(jsonlite)
df <- read.table("http://pastebin.com/raw/MMPMwNZv",
quote='"', sep=",", stringsAsFactors=FALSE, header=TRUE)
bind_cols(df[,1:3], stream_in(textConnection(gsub("\n", "", df$Message)))) %>%
glimpse()
##
Found 3 records...
Imported 3 records. Simplifying into dataframe...
## Observations: 3
## Variables: 19
## $ Id <int> 35054, 35055, 35059
## $ Date <chr> "2016-06-17 19:29:43 +0000", "2016-06-17 1...
## $ Level <chr> "INFO", "INFO", "INFO"
## $ id <int> -2, -1, -3
## $ ipAddress <chr> "100.100.100.100", NA, "100.200.300.400"
## $ howYouHearAboutUs <chr> NA, "Radio", NA
## $ isInterestedInOffer <lgl> TRUE, FALSE, TRUE
## $ incomeRange <int> 60000, 1, 100000
## $ isEmailConfirmed <lgl> FALSE, NA, TRUE
## $ firstName <chr> NA, "John", NA
## $ lastName <chr> NA, "Smith", NA
## $ email <chr> NA, "john.smith#gmail.com", NA
## $ city <chr> NA, "Smalltown", NA
## $ birthDate <chr> NA, "1999-12-10T05:00:00Z", NA
## $ password <chr> NA, "*********", NA
## $ agreeToTermsOfUse <lgl> NA, TRUE, TRUE
## $ visitUrl <chr> NA, NA, "https://www.website.com/?purpose=X"
## $ isIdentityConfirmed <lgl> NA, NA, FALSE
## $ validationResults <lgl> NA, NA, NA

Related

how do to scrape web table from website using R

I am trying to scrape the table found from the following site: https://finance.yahoo.com/gainers?e=us
However, I have searched for answers on several different methods to scrape tables from site on here and none of the methods have worked for me.
I have tried:
library(xml2)
url <- "https://finance.yahoo.com/gainers?e=us"
tbl <- read_html(url)
also:
library(XML)
url <- "https://finance.yahoo.com/gainers?e=us"
tbl <- readHTMLList(url)
and other packages such as rvest however I cannot get the table to show!
They store that data in a javascript block in the page itself. You have two choices: either use RSelenium and then deal with a really complex table that you have to unwind or perform some string surgery combined with some JSON mangling after getting some help from the V8 package:
library(V8)
library(xml2)
library(stringi)
library(jsonlite)
library(ndjson)
pg <- read_html("https://finance.yahoo.com/gainers?e=us")
This next bit extracts the data from the salient <script> tag. I'm doing this by position, which means this is the first potential thing to break if Yahoo! ever changes it format (to, say, insert useless Verizon Wireless ads via javascript if VZ ever does finish buying them). You can change this to look for text indicators, but that leads to the second issue…
The second issue with "scraping" this way is that if the javascript for the data ever changes this will also break. However that's also true for HTML-based scraping. This method doesn't require starting up a web server and a control server for navigating the web server. Anyway…
html_nodes(pg, "script")[13] %>%
html_text() %>%
stri_replace_first_fixed("(function (root) {", "var root = { App : {}};n") %>%
stri_replace_last_fixed("}(this));", "") -> js
We had to do ^^ since the javascript code in the <script> tag was expecting to be in a browser (which we won't be). Now, we evaluate that large javascript chunk via V8 and extract the main data element:
ctx <- v8()
ctx$eval(JS(js))
root <- ctx$get("root", flatten=TRUE)
That data element holds all of the data for that page (which is really a single page app). So we have to find the data we care about which is really far down a nested javascript hole:
quotes <- root$App$main$context$dispatcher$stores$`QuoteDataStore-Immutable`$quoteData
There are numerous ways to turn that nested list data into a nice, rectangular data frame. The method below chooses the "turn it into JSON then bring it back from JSON in a completely "flat" format. Feel free to use other methods you can find on SO.
The code starts by telling R to ignore the non-stocks (since you, presumably, want just the quotes in the pretty table on that page):
discard(names(quotes), ~grepl("[\^\=]", .)) %>%
map_df(~ndjson::flatten(toJSON(quotes[[.]]))) %>%
glimpse()
## Observations: 30
## Variables: 53
## $ averageDailyVolume3Month.fmt.0 <chr> "437,939", "541,801", "1.033M", "992,278", "1.40...
## $ averageDailyVolume3Month.longFmt.0 <chr> "437,939", "541,801", "1,033,453", "992,278", "1...
## $ averageDailyVolume3Month.raw.0 <dbl> 437939, 541801, 1033453, 992278, 1402175, 537906...
## $ exchange.0 <chr> "NYQ", "NMS", "NYQ", "NGM", "NGM", "NMS", "NGM",...
## $ exchangeTimezoneName.0 <chr> "America/New_York", "America/New_York", "America...
## $ exchangeTimezoneShortName.0 <chr> "EST", "EST", "EST", "EST", "EST", "EST", "EST",...
## $ fiftyTwoWeekHigh.fmt.0 <chr> "15.36", "46.12", "15.82", "7.64", "5.97", "9.86...
## $ fiftyTwoWeekHigh.raw.0 <dbl> 15.360, 46.120, 15.820, 7.640, 5.970, 9.860, 5.0...
## $ fiftyTwoWeekHighChange.fmt.0 <chr> "-6.96", "-17.72", "-4.19", "-2.83", "-4.29", "-...
## $ fiftyTwoWeekHighChange.raw.0 <dbl> -6.960, -17.720, -4.190, -2.830, -4.290, -3.640,...
## $ fiftyTwoWeekHighChangePercent.fmt.0 <chr> "-45.31%", "-38.42%", "-26.49%", "-37.04%", "-71...
## $ fiftyTwoWeekHighChangePercent.raw.0 <dbl> -0.4531, -0.3842, -0.2649, -0.3704, -0.7186, -0....
## $ fiftyTwoWeekLow.fmt.0 <chr> "6.59", "23.75", "9.46", "4.22", "0.25", "4.97",...
## $ fiftyTwoWeekLow.raw.0 <dbl> 6.590, 23.750, 9.460, 4.220, 0.251, 4.970, 0.970...
## $ fiftyTwoWeekLowChange.fmt.0 <chr> "1.81", "4.65", "2.17", "0.59", "1.43", "1.25", ...
## $ fiftyTwoWeekLowChange.raw.0 <dbl> 1.810, 4.650, 2.170, 0.590, 1.429, 1.250, 0.190,...
## $ fiftyTwoWeekLowChangePercent.fmt.0 <chr> "27.47%", "19.58%", "22.94%", "13.98%", "569.32%...
## $ fiftyTwoWeekLowChangePercent.raw.0 <dbl> 0.2747, 0.1958, 0.2294, 0.1398, 5.6932, 0.2515, ...
## $ fullExchangeName.0 <chr> "NYSE", "NasdaqGS", "NYSE", "NasdaqGM", "NasdaqG...
## $ gmtOffSetMilliseconds.0 <dbl> -1.8e+07, -1.8e+07, -1.8e+07, -1.8e+07, -1.8e+07...
## $ invalid.0 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isLoading.0 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ language.0 <chr> "en-US", "en-US", "en-US", "en-US", "en-US", "en...
## $ longName.0 <chr> "Bankrate, Inc.", "Air Methods Corp.", "Bristow ...
## $ market.0 <chr> "us_market", "us_market", "us_market", "us_marke...
## $ marketCap.fmt.0 <chr> "758.268M", "1.082B", "407.786M", "494.03M", "33...
## $ marketCap.longFmt.0 <chr> "758,267,968", "1,081,625,344", "407,786,176", "...
## $ marketCap.raw.0 <dbl> 758267968, 1081625344, 407786176, 494030272, 338...
## $ marketState.0 <chr> "CLOSED", "CLOSED", "CLOSED", "CLOSED", "CLOSED"...
## $ messageBoardId.0 <chr> "finmb_30061", "finmb_24494", "finmb_292980", "f...
## $ quoteType.0 <chr> "EQUITY", "EQUITY", "EQUITY", "EQUITY", "EQUITY"...
## $ regularMarketChange.fmt.0 <chr> "1.20", "3.65", "1.44", "0.59", "0.27", "0.80", ...
## $ regularMarketChange.raw.0 <dbl> 1.20, 3.65, 1.44, 0.59, 0.27, 0.80, 0.15, 0.04, ...
## $ regularMarketChangePercent.fmt.0 <chr> "16.67%", "14.75%", "14.13%", "13.98%", "19.15%"...
## $ regularMarketChangePercent.raw.0 <dbl> 16.6667, 14.7475, 14.1315, 13.9810, 19.1489, 14....
## $ regularMarketDayHigh.fmt.0 <chr> "9.90", "30.25", "12.60", "5.29", "1.85", "6.30"...
## $ regularMarketDayHigh.raw.0 <dbl> 9.9000, 30.2500, 12.6000, 5.2900, 1.8500, 6.2950...
## $ regularMarketDayLow.fmt.0 <chr> "7.90", "25.30", "9.46", "4.30", "1.33", "5.79",...
## $ regularMarketDayLow.raw.0 <dbl> 7.9000, 25.3000, 9.4600, 4.3000, 1.3300, 5.7900,...
## $ regularMarketPrice.fmt.0 <chr> "8.40", "28.40", "11.63", "4.81", "1.68", "6.22"...
## $ regularMarketPrice.raw.0 <dbl> 8.40, 28.40, 11.63, 4.81, 1.68, 6.22, 1.16, 1.20...
## $ regularMarketTime.fmt.0 <chr> "4:02PM EDT", "4:00PM EDT", "4:00PM EDT", "4:00P...
## $ regularMarketTime.raw.0 <dbl> 1478289722, 1478289600, 1478289614, 1478289600, ...
## $ regularMarketVolume.fmt.0 <chr> "2.383M", "3.32M", "3.411M", "4.567M", "4.232M",...
## $ regularMarketVolume.longFmt.0 <chr> "2,382,856", "3,320,072", "3,411,052", "4,566,79...
## $ regularMarketVolume.raw.0 <dbl> 2382856, 3320072, 3411052, 4566790, 4232135, 185...
## $ sharesOutstanding.fmt.0 <chr> "90.27M", "38.085M", "35.063M", "102.709M", "20....
## $ sharesOutstanding.longFmt.0 <chr> "90,270,000", "38,085,400", "35,063,300", "102,7...
## $ sharesOutstanding.raw.0 <dbl> 90270000, 38085400, 35063300, 102709000, 2016800...
## $ shortName.0 <chr> "Bankrate, Inc. Common Stock", "Air Methods Corp...
## $ sourceInterval.0 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, ...
## $ symbol.0 <chr> "RATE", "AIRM", "BRS", "CERS", "EBIO", "ELNK", "...
## $ uuid.0 <chr> "79521cde-a3ef-383f-917d-31c49f9082f5", "f0432c1...
I think you still have some data conversion to do (depending on what you're after), but you have the data that you're looking for.

Parsing JSON arrays from a .txt file in R — several large files

I have recently been downloading large quantities of Tweets from Twitter. My starting point is around 400 .txt files containing Tweet IDs. After running a tool, Tweets are scraped from Twitter using the Tweet IDs and for every .txt file I had with a large list of Tweet IDs, I get a very large .txt file containing JSON strings. Each JSON string contains all of the information about the Tweet. Below is hyperlink to my one-drive, that contains the file I am working on (once I get this to work, I will apply the code to the other files):
https://1drv.ms/t/s!At39YLF-U90fhKAp9tIGJlMlU0qcNQ
I have been trying to parse each JSON string in each file but with no success. My aim is to convert each file into a large dataframe in R. Each row will be a Tweet and each column a feature in the Tweet. Given their nature, the 'text' column will be very large (it will contain the body of the tweet), whereas the 'location' will be short. Each JSON string is formatted in the same way and there can be up to a million strings per file.
I have tried several methods (shown below) to obtain what I need with no success:
library('RJSONIO')library('RCurl')
json_file <- fromJSON("Pashawar_test.txt")
json_file2 = RJSONIO::fromJSON(json_file)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘fromJSON’ for signature ‘"list", "missing"’
My other attempt:
library('RJSONIO')
json_file <- fromJSON("Pashawar_test.txt")
text <- json_file[['text']]
idstr <- json_file[['id_str']]
This code seems to parse only the first JSON string in the file. I say this because when I attempt to select 'text' or 'id_str', I only get one instance. It's also worth pointing out that the 'json_file' is a large list that is 52.7mb in size, whereas the source file is 335mb.
Try the stream_in function of the jsonlite package. Your file contains a JSON for each line. Either you read line by line and convert through fromJSON or you use directly stream_in, which is made for handling exactly this kind of files/connections.
require(jsonlite)
filepath<-"path/to/your/file"
#method A: read each line and convert
content<-readLines(filepath)
#this will take a while
res<-lapply(content,fromJSON)
#method B: use stream_in
con<-file(filepath,open="rt")
#this will take a while
res<-stream_in(con)
Notice that stream_in will also simplify the result, coercing it to a data.frame, which might be handier.
That's a [n]ewline [d]elimited [json] (ndjson) file which was tailor-made for the ndjson package. Said package is very measurably faster than jsonlite::stream_in() and produces a "completely flat" data frame. That latter part ("completely flat") isn't always what folks really need as it can make for a very wide structure (in your case 1,012 columns as it expanded all the nested components) but you get what you need fast without having to unnest anything on your own.
The output of str() or even glimpse() is too large to show here but this is how you use it.
NOTE that I renamed your file since .json.gz is generally how ndjson is stored (and my package can handle gzip'd json files):
library(ndjson)
library(tidyverse)
twdf <- tbl_df(ndjson::stream_in("~/Desktop/pashwar-test.json.gz"))
## dim(twdf)
## [1] 75008 1012
Having said that…
I was alternatively going to suggest using Apache Drill since you have many of these files and they're relatively big. Drill would let you (ultimately) convert these to parquet and significantly speed things up, and there's a package to interface with Drill (sergeant):
library(sergeant)
library(tidyverse)
db <- src_drill("dbserver")
twdf <- tbl(db, "dfs.json.`pashwar-test.json.gz`")
glimpse(twdf)
## Observations: 25
## Variables: 28
## $ extended_entities <chr> "{"media":[]}", "{"media":[]}", "{"m...
## $ quoted_status <chr> "{"entities":{"hashtags":[],"symbols...
## $ in_reply_to_status_id_str <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ created_at <chr> "Tue Dec 16 10:13:47 +0000 2014", "Tue De...
## $ in_reply_to_user_id_str <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ source <chr> "<a href="http://twitter.com/download/an...
## $ retweeted_status <chr> "{"created_at":"Tue Dec 16 09:28:17 +0...
## $ quoted_status_id <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ retweet_count <int> 220, 109, 9, 103, 0, 398, 0, 11, 472, 88,...
## $ retweeted <chr> "false", "false", "false", "false", "fals...
## $ geo <chr> "{"coordinates":[]}", "{"coordinates"...
## $ is_quote_status <chr> "false", "false", "false", "false", "fals...
## $ in_reply_to_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ id_str <dbl> 5.447975e+17, 5.447975e+17, 5.447975e+17,...
## $ in_reply_to_user_id <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ favorite_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ id <dbl> 5.447975e+17, 5.447975e+17, 5.447975e+17,...
## $ text <chr> "RT #afneil: Heart-breaking beyond words:...
## $ place <chr> "{"bounding_box":{"coordinates":[]},...
## $ lang <chr> "en", "en", "en", "en", "en", "en", "en",...
## $ favorited <chr> "false", "false", "false", "false", "fals...
## $ possibly_sensitive <chr> NA, "false", NA, "false", NA, "false", NA...
## $ coordinates <chr> "{"coordinates":[]}", "{"coordinates"...
## $ truncated <chr> "false", "false", "false", "false", "fals...
## $ entities <chr> "{"user_mentions":[{"screen_name":"a...
## $ quoted_status_id_str <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ user <chr> "{"id":25968369,"id_str":"25968369"...
BUT
you've managed to create really inconsistent JSON. Not all fields with nested content are consistently represented that way and newcomers to Drill will find it somewhat challenging to craft bulletproof SQL that will help them unnest that data across all scenarios.
If you only need the data from the "already flat" bits, give Drill a try.
If you need the nested data and don't want to fight with unnesting from jsonlite::stream_in() or struggling with Drill unnesting, then, I'd suggest using ndjson as noted in the first example and then carve out the bits you really need into more manageable, tidy data frames.

How to select a particular section of JSON Data in R?

I am trying to import data from url into R which is in the form of JSON, then export it to an Excel file.
url: https://api.typeform.com/v1/form/JlCM2J?key=ecab3f590a2af4ca55468adc95686a043bbf6c9a
This is my R code
library(data.table)
library(httr)
library(rjson)
set_config(config(ssl_verifypeer = 0L))
var1=fread('https://api.typeform.com/v1/form/JlCM2J?key=ecab3f590a2af4ca55468adc95686a043bbf6c9a')
head(var1)
Output:
Empty data.table (0 rows) of 161 cols: {"http_status":200,stats":{"responses":{"showing":2,"total":2,"completed":1}},"questions":[{"id":"textfield_38991412,question":"What is your first name?,field_id":38991412},{"id":"statement_38991416,question":"Hi {{answer_38991412}},Thank you for taking this questionnaire.Your answers will help us build a great brand for you. One that is strong and memorable in your customers' minds. One that defines clearly what you are, what you stand for, and what makes you different.Let's get started!,field_id":38991416},{"id":"group_38991407...
Need:
I only need the responses section of this data to be exported as an Excel file.
Entire data can be viewed by pasting the above url in the jsonviewer.stack.hu site
The following is the more (IMO) idiomatic way to interface with REST APIs in modern R:
library(httr)
library(jsonlite)
library(dplyr)
res <- GET("https://api.typeform.com/v1/form/JlCM2J",
query=list(key="ecab3f590a2af4ca55468adc95686a043bbf6c9a"))
content(res, as="text") %>%
fromJSON(flatten=FALSE) -> out
glimpse(out$responses$answers)
## Observations: 2
## Variables: 26
## $ textfield_38991412 <chr> NA, "A"
## $ dropdown_38991418 <chr> NA, "Accounting"
## $ textarea_38991420 <chr> NA, "A"
## $ textfield_38991413 <chr> NA, "A"
## $ textarea_38991421 <chr> NA, "A"
## $ listimage_38991426_choice <chr> NA, "Company"
## $ textfield_38991414 <chr> NA, "A"
## $ website_38991435 <chr> NA, "http://A.com"
## $ textarea_38991422 <chr> NA, "A"
## $ listimage_38991427_choice <chr> NA, "Sincere"
## $ listimage_38991428_choice <chr> NA, "Male"
## $ list_38991436_choice <chr> NA, "17 or younger"
## $ list_38991437_choice <chr> NA, "Upper class"
## $ listimage_38991429_choice_49501105 <chr> NA, "Store"
## $ listimage_38991430_choice <chr> NA, "Product"
## $ textarea_38991423 <chr> NA, "A"
## $ listimage_38991431_choice <chr> NA, "Techy"
## $ listimage_38991432_choice_49501124 <chr> NA, "Fuchsia Rose"
## $ listimage_38991433_choice <chr> NA, "Classic"
## $ list_38991438_choice <chr> NA, "$3,000 or less"
## $ listimage_38991434_choice_49501140 <chr> NA, "Brand Design"
## $ textarea_38991424 <chr> NA, "A"
## $ textfield_38991415 <chr> NA, "A"
## $ dropdown_38991419 <chr> NA, "Afghanistan"
## $ email_38991439 <chr> NA, "A#a.com"
## $ textarea_38991425 <chr> NA, "A"
Using httr::GET() directly will enable easier management of extra parameters.
Using httr::content() to take the response and retrieve the raw text enables finer-grained processing (if needed)
Using jsonlite::fromJSON() directly provides far more granular control over individual JSON processing options.
However, there's an R package rtypeform which really simplifies everything (fun fact: it follows the idiom above under the covers):
library(rtypeform)
library(dplyr)
res <- get_results("JlCM2J")
glimpse(res$responses$answers)
## Observations: 2
## Variables: 26
## $ textfield_38991412 <chr> NA, "A"
## $ dropdown_38991418 <chr> NA, "Accounting"
## $ textarea_38991420 <chr> NA, "A"
## $ textfield_38991413 <chr> NA, "A"
## $ textarea_38991421 <chr> NA, "A"
## $ listimage_38991426_choice <chr> NA, "Company"
## $ textfield_38991414 <chr> NA, "A"
## $ website_38991435 <chr> NA, "http://A.com"
## $ textarea_38991422 <chr> NA, "A"
## $ listimage_38991427_choice <chr> NA, "Sincere"
## $ listimage_38991428_choice <chr> NA, "Male"
## $ list_38991436_choice <chr> NA, "17 or younger"
## $ list_38991437_choice <chr> NA, "Upper class"
## $ listimage_38991429_choice_49501105 <chr> NA, "Store"
## $ listimage_38991430_choice <chr> NA, "Product"
## $ textarea_38991423 <chr> NA, "A"
## $ listimage_38991431_choice <chr> NA, "Techy"
## $ listimage_38991432_choice_49501124 <chr> NA, "Fuchsia Rose"
## $ listimage_38991433_choice <chr> NA, "Classic"
## $ list_38991438_choice <chr> NA, "$3,000 or less"
## $ listimage_38991434_choice_49501140 <chr> NA, "Brand Design"
## $ textarea_38991424 <chr> NA, "A"
## $ textfield_38991415 <chr> NA, "A"
## $ dropdown_38991419 <chr> NA, "Afghanistan"
## $ email_38991439 <chr> NA, "A#a.com"
## $ textarea_38991425 <chr> NA, "A"
Either way, this must be your first time using R (or almost your first time) if you aren't used to using $ to access fields in lists. You should really spend some time learning R before trying to work with API data. Incorrect results and self-frustration are the only things you're going to get in return for cut-paste-and-praying your way through coding. You still need to get this into a CSV file (write.csv() for that).
Penultimately, if you're just going to end up using Excel, why are you programmatically retrieving the form responses? If you're not going to use R for the rest of the work, this seems like a needless step unless you're trying to script the download of the data vs make someone login to the site to fetch it.
Finally, invalidate and re-generate your API key immediately since you posted it in an open forum. I now know you have 2 forms: "Branding Questionnaire" and "Test Form" and will be able to monitor what you do in Typeform and retrieve any of your form data at-will. The rtypeform package will let you store your API key in the typeform_api environment variable (you can use the ~/.Renviron to hold this data) so you never have to expose it to the world again in your scripts.

parsing meta name/content using xml and r

Regarding the answer to: how to get information within <meta name...> tag in html using htmlParse and xpathSApply
My issue:
html <- htmlParse(domain, useInternalNodes=T);
names <- html['//meta/#name']
content <- html['//meta/#content']
cbind(names, content)
The meta tags in the page are:
<meta name="description" content="blah, blah...." />
<meta name="keywords" content="keyword1, keyword2" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="google-site-verfication" content="1234jalsdkfjasdf928374-293423" />
What I find is this:
length(names)
[1] 3
length(content)
[1] 4
names content
[1, ] "description" [1, ] "blah, blah...."
[2, ] "keywords" [2, ] "keyword1, keyword2"
[3, ] "google-site-verification" [3, ] "text/html; charset=UTF-8"
[4, ] "description" [4, ] "1234jalsdkfjasdf928374-293423"
Seems like the parser is tripping up on "http-equiv" and returning the next line of code "google-site-verification" but still returning the "content" for the "http-equiv", and then since there are no more "names" cbind is wrapping around to "description" again to match the last line of content which is the actual "google-site-verification".
Seems like a simple fix, by so far any conditional I do does not work, how can I make this right?
I realize you figured out what you needed to (which doesn't really match the original q) but we'll take StackOverflow.com as an example since I had it coded up anyway as an addition to my orignal answer:
library(XML)
doc <- htmlParse("http://stackoverflow.com/", useInternalNodes=TRUE)
that has the following <meta> tags:
<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="stackoverflow.com"/>
<meta property="og:type" content="website" />
<meta property="og:image" itemprop="image primaryImageOfPage" content="http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon#2.png?v=fde65a5a78c6" />
<meta name="twitter:title" property="og:title" itemprop="title name" content="Stack Overflow" />
<meta name="twitter:description" property="og:description" itemprop="description" content="Q&A for professional and enthusiast programmers" />
<meta property="og:url" content="http://stackoverflow.com/"/>
Not every tag has a name attribute, in fact of the 7, only 4 do:
length(doc["//meta/#property"])
## [1] 4
Notice that's the same as doing:
length(xpathSApply(doc, "//meta/#name"))
## [1] 4
which is pretty much what's happening under the covers.
It's only going to come back with only what is true in the search. You can see it more laid out if you do:
xpathSApply(doc, "//meta", xmlGetAttr, "name")
## [[1]]
## [1] "twitter:card"
##
## [[2]]
## [1] "twitter:domain"
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## [1] "twitter:title"
##
## [[6]]
## [1] "twitter:description"
##
## [[7]]
## NULL
that list, when converted to a vector, truncates to 4 entries due to the NULLs. rvest (original answer` is just "smarter" when it comes to the extractions.
ORIGINAL ANSWER
Working with rvest, you can grab all the <meta> attributes into a data frame pretty quickly (if that's what you're trying to do):
library(rvest)
library(dplyr)
pg <- html("http://facebook.com/")
all_meta_attrs <- unique(unlist(lapply(lapply(pg %>% html_nodes("meta"), html_attrs), names)))
dat <- data.frame(lapply(all_meta_attrs, function(x) {
pg %>% html_nodes("meta") %>% html_attr(x)
}))
colnames(dat) <- all_meta_attrs
glimpse(dat)
## Observations: 19
## Variables:
## $ charset (fctr) utf-8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ http-equiv (fctr) NA, refresh, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ content (fctr) NA, 0; URL=/?_fb_noscript=1, default, Facebook, h...
## $ name (fctr) NA, NA, referrer, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ id (fctr) NA, NA, meta_referrer, NA, NA, NA, NA, NA, NA, NA...
## $ property (fctr) NA, NA, NA, og:site_name, og:url, og:image, og:lo...
but it will also reliably extract the attributes for you:
pg %>% html_nodes("meta") %>% html_attr("http-equiv")
## [1] NA "refresh" NA
## [4] NA NA NA
## [7] NA NA NA
## [10] NA NA NA
## [13] NA NA NA
## [16] NA NA NA
## [19] "X-Frame-Options"
So I figured it out, at least what I was going for anyway. Ultimately I need to extract just "keywords" and "descriptions". The piece of code that needed changing was:
This...
html <- htmlParse(domain, useInternalNodes=T);
names <- html['//meta/#name']
content <- html['//meta/#content']
to this...
html <- htmlParse(domain, useInternalNodes=T);
**keywords <- html['//meta[#name="keywords"]/#content']
description <- html['//meta[#name="description"]/#content']**
Cheers

Error parsing JSON file with the jsonlite package

I'm trying to read a JSON file into R but I got this error:
Error in parseJSON(txt) : parse error: trailing garbage
[ 33.816101, -117.979401 ] } { "a": "Mozilla/4.0 (compatibl
(right here) ------^
I downloaded the file from http://1usagov.measuredvoice.com/ and unzipped it using 7zip, then I used the following code in R:
library(jsonlite)
jsonData <- fromJSON("usagov_bitly_data2013-05-17-1368832207")
I'm not sure why this error happens, I looked up in Google but there's no information, someone that could help me? Is this a file problem or my code?
ANOTHER UPDATE
You can use the ndjson package to process this ndjson/streaming JSON data. It's faster than jsonlite::stream_in() and always produces a completely "flat" data frame:
system.time(bitly01 <- ndjson::stream_in("usagov_bitly_data2013-05-17-1368832207.gz"))
## user system elapsed
## 0.146 0.004 0.154
system.time(bitly02 <- jsonlite::stream_in(file("usagov_bitly_data2013-05-17-1368832207.gz"), verbose=FALSE, pagesize=10000))
## user system elapsed
## 0.419 0.008 0.427
If we examine the resultant data frame2, you'll see ndjson expands ll into ll.0 and ll.1 where you get a list column in jsonlite that you have to deal with later.
ndjson:
dplyr::glimpse(bitly01)
## Observations: 3,959
## Variables: 19
## $ a <chr> "Mozilla/5.0 (Linux; U; Android 4.1.2; en-us; HTC_PN071 Build/JZO54K) AppleWebKit/534.30 ...
## $ al <chr> "en-US", "en-us", "en-US,en;q=0.5", "en-US", "en", "en-US", "en-US,en;q=0.5", "en-us", "e...
## $ c <chr> "US", NA, "US", "US", NA, "US", "US", NA, "AU", NA, "US", "US", "US", "US", "US", "US", "...
## $ cy <chr> "Anaheim", NA, "Fort Huachuca", "Houston", NA, "Mishawaka", "Hammond", NA, "Sydney", NA, ...
## $ g <chr> "15r91", "ifIpBW", "10DaxOu", "TysVFU", "10IGW7m", "13GrCeP", "YmtpnZ", "13oM0hV", "15r91...
## $ gr <chr> "CA", NA, "AZ", "TX", NA, "IN", "WI", NA, "02", NA, "OH", "MD", "KY", "OR", "IL", "TX", "...
## $ h <chr> "10OBm3W", "ifIpBW", "10DaxOt", "TChsoQ", "10IGW7l", "13GrCeP", "YmtpnZ", "15PUeH0", "10O...
## $ hc <dbl> 1365701422, 1302189369, 1368814585, 1354719206, 1368738258, 1368130510, 1363711958, 13687...
## $ hh <chr> "j.mp", "1.usa.gov", "1.usa.gov", "1.usa.gov", "1.usa.gov", "1.usa.gov", "1.usa.gov", "go...
## $ l <chr> "pontifier", "bitly", "jaxstrong", "o_5004fs3lvd", "peacecorps", "bitly", "bitly", "nasat...
## $ ll.0 <dbl> 33.8161, NA, 31.5273, 29.7633, NA, 41.6123, 45.0070, NA, -33.8615, NA, 39.5151, 39.1317, ...
## $ ll.1 <dbl> -117.9794, NA, -110.3607, -95.3633, NA, -86.1381, -92.4591, NA, 151.2055, NA, -84.3983, -...
## $ nk <dbl> 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ r <chr> "direct", "http://www.usa.gov/", "http://www.facebook.com/l.php?u=http%3A%2F%2F1.usa.gov%...
## $ t <dbl> 1368832205, 1368832207, 1368832209, 1368832209, 1368832208, 1368832209, 1368832210, 13688...
## $ tz <chr> "America/Los_Angeles", "", "America/Phoenix", "America/Chicago", "", "America/Indianapoli...
## $ u <chr> "http://www.nsa.gov/", "http://answers.usa.gov/system/selfservice.controller?CONFIGURATIO...
## $ _heartbeat_ <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ kw <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
jsonlite:
dplyr::glimpse(bitly02)
## Observations: 3,959
## Variables: 18
## $ a <chr> "Mozilla/5.0 (Linux; U; Android 4.1.2; en-us; HTC_PN071 Build/JZO54K) AppleWebKit/534.30 ...
## $ c <chr> "US", NA, "US", "US", NA, "US", "US", NA, "AU", NA, "US", "US", "US", "US", "US", "US", "...
## $ nk <int> 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ tz <chr> "America/Los_Angeles", "", "America/Phoenix", "America/Chicago", "", "America/Indianapoli...
## $ gr <chr> "CA", NA, "AZ", "TX", NA, "IN", "WI", NA, "02", NA, "OH", "MD", "KY", "OR", "IL", "TX", "...
## $ g <chr> "15r91", "ifIpBW", "10DaxOu", "TysVFU", "10IGW7m", "13GrCeP", "YmtpnZ", "13oM0hV", "15r91...
## $ h <chr> "10OBm3W", "ifIpBW", "10DaxOt", "TChsoQ", "10IGW7l", "13GrCeP", "YmtpnZ", "15PUeH0", "10O...
## $ l <chr> "pontifier", "bitly", "jaxstrong", "o_5004fs3lvd", "peacecorps", "bitly", "bitly", "nasat...
## ## $ al <chr> "en-US", "en-us", "en-US,en;q=0.5", "en-US", "en", "en-US", "en-US,en;q=0.5", "en-us", "e...
## $ hh <chr> "j.mp", "1.usa.gov", "1.usa.gov", "1.usa.gov", "1.usa.gov", "1.usa.gov", "1.usa.gov", "go...
## $ r <chr> "direct", "http://www.usa.gov/", "http://www.facebook.com/l.php?u=http%3A%2F%2F1.usa.gov%...
## $ u <chr> "http://www.nsa.gov/", "http://answers.usa.gov/system/selfservice.controller?CONFIGURATIO...
## $ t <int> 1368832205, 1368832207, 1368832209, 1368832209, 1368832208, 1368832209, 1368832210, 13688...
## $ hc <int> 1365701422, 1302189369, 1368814585, 1354719206, 1368738258, 1368130510, 1363711958, 13687...
## $ cy <chr> "Anaheim", NA, "Fort Huachuca", "Houston", NA, "Mishawaka", "Hammond", NA, "Sydney", NA, ...
## $ ll <list> [<33.8161, -117.9794>, NULL, <31.5273, -110.3607>, <29.7633, -95.3633>, NULL, <41.6123, ...
## $ _heartbeat_ <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ kw <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
UPDATE
The latest version of the jsonlite package supports streaming JSON (which is what this actually is). You can now read it with one line like so:
json_file <- stream_in(file("usagov_bitly_data2013-05-17-1368832207"))
See also Jeroen's answer below for stream-parsing it directly over http.
OLD ANSWER
It turns out this is a "pseudo-JSON" file. I come across these in many naive API systems I work in. Each line is valid JSON, but the individual objects aren't in a JSON array. You need to use readLines and then build your own, valid JSON array from it and pass that into fromJSON:
library(jsonlite)
# read in individual JSON lines
json_file <- "usagov_bitly_data2013-05-17-1368832207"
# turn it into a proper array by separating each object with a "," and
# wrapping that up in an array with "[]"'s.
dat <- fromJSON(sprintf("[%s]", paste(readLines(json_file), collapse=",")))
dim(dat)
## [1] 3959 18
str(dat)
## 'data.frame': 3959 obs. of 18 variables:
## $ a : chr "Mozilla/5.0 (Linux; U; Android 4.1.2; en-us; HTC_PN071 Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile "| __truncated__ "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4"| __truncated__ "Mozilla/5.0 (Windows NT 6.1; rv:21.0) Gecko/20100101 Firefox/21.0" "Mozilla/5.0 (Linux; U; Android 4.1.2; en-us; SGH-T889 Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile S"| __truncated__ ...
## $ c : chr "US" NA "US" "US" ...
## $ nk : int 0 0 1 1 0 0 1 0 0 0 ...
## $ tz : chr "America/Los_Angeles" "" "America/Phoenix" "America/Chicago" ...
## $ gr : chr "CA" NA "AZ" "TX" ...
## $ g : chr "15r91" "ifIpBW" "10DaxOu" "TysVFU" ...
## $ h : chr "10OBm3W" "ifIpBW" "10DaxOt" "TChsoQ" ...
## $ l : chr "pontifier" "bitly" "jaxstrong" "o_5004fs3lvd" ...
## $ al : chr "en-US" "en-us" "en-US,en;q=0.5" "en-US" ...
## $ hh : chr "j.mp" "1.usa.gov" "1.usa.gov" "1.usa.gov" ...
## ... (goes on for a while, many columns)
I combined the readLines in with the paste/sprintf call since the object.size of the resultant (temporary) object is 2,025,656 bytes (~2MB) and didn't feel like doing an rm on a separate temporary variable.
This format called ndjson and designed to stream import (including the gzip). Just use this:
con <- url("http://1usagov.measuredvoice.com/bitly_archive/usagov_bitly_data2013-05-17-1368832207.gz")
mydata <- jsonlite::stream_in(gzcon(con))
Or alternatively use the curl package for better performance or to customize the http request:
library(curl)
con <- curl("http://1usagov.measuredvoice.com/bitly_archive/usagov_bitly_data2013-05-17-1368832207.gz")
mydata <- jsonlite::stream_in(gzcon(con))
The package tidyjson can also read this "json lines" format:
read_json("my.json",format="jsonl")
The output is then parsed using a series of pipes, rather than having lists nested with dataframes.

fromJSON() не «применяется» против символьного вектора, он пытается преобразовать все это в кадр данных. Можешь попробовать

purrr::map(df$Message, jsonlite::fromJSON)

что @Abdou предоставил или

jsonlite::stream_in(textConnection(gsub("\n", "", df$Message)))

Последние два будут создавать кадры данных. Первый создаст список, который вы можете добавить в качестве столбца.

Вы можете использовать последний метод с dplyr::bind_cols чтобы создать новый фрейм данных со всеми данными:

dplyr::bind_cols(df[,1:3],
jsonlite::stream_in(textConnection(gsub("\n", "", df$Message))))

Также предложенный @Abdou является почти чистым базовым R-решением:

cbind(df, do.call(plyr::rbind.fill, lapply(paste0("[",df$Message,"]"), function(x) jsonlite::fromJSON(x))))

Полный, рабочий, рабочий процесс:

library(dplyr)
library(jsonlite)

df <- read.table("http://pastebin.com/raw/MMPMwNZv",
quote='"', sep=",", stringsAsFactors=FALSE, header=TRUE)

bind_cols(df[,1:3], stream_in(textConnection(gsub("\n", "", df$Message)))) %>%
glimpse()
##
Found 3 records...
Imported 3 records. Simplifying into dataframe...
## Observations: 3
## Variables: 19
## $ Id <int> 35054, 35055, 35059
## $ Date <chr> "2016-06-17 19:29:43 +0000", "2016-06-17 1...
## $ Level <chr> "INFO", "INFO", "INFO"
## $ id <int> -2, -1, -3
## $ ipAddress <chr> "100.100.100.100", NA, "100.200.300.400"
## $ howYouHearAboutUs <chr> NA, "Radio", NA
## $ isInterestedInOffer <lgl> TRUE, FALSE, TRUE
## $ incomeRange <int> 60000, 1, 100000
## $ isEmailConfirmed <lgl> FALSE, NA, TRUE
## $ firstName <chr> NA, "John", NA
## $ lastName <chr> NA, "Smith", NA
## $ email <chr> NA, "john.smith@gmail.com", NA
## $ city <chr> NA, "Smalltown", NA
## $ birthDate <chr> NA, "1999-12-10T05:00:00Z", NA
## $ password <chr> NA, "*********", NA
## $ agreeToTermsOfUse <lgl> NA, TRUE, TRUE
## $ visitUrl <chr> NA, NA, "https://www.website.com/?purpose=X"
## $ isIdentityConfirmed <lgl> NA, NA, FALSE
## $ validationResults <lgl> NA, NA, NA

Понравилась статья? Поделить с друзьями:
  • Error parse error octave
  • Error parse error nodejs
  • Error parse error node
  • Error parse error in pattern
  • Error parse error expected http