The Art of DOI Scraping ( Hacking ) lol, Zotero may rely too much on html

Mike Marchywka marchywka at hotmail.com
Mon Jul 18 12:48:39 CEST 2022


I ran across this page,


https://osf.io/preprints/socarxiv/eu578/

The Zotero webform returned this,

@misc{noauthor_notitle_nodate,
	url = {https://osf.io/preprints/socarxiv/eu578/},
	urldate = {2022-07-18},
}

I would note right now that when TooBib fails it 
does not return a placeholder
but just fails and I have found that ok but can see a point
to this too.

In any case, AFAICT TooBib managed to scrape the right doi
as shown below.  As stated in the "documentation" draft,
TooBib is a collection of hacks and the DOI scraping is
one of the worst parts but it does work amazingly well
on text and pdf files. When it seems to fail often it
is due to a 404 returned from crossref because the DOI
does not exist there ( or on datasite ). Maybe for end
users that is ok but I would think you have to manually
curate these quite often and again deleting is easier than
hunting. 

I did not here for example TooBib did not extract the
publication year so I just added two fields 
to a list of source fields for that, 

 vdate.push_back("publication-date");
 vdate.push_back("date-posted");


I'm not entirely sure of the tech behind Zotero, I thnk
I've seen nodejs and python mentioned, but with the
c++ code delefating a lot to bash invokations with 
system() has worked well. Right now I'm starting 
wscat to talk to headless chrome that way but can 
insert a c++ websocket library when I get around to it or
never. For handling dates initially the linux/cygwin "date"
command was useful for getting proven robust date code.  

And for whatever reasons more site html seems to be bad or
difficult to download without "chromate" ( headless chrome
used as a replacement for wget , not sure if name is taken
but that is what I call it ). From the one post on the
Zotero forum it looked like one place wanted you to
pay to get the citation so maybe it is now a money thing.



 toobib -clip
mjm>clip xxxx
./toobib.h546  cmd=clip p1=xxxx p2= flags=18 x.flag_to_string(flags,0)=show_trial paste_citation 
./mjm_med2bib_guesses.h990  uin=https://osf.io/preprints/socarxiv/eu578/ dest=xxxx flags=18
./mjm_med2bib_guesses.h1164 % mjmhandler: toobib handledoi(crossref)
% date 2022-07-18:05:27:01 Mon Jul 18 05:27:01 EDT 2022
% srcurl: https://osf.io/preprints/socarxiv/eu578/
% citeurl: http://api.crossref.org/works/10.31235/osf.io/eu578
@article{Ortenzi_Kolby_Lawrence_Limitations_Food_,
X_TooBib = {urldate: FixBeKvp s= cmd=date "+%Y-%m-%d" d=2022-07-18 dn=urldate},
X_TooBib = {journal: ReWriteParse be.get(s)=Center for Open Science be.get(dest)=},
X_TooBib = {author: Ortenzi , Flaminia and Kolby , Marit and Lawrence , Mark and Leroy , Frederic and Nordhagen , Stella and Phillips , Stuart and Vliet , Stephan van and Beal , Ty},
abstract = {<p>Nutrient Profiling Systems provide algorithms which are designed to assess the healthfulness of foods based on nutrient composition, and intended as a strategy to improve diets. Many Nutrient Profiling Systems are founded on a reductionist assumption that the healthfulness of foods is determined by the sum of their nutrients, with little consideration for the extent and purpose of processing and its health implications. A novel Nutrient Profiling System called Food Compass attempted to address existing gaps and provide a more holistic assessment of the healthfulness of foods. While a conceptually impressive effort, we propose that the chosen algorithm is not well justified and produces results that fail to discriminate for common shortfall nutrients, exaggerate the risks associated with animal-source foods, and underestimate the risks associated with ultra-processed foods. We caution against the use of Food Compass in its current form to inform consumer choices, policies, programs, industry reformulations, and investment decisions.</p>},
affiliation = {},
author = {Ortenzi , Flaminia and Kolby , Marit and Lawrence , Mark and Leroy , Frederic and Nordhagen , Stella and Phillips , Stuart and Vliet , Stephan van and Beal , Ty},
author_orig = {Flaminia Ortenzi and Marit Kolby and Mark Lawrence and Frederic Leroy and Stella Nordhagen and Stuart Phillips and Stephan van Vliet and Ty Beal},
bib-source = {Crossref},
content-domain = {false},
date-created = {2022-02-18T00:18:55Z},
date-deposited = {2022-02-182022-02-18T00:18:56Z},
date-indexed = {2022-03-30T00:05:20Z},
date-issued = {2022-02-18},
date-license = {2022-02-182022-02-18T00:00:00Z},
date-posted = {2022-02-18},
deposited = {1645143536000},
doi = {10.31235/osf.io/eu578},
group-title = {SocArXiv},
is-referenced-by-count = {0},
journal = {Center for Open Science},
license = {1645142400000, unspecified, 0, https://creativecommons.org/licenses/by/4.0/legalcode},
member = {15934},
prefix = {10.31235},
publication-date = {2022-02-18},
publisher = {Center for Open Science},
reference-count = {0},
references-count = {0},
resource = {https://osf.io/eu578},
score = {1},
subtype = {preprint},
title = {Limitations of the Food Compass Nutrient Profiling System},
type = {posted-content},
url = {http://dx.doi.org/10.31235/osf.io/eu578},
urldate = {2022-07-18},
final_assembly ={ TooBib handler handledoi(crossref)},
srcurl={https://osf.io/preprints/socarxiv/eu578/},
xsrcurl={https://osf.io/preprints/socarxiv/eu578/},
citeurl={http://api.crossref.org/works/10.31235/osf.io/eu578},
toobib-date={2022-07-18:05:27:01 Mon Jul 18 05:27:01 EDT 2022}

}


./mjm_med2bib_guesses.h1172  saving to  df=xxxx
./mjm_med2bib_guesses.h1186  have citation   nfound=1 cite=\cite{Ortenzi_Kolby_Lawrence_Limitations_Food_} something=1 paste_citation=1
mjm>
marchywka at happy:/home/documents/latex/proj/vd$ 



-- 

mike marchywka
306 charles cox
canton GA 30115
USA, Earth 
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X



More information about the texhax mailing list.