abusing lynx

Get your foo on.

abusing lynx

Postby nodir » December 10th, 2018, 10:34 am

Not sure where to put this. Isn't a how-to, but a hint how you could do something.

"something" i can hardly describe in words. Let me try:
I got an URL, in this case https://notabug.org/dragora/dragora
That URL has further links, and those i want to search for a given pattern (a recipe for a package).

You might say, and would be right: "git clone" the whole shebang and use "find" and call it a day. And you would be right.
So in general this is just a bit of training, something which might be useful for similar cases (if adapted), and a hint to this:
http://mywiki.wooledge.org/BashFAQ/113? ... %28lynx%29
which boils down to:
Code: Select all
 lynx -dump -listonly -nonumbers URL


This is what i did for a given problem (searching if a recipe for a dragora package already exists in dragora), to kinda give an idea how to fool with the above code
Code: Select all
#!/usr/bin/bash

# test for searchpattern as argument to script
if [[ $# !=  1 ]]; then
   printf "\n\twrong usage, add searchpattern for recipe\n\n"
   exit 1
fi

# function for lynx
glynx() {
   lynx -dump -listonly "$1" | \
      grep recipes | \
      grep -v 'lang\|redirect\|order\|recipes$'
}

#variables with
url=https://notabug.org/dragora/dragora/src/master/recipes
rp="$1" # recipe pattern
mapfile -t v < <(glynx "$url")

# main stuff
for i in "${v[@]}"; do
           i=${i#*. }
      glynx "$i" | grep "$rp"
done


Last note: i didn't write it from the top of my head. I head to check things, i had to ask in #bash, but in general it was done in less than 20 minutes, so it is rather easy to do it.
It also is quite "raw" (mainly the grep -v solution sure is not what you call error resistent ... ).
Still it might give others ideas, in case they will ever need something similar.

It sure is simple, intuitive and easy to understand.
compare it with this, which was proposed to me by "someone"
Code: Select all
#!/usr/bin/bash

[[ $# != 1 ]] && exec echo 'need _one_ arg, recipe'

b='https://notabug.org'
m='/dragora/dragora/src/master/recipes'
mu="$b$m"

c() { curl -s "$@"; }
h() { gawk -v 'RS=[<>]' '{ gsub(/\r$/, ""); if ( $0 ~ /[a-zA-Z0-9]/ ) print $0; print " " RT "\n" }' "$@"; }
p() { gawk -v 'FS=[="]+' -v p="$b" '/octicon-file-directory/ { while ( $0 !~ /href/ ) getline; print p $( NF -1 ) }'; }

a=(
 $(
  c "$mu" |
   h |
    p
 )
)

for e in "${a[@]}"; do
 c "$e" | h | p | awk -v p="$1" '$0 ~ "/" p "$"' &
done

wait

Might be me, but i can't make no sense of it (it does the job though).
nodir
 
Posts: 307
Joined: June 16th, 2015, 10:10 pm

Re: abusing lynx

Postby debil » December 12th, 2018, 12:11 pm

nodir wrote:
Code: Select all
lynx -dump -listonly -nonumbers URL

Cheers for the tip!

nodir wrote:It sure is simple, intuitive and easy to understand.

Your script was very readable.

nodir wrote:compare it with this, which was proposed to me by "someone"
Code: Select all
<snip>

nodir wrote:Might be me, but i can't make no sense of it (it does the job though).

Yeah... Nowadays there's no reason to use one-letter-variables whatever the situation (counters in loops as exception). It just adds obfuscation. Self-describing variable names and sparse but on-the-point comments are the best.
Ultimimate fanboi edition contributor
debil
 
Posts: 662
Joined: February 9th, 2011, 12:02 pm

Re: abusing lynx

Postby nodir » December 12th, 2018, 4:57 pm

Thanks. :-)

What bugged me a bit is naming the array v (which is kinda default for arrays?), but the brutal truth: i don't know the web well enough to use a different name
(to me an url is an url, but some have links to other urls, and to not make it worse by using a confusing array name, i just used v. Not as in Vendeta, but as in something i recall as a default :-) ).

glynx is supposed to say something like get-links-via-url, so g as in get. Same problem as above, else i would have choosen a more meaningful name

Usually i add a comment if the variables are not clear. but #get_lynx wouldn't have added much to clarity ...

-
perhaps something like:
main_url=
sub_urls=()
get_sub_urls () { }

For such a short snippet it is ok (as far it's me), so i gave up finding better names :-) In general i agree.
So: just chatting a bit.
nodir
 
Posts: 307
Joined: June 16th, 2015, 10:10 pm


Return to Programming

Who is online

Users browsing this forum: No registered users and 1 guest

x