Retrieving specific string from html

Discussion in 'Spigot Plugin Development' started by Qruet, May 22, 2017.

  1. I'm attempting to retrieve a string from an html site, this string updates there and here but as of now, on startup I'm simply looking for two keywords within this large bundle of html code. Then the text in between these two keywords would be the string I'm looking for. In my case the string I'm looking for is provided in the image below...

    [​IMG]

    I got as far as being able to get html code from a site. However I was successful with only really simple html sites (like sites as: http://ismycomputeron.com/ and returning "YES"). Now when using plug.dj I only seem to get only the javascript code for it rather then all of the html code. I was wondering what I may be doing wrong.
    Code provided below:
    Code (Text):
            public static void main() throws Exception {
                                 
                URL url = new URL("https://plug.dj/bectodj/");
                URLConnection connection = url.openConnection();
                InputStream in = connection.getInputStream();
                String encoding = connection.getContentEncoding();
                encoding = encoding == null ? "UTF-8" : encoding;
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                byte[] buf = new byte[8192];
                int len = 0;
                while ((len = in.read(buf)) != -1) {
                    baos.write(buf, 0, len);
                }
                String body = new String(baos.toByteArray(), encoding);
                String input = body;
                Bukkit.broadcastMessage(input);
                String firststring = ">";
                String secondstring = "<";
                input = input.replaceAll(firststring, " one ");
                input = input.replaceAll(secondstring, " two ");
                Bukkit.broadcastMessage(input);
                Pattern p = Pattern.compile("(?<=\\bone\\b).*?(?=\\btwo\\b)");
                Matcher m = p.matcher(input);
                List<String> matches = new ArrayList<String>();
                while (m.find()) {
                  matches.add(m.group());
                }
                for(String s : matches){
                    Bukkit.broadcastMessage(ChatColor.RED + s);
                }
           


                Bukkit.broadcastMessage("Closing...");
            }
           

    ~Geekles
     
  2. Try this plug dj has a java api :D
     
  3. The API is only Javascript and it's front-end
     
  4. Oh I got the url from an old bukkit post, I don't have an account so I couldnt check it myself. Sorry for posting something wrong.
     
  5. I tried looking to see if you could run a javascript code to get the object and let it return the values
     
  6. check response code, you're most likely not sending correct headers
     
  7. ? I'm getting myself a response with text. However It only reads the script code. Here I'll leave below what I get back from the plugin.
    ~Geekles

    Warning it's a lot as you should expect when attempting to retrieve all of the html from a site.

    Code (Text):
    22.05 15:13:09 [Server] INFO window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var o=n[t]={exports:{}};e[t][0].call(o.exports,function(n){var o=e[t][1][n];return r(o||n)},o,o.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;owindow.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","queueTime":0,"licenseKey":"17af7e6d07","agent":"","transactionName":"NQdaYURWW0UFVhEIWgxNfkBYVEFfC1tKEVkXBRZDX1JCRV5SABU=","applicationID":"10317296","errorBeacon":"bam.nr-data.net","applicationTime":58}
    22.05 15:13:09 [Server] INFO BectoDJ - plug.dj
    22.05 15:13:09 [Server] INFO var _v="1.5.6.10380";
    22.05 15:13:09 [Server] INFO -->
    22.05 15:13:09 [Server] INFO .async-hide { opacity: 0 !important}
    22.05 15:13:09 [Server] INFO (function(a,s,y,n,c,h,i,d,e){s.className+=' '+y;h.start=1*new Date;
    22.05 15:13:09 [Server] INFO h.end=i=function(){s.className=s.className.replace(RegExp(' ?'+y),'')};
    22.05 15:13:09 [Server] INFO (a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;
    22.05 15:13:09 [Server] INFO })(window,document.documentElement,'async-hide','dataLayer',4000,
    22.05 15:13:09 [Server] INFO {'GTM-NSRQTKC':true});
    22.05 15:13:09 [Server] INFO (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    22.05 15:13:09 [Server] INFO (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
    22.05 15:13:09 [Server] INFO m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
    22.05 15:13:09 [Server] INFO })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
    22.05 15:13:09 [Server] INFO ga('create', 'UA-28569875-1', 'auto');
    22.05 15:13:09 [Server] INFO ga('require', 'GTM-NSRQTKC');
    22.05 15:13:09 [Server] INFO //ga('send', 'pageview');
    22.05 15:13:09 [Server] INFO ga('send', 'pageview');      
    22.05 15:13:09 [Server] INFO window.__atudekey = "5a94dbbf6b2b505b192e9b4ff6b6818f";
    22.05 15:13:09 [Server] INFO window.intercomSettings = { app_id: 'fbynlv29' };
    22.05 15:13:09 [Server] INFO (function(){var w=window;var ic=w.Intercom;if(typeof ic==="function"){ic('reattach_activator');ic('update',intercomSettings);}else{var d=document;var i=function(){i.c(arguments)};i.q=[];i.c=function(args){i.q.push(args)};w.Intercom=i;function l(){var s=d.createElement('script');s.type='text/javascript';s.async=true;s.src='https://widget.intercom.io/widget/fbynlv29';var x=d.getElementsByTagName('script')[0];x.parentNode.insertBefore(s,x);}if(w.attachEvent){w.attachEvent('onload',l);}else{w.addEventListener('load',l,false);}}})()
    22.05 15:13:09 [Server] INFO window.__insp = window.__insp || []; __insp.push(['wid', '1312685934']); (function() { function ldinsp(){if(typeof window.__inspld != "undefined") return; window.__inspld = 1; var insp = document.createElement('script'); insp.type = 'text/javascript'; insp.async = true; insp.id = "inspsync"; insp.src = ('https:' == document.location.protocol ? 'https' : 'http') + '://cdn.inspectlet.com/inspectlet.js'; var x = document.getElementsByTagName('script')[0]; x.parentNode.insertBefore(insp, x); }; setTimeout(ldinsp, 500); document.readyState != "complete" ? (window.attachEvent ? window.attachEvent('onload', ldinsp) : window.addEventListener('load', ldinsp, false)) : ldinsp(); })();
    22.05 15:13:09 [Server] INFO window.__insp.push(["virtualPage"]);
    22.05 15:13:09 [Server] INFO REVAMP_CSS_PATH = "https://cdn.plug.dj/_/static/css/signuprevamp.b6986b6e3a55e829a7293e8a7b5815b4fcbf1813.css";
    22.05 15:13:09 [Server] INFO var _csrf="e70591d2419286970d61fd8e8ab7cbe61bb881f739aff3cb9bb79075ae95",_fb="216041638480603",_gws="wss://godj.plug.dj:443/socket",_jm="oa9o0HPsKsQxloUqjns4vSU4969HqWMFGIoyqcSKHd/RCxKw+BVsJQ+Vn8s8m1lwz2b+GJUHy33brnFwuMnN7W+7H7JCTLf5CVjgnAOMX6PBVuDLzid9+vSuQpmyIPpjD1GD2Te0LpkOlJnRwrkmK62nIFcItun3OKtlGPq0G7E=",_st="2017-05-22 19:11:53.616687",_loc="en-US";(function(){function p(){if(window.rjs && window.require){delete window.rjs;require(["https://cdn.plug.dj/_/static/js/app.cb1ad636ccfbe1871e792d55f28b176ba1d56a33.js"]);}else{setTimeout(p,100);}}p();})();
    22.05 15:13:09 [Server] INFO BectoDJplug.dj(function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(d.getElementById(id))return;js=d.createElement(s);js.id=id;js.src="//connect.facebook.net/en_US/sdk.js#xfbml=1&appId=216041638480603&version=v2.4";fjs.parentNode.insertBefore(js,fjs);}(document,'script','facebook-jssdk'));
    22.05 15:13:09 [Server] INFO window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var o=n[t]={exports:{}};e[t][0].call(o.exports,function(n){var o=e[t][1][n];return r(o||n)},o,o.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;owindow.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","queueTime":0,"licenseKey":"17af7e6d07","agent":"","transactionName":"NQdaYURWW0UFVhEIWgxNfkBYVEFfC1tKEVkXBRZDX1JCRV5SABU=","applicationID":"10317296","errorBeacon":"bam.nr-data.net","applicationTime":58}
    22.05 15:13:09 [Server] INFO BectoDJ - plug.dj
    22.05 15:13:09 [Server] INFO var _v="1.5.6.10380";
    22.05 15:13:09 [Server] INFO -- one
    22.05 15:13:09 [Server] INFO .async-hide { opacity: 0 !important}
    22.05 15:13:09 [Server] INFO (function(a,s,y,n,c,h,i,d,e){s.className+=' '+y;h.start=1*new Date;
    22.05 15:13:09 [Server] INFO h.end=i=function(){s.className=s.className.replace(RegExp(' ?'+y),'')};
    22.05 15:13:09 [Server] INFO (a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;
    22.05 15:13:09 [Server] INFO })(window,document.documentElement,'async-hide','dataLayer',4000,
    22.05 15:13:09 [Server] INFO {'GTM-NSRQTKC':true});
    22.05 15:13:09 [Server] INFO (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    22.05 15:13:09 [Server] INFO (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
    22.05 15:13:09 [Server] INFO m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
    22.05 15:13:09 [Server] INFO })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
    22.05 15:13:09 [Server] INFO ga('create', 'UA-28569875-1', 'auto');
    22.05 15:13:09 [Server] INFO ga('require', 'GTM-NSRQTKC');
    22.05 15:13:09 [Server] INFO //ga('send', 'pageview');
    22.05 15:13:09 [Server] INFO ga('send', 'pageview');      
    22.05 15:13:09 [Server] INFO window.__atudekey = "5a94dbbf6b2b505b192e9b4ff6b6818f";
    22.05 15:13:09 [Server] INFO window.intercomSettings = { app_id: 'fbynlv29' };
    22.05 15:13:09 [Server] INFO (function(){var w=window;var ic=w.Intercom;if(typeof ic==="function"){ic('reattach_activator');ic('update',intercomSettings);}else{var d=document;var i=function(){i.c(arguments)};i.q=[];i.c=function(args){i.q.push(args)};w.Intercom=i;function l(){var s=d.createElement('script');s.type='text/javascript';s.async=true;s.src='https://widget.intercom.io/widget/fbynlv29';var x=d.getElementsByTagName('script')[0];x.parentNode.insertBefore(s,x);}if(w.attachEvent){w.attachEvent('onload',l);}else{w.addEventListener('load',l,false);}}})()
    22.05 15:13:09 [Server] INFO window.__insp = window.__insp || []; __insp.push(['wid', '1312685934']); (function() { function ldinsp(){if(typeof window.__inspld != "undefined") return; window.__inspld = 1; var insp = document.createElement('script'); insp.type = 'text/javascript'; insp.async = true; insp.id = "inspsync"; insp.src = ('https:' == document.location.protocol ? 'https' : 'http') + '://cdn.inspectlet.com/inspectlet.js'; var x = document.getElementsByTagName('script')[0]; x.parentNode.insertBefore(insp, x); }; setTimeout(ldinsp, 500); document.readyState != "complete" ? (window.attachEvent ? window.attachEvent('onload', ldinsp) : window.addEventListener('load', ldinsp, false)) : ldinsp(); })();
    22.05 15:13:09 [Server] INFO window.__insp.push(["virtualPage"]);
    22.05 15:13:09 [Server] INFO REVAMP_CSS_PATH = "https://cdn.plug.dj/_/static/css/signuprevamp.b6986b6e3a55e829a7293e8a7b5815b4fcbf1813.css";
    22.05 15:13:09 [Server] INFO var _csrf="e70591d2419286970d61fd8e8ab7cbe61bb881f739aff3cb9bb79075ae95",_fb="216041638480603",_gws="wss://godj.plug.dj:443/socket",_jm="oa9o0HPsKsQxloUqjns4vSU4969HqWMFGIoyqcSKHd/RCxKw+BVsJQ+Vn8s8m1lwz2b+GJUHy33brnFwuMnN7W+7H7JCTLf5CVjgnAOMX6PBVuDLzid9+vSuQpmyIPpjD1GD2Te0LpkOlJnRwrkmK62nIFcItun3OKtlGPq0G7E=",_st="2017-05-22 19:11:53.616687",_loc="en-US";(function(){function p(){if(window.rjs && window.require){delete window.rjs;require(["https://cdn.plug.dj/_/static/js/app.cb1ad636ccfbe1871e792d55f28b176ba1d56a33.js"]);}else{setTimeout(p,100);}}p();})();
    22.05 15:13:09 [Server] INFO BectoDJplug.dj(function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(d.getElementById(id))return;js=d.createElement(s);js.id=id;js.src="//connect.facebook.net/en_US/sdk.js#xfbml=1&appId=216041638480603&version=v2.4";fjs.parentNode.insertBefore(js,fjs);}(document,'script','facebook-jssdk'));
     
  8. No response for some time now. Still looking for help on this issue...
    *Bump* :cool:
     
  9. This is actually pretty simple to achieve using a library called "JSoup". You can find it here. I, for instance, used JSoup to parse the HTML code from https://sosialis.me/resources/music/ , so I could get each element to download it.

    Maven (latest version as of 24/05/2017)
    Code (Text):
    <dependency>
      <!-- jsoup HTML parser library @ http://jsoup.org/ -->
      <groupId>org.jsoup</groupId>
      <artifactId>jsoup</artifactId>
      <version>1.10.2</version>
    </dependency>
    By running this piece of code:
    Code (Text):
    Document doc = Jsoup.connect("https://www.spigotmc.org/threads/retrieving-specific-string-from-html.242262/").get(); // your thread

    // Select the logo block (looks like this: <div id="logoBlock" class="header__blockItem ">)
    Elements logo = doc.select("div#logoBlock");
    // Select its class attribute
    String clazz = logo.attr("class");

    // Print it
    System.out.println(clazz);
    This outputted:
    Code (Text):
    header__blockItem

    Process finished with exit code 0
    I believe yours would look like:
    Code (Text):
    // Connect to your page, and get parse it
    Document doc = Jsoup.connect("url-to-your-page").get();

    // Select the div "noa-playing-media"
    Elements nowPlaying = doc.select("div#now-playing-media");

    // Get its attribute "title"
    String title = nowPlaying.attr("title");

    // Printing it for testing
    System.out.println(title);
    Now, I can't test it as I don't have the url to your page, but it should work. The parser is very fast! :)

    Note: you would probably have to shade the jar, as it is not included in the spigot api.

    Hope this helped.

    EDIT: I just saw your code, and you have tested with your link, and unfortunately it does not work (returns empty string). This is because the page changes dynamically a lot (you load it, then u can see the HTML changes a lot, the page you first load doesn't even include this div). I am, however, currently trying to find a separate solution.
     
    #9 ExpDev, May 24, 2017
    Last edited: May 24, 2017
    • Like Like x 2
  10. Yeah I looked into JSoup however I dont get how I can use it on my server. It'll require it as a dependency but since it's not a plugin it won't run in plugins folder. I know I'm suppose to modify something with my plugin but I didn't quite understand, and unless someone is willing to explain more clearly how I can get JSoup to run with my plugin I'm not sure how I'm going to be able to do this using JSoup.


    Edit: just read the the rest of our response (should've done so to begin with). Let me know if you find another solution.
     
  11. you shade jsoup inside your plugin jar
     
  12. It does. But by my understanding, JSoup only loads the page once, and does not wait around for the "final load".Load your plug.dj site, inspect the code, and hit refresh on the page. As you can see, there first is HTML for the loading page, then some other page, then the final (where your div lies). I believe the get() method only parses that first.

    However, I do believe there's a way around. Maybe another method JSoup offers where it waits. Honestly, I don't know, as I am not that familiar with it. You would have to do some googling and maybe ask at different forums.

    I tried downloading/finding the source for the plug.dj JS front-end api. However, I was unable to find both the download nor the source. I was thinking if I could check how it retrieves the data, I could try to rewrite/replicate those methods in Java. Maybe see where it sends the requests (if it does). To my surprise, I was not able to find any documentation nor source code.
     
  13. It's very likely that it's written in JavaScript, but under the hood it's just a HTTP API which can be used by any programming language.
     
  14. Hence why I was looking for the source, but unable to find it :(
     
  15. Hmmm, perhaps I should contact the developers over at plug.dj see if they'd be willing to share some of their source code, or at least help me find a way to retrieve information from the site with a java application?
    ~Geekles
     
  16. I'm sorry, so this is just getting the html code from a webpage correct? If so it's fairly easy to get it and you're pretty close.

    Code (Text):
    public static void main(String[] args) throws IOException {
            URL url = new URL("https://plug.dj/bectodj/");
            URLConnection urlConnection = url.openConnection();
            BufferedReader in = new BufferedReader(new InputStreamReader(
                    urlConnection.getInputStream(), "UTF-8"
            ));

            String inputLine;

            while((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
            }

        }
    output: https://pastebin.com/MC2ztTLs
     
  17. I think maybe you'd like to read the comments above.
     
  18. Haha, if you'd looked at my code, you would've seen that my code is pretty much the same as to what you just suggested. Just as for future reference, please read the comments beforehand before suggesting something.

    ~Geekles
     
    #20 Qruet, May 26, 2017
    Last edited: May 26, 2017