Solved How to replace text in part of a String but leave other part untouched?

Discussion in 'Spigot Plugin Development' started by Rsl1122, Sep 5, 2019.

  1. I'm trying to get some issues with translation under control, where I need to replace some words with the translated counterparts while leaving other part (javascript function calls) untouched.

    What I have thought about so far:
    • Using regex to select anything that isn't supposed to change & placing them back after the translation
    • Splitting the String into parts (Similar to the regex above)
    I haven't been able to get my regex to work. It works fine for this regex engine (I tested here https://regex101.com/r/cU5lC2/1), but on Java it does not select anything

    Here is my regex
    Code (Text):
    "(<script>[\\s\\S]*?<\\/script>)|<script src=[\"|'].*[\"|']><\\/script>|<link [\\s\\S]*?>"
    (This one works for the PHP engine)
    Code (Text):
    (<script>[\s\S]*?<\/script>)|<script src=[\"|'].*[\"|']><\/script>|<link [\s\S]*?>
    - <script> matches exactly that
    - [\s\S]*? matches minimum number of characters (any, including line breaks) to get a match
    - </script>
    - | or
    - <script src=[\"|'].*[\"|']><\/script> Matches <script src='...'> & <script src="...">
    - <link [\s\S]*?> Matches <link ... >

    Some things the regex covers
    Code (Text):
    <link href="https://fonts.googleapis.com/css?family=Nunito:400,700,800,900&display=swap&subset=latin-ext"
              rel="stylesheet">
    <script src="vendor/jquery/jquery.min.js"></script>
    <script>
        var gmPieColors = exampleFunctionCall();
    </script>
    My java code (For checking what it matches, no replacement yet)
    Code (Text):
            StringBuilder builder = new StringBuilder();
            // the stuff above is read into this StringBuilder line by line but that is not relevant so I left it out

            Matcher matcher = Pattern.compile("(<script>[\\s\\S]*?<\\/script>)|<script src=[\"|'].*[\"|']><\\/script>|<link [\\s\\S]*?>").matcher(builder.toString());

            MatchResult matches = matcher.toMatchResult();
            for (int i = 0; i < matches.groupCount(); i++) {
                System.out.println(matches.group(i));
            }
     
    The error I get
    Code (Text):
    No match found
    java.lang.IllegalStateException: No match found
        at java.util.regex.Matcher.group(Matcher.java:536)
        at com.djrapitops.plan.settings.locale.PatternMatchTest.test(PatternMatchTest.java:38)
        ...
    ----

    TL;DR: How to preserve part of a string when replacing words in it
    • Is there another (simpler) way of doing this that I have overlooked?
    • What is wrong with my regex?
     
  2. String#replaceAll(String, String)
     
  3. This doesn't help as it will also replace the words in the parts that I want to preserve as they were before the replacement
     
  4. So you want the original string before replacing and the string after replacing?
     
  5. Two things.

    1. If I recall correctly, the Matcher must have the Matcher#find() method called before you can access any of its groups.
    2.
    Code (Text):
            Matcher matcher = Pattern.compile("(<script>[\\s\\S]*?<\\/script>)|<script src=[\"|'].*[\"|']><\\/script>|<link [\\s\\S]*?>").matcher(builder.toString());

            MatchResult matches = matcher.toMatchResult();
            for (int i = 0; i < matches.groupCount(); i++) {
                System.out.println(matches.group(i));
            }
    Here, you're using [\\s\\S], which means (any whitespace character or any non-whitespace character); this is the same as the regex symbol . (period, meaning accept any character). Your parentheses are also only around your first match, that's group #1.

    Here's the bottom line:
    • Call Matcher#find().
    • Your parentheses will capture the groups, but your group count goes up by 1 for each pair of capturing parentheses. To group something but not capture it (very useful in capturing specific bits of text, ironically enough), use:
    Code (Text):
    (?:[a-fA-F0-9])
    Non-capturing groups (above) are useful if you need a token to apply to a whole complicated expression (?, +, *, etc).
     
    • Useful Useful x 1
  6. Thanks for the detailed explanation on what I was doing wrong!

    I was able to get the Matcher to match all of the parts with modifications:

    Moved the capturing group to be the whole regex
    Code (Text):
    (<script>[\s\S]*?</script>|<script src=["|'].*["|']></script>|<link [\s\S]*?>)
    Called Matcher#find before getting results
    Code (Text):
            while (matcher.find()) {
                MatchResult matches = matcher.toMatchResult();
                System.out.println(matches.group(0));
            }
    I'll probably go with a solution that preserves the matches first and then replaces new matches after translation.

    eg.
    1. Match the original and store them
    2. Translate the String by replacing all words
    3. Split translated String with the same regex and place the original matches in the gaps
     
  7. Do you still require aid? You didn't take my first point into consideration (replace [\s\S] with .). Also, said 'preservation of matches' can be achieved through capturing groups.

    Edit: Also, the while-loop implementation of the find function sketches me out.
     
    #7 Drkmaster83, Sep 9, 2019
    Last edited: Sep 9, 2019
  8. I am thinking that you may be going about this in the wrong way.

    How about, divide and conquer?

    Map the string to it's component parts and work with them? (Seperate at a space character and/or at an optional charcter)

    Natural language is full of ambiguities and can make translation very difficult due to noun, verb, adjective, etc. placement.

    It could be worth looking into a translation API, such as Google uses?
     
  9. The current implementation works in a way that leaves all script tags intact like I wanted.
    (It is here https://github.com/plan-player-anal...apitops/plan/settings/locale/Locale.java#L104)

    I'll replace [\S\s] with . - it was used since dot does not match newlines on all regex engines. Any other optimization suggestions are of course appreciated ^.^

    Matcher#find finds the next match in the subsequence and returns false if it was not found or String ends, so it would seem an iterator like while implementation works like intended. How would you have done this instead? (The number of script tags can change in the future)

    The whole word sequence is replaced with translated line, sorry, I probably used a wrong word in the original post.

    There are very few whole sentences present that need to be translated so anything using natural language processing would be overkill.
     
  10. I took a look at it once more and the implementation makes sense given the context. I had used regex heavily for something over the past summer and was stuck in that mindset. I don't see anything I'd optimize right off the bat.

    Matcher#find is indeed intended to work with conditional structures.
     
    • Like Like x 1
  11. I'm too late, but hopefully, this will help someone in the future.
    Learn about the XY problem. This question is one :rolleyes:

    I guess you are trying to translate a web page.
    • Create all translations files configs (even for the original one) to load the desired one on runtime.
    • Create a class to handle the texts with a method looks like #getText(String node, String/enum language).
    • Follow one of the approaches (approach two is recommended):
    [Approache one] if the HTML is created during the runtime by java:
    I personally don't recommend the HTML being created on runtime.
    This will be nightmare to maintain on big scales.
    Code (Text):
    String html = "<html>"
    html += "<script> coolfunction(); var foo = 'Invalid Response';</script>"
    html += "<a>status: active</a>
    You can just append the text directly. Results:
    Code (Text):
    String html = "<html>"
    html += "<script> coolfunction(); var foo = '" + translator.getText("error.invalidresponse", "de") + "';</script>"
    html += "<a>" + translator.getText("status.active", "en") + "</a>

    [Approache two] if the HTML is created and saved as non-edit-able resources for java:
    Use placeholders instead of the text and replace them during runtime.
    Code (Text):
    <a>%text_status_cool%</a>
    <script>
    var errorMessage = '%text_error_invalid_response%';
    showDialog(errorMessage);
    </script>[/quote]
    then do something like:
    [quote]String fullHtmlPage = resources.load("myPage.html");
    fullHtmlPage.replaceAll("%text_status_cool%", translator.getText("status.disabled", "en"));
    Problem/Bug: You may have to encode the text to html or escape special chars (like ', ", >, <, etc). Read "tips and bugs to notice" down below for solutions.
    Tip: You can replace the "%" with something else (like @, #, __, ##, etc.) if they are used.

    [Approache three] if the HTML is created and retrieved from external resources for java:
    First of all, why on earth do you want to translate a website not controlled by you in your java plugin?
    This is unreliable and should never be considered a good approach at all. The website may change at any time. It can break your code.
    Things you can do:
    1. Use regex (as explained by others above)
    2. Use API like google translate API
    3. Use HTML/XML parsers to parse and replace texts
    4. Use regex to extract the information and recreate the website with the info (re-create the website and use approach one or two to display it)
    5. *Insert another some other crazy non-recommended methods*


    Tips and bugs to notice:
    Problem with text formatting: Some equivalent sentences in other language are different. Something like "This is %text%" to "This %text" is". This will be difficult to translate with:
    • String text = "this is " + text;
    • String#format("this is %s", "text"). (No, it is not recommended either).

    Problem with javascript variables: Sometimes you want to use text in variables. This will be a problem if you are forcing encoding everything to HTML since javascript will not decode them. You have to either:

    • decode them from front-end (only if you have API)
    • detect them by using special placeholders like "%translate_status%:escape" and escape/encode on what is found. You have to replace one with (must do with first) string.replace("%text%:escape", "active") and one without it string.replace(%text%,"active")

    Enhance formatting changes: Sometimes you will face a problem
    where you find a bug/want to enhance and everything is scattered. It will be pain tp replace everything one by one. Create a class to handle the formatting.

    Code (Text):
    public final class LanguageFomatter{
      private final Config config = Config.getInstance();

      String formatStatus(String status, boolean isJavascript){
        String text = config.translator.getText("no_magic_values_allowed", config.language);
        String result = text.replace("%translate_status%", status);

        return encoder(result, isJavascript);
      }

      private static String encoder(String text, boolean isJavascript){
        return isJavascript ? escapeChars(text) : htmlEncoder.encode(text);
      }
    }

    Crazy idea but still fix most problems:
    I have some ideas to enhance even more, but I won't write it here. I will let you discover them by yourselves when you come across these problems ;)
     
    #11 aidn5, Sep 28, 2019
    Last edited: Sep 28, 2019
  12. Good write-up on translating html.

    I asked about this problem because I figured resolving this issue and those related to it would be less work.
    Existing solution was in place that replaces default text with replacements if they are overridden, and the text to translate was already hardcoded in the files (case 2).
    So solving this was less effort overall.