Solved How to replace text in part of a String but leave other part untouched?

Discussion in 'Spigot Plugin Development' started by Rsl1122, Sep 5, 2019.

  1. I'm trying to get some issues with translation under control, where I need to replace some words with the translated counterparts while leaving other part (javascript function calls) untouched.

    What I have thought about so far:
    • Using regex to select anything that isn't supposed to change & placing them back after the translation
    • Splitting the String into parts (Similar to the regex above)
    I haven't been able to get my regex to work. It works fine for this regex engine (I tested here https://regex101.com/r/cU5lC2/1), but on Java it does not select anything

    Here is my regex
    Code (Text):
    "(<script>[\\s\\S]*?<\\/script>)|<script src=[\"|'].*[\"|']><\\/script>|<link [\\s\\S]*?>"
    (This one works for the PHP engine)
    Code (Text):
    (<script>[\s\S]*?<\/script>)|<script src=[\"|'].*[\"|']><\/script>|<link [\s\S]*?>
    - <script> matches exactly that
    - [\s\S]*? matches minimum number of characters (any, including line breaks) to get a match
    - </script>
    - | or
    - <script src=[\"|'].*[\"|']><\/script> Matches <script src='...'> & <script src="...">
    - <link [\s\S]*?> Matches <link ... >

    Some things the regex covers
    Code (Text):
    <link href="https://fonts.googleapis.com/css?family=Nunito:400,700,800,900&display=swap&subset=latin-ext"
              rel="stylesheet">
    <script src="vendor/jquery/jquery.min.js"></script>
    <script>
        var gmPieColors = exampleFunctionCall();
    </script>
    My java code (For checking what it matches, no replacement yet)
    Code (Text):
            StringBuilder builder = new StringBuilder();
            // the stuff above is read into this StringBuilder line by line but that is not relevant so I left it out

            Matcher matcher = Pattern.compile("(<script>[\\s\\S]*?<\\/script>)|<script src=[\"|'].*[\"|']><\\/script>|<link [\\s\\S]*?>").matcher(builder.toString());

            MatchResult matches = matcher.toMatchResult();
            for (int i = 0; i < matches.groupCount(); i++) {
                System.out.println(matches.group(i));
            }
     
    The error I get
    Code (Text):
    No match found
    java.lang.IllegalStateException: No match found
        at java.util.regex.Matcher.group(Matcher.java:536)
        at com.djrapitops.plan.settings.locale.PatternMatchTest.test(PatternMatchTest.java:38)
        ...
    ----

    TL;DR: How to preserve part of a string when replacing words in it
    • Is there another (simpler) way of doing this that I have overlooked?
    • What is wrong with my regex?
     
  2. String#replaceAll(String, String)
     
  3. This doesn't help as it will also replace the words in the parts that I want to preserve as they were before the replacement
     
  4. So you want the original string before replacing and the string after replacing?
     
  5. Two things.

    1. If I recall correctly, the Matcher must have the Matcher#find() method called before you can access any of its groups.
    2.
    Code (Text):
            Matcher matcher = Pattern.compile("(<script>[\\s\\S]*?<\\/script>)|<script src=[\"|'].*[\"|']><\\/script>|<link [\\s\\S]*?>").matcher(builder.toString());

            MatchResult matches = matcher.toMatchResult();
            for (int i = 0; i < matches.groupCount(); i++) {
                System.out.println(matches.group(i));
            }
    Here, you're using [\\s\\S], which means (any whitespace character or any non-whitespace character); this is the same as the regex symbol . (period, meaning accept any character). Your parentheses are also only around your first match, that's group #1.

    Here's the bottom line:
    • Call Matcher#find().
    • Your parentheses will capture the groups, but your group count goes up by 1 for each pair of capturing parentheses. To group something but not capture it (very useful in capturing specific bits of text, ironically enough), use:
    Code (Text):
    (?:[a-fA-F0-9])
    Non-capturing groups (above) are useful if you need a token to apply to a whole complicated expression (?, +, *, etc).
     
    • Useful Useful x 1
  6. Thanks for the detailed explanation on what I was doing wrong!

    I was able to get the Matcher to match all of the parts with modifications:

    Moved the capturing group to be the whole regex
    Code (Text):
    (<script>[\s\S]*?</script>|<script src=["|'].*["|']></script>|<link [\s\S]*?>)
    Called Matcher#find before getting results
    Code (Text):
            while (matcher.find()) {
                MatchResult matches = matcher.toMatchResult();
                System.out.println(matches.group(0));
            }
    I'll probably go with a solution that preserves the matches first and then replaces new matches after translation.

    eg.
    1. Match the original and store them
    2. Translate the String by replacing all words
    3. Split translated String with the same regex and place the original matches in the gaps
     
  7. Do you still require aid? You didn't take my first point into consideration (replace [\s\S] with .). Also, said 'preservation of matches' can be achieved through capturing groups.

    Edit: Also, the while-loop implementation of the find function sketches me out.
     
    #7 Drkmaster83, Sep 9, 2019
    Last edited: Sep 9, 2019
  8. I am thinking that you may be going about this in the wrong way.

    How about, divide and conquer?

    Map the string to it's component parts and work with them? (Seperate at a space character and/or at an optional charcter)

    Natural language is full of ambiguities and can make translation very difficult due to noun, verb, adjective, etc. placement.

    It could be worth looking into a translation API, such as Google uses?
     
  9. The current implementation works in a way that leaves all script tags intact like I wanted.
    (It is here https://github.com/plan-player-anal...apitops/plan/settings/locale/Locale.java#L104)

    I'll replace [\S\s] with . - it was used since dot does not match newlines on all regex engines. Any other optimization suggestions are of course appreciated ^.^

    Matcher#find finds the next match in the subsequence and returns false if it was not found or String ends, so it would seem an iterator like while implementation works like intended. How would you have done this instead? (The number of script tags can change in the future)

    The whole word sequence is replaced with translated line, sorry, I probably used a wrong word in the original post.

    There are very few whole sentences present that need to be translated so anything using natural language processing would be overkill.
     
  10. I took a look at it once more and the implementation makes sense given the context. I had used regex heavily for something over the past summer and was stuck in that mindset. I don't see anything I'd optimize right off the bat.

    Matcher#find is indeed intended to work with conditional structures.
     
    • Like Like x 1