Edit

My question was very badly written but the new title reflect the actual question. Thanks to 3 very friendly and dedicated users (@harsh3466 @tuna @learnbyexample) I was able to find a solution for my files, so thank you guys !!!

For those who will randomly come across this post here are 3 possible ways to achieve the desired results.

Solution 1 (https://lemmy.ml/post/25346014/16383487)

#! /bin/bash
files="/home/USER/projects/test.md"

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
	sed -i "s/$line/${dashlink}/" "$files"

	#Puts everything to lowercase after a hashtag
	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
	sed -i "s/$dashlink/${lowercaselink}/" "$files"

	#Removes spaces (%20) from markdown links after a hashtag
	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
	sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"

Solution 2 (https://lemmy.ml/post/25346014/16453351)

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Solution 3 (https://lemmy.ml/post/25346014/16453161)

perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'

Relevant links

https://mike.bailey.net.au/notes/software/apps/obsidian/issues/markdown-heading-anchors/#background


Hi everyone !

I’m in need for some assistance for string manipulation with sed and regex. I tried a whole day to trial & error and look around the web to find a solution however it’s way over my capabilities and maybe here are some sed/regex gurus who are willing to give me a helping hand !

With everything I gathered around the web, It seems it’s rather a complicated regex and sed substitution, here we go !

What Am I trying to achieve?

I have a lot of markdown guides I want to host on a self-hosted forgejo based git markdown. However the classic markdown links are not the same as one github/forgejo…

Convert the following string:

[Some text](#Header%20Linking%20MARKDOWN.md)

Into

[Some text](#header-linking-markdown.md)

As you can see those are the following requirement:

  • Pattern: [Some text](#link%20to%20header.md)
  • Only edit what’s between parentheses
  • Replace space (%20) with -
  • Everything as lowercase
  • Links are sometimes in nested parentheses
    • e.g. (look here [Some text](#link%20to%20header.md))
  • Do not change a line that begins with https (external links)

While everything is probably a bit complex as a whole the trickiest part is probably the nested parentheses :/

What I tried

The furthest I got was the following:

sed -Ei 's|\(([^\)]+)\)|\L&|g' test3.md #make everything between parentheses lowercase

sed -i '/https/ ! s/%20/-/g' test3.md #change every %20 occurrence to -

These sed/regx substitution are what I put together while roaming the web, but it has a lot a flaws and doesn’t work with nested parentheses. Also this would change every %20 occurrence in the file.

The closest solution I found on stackoverflow looks similar but wasn’t able to fit to my needs. Actually my lack of regex/sed understanding makes it impossible to adapt to my requirements.


I would appreciate any help even if a change of tool is needed, however I’m more into a learning processes, so a script or CLI alternative is very appreciated :) actually any help is appreciated :D !

Thanks in advance.

  • nutcase2690@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    0
    ·
    4 days ago

    Not home so I can’t try it but do you need to be so specific to match the whole markdown syntax?

    You might be able to get away with

    s/#(\w+%20)*\w+\.\w{2,3}/\L&/g; /#(\w+%20)*\w+\.\w{2,3}/ s/%20/-/g
    

    basically, matching #this%20is%20LIKELY%20a%20link.md as opposed to matching whole markdown link

    lowercasing that entire match, then on a search matching stuff that looks like that, replace the %20 with a hyphen (combined into a single sed command). this only fails when an http link falls within the same line as a markdown hyperlink

    • N0x0n@lemmy.mlOP
      link
      fedilink
      arrow-up
      0
      ·
      2 days ago

      Hello :) Sorry to pin you, I just gave pandoc a try but it doesn’t work and I had to dig a bit further into the web to find out why !

      Links to Headings with Spaces are not specified by CommonMark and each tool implement a different approach… Most replace space with hyphens other use URL encoding (%20). So even though pandoc looks awesome it doesn’t work for my use case (or did i miss something? Feel free to comment).

      You can give it a try on https://pandoc.org/try/ with commonmark to gfm:

      [Just a test](#Just a test)
      [Just a link](https://mylink/%20with%20space.com)
      [External link](Readme.md#JUST%20a%20test)
      [Link with numbers](readme.md#1.3%20this%20is%20another%20test)
      [Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test)
      

      If you prefere a cli version:

      pandoc --from=commonmark_x --to=gfm+gfm_auto_identifiers "/home/user/Documents/test.md" -o "pandoc_test.md"
      
      • bokherif@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        2 days ago

        Hey I just did a quick web search and found this. I haven’t used the tool specifically before. However I recommend either searching the web for a similar tool or using a chatgpt like tool to create a python script that’ll achieve your end result. Sed and regex are cool and useful, but they’re only going to make it more difficult to achieve what you need.

    • N0x0n@lemmy.mlOP
      link
      fedilink
      arrow-up
      0
      ·
      2 days ago

      Yeah probably bare bone regex was a mistake however a friendly user gave me a step by step guide on how to achieve my goal:

      #! /bin/bash
      
      files="/home/USER/projects/test.md"
      
      mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
      mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
      
      while IFS= read -r line; do
      	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
      	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
      	sed -i "s/$line/${dashlink}/" "$files"
      
      	#Puts everything to lowercase after a hashtag
      	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
      	sed -i "s/$dashlink/${lowercaselink}/" "$files"
      
      	#Removes spaces (%20) from markdown links after a hashtag
      	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
      	sed -i "s/$lowercaselink/${spacelink}/" "$files"
      
      done <<<"$mdlinks2"
      

      If you know a better way to achieve similar results I’m very open for every new lead and learn something new !

  • learnbyexample@programming.dev
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    5 days ago

    Here’s a solution with perl (assuming you don’t want to change http/https after the start of ( instead of start of a line):

    perl -pe 's/\[[^]]+\]\(\K(?!https?)[^)]+(?=\))/lc $&=~s|%20|-|gr/ge' ip.txt
    
    • e flag allows you to use Perl code in the substitution portion.
    • \[[^]]+\]\(\K match square brackets and use \K to mark the start of matching portion (text before that won’t be part of $&)
    • (?!https?) don’t match if http or https is found
    • [^)]+(?=\)) match non ) characters and assert that ) is present after those characters
    • $&=~s|%20|-|gr change %20 to - for the matching portion found, the r flag is used to return the modified string instead of change $& itself
    • lc is a function to change text to lowercase
    • N0x0n@lemmy.mlOP
      link
      fedilink
      arrow-up
      0
      ·
      2 days ago

      Sorry for the late response… I was busy with another user :S My English is so bad I’m not able to response to every one at the same time… Whatever…

      I tried your pearl regex substitution and effectively it does what I ask from my post, so thank you very much for your help ! However, I missed a few use cases were your regex breaks… But that’s on me, your command works as expected !!!

      [Link with numbers](Another%20Markdown%20file.md#1.3%20this%20is%20another%20test.md)
      

      The part before the hashtag need to keeps it’s original form (even with %20) because it links to a markdown file directly and not a header (Hope it’s comprehensible?). It took me a lot of time with another user and we came to a wrapped up script that does everything:

      #! /bin/bash
      
      files="/home/USER/projects/test.md"
      
      mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
      mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
      
      while IFS= read -r line; do
      	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
      	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
      	sed -i "s/$line/${dashlink}/" "$files"
      
      	#Puts everything to lowercase after a hashtag
      	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
      	sed -i "s/$dashlink/${lowercaselink}/" "$files"
      
      	#Removes spaces (%20) from markdown links after a hashtag
      	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
      	sed -i "s/$lowercaselink/${spacelink}/" "$files"
      
      done <<<"$mdlinks2"
      

      If you are motivated you can still improve your regex If you want :) I’m kinda curious If it’s possible with a one-liner ! Thank again for your help and sorry for the late response !!

      • learnbyexample@programming.dev
        link
        fedilink
        English
        arrow-up
        0
        ·
        2 days ago

        This might work, but I think it is best to not tinker further if you already have a working script (especially one that you understand and can modify further if needed).

        perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'
        
    • bizdelnick@lemmy.ml
      link
      fedilink
      English
      arrow-up
      0
      ·
      5 days ago

      I didn’t test this, but it will change the whole URL while changes are only needed in its fragment component (after the first #).

  • tuna@discuss.tchncs.de
    link
    fedilink
    arrow-up
    0
    ·
    5 days ago

    This is very close

    sed ':loop;/\[[^]]*\](http/! s/\(\[[^]]*\]\)\(([^)]*\)%20\([^)]*)\)/\1\2-\3/g;t loop;/\[[^]]*\](http/! s/\(\[[^]]*\]\)\(([^)]*)\)/\1\L\2/g'
    

    example file

    [Some text](#Header%20Linking%20MARKDOWN.md)
    (#Should%20stay%20as%20is.md)
    Text surrounding [a link](readme.md#Other%20Page). Cool
    Multiple [links](#Links.md) in (%20) [a](#An%20A.md) SINGLE [line](#Lines.md)
    Do [NOT](https://example.com/URL%20Should%20Be%20Untouched.html) CHANGE%20 [hyperlinks](http://example.com/No%20Touchy.html)
    

    but it doesn’t work if you have a http link and markdown link in the same line, and doesn’t work with [escaped \] square brackets](#and-escaped-\)-parenthesis) in the link

    but!! it was fun!

    • N0x0n@lemmy.mlOP
      link
      fedilink
      arrow-up
      0
      ·
      2 days ago

      Hello :) Sorry for the very late response !

      Effectively your regex is very close as a one line, I’m pretty impress ! :0 However I missed to mention something In my post (I only though about it after working on it with another user in the comments…). There a 2 things missing on your beautiful and complex regex:

      1. Numbering with dots also needs to have a dash in between (actually I think every special characters like spaces or a dots are converted to a dash )
      FROM
      ---------------
      [Link with numbers](readme.md#1.3%20this%20is%20another%20test)
      
      TO
      ---------------
      [Link with numbers](readme.md#1-3-this-is-another-test)
      
      1. The part before the hashtag needs to keep it original form (links to a real file)
      FROM
      ---------------
      [Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test.md)
      
      TO
      ---------------
      [Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test.md)
      

      Sorry for the trouble I wasn’t aware of all the GitHub-Flavored Markdown syntax :/. I got a a very cool working script that works perfectly with another user but If you want to modify your regex and try to solve the issue in pure regex feel free :) I’m very curious how It could look like (god regex is so obscure and at the same time it has some beauty in it !)

      #! /bin/bash
      
      files="/home/USER/projects/test.md"
      
      mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
      mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
      
      while IFS= read -r line; do
      	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
      	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
      	sed -i "s/$line/${dashlink}/" "$files"
      
      	#Puts everything to lowercase after a hashtag
      	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
      	sed -i "s/$dashlink/${lowercaselink}/" "$files"
      
      	#Removes spaces (%20) from markdown links after a hashtag
      	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
      	sed -i "s/$lowercaselink/${spacelink}/" "$files"
      
      done <<<"$mdlinks2"
      
      • tuna@discuss.tchncs.de
        link
        fedilink
        arrow-up
        0
        ·
        edit-2
        2 days ago

        I did it!! It also handles the case where an external link and internal link are on the same line :D

        sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'
        

        Here is my annotated file

        # Begin loop
        :l;
        
        # Bisect first link in pattern space into pattern space and append to hold space
        # Example: `text [label](file#fragment)'
        #   Pattern space: `file#fragment)'
        #   Hold space: `text [label]('
        # Steps:
        #   1. Strategically insert \n
        #       1a. If this fails, branch out
        #   2. Append to hold space (this creates two \n's. It feels weird for the
        #      first iteration, but that's ok)
        #   3. Copy hold space to pattern space, remove first \n, then trim off
        #      everything past the second \n
        #   4. Swap pattern/hold, and trim off everything up to and incl the last \n
        s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;
        Te;
        H;
        g; s/\n//; s/\n.*//;
        x; s/.*\n//;
        
        # Modify only if it is an internal link
        /^https?:/! {
            # Add hyphens
            :h;
            s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;
            th;
            # Make lowercase
            s/(#[^)]*\))/\L\1/;
        };
        
        # "conditional" branch so it checks the next conditional again
        tl;
        
        # Exit: join pattern space to hold space, then move to pattern space.
        # Since the loop uses H instead of h, have to make sure hold space is empty
        :e;
        H;
        z;
        x; s/\n//;
        
        • N0x0n@lemmy.mlOP
          link
          fedilink
          arrow-up
          0
          ·
          2 days ago

          Wow ! Thank you ! It did a rapid test on a test-file.md

          [Just a test](#just-a-test)
          [Just a link](https://mylink/%20with%20space.com)
          [External link](readme.md#just-a-test)
          [Link with numbers](readme.md#1-3-this-is-another-test)
          [Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test)
          

          Great job ! Thank you very much !!! I’m really impressed what someone with proper knowledge can do ! However, I really do not want to mess around with your regex… This will only call for disaster xD ! I will keep preciously your regex and annotated file in my knowledge base, I’m sure some time in the future I will come back to it and try to break it down as learning process.

          Thank you very much !!! 👍

    • tuna@discuss.tchncs.de
      link
      fedilink
      arrow-up
      0
      ·
      5 days ago

      annotated it is working like this:

      # use a loop to iteratively replace the %20 with -, since doing s/%20/-/g would replace too much. we loop until it cant substitute any more
      
      # label for looping
      :loop;
      # skip the following substitute command if the line contains an http link in markdown format
      /\[[^]]*\](http/!
      # capture each part of the link, and join it together with -
      s/\(\[[^]]*\]\)\(([^)]*\)%20\([^)]*)\)/\1\2-\3/g;
      # if the substitution made a change, loop again, otherwise break
      t loop;
      
      # convert all insides to the link lowercase if the line doesnt contain an http link
      /\[[^]]*\](http/!
      # this is outside the loop rather than in the s command above because if the link doesnt contain %20 at all then it won't convert to lowercase
      s/\(\[[^]]*\]\)\(([^)]*)\)/\1\L\2/g
      
      • bizdelnick@lemmy.ml
        link
        fedilink
        arrow-up
        0
        ·
        edit-2
        5 days ago

        skip the following substitute command if the line contains an http link in markdown format

        Why you assume there’s only one link in the line?

        Also, you perform substitutions in the whole URL instead only the fragment component.

        • tuna@discuss.tchncs.de
          link
          fedilink
          arrow-up
          0
          ·
          edit-2
          5 days ago

          Why you assume there’s only one link in the line?

          They did not want external (http) links to be modified as that would break it:

          • [Example](https://example.com/#Some%20Link)
          • [Example](https://example.com/#some-link)

          I compromised by thinking that it might be unlikely enough to have an external http link AND internal link within the same line. You could probably still do it, my first thought was [^h][^t][^t][^p] but that would cause issues for #ttp and #A so i just gave up. Instead I think you’d want a different approach, like breaking each link onto their own line, do the same external/internal check before the substitution, and join the lines afterward.

          Also, you perform substitutions in the whole URL instead of the fragment component

          That requirement i missed. I just assumed the filename would be replaced the same way too Lol. Not too hard to fix tho :)

  • bizdelnick@lemmy.ml
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    5 days ago

    As I see, you’ve already got an answer how to convert text to lower case. So I just tell you how to replace all occurrences of %20 with -. You need to repeat substitution until no matches found. For such iteration you need to use branching to label. Below is sed script with comments.

    :subst                                         # label
    s/(\[[^]]+\]\([^)#]*#[^)]*)%20([^)]*\))/\1-\2/ # replace the first occurrence of `%20` in the URL fragment
    t subst                                        # go to the `subst` label if the substitution took place
    

    However there are some cases when this script will fail, e. g. if there is an escaped ] character in the link text. You cannot avoid such mistakes using only simple regexps, you need a full featured markdown parser for this.

    • bizdelnick@lemmy.ml
      link
      fedilink
      arrow-up
      0
      ·
      5 days ago

      NB: global substitution s///g is not applicable here because you need to perform new substitutions in a substituted text. Both sed regexp syntaxes (basic and extended) don’t support lookarounds that could solve this issue.

  • Deckweiss@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    6 days ago

    Is the ‘%MARKDOWN’ part of your example correct? That should also be converted to a dash? Or did you forget the 20 there?

  • harsh3466@lemmy.ml
    link
    fedilink
    arrow-up
    0
    ·
    6 days ago

    I’ve got a sed regex that should work, just writing up a breakdown of the whole command so anyone interested can follow what it does. Will post in a bit.

    • N0x0n@lemmy.mlOP
      link
      fedilink
      arrow-up
      0
      ·
      6 days ago

      This would be awesome ! A breakdown of the whole command will give me a better understanding !

      Thank you in advance, waiting for your post :)

      • harsh3466@lemmy.ml
        link
        fedilink
        arrow-up
        0
        ·
        edit-2
        6 days ago

        Okay, here’s the command and a breakdown. I broke down every part of the command, not because I think you are dumb, but because reading these can be complicated and confusing. Additionally, detailed breakdowns like these have helped me in the past.

        The command:

        sed -ri 's|]\(#.+\)|\L&|; s|%20|-|g' /path/to/somefile
        

        The breakdown:

        sed - calls sed

        -r - allows for the use of extended regular expressions

        -i - edit the file given as an argument at the end of the command (note, the i flag must follow the r flag, or the extended regular expressions will not be evaluated)

        Now the regex piece by piece. This command has two substitution regex to break down the goals into managable chunks.

        Expression one is to convert the markdown links to lowercase. That expression is:

        's|]\(#.+\)|\L&|;

        The goal of this expression is to find markdown links, and to ignore https links. In your post you indicate the markdown links all start with a # symbol, so we don’t have to explicitly ignore the https as much as we just have to match all links starting with #. Here’s the breakdown:

        ' - begins the entire expression set. If you had to match the ' character in your expression you would begin the expression set with " instead of '.

        s| - invoking find and replace (substitution). Note, Im using the | as a separator instead of the / for easier readability. In sed, you can use just about any separator you want in your syntax

        ]\(# - This is how we find the link we want to work on. In markdown, every link is preceded by ]( to indicate a closing of the link text and the opening of the actual url. In the expression, the ( is preceded by a \ because it is a special regex character. So \( tells sed to find an actual closing parentheses character. Finally the # will be the first character of the markdown links we want to convert to lowercase, as indicated by your example. The inclusion of the # insures no https links will be caught up in the processing.

        .+ - this bit has two parts, . and +. These are two special regex characters. the . tells sed to find any character at all and the + tells it to find the preceding character one or more times. In the case of .+, it’s telling sed to find one or more of any characters. You might think this will eat ALL of the text in the document and make it all lowercase, but it will not because of the next part of the regex.

        \) - this tells sed to find a closing parentheses. Like the opening parentheses, it is a special regex character and needs to be escaped with the backslash to tell sed to find an actual closing parentheses character. This is what stops the command from converting the entire document to lowercase, because when you combine the previous bit with this bit like so .+\), you’re telling sed to find one or more of any character UNTIL you find a closing parentheses.

        | - This tells sed we’re done looking for text to match. The next bits are about how to modify/replace that text

        \L - This tells sed to convert the given text to all lowercase

        & - This is the given text to modify. In this case the & is a special mertacharacter that tells sed to modify the entire pattern matched in the matching portion of the expression. So when the & is preceded by the \L, this tells sed Take everything that was matched in the pattern matching expression and convert it to lowercase.

        ; - this tells sed that this is the end of the first expression, and that more are coming.

        So all together, what this first expression does is: Find a closing bracket followed by an opening parentheses followed by a pound/hash symbol followed by one or more of any characters until finding a closing parentheses. Then convert that entire chunk of text to lowercase. Because symbols don’t have case you can just convert the entire matched pattern to lowercase. If there were specific parts that had to be kept case sensitive, then you’d have to match and modify more precisely.

        The next expression is pretty easy, UNLESS any of your https links also include the string %20:

        If no https links contain the %20 string, then this will do the trick:

        s|%20|-|g'

        s| - again opens the expression telling sed wer’re looking to substitute/modify text

        %20 - tells sed to find exactly the character sequence %20

        | - ends the pattern matching portion of the expression

        - - tells sed to replace the matched pattern with the exact character -

        | - tells sed that’s the end of the modification instructions

        g - tells sed to do this globally throughout the document. In other words, to find all occurrances of the string %20 and replace them with the string -

        ' - tells sed that is the end of the expression(s) to be evaluated.

        So all together, what this expression does is: Within the given document, find every occurrence of a percent symbol followed by the number two followed by the number zero and replace them with the dash character.

        /path/to/somefile - tells sed what file to work on.

        Part of using regex is understanding the contents of your own text, and with the information and examples given, this should work. However, if the markdown links have different formatting patterns, or as mentioned any of the https links have the %20 string in them, or other text in the document might falsely match, then you’d have to provide more information to get a more nuanced regex to match.

        Edit: clarified the use of the & metacharacter.

        Edit 2: clarified that the + metacharacter indicates finding the preceding character (or character set) one or more times.

        • N0x0n@lemmy.mlOP
          link
          fedilink
          arrow-up
          0
          ·
          edit-2
          6 days ago

          Sorry to spam your unread message 😅 !

          I played a bit around and came to the following conclusion:

          s|]\(#.+\)|\L&| - Works great for in document links so I further expanded to this s|]\(#.+\)|\L&|;s|]\(.+#.+\)|\L&| to also add the following pattern [Some Text](readme.md#hello%20world.md)

          s|%20|-|g - Works on every occurrence of %20 even for the following pattern [Some text](https://my/%20home%20page.com) which would break all external links to the web. So I used this /https/ ! s|%20|-|g

          It’s probably very sloppy what I’m doing and not as elegant as your command but it does the trick :) If you to further expand on it feel free however the following command does exactly what I wanted:

          sed -re 's|]\(#.+\)|\L&|;s|]\(.+#.+\)|\L&|;/https/ ! s|%20|-|g'
          

          Thanks again from the bottom of my heart !

          • harsh3466@lemmy.ml
            link
            fedilink
            arrow-up
            0
            ·
            edit-2
            6 days ago

            Nicely done! Happy I could help.

            There’s a million ways to do it and none are “right”, so I wouldn’t call yours sloppy at all. I’m still learning and have lots of slop in my own expressions. 🤣

            I’ll turn around and ask you a question if you don’t mind. That last bit you used, I kind of understand what it’s doing, but not fully. I’m getting that it’s excluding https, but if you could explain the syntax I’d really appreciate it!

            This is the bit:

            /https/ ! s|%20|-|g

            Edit: removed a redundancy

            • N0x0n@lemmy.mlOP
              link
              fedilink
              arrow-up
              0
              ·
              edit-2
              5 days ago

              Sure :)

              I don’t know if it still a thing but in the past some web URLs had spaces in their addresses e.g.

              https://www.my/%20website%20with%20spaces.com
              

              In markdown you can link to external web addresses like so

              [some link to a web address](https://my/%20website%20with%20spaces.com)
              

              However, /https/ ! s|%20|-|g replaces all occurrences of %20 (which is consider a space in html? Sorry if I’m wrong here :s still have a lot to learn) with -. This would break the link the the web URL [some link to a web address](https://my-website-with-spaces.com/). Am I wrong here?


              If I may I just found something else that doesn’t quite work 😅 and it seems a bit harder to fix i think ! Sometimes I have links in this form:

              [1.3 Subtitles](BDMV_svt-av1_encode_anime.md#1.3%20Subtitles)
              

              As you can see I append the header with 1.3 but as dumb as it is… it also need to be 1-3-subtitles

              e.g.

              [1.3 Subtitles](BDMV_svt-av1_encode_anime.md#1.3%20Subtitles)
              

              Needs to become

              [1.3 Subtitles](BDMV_svt-av1_encode_anime.md#1-3-Subtitles)
              

              Sorry for my bad English trying my best haha ! Hope it’s comprehensible.

              Edit:

              I don’t know why but lemmy add /%20 instead of %20 in my fake URLS ://

              • harsh3466@lemmy.ml
                link
                fedilink
                arrow-up
                0
                ·
                5 days ago

                Okay. To address the %20 and the https links, and the placeholder links, I came up with a bash script to handle this.

                Because of the variation in the links, instead of trying to write a sed command that will match only %20 in anchor markdown links, and placeholder links, while ignoring https links and ignoring all other text in the document.

                To do that, I used grep, a while loop, IFS, and sed

                Here’s the script:

                #! /bin/bash
                
                mdlinks="$(grep -Po ']\((?!https).*\)' ~/mkdn"
                
                while IFS= read -r line; do
                	dashlink="$(echo "$line" | sed 's/%20/-/g')"
                	sed -i "s/$line/${dashlink}/" /path/to/file
                done <<<"$mdlinks"
                

                I’m not sure how familiar you are with bash scripting, so I’ll do the same breakdown:

                #! /bin/bash - This tells the shell what interpreter to use for the script. In this case it’s bash.

                mdlinks="$(grep -Po ']\((?!https).*\)' /path/to/file" - This line uses grep to search for markdown link enclosures excluding https links and to output only the text that matches and saves all of that into a variable called mdlinks. Each link match will be a new line inside the variable.

                The breakdown of the grep command is as followes:

                grep - invokes the grep command

                -Po - two command flags. The P tells grep to use perl regular expressions. The o tells grep to only print the output that matches, rather than the entire line.

                ' - opens the regex statement

                ]\( - finds a closing bracket followed by an opening parentheses

                (?!https) - This is a negative look ahead, which a feature available in perl regex. This tells grep not to match if it finds the https string. The parentheses encloses the negative look ahead. The ?! Is what indicates it’s a negative look ahead, and the https is the string to look for and ignore.

                ' - closes the regex statement

                /path/to/file - the file to search for matches

                while IFS= read -r line; do - this invokes a while loop using the Internal Field Separator (IFS=), which by default includes newline character. This allows the loop to take in the variable containing all of the matched links and separate them line by line to work on one at a time. The read command does what it says and reads the input given. In this case our variable mdlinks. The -r flag tells read to ignore the backslash character and just treat it as a normal part of the input. line is the variable that each line will be saved in as they are worked through the loop. The ; ends while setup, and do opens the loop for the commands we want to run using the input saved in line.

                dashlink="$(echo "$line" | sed 's/%20/-/g')" - This command sequence runs the markdown link saved in the line variable into sed to find all instances of %20 and replace them with a -.

                dashlink - the variable we’re saving the new link with dashes to.

                = - separates the variable from the input being saved into the variable.

                " - opens this command string for variable expansion.

                $ - tells bash to do command substition, meaning that the output of the following commands will be saved to the variable, rather than the actual text of the commands that follows.

                ( - opens the command set

                echo - prints the given text or arguments to standard output, in this case the given argument is the variable $line

                " - tells bash to expand any variables contained within the quote set while ignoring any nonstandard characters like spaces or special shell characters that are saved in the variable.

                $line - the variable containing our active markdown link from the text document

                " - the closing quote ending the argument and the expansion enclosure

                | - This is a pipe, which redirects the standard output of the command on the left into the command on the right. Meaning we’re taking the markdown link currently saved in the variable and feeding it into sed

                sed - invokes sed so we can manipulate our text, and because sed is receiving redirected input, and we’ve specified no flags, the modified text will be printed to standard output.

                's/%20/-/g' - Our pattern match/substitution, which will find all occurrences of the string %20 in the markdown link fed into sed and replace them with -.

                )" - closes our command sequence for command substitution, and the variable expansion. At this point the text printed to standard output by sed is saved to the variable dashlink

                The next line is: sed -i "s/$line/${dashlink}/" /path/to/file, which uses sed to take the line and dashlink variables and use them to find the exact original markdown link in the text containing the %20 sequences, and replace it with the properly formatted markdown link with dashes.

                sed -i - invokes sed and uses the -i flag to edit the file in place.

                " - The double quote enclosure allows the expansion of variables in the pattern match/replacement sequence so it searches for the markdown link, and not the literal text string $line.

                s/ - opens our match/modify sequence.

                $line - the original markdown link that will be found

                / - ends the pattern matching section

                ${dashlink} - The variable containing the previously modified markdown link that now has dashes. This expands to that properly formatted link which will be written into the text file replacing the malformed link. I don’t know why this link has to be enclosed in curly braces while the first one does not.

                /" - ends the text modification section and closes the variable expansion.

                /path/to/file - the file to be worked on

                Finally we have done<<<"$mdlinks", which ends the while loop and feeds the mdlinks variable into it.

                done - closes the while loop

                <<< - This feeds the given argument into the while loop for processing

                " - expands the variable within while ignoring nonstandard characters

                $mdlinks - the variable we’re feeding in with all of our links containing %20, except for https links.

                " - closes the variable expansion.

                If you’ve never written/created your own bash script, here’s what you need to do.

                • in your home directory, or in the directory you’re working in with these files, use a text editor like vim or nano or gedit or kate or whatever plain text editor you want to to create a new file. Call the file whatever you want.

                • Paste the entirety of the script text into the file. Modify the file paths as needed to work the file you want to work. if working multiple files, you’ll need to update the script for each new file path as you finish one and move on to the next

                • Save and exit the file

                • Make the file executable at the terminal with sudo chmod +x /path/to/script/file

                • To run it:

                  • Change directory to the directory that contains the script file (if you’re not already there)
                  • at the command line use the command . ./name-of-script-file
                • N0x0n@lemmy.mlOP
                  link
                  fedilink
                  arrow-up
                  0
                  ·
                  4 days ago

                  First, thanks again for sharing your knowledge with me I really appreciate the time/effort you took to write all of this. I know those are a lot of thank you :/ but I’m really grateful for all of this, this is very valuable information I will keep in my knowledge base. It’s really time I learn proper bash/python/Pearl? scripting with all those tools (grep/sed/regex).

                  Second, YOU MISSED A DAMNED parentheses you fool xD ! mdlinks="$(grep -Po ']\((?!https).*\)' ~/mkdn)" Took me some time to figured it out with a very non informative error bashscript.sh: line 8: unexpected EOF while looking for matching "' but as expected it works !

                  From
                  -------
                  [Just a test](#Just%20a%20test.md)
                  [Just a link](https://mylink/%20with%20space.com)
                  %20
                  
                  To
                  -------
                  [Just a test](#Just-a-test.md)
                  [Just a link](https://mylink/%20with%20space.com)
                  %20
                  

                  Next to show you my appreciation and not to take everything for granted and being spoon feed for everything, I tried to find a solution myself for something else, I will try to explain the best I can how I solved it.

                  From
                  -------
                  [Just a test](Another%20markdown%20file.md#Hello%20World)
                  
                  To
                  -------
                  [Just a test](Another%20markdown%20file.md#hello-world)
                  

                  The part before the hashtag needs to keep it’s initial form (it links to the original markdown file). So, because just playing around with Pearl and regex (which doesn’t end well doing this blindly without the proper knowledge) I did some simple string manipulation. It’s not very elegant but does the trick, thankfully to your well written breakdown.

                  • I printed out the $mdlinks variable just to see what it prints out
                  • Copied and changed your Pearl/regex to find the first hashtag (#) and save it into a new variable ($mdlinks2)
                  • Feed your $mdlinks variable into my new Pearl/regex
                  • Feed my new variable into done? (I’m a bit confused here but okay xD)
                  #! /bin/bash
                  mdlinks="$(grep -Po ']\((?!https).*\)' "/home/dany/newtest.md")"
                  echo $mdlinks
                  
                  mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
                  echo $mdlinks2
                  
                  while IFS= read -r line; do
                  	dashlink="$(echo "$line" | sed 's|%20|-|g')"
                  	sed -i "s/$line/${dashlink}/" "/home/dany/newtest.md"
                  done <<<"$mdlinks2"
                  

                  Yes, not very elegant but It’s the best I could do currently :/ However, I still got a YES effect :P


                  To answer your question:

                  Quick question as I’m working on this, in the new link example, is the BDMV and other capitalized text in this link supposed to be converted to lowercase, or to remain uppercase?

                  As you can see in my string manipulation above, the part before the # needs to keep it’s original form :) (Sorry wasn’t aware of this before working with the original files) I solved it with some string manipulation as shown above.

                  I’m a bit tired from all this searching/trail&error, tomorrow I will try to wrap everything up and answer your post below :) ! Also, I need to clean up the mess I made in my home directory xD.

                  Thanks again for your help ! Have a good night/day !

              • harsh3466@lemmy.ml
                link
                fedilink
                arrow-up
                0
                ·
                edit-2
                5 days ago

                Quick question as I’m working on this, in the new link example, is the BDMV and other capitalized text in this link supposed to be converted to lowercase, or to remain uppercase?

                Edit: expanded the question to question case in the whole link

                • N0x0n@lemmy.mlOP
                  link
                  fedilink
                  arrow-up
                  0
                  ·
                  4 days ago

                  Hello !!!

                  Sorry for the very late response had something else to do. I will read everything carefully and response to every post :) I also thought about it over night and I think that sed and and regex wasn’t the best option here (as other have mentioned it).

                  I think a python script or bash (as you have mentioned it a bit later ) would be a better way. I’m sorry that I put you through all of this… wrong tool for the job :s.

        • N0x0n@lemmy.mlOP
          link
          fedilink
          arrow-up
          0
          ·
          edit-2
          6 days ago

          Thank you, thank you very much for taking your time to help me out here ! I really appreciate your full breakdown and complete development ! I didn’t tried it out yet but skimming through your post I’m sure it will work out !

          However, I forgot to mention something:

          The goal of this expression is to find markdown links, and to ignore https links. In your post you indicate the markdown links all start with a # symbol, so we don’t have to explicitly ignore the https as much as we just have to match all links starting with #.

          This is only true for links in the same file, if i link to another file it look something like this:

          [Why SVT-AV1 over AOM?](readme.md#Why%20SVT-AV1%20over%20AOM?)
          

          I can try to wrap my head around and find a solution by myself, with your well written breakdown I’m sure I can try something out. But if you think it will be to complex for my limited knowledge feel free to adjust :).

          Do you mind If I ping you if I’m not able to solve the issue?

          Thank again !!! 👍

          • harsh3466@lemmy.ml
            link
            fedilink
            arrow-up
            0
            ·
            edit-2
            6 days ago

            Feel free to ping me if you want some help! I’d say I’m intermediate with regex, and I’m happy to help where I can.

            Regarding the other file, you could pretty easily modify the command I gave you to adapt to the example you gave. There’s two approaches you could take.

            This is focused on the first regex in the command. The second should work unmodified on the other files if they follow the same pattern.

            Here’s the original chunk:

            s|]\(#.+\)|\L&|

            In the new example given, the # is preceded by readme.md. The easy modification is just to insert readme\.md before the # in the expression, adding the \ before the . to escape the metacharacter and match the actual period character, like so:

            s|]\(readme\.md#.+\)|\L&|

            However, if you have other files that have similar, but different patterns, like (faq.md#%20link%20text), and so on, you can make the expression more universal by using the .* metacharacter sequence. This is similar to the .+ metacharacter sequence, with one difference. The + indicates one or more times, while the * indicates zero or more times. So by using .* before the # you can likely use this on all the files if they follow the two pattern examples you gave.

            If that will work, this would be the expression:

            s|]\(.*#.+\)|\L&|

            What this expression does is:

            Find find a closing bracket followed by a opening parentheses followed by any sequence of characters (including no characters at all) until finding a pound/hash symbol then finding one or more characters until finding a closing parentheses, and convert that entire matched string to lowercase.

            And with that modified expression, this would be the full command:

            sed -ri 's|]\(#.+\)|\L&|; s|%20|-|g' /path/to/somefile
            

            Edit: grammar

            Edit 2: added the full modified command.

            • N0x0n@lemmy.mlOP
              link
              fedilink
              arrow-up
              0
              ·
              edit-2
              6 days ago

              Haha we cross-replied !

              .* did the trick and removes my additional s|]\(.+#.+\) to include that pattern form my last reply !

              Last question https/ ! s|%20|-| change all occurrence of %20 in the whole file except if it begins with https, is there any way to just change that occurrence when it appears in the markdown link pattern []()?

              e.g. replace in [Some text](some%20text.md) but not If Hello I'm just some%20place holder text ?

              Thanks again for your easy to read and very informative walk through ! 🤩

  • d_k_bo@feddit.org
    link
    fedilink
    arrow-up
    0
    ·
    edit-2
    6 days ago

    This is more of a general suggestion: if you use Regular Expression, use https://regex101.com/. It provides syntax highlighting, explains the syntax and allows you to test your regexes.

    Additionally, I think that sd is way more intuitive than sed.

    • bizdelnick@lemmy.ml
      link
      fedilink
      arrow-up
      0
      ·
      5 days ago

      Bad advise for sed. regex101 doesn’t support POSIX regexes, so you are unable to get the same results as with sed.

    • N0x0n@lemmy.mlOP
      link
      fedilink
      arrow-up
      0
      ·
      6 days ago

      Hello :) Thanks for your reply !

      That’s exactly what I did and how I came to my “final” result but I doesn’t work as expected… because the lack of knowledge and understanding !

      Will give sd a try and see if I can come up with something ! Thanks for the pointer !

  • just_another_person@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    edit-2
    6 days ago

    Honestly, I’d be looking at doing this in any other language that has a Markdown library to parse these. You’re doing this on “hard mode” with sed. There are probably already a ton of Python tools out there that do this.

    Have a look at this. Seems it could do the job: https://github.com/Wenzil/mdx_bleach

    • N0x0n@lemmy.mlOP
      link
      fedilink
      arrow-up
      0
      ·
      6 days ago

      Hello,

      I have thought of a python script and looked a bit around but couldn’t find something satisfactory. Also I’m a tiny bit more versed in bash/CLI than with python… Even though that’s very arguable !

      I looked through the Github repo and at first glance I have no idea how this could do the job, again I probably have to dig a bit deeper and understand what this is actually doing !

      Thanks for the pointer will give it a try :)