Skip to content
This repository has been archived by the owner on Feb 13, 2020. It is now read-only.

Question about regular expression and loop #535

Open
chenchenguo opened this issue Nov 29, 2018 · 17 comments
Open

Question about regular expression and loop #535

chenchenguo opened this issue Nov 29, 2018 · 17 comments

Comments

@chenchenguo
Copy link

Hi, I met a problem when I want to implement a loop in regular expression.
What if I want to search a specified list of letters, like search "a", "b", "c", "d",..."z", sequentially?
right now, what I am implementing this through writing 26 regular expressions, which I know is stupid, but how to figure it out in just one loop or something else?
Thanks in advance.

@ChadFibke
Copy link

Hey @chenchenguo,

Can you provide us with a bit more context (what is the input, output, and what did you want to accomplish)?

@chenchenguo
Copy link
Author

Hey @chenchenguo,

Can you provide us with a bit more context (what is the input, output, and what did you want to accomplish)?

Thanks @ChadFibke

The data is all those words filtered from words.txt, which has same starting and ending letter like "bob", "kick".
Now I want to count the number that how many words for each letter (from "a" to "z")? Like the number of words for starting and ending with "a" is 20, starting and ending with "b" maybe is 50.
Right now my implementation is to write down for each letter: a <- str_subset(data, "^a"); a_number <- length(a). And I repeated it for 26 times.
Is there any loop methods to figure thsi out?
Thanks a lot.

@zeeva85
Copy link

zeeva85 commented Nov 29, 2018

words <- readLines("words.txt")

output <- vector("character", length(letters))
for (i in letters) {
output[match(i, letters)] <- paste0("^", i, ".*", i, "$") # this is regex
}

This gives the regex

df <- tibble(letters,
start_letter = seq_along(letters)) # make tibble

for (i in output) {
df [match(i, output), 2] <- sum(str_count(words, pattern = i))
}

frequency table

I think should work

@ChadFibke
Copy link

Ah I found something as well:

count_all_hits<-function(a_charater_vector, pattern_list){
  
  require(purrr)
  
  # Lets make a list for our results 
  
  results <- list()


  

for ( match in pattern_list) {
  
results[[sprintf("Matches for %s",match)]] <- a_charater_vector[grepl(sprintf("^%s.*%s$", match, match), a_charater_vector)]
  
}


return(map(results, length))

}




count_all_hits(a_charater_vector = wordss, pattern_list = letters)

@ChadFibke
Copy link

sprintf() is definitely a function to look into. sprint will allow you to expand variable names in a character string. The sprintf("Matches for %s",match) will place the character value of the match object into the string. The %s means to print a string with the character value found in match.

@ChadFibke
Copy link

Also.. I converted all the string to lowercase using:

wordss<-str_to_lower(readLines("./words.txt"))

If you do not want to count, and actually want to see the words remove then replace:

return(map(results, length))

# with
return(results)
# which will give you a list with all the found words.

@bassamjaved
Copy link

Here's another possibility...

There's an exercise from Hadley's R for Data Science in the strings chapter that can be adapted for this.

You could create a string to the effect "^a|^b|^c" and continue all the way to the letter 'z'. Let's call that string letter_match, which we'll use to match up with regex. Then,

#find and extract matches
matches <- str_extract(words, letter_match)

#create a frequency table
Letters <- table(matches)

@ChadFibke
Copy link

@bassamjaved,

Are you able to use that to find words that start with and end with a, b, c....z?

@bassamjaved
Copy link

bassamjaved commented Nov 29, 2018 via email

@bassamjaved
Copy link

Ah but I see you want start and end with the same letter. Okay, no I haven't tried that with this particular method...

@chenchenguo
Copy link
Author

@zeeva85
Thanks, your function is so concise and useful.
For the df[match(i, output), 2] what is the meaning 2 here? The start letter row?

@chenchenguo
Copy link
Author

Thanks @ChadFibke
I will try your suggestion

@zeeva85
Copy link

zeeva85 commented Nov 29, 2018

@zeeva85
Thanks, your function is so concise and useful.
For the df[match(i, output), 2] what is the meaning 2 here? The start letter row?

Correct, sum the values then replace the 1:26 in 2nd column ("start_letter")

This should work also i think df[match(i, output), "start_letter"], its more explicit and probably better, prevents errors

df[row, column]

@chenchenguo
Copy link
Author

Ah but I see you want start and end with the same letter. Okay, no I haven't tried that with this particular method...

Yeah, the part of start and end with same letter is done.. I will try str_extract function here, thank you

@chenchenguo
Copy link
Author

Also.. I converted all the string to lowercase using:

wordss<-str_to_lower(readLines("./words.txt"))

If you do not want to count, and actually want to see the words remove then replace:

return(map(results, length))

# with
return(results)
# which will give you a list with all the found words.

Nice, yeah I fogot to switch them to lower case, thanks for notice

@bassamjaved
Copy link

here's a revision of the method I posted earlier:

#create a regular expression pattern that begin with a letter and ends with the same letter
(letters_for_regex <- str_c("(", "^", letters, ".+", letters, "$", ")"))

#collapse into one string
(letter_match <- str_c(letters_for_regex, collapse = "|"))

#find and subset matches
(words_with_matches <- str_subset(words_lowercase, letter_match))

#extract letters in matches
(letters_in_matches <- str_extract(words_with_matches, "^."))

#create a frequency table
(Letters <- table(letters_in_matches))

@ChadFibke
Copy link

Well @chenchenguo has multiple answers to choose from now!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants