Ticket #74 (closed defect: fixed)
Regular expression matching does not seem to work with certain input
| Reported by: | mooneer | Owned by: | mooneer |
|---|---|---|---|
| Priority: | major | Milestone: | |
| Component: | interpreter | Version: | 1.0b8 |
| Keywords: | Cc: |
Description
Example code (from old kirb):
(in.command == "message" and in.message|match("^.title http://")) [
url_rgx = make System.regex("http://([^/: ]+)(:([0-9]+))?(/[^ ]*)?");
rgx_result = url_rgx|match(in.message);
while(not (rgx_result is System.null)) [
decide [
(rgx_result.captures[3]|length == 0) [
port = 80;
],
true [ port = rgx_result.captures[3]|int; ]
];
"retrieving host %s port %d path %s"|format([rgx_result.captures[1], port, rgx_result.captures[4]]
)|print;
run [
resp = interface.http|get("GET", rgx_result.captures[1], port, rgx_result.captures[4], null);
#decide [
# (resp.content_type == "text/html") [
title_rgx = make System.regex("<title>([^<]+)</title>");
title_res = title_rgx|match(resp.content);
decide [
(not (title_res is System.null)) [
bot|write("PRIVMSG %s :[%s:%d] %s\r\n", [in.receiver, rgx_result.captures[1],
port, title_res.captures[1]]);
],
true [
bot|write("PRIVMSG %s :[%s:%d] %s\r\n", [in.receiver, rgx_result.captures[1],
port, "(no title)"]);
]
];
#]
#];
] catch [
__exc|print;
];
rgx_result = url_rgx|match(in.message, rgx_result.match_begin + rgx_result.match_length);
];
],
Passing ".title http://www.google.com/" results in "(no title)" into the channel or private message window where this command was sent from. However, most URLs work fine. Common theme is that the remote site is sending back UTF-8, at least if Content-Type's to be believed.
Change History
Note: See
TracTickets for help on using
tickets.

On second thought, it might not be a specific UTF-8 issue. Google is sending back ISO-8859-1, if the headers are to be believed.