Ticket #74 (closed defect: fixed)

Opened 3 years ago

Last modified 3 years ago

Regular expression matching does not seem to work with certain input

Reported by: mooneer Owned by: mooneer
Priority: major Milestone:
Component: interpreter Version: 1.0b8
Keywords: Cc:

Description

Example code (from old kirb):

            (in.command == "message" and in.message|match("^.title http://")) [
                url_rgx = make System.regex("http://([^/: ]+)(:([0-9]+))?(/[^ ]*)?");
                rgx_result = url_rgx|match(in.message);
                while(not (rgx_result is System.null)) [
                    decide [ 
                        (rgx_result.captures[3]|length == 0) [ 
                            port = 80; 
                        ],
                        true [ port = rgx_result.captures[3]|int; ]
                    ];
                    "retrieving host %s port %d path %s"|format([rgx_result.captures[1], port, rgx_result.captures[4]]
)|print;
                    run [
                        resp = interface.http|get("GET", rgx_result.captures[1], port, rgx_result.captures[4], null);
                        #decide [
                        #    (resp.content_type == "text/html") [
                                title_rgx = make System.regex("<title>([^<]+)</title>");
                                title_res = title_rgx|match(resp.content);
                                decide [
                                    (not (title_res is System.null)) [
                                        bot|write("PRIVMSG %s :[%s:%d] %s\r\n", [in.receiver, rgx_result.captures[1], 
port, title_res.captures[1]]);
                                    ],
                                    true [
                                        bot|write("PRIVMSG %s :[%s:%d] %s\r\n", [in.receiver, rgx_result.captures[1], 
port, "(no title)"]);
                                    ]
                                ];
                            #]
                        #];
                    ] catch [
                        __exc|print;
                    ];
                    rgx_result = url_rgx|match(in.message, rgx_result.match_begin + rgx_result.match_length);
                ];
            ],

Passing ".title  http://www.google.com/" results in "(no title)" into the channel or private message window where this command was sent from. However, most URLs work fine. Common theme is that the remote site is sending back UTF-8, at least if Content-Type's to be believed.

Change History

comment:1 Changed 3 years ago by mooneer

  • Status changed from new to assigned
  • Summary changed from Regular expression matching does not seem to work with UTF-8 to Regular expression matching does not seem to work with certain input

On second thought, it might not be a specific UTF-8 issue. Google is sending back ISO-8859-1, if the headers are to be believed.

comment:2 Changed 3 years ago by mooneer

  • Status changed from assigned to closed
  • Resolution set to fixed

The problem is actually because on the pages it doesn't work with, they don't send back a Content-Length. This is now fixed in interface.http in r483.

Note: See TracTickets for help on using tickets.