How to index the words containing the letters as html entities?
The title says it all.
I replace currently known from HTML entities with their counterparts of Unicode from the start, but I was wondering if some feature integrated into the Oracle text could do the same and save an extra headache.
Indexed entities html anyone?
Thank you
Flavio
----
http://oraclequirks.blogspot.com
You can create your own procedure using any method that you like, then use this procedure in a procedure filter and using this filter procedure in your index settings. In the example below, I borrowed a strip_html function of
http://www.supermanhamuerto.com/doku.php?id=Oracle:fixhtml
and used in the procedure.
Scott@orcl12c_11gR2 > left scan
Scott@orcl12c_11gR2 >-table, data and lexer:
Scott@orcl12c_11gR2 > create table example (t varchar2 (4000))
2.
Table created.
Scott@orcl12c_11gR2 > insert all
2 in the example values ("crónicas y relatos")
3 in the example values ("crónicas y relatos")
4 in the example values ("CRÓnicas y Relatos de Mexico")
5 in the example values ("" crónicas y relatos Mexico City ' ")
6 select * of the double
7.
4 lines were created.
Scott@orcl12c_11gR2 > start
2 ctx_ddl.create_preference ("mylex", "BASIC_LEXER");
3 ctx_ddl.set_attribute ('mylex', 'base_letter', 'YES');
4 end;
5.
PL/SQL procedure successfully completed.
Scott@orcl12c_11gR2 >- http://www.supermanhamuerto.com/doku.php?id=oracle:fixhtml function
Scott@orcl12c_11gR2 > strip_html FUNCTION to CREATE or REPLACE (dirty IN clob,
2 to_cvs to THE NUMBER 0 by DEFAULT)
CLOB RETURN 3 IS OUT clob.
4
5 TYPE IS varray arr_string (200) OF VARCHAR2 (64);
6
entities_search_for 7 arr_string;
8 entities_replace arr_string.
9 cont NUMBER;
10
BEGIN 11
12
13
14. to speed up the question
15. IF dirty IS NULL THEN
16 dirty BACK;
17 END IF; -isnull (dirty)
18
19. If LENGTH (dirty) = 0 THEN
20 dirty BACK;
21 END IF; -length (dirty)
22
23 entities_search_for: = arr_string)
24 ' !'.
25 ' #'.
26 ' $'.
27 ' %'.
28 '& ',.
29 '' '.
30 ' ('.
31 ' )'.
32 ' *'.
33 ' +'.
34 ' ,'.
35 ' ‐'.
36 ' .'.
37 ' /'.
38 ' :'.
39 ' ;'.
40 ' < ',.
41 ' ='.
42 ' > '.
43 ' ?'.
44 ' @'.
45 ' ['.
46 ' \'.
47 ' ]'.
48 "BE."
49 ' _'.
50 ' `'.
51 ' {'.
52 ' |'.
53 ' }'.
54 "˜"
55' ',
56 ""
57 "¢."
58 "£"
59 ' ¤',.
60 «¥»,
61 '¦ ',.
62 «§»,
63 ' ¨'.
64 ' ©'.
65 "ª"
66 ' ' ', '.
67 '¬"
68 cm,
69 '®',
70 '¯ ',.
71 "°",.
72 "±"
73 '²',
74 '³',
75 "Honourable."
76 "µ",.
77 "¶"
78 "·"
79 '¸ ',.
80 '¹',
81 'º"
82 '' '.
83 ' &fr;'.
84 ' &fr;'.
85 ' &fr;'.
86 ""
87 'TO. "
88 'A ',.
89 'A ',.
90 'A ',.
91 'A ',.
92 'A ',.
93 "AE."
94 ' &il;'.
95 'E ',.
96 'E ',.
97 'E ',.
98 'E ',.
99 'I ',.
100 'I ',.
101 "I."
102 'I ',.
103 "D."
104 "N."
105 "O."
106 "O."
107 'O,
108 "O."
109 "O."
110 'x',
111 "O."
112 "U."
113 "U."
114 "U."
115 "U."
116 "Y."
117 'Þ ',.
118 "ss."
119 "to."
120 'a ',.
121 'a ',.
122 "a."
"123A",.
124 'e ',.
125 'e ',.
126 'e ',.
127 ' &etilde;'.
128 'e ',.
129 "i."
130 'i ',.
131 'i ',.
132 ' ĩ'.
133 'i ',.
134 "o."
135 "o."
136 'o,
137 "o."
138 "o."
139 "u."
140 "u."
141 "u."
142 ' ũ'.
143 'u');
144
145 entities_replace: = arr_string)
146 ""
147 'º"
148 "$."
149 '% ',.
150 '& ',.
151 '' '.
152 '(',)
153 ")',"
154 ' *'.
155 «+»,
156 «,»,
157 '-'.
158 '.',
159 "ground."
160 "colon."
161 ' *'.
162'<>
163 'is',
164 ' > '.
165 '?'.
166 «,»,
167 ' *'.
168 ' *'.
169 ' *'.
170 ' *'.
171 "_."
172 "',
173 ' *'.
174 ' *'.
175 ' *'.
176 cm '.
177', '
178 ""
179 "cent."
180 'L ',.
181 ' *'.
182 'Y',
183 ' *'.
184 ' *'.
185 '.',
186 '(c)"
187 ' *'.
188 ' *'.
189 '!'.
190 ' *'.
191 "(r)."
192 ' *'.
193 ' *'.
194 ' *'.
195 ' *'.
196 ' *'.
197 'a ',.
198 "u."
199 ' *'.
200 "·"
201 c '.
202 ' *'.
203 ' *'.
204 ' *'.
205 ' *'.
206 ' *'.
207 ' *'.
208 ""
209 'E ',.
210 'A ',.
211 'A ',.
212 "A."
213 ' *'.
214 ' *'.
215 "AE."
216 ' *'.
217 'E ',.
218 "E."
219 ' *'.
220 ' *'.
221 "I."
222 'I ',.
223 'I ',.
224 ' *'.
225 ' *'.
226 ' N ',.
227 "O."
228 "O."
229 'O,
230 ' W '.
231 ' *'.
232 ' *'.
233 ' O ',.
234 "U."
235 "U."
236 "U."
237 ' *'.
238 ' *'.
239 ' *'.
240 ' *'.
241 "to."
242 'a ',.
'a' 243,
244 'a ',.
245 ' *'.
246 'e ',.
247 'e ',.
248 'e ',.
249 "e."
250 ' *'.
251 'i ',.
252 'i ',.
253 'i ',.
254 'i ',.
255 ' *'.
256 "o."
257 "o."
258 'o,
259 ' o ',.
260 ' *'.
261 "u."
262 "u."
263 "u."
264 "u."
265 ' *');
266
267 OUT: = dirty;
268
269 - replace which is bounded
270?-- -> lazy star (catch the minimum possible)
OUT 271: = regexp_replace (OUT, '
272 clean what it is inside the style tags
273 OUT: = regexp_replace (OUT, ' ', ", 1, 0, 'nor');
274
275 IF to_cvs = 2 THEN
276 disinfect (not clean) the html code
277
278 clean tag
279 OUT: = regexp_replace (OUT, '<\?xml:.*?>', ", 1, 0, 'nor');
280 clean tags
281 OUT: = regexp_replace (OUT, '
282 comments 283 OUT: = regexp_replace (OUT,' ', ", 1, 0, 'nor'); 284 clean meta OUT 285: = regexp_replace (OUT, '
286 - own link 287 OUT: = regexp_replace (OUT, '
288 clean DIV 289 OUT: = regexp_replace (OUT, ",", 1, 0, 'nor'); 290 - DURATION of own OUT 291: = regexp_replace (OUT, ",", 1, 0, 'nor'); 292 clean 'class inside of the tags' 293 OUT: = regexp_replace (OUT, ' (<.*?)class="?[a-zA-Z0-9-_]*"?(.*?>) ","\1\2", 1, 0, 'nor');
294 - clean the 'style' inside the following tags: I p b 295 OUT: = regexp_replace (OUT, ' (<[ibp] .*?)style=".*?" (.*?="">) ","\1\2", 1, 0, 'nor');
296 clean namespaces
297 OUT: = regexp_replace (OUT, ' (<)[a-zA-Z0-9-_]*:(.*?>)', "\1\2", 1, 0, 'nor');)[a-zA-Z0-9-_]*:(.*?>
298 OUT: = regexp_replace (OUT, "()", "\1\2", 1, 0, 'nor');
299
300 clean empty tags opening / closing: it must be
301 - past twice or three times to clean things like this:
302-
303 TWEAK:
should be replaced by
304 OUT: = regexp_replace (OUT, e
','
1, 0, 'nor');
305 OUT: = regexp_replace (OUT, '<([a-zA-Z0-9-_]*)>', ", 1, 0, 'nor');
306 TWEAK:
should be replaced by
307 OUT: = regexp_replace (OUT, e
','
1, 0, 'nor');
308 OUT: = regexp_replace (OUT, '<([a-zA-Z0-9-_]*)>', ", 1, 0, 'nor');
309
ELSE 310
311 clean html
312
313 - replace all the stuff that comes up to a carriage return
OUT 314: = regexp_replace ([OUT, '] * > ', Chr (10) |) CHR (13));
315 OUT: = regexp_replace ([OUT, '] * > ', Chr (10) |) CHR (13));
OUT 316: = regexp_replace ([OUT, '] * > ', Chr (10) |) CHR (13));
317
318 - replace all other html stuff
OUT 319: = regexp_replace ([OUT,'<[^>] * > "," 1, 0, 'nor');
320
321 replace all entities
FOR cont IN 1.119 LOOP 322
323 OUT: = REPLACE (OUT, (cont) entities_search_for, entities_replace (cont));
324 END LOOP;
325
326 - cleaning for export to cvs
327 IF to_cvs = 1 THEN
328 OUT: = REPLACE (OUT, CHR (10), ");
OUT 329: = REPLACE (OUT, CHR (13), ");
330 TO: = REPLACE (OUT, CHR (9), ");
331 OUT: = REPLACE (OUT, ';', ',');
332 OUT: = REPLACE (' OUTSIDE, ' "',"');
333 END IF;
334
335
336 END IF;
337
338
339 (OUT) RETURN;
340 END strip_html;
341.
The function is created.
Scott@orcl12c_11gR2 >-procedure that uses the function:
Scott@orcl12c_11gR2 > create or replace procedure standardization
2 (p_input in clob,
3 p_output in out nocopy clob)
4, as
5. start
6 p_output: = strip_html (p_input);
7 end normalize;
8.
Created procedure.
Scott@orcl12c_11gR2 >-filter that uses the procedure:
Scott@orcl12c_11gR2 > start
2 ctx_ddl.create_preference ('myfilt', 'procedure_filter');
3 ctx_ddl.set_attribute ('myfilt', 'procedure', 'normalise');
4 ctx_ddl.set_attribute ('myfilt', 'input_type', 'clob');
5 ctx_ddl.set_attribute ('myfilt', 'TYPE_SORTIE', 'clob');
6 end;
7.
PL/SQL procedure successfully completed.
Scott@orcl12c_11gR2 >-index that uses the filter:
Scott@orcl12c_11gR2 > create index myindex on example (t) indextype is ctxsys.context
2 parameters ("FILTER LEXER mylex myfilt")
3.
The index is created.
Scott@orcl12c_11gR2 >-tokens indexed:
Scott@orcl12c_11gR2 > select token_text from dr$ myindex$ I
2.
TOKEN_TEXT
----------------------------------------------------------------
CRÓNICAS
OF
Mexico
RELATOS
THERE
5 selected lines.
Scott@orcl12c_11gR2 >-research:
Scott@orcl12c_11gR2 > select * from example where contains (t, "crónicas") > 0
2.
T
--------------------------------------------------------------------------------
Crónicas y relatos
Crónicas y relatos
CRÓnicas y Relatos de Mexico
Crónicas y relatos of Mexico
4 selected lines.
Scott@orcl12c_11gR2 > select * from example where contains (t, "Mexico") > 0
2.
T
--------------------------------------------------------------------------------
CRÓnicas y Relatos de Mexico
Crónicas y relatos of Mexico
2 selected lines.
[^>([a-zA-Z0-9-_]*)>([a-zA-Z0-9-_]*)>
Tags: Database
Similar Questions
-
How to change the letters of the disc in XP
During repair XP, after the kids he messed up, the names of CD/DVD players changed into something like "Compact Flash" and "MS/SD" or something like that. How can I change the names of back to Cd/DVD? The letters are very well that just the names are wrong.
The letters are very well that just the names are wrong.
Hello
Open my computer. If the reader can be renamed, there will be an option to rename if you right-click.
You can also consider a system restore...
http://support.Microsoft.com/kb/306084
.. .provided that it was a fairly recent thing, and you can go back enough.
Tricky
-
HOW CAN I MAKE THESE LETTERS MORE GRAND AS WELL ON MY DESKTOP SCREEN - THE LETTERS ON THIS BODY ARE TOO SMALL AND THE LIGHT BLUE IS DIFFICULT TO READ
Hello
If you are using Internet Explorer, press the Alt key to display the menu bar , and then select View/Text Size. Select a size that's comfortable for you.
For the Office, you will need to change the ppp settings.
Click on desktop and select Screen Resolution.
In the next window, select Make text and other more or less important.
In the next window, the Select one of the sizes in option or set a custom size to your needs.
I hope this helps.
Thank you for using Windows 7
Ronnie Vernon MVP -
How to index the condition 'null' or 'not null '?
Hello together,
first of all for your background, we would like to make the following changes to a Table:
1. we have an old varchar2 (50) column that is filled with strings.
2. we now have a new number (3) column that is blank.
Our goal is to move from the old column a new column so that each different string is mapped to a number. ('abc' - > 0, "xyz"-> 1, etc.)
The table that contains the columns has 1.3 billion lines. There is no index on the old column.
If possible the migration should be made online (without interruption) and the temporary additional space should be as low as possible. Due to the effect of the performance, we plan to cut migration into several parts which will run on low load times.
To avoid full table scans, I question whether it is possible to index the status of the line. With the status, I'm only interested in "null" or "not null".
Is it possible to define a type of bitmap index? (0 = null, 1 = not null) which should stimulate the migration time and does not use the amount of memory?
Unfortunately I am quite familiar with index now.
To crack the migration in parts, I thought to use to use ' where rowum > = x and rownum < = x + 10, 000, 000and new_column is not null "to do it in steps 10mio.»
Thanks in advance,
AndreasLike this?
CREATE INDEX idx_test ON TABLE_NAME (NVL2 (column_name, 0, 1)); -- NVL2--> if column is null then 0 else 1 SELECT * FROM table_name WHERE NVL2 (column_name, 1, 0) = 1;
G.
-
everything I opened is to small to read, I can't read my emails because the lettering is micro-petit. The letters in the browser are legible. I went through microsoft help and he did increase the letters in the browser - same prblem here on Firefox. Help
Firefox remembers the settings of zoom on a per site basis. Maybe you just need to zoom in more. Use the keyboard Ctrl key & more
That is to say the key to control with it, then press the sign next to the BACKSPACE key.
-
How to disable the letters over the entrance
Hello world
I was wondering if there is a way to disable the user to type the letters on a text box on a widget, a few numbers.
I made this function, but it removes the field if there is a letter:
function processKeyPress (e) {}
var targ;
If (! e) var e = window.event;
If (e.target) targ = e.target;
Else if (e.srcElement) targ = e.srcElement;
If {(e.keyCode<49 ||="" e.keycode="">57)
{if(e.keycode!=8)}
Targ.Value ="";
}
}
}document.addEventListener ("keypress", processKeyPress, false);
and I put
.You have a better idea?
Thank you
Michel
The application will run on BlackBerry Device Software 5.0 or higher? If so, then simply use the new HTML5 input types. There is a digital input.
49> -
How to put the code in html to wordpress
I want to put ads on my blog, how do I put the code in html to wordpress
Your question better asked in the WordPress support forums:
http://WordPress.org/support/forum/3 -
How to index the occurrences in the table
Hi, is there a way we can index every occurrence in the table?
It seems that search that ID Array is once and I can't understand how do.
Please notify
Thanks in advance
Clement
You need only the index that corresponds to the item looking like this?
-
My underlined letters are all the time in menus and dialog boxes, and I want to disable them.
The first thing I did was go to control panel, accessibility, making the keyboard easy to use, to turn off underline keyboard shortcuts and access keys - but it is already off.
So how do you disable these underscores in Windows 7?
Hello Dmbyrnes,
I understand that you may be eager to hide the menu bar. Attached, are steps that can be beneficial in addressing you request...
(1) first Rt click on the bottom of the menu and uncheck lock the toolbars. Then click on organize > layout > menu bar make sure is unaudited.
From there you should be able to hit the Alt key and the Bar Menu to repopulate. If you please you would follow with me at your convenience, I would be very happy.
Thank you
Aaron
Microsoft Answers Support Engineer
Visit our Microsoft answers feedback Forum and let us know what you think -
How to lock the letters on the keyboard
Hello
I had my thinkpad Tablet 2 for less than a week.
This is a silly question, I couldn't figure out how to make a lock.
Thanks in advance,
Double tapping the shift "key" allows to lock. There is no need to go to the 'full' keyboard
-
Hi guys,.
I just copied the HTML from Youtub code and pasted into the HTML region:
How can I avoid what you see in the following picture
http://www.9M.com/upfiles/fm577745.PNG
However, it is the HTML code I use:
Best regards</head> <body> <object style="height: 390px; width: 640px"> <param name="movie" value="&P2_VIDEO."> <param name="allowFullScreen" value="true"> <param name="allowScriptAccess" value="always"> <embed src="&P2_VIDEO." type="application/x-shockwave-flash" allowfullscreen="true" allowScriptAccess="always" width="640" height="390"> </object> </body> </html>
I just want to make sure that nothing bad in the HTML code.
There are. This is the soup of tags: